-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added PerfScore support for Arm64 #751
Conversation
@dotnet/jit-contrib PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall structure looks good. I haven't reviewed the actual latencies, and hope that perhaps @tannergooding or @TamarChristinaArm can do so.
In any event, I think it's reasonable to go ahead and merge this and make any needed adjustments later.
src/coreclr/src/jit/emit.cpp
Outdated
// | ||
void emitter::perfScoreUnhandledInstruction(instrDesc* id, insExecutionCharacteristics* pResult) | ||
{ | ||
// Change this to #ifdef DEBUG to assert on any unhnadled instructions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo (unhandled).
I'm not sure why you wouldn't leave this enabled in DEBUG
builds, though I can see that it might be risky. That said, I don't know how often someone would think of re-enabling this if they were adding instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is active work with adding instruction in the SIMD area. As I will be out ofthe office shortly I didn't want to block them at this time.
If Tanner and others want I can enable this DEBUG assertion to enforce that they add latencies for any new instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice, IMO, to enforce this now while we are doing iterative work; rather than needing to go back and fixup everything later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created PR #810 to turn this assert on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few typos and nits
Based upon arm_cortex_a55_software_optimization_guide_v2.pdf
8d2f021
to
c3dacce
Compare
Sure, I'll review them today. |
// Branch Instructions | ||
// | ||
|
||
case IF_BI_0A: // b, bl_local |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious.. what are bl_local
and b_tail
?
case INS_udiv: | ||
if (id->idOpSize() == EA_4BYTE) | ||
{ | ||
result.insThroughput = PERFSCORE_THROUGHPUT_4C; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be 3? Or are you applying an additional penalty? same for the X-form
one below.
// Load/Store Instructions | ||
// | ||
|
||
case IF_LS_1A: // ldr, ldrsw (literal, pc relative immediate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming it's intentional that none of the loads have a latency? one does seem to be set for ldp
and stp
but the default value is PERFSCORE_LATENCY_ILLEGAL
isn't it?
} | ||
break; | ||
|
||
case IF_LS_2D: // ld1 (vector - multiple structures) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably clarify that these are only for the 1 register case.
|
||
case IF_DV_1C: // fcmp vn, #0.0 | ||
result.insThroughput = PERFSCORE_THROUGHPUT_1C; | ||
result.insLatency = PERFSCORE_LATENCY_3C; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is off, I think this should be 1 as well, the explicit compare with 0.0
isn't more expensive than an arbitrary scalar.
case INS_fabs: | ||
case INS_fneg: | ||
result.insThroughput = PERFSCORE_THROUGHPUT_2X; | ||
result.insLatency = (id->idOpSize() == EA_8BYTE) ? PERFSCORE_LATENCY_2C : PERFSCORE_LATENCY_3C / 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm why the different costs for the Q one? shouldn't this just be 4?
if ((id->idInsOpt() == INS_OPTS_2S) || (id->idInsOpt() == INS_OPTS_4S)) | ||
{ | ||
// S-form | ||
result.insThroughput = PERFSCORE_THROUGHPUT_3C; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem off, the guide says latency 12, throughput 1/9 for the S variant and latency 22 throughput 1/19 for the D form. These seem to be quite a bit cheaper.
{ | ||
// D-form | ||
assert(id->idInsOpt() == INS_OPTS_2D); | ||
result.insThroughput = PERFSCORE_THROUGHPUT_10C; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think this should be 19.
if (id->idOpSize() == EA_8BYTE) | ||
{ | ||
// D-form | ||
result.insThroughput = PERFSCORE_THROUGHPUT_6C; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, the scalar and vector ones have the same latency and throughput in the guide. are these intentionally lower?
Based upon arm_cortex_a55_software_optimization_guide_v2.pdf
Compiles all of System.Private.CoreLib.dll for Arm64 with updated PerfScore numbers