Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added PerfScore support for Arm64 #751

Merged
merged 1 commit into from
Dec 12, 2019

Conversation

briansull
Copy link
Contributor

@briansull briansull commented Dec 11, 2019

Based upon arm_cortex_a55_software_optimization_guide_v2.pdf
Compiles all of System.Private.CoreLib.dll for Arm64 with updated PerfScore numbers

@briansull
Copy link
Contributor Author

@dotnet/jit-contrib PTAL

@briansull briansull added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 11, 2019
@BruceForstall
Copy link
Member

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall structure looks good. I haven't reviewed the actual latencies, and hope that perhaps @tannergooding or @TamarChristinaArm can do so.
In any event, I think it's reasonable to go ahead and merge this and make any needed adjustments later.

//
void emitter::perfScoreUnhandledInstruction(instrDesc* id, insExecutionCharacteristics* pResult)
{
// Change this to #ifdef DEBUG to assert on any unhnadled instructions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo (unhandled).
I'm not sure why you wouldn't leave this enabled in DEBUG builds, though I can see that it might be risky. That said, I don't know how often someone would think of re-enabling this if they were adding instructions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is active work with adding instruction in the SIMD area. As I will be out ofthe office shortly I didn't want to block them at this time.

If Tanner and others want I can enable this DEBUG assertion to enforce that they add latencies for any new instructions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice, IMO, to enforce this now while we are doing iterative work; rather than needing to go back and fixup everything later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created PR #810 to turn this assert on

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few typos and nits

src/coreclr/src/jit/emit.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/emit.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/emit.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/emit.cpp Show resolved Hide resolved
src/coreclr/src/jit/emit.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/emitxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/emitarm64.cpp Outdated Show resolved Hide resolved
Based upon arm_cortex_a55_software_optimization_guide_v2.pdf
@TamarChristinaArm
Copy link
Contributor

Overall structure looks good. I haven't reviewed the actual latencies, and hope that perhaps @tannergooding or @TamarChristinaArm can do so.

Sure, I'll review them today.

// Branch Instructions
//

case IF_BI_0A: // b, bl_local
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious.. what are bl_local and b_tail?

case INS_udiv:
if (id->idOpSize() == EA_4BYTE)
{
result.insThroughput = PERFSCORE_THROUGHPUT_4C;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be 3? Or are you applying an additional penalty? same for the X-form one below.

// Load/Store Instructions
//

case IF_LS_1A: // ldr, ldrsw (literal, pc relative immediate)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming it's intentional that none of the loads have a latency? one does seem to be set for ldp and stp but the default value is PERFSCORE_LATENCY_ILLEGAL isn't it?

}
break;

case IF_LS_2D: // ld1 (vector - multiple structures)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably clarify that these are only for the 1 register case.


case IF_DV_1C: // fcmp vn, #0.0
result.insThroughput = PERFSCORE_THROUGHPUT_1C;
result.insLatency = PERFSCORE_LATENCY_3C;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is off, I think this should be 1 as well, the explicit compare with 0.0 isn't more expensive than an arbitrary scalar.

case INS_fabs:
case INS_fneg:
result.insThroughput = PERFSCORE_THROUGHPUT_2X;
result.insLatency = (id->idOpSize() == EA_8BYTE) ? PERFSCORE_LATENCY_2C : PERFSCORE_LATENCY_3C / 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm why the different costs for the Q one? shouldn't this just be 4?

if ((id->idInsOpt() == INS_OPTS_2S) || (id->idInsOpt() == INS_OPTS_4S))
{
// S-form
result.insThroughput = PERFSCORE_THROUGHPUT_3C;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem off, the guide says latency 12, throughput 1/9 for the S variant and latency 22 throughput 1/19 for the D form. These seem to be quite a bit cheaper.

{
// D-form
assert(id->idInsOpt() == INS_OPTS_2D);
result.insThroughput = PERFSCORE_THROUGHPUT_10C;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think this should be 19.

if (id->idOpSize() == EA_8BYTE)
{
// D-form
result.insThroughput = PERFSCORE_THROUGHPUT_6C;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, the scalar and vector ones have the same latency and throughput in the guide. are these intentionally lower?

@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants