New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Added PerfScore support for Arm64 #751

Merged

briansull merged 1 commit into dotnet:master from briansull:new-arm64-perfscore

Dec 12, 2019

Contributor

briansull commented Dec 11, 2019 •

edited

Loading

Based upon arm_cortex_a55_software_optimization_guide_v2.pdf
Compiles all of System.Private.CoreLib.dll for Arm64 with updated PerfScore numbers

briansull requested a review from CarolEidt

December 11, 2019 00:24

Contributor Author

briansull commented Dec 11, 2019

@dotnet/jit-contrib PTAL

briansull added the area-CodeGen-coreclr label

Member

BruceForstall commented Dec 11, 2019

cc @tannergooding @TamarChristinaArm

CarolEidt approved these changes

View reviewed changes

Contributor

CarolEidt left a comment

Overall structure looks good. I haven't reviewed the actual latencies, and hope that perhaps @tannergooding or @TamarChristinaArm can do so.
In any event, I think it's reasonable to go ahead and merge this and make any needed adjustments later.

src/coreclr/src/jit/emit.cpp Outdated

    
              //

              void emitter::perfScoreUnhandledInstruction(instrDesc* id, insExecutionCharacteristics* pResult)

              {

              // Change this to #ifdef DEBUG to assert on any unhnadled instructions

Contributor

CarolEidt Dec 11, 2019

Typo (unhandled).
I'm not sure why you wouldn't leave this enabled in DEBUG builds, though I can see that it might be risky. That said, I don't know how often someone would think of re-enabling this if they were adding instructions.

Contributor Author

briansull Dec 12, 2019

There is active work with adding instruction in the SIMD area. As I will be out ofthe office shortly I didn't want to block them at this time.

If Tanner and others want I can enable this DEBUG assertion to enforce that they add latencies for any new instructions.

Member

tannergooding Dec 12, 2019

It would be nice, IMO, to enforce this now while we are doing iterative work; rather than needing to go back and fixup everything later.

Contributor Author

briansull Dec 12, 2019

I created PR #810 to turn this assert on

BruceForstall approved these changes

View reviewed changes

Member

BruceForstall left a comment

A few typos and nits

src/coreclr/src/jit/emit.cpp Outdated Show resolved Hide resolved

src/coreclr/src/jit/emit.cpp Outdated Show resolved Hide resolved

src/coreclr/src/jit/emit.cpp Outdated Show resolved Hide resolved

src/coreclr/src/jit/emit.cpp Show resolved Hide resolved

src/coreclr/src/jit/emit.cpp Outdated Show resolved Hide resolved

src/coreclr/src/jit/emitxarch.cpp Outdated Show resolved Hide resolved

src/coreclr/src/jit/emitarm64.cpp Outdated Show resolved Hide resolved


          Added PerfScore support for Arm64

c3dacce

Based upon arm_cortex_a55_software_optimization_guide_v2.pdf

briansull force-pushed the new-arm64-perfscore branch from 8d2f021 to c3dacce Compare

December 12, 2019 00:47

briansull merged commit 41715bc into dotnet:master

briansull mentioned this pull request

Enable assert in perfScoreUnhandledInstruction #810

Merged

Contributor

TamarChristinaArm commented Dec 13, 2019

Overall structure looks good. I haven't reviewed the actual latencies, and hope that perhaps @tannergooding or @TamarChristinaArm can do so.

Sure, I'll review them today.

TamarChristinaArm reviewed

View reviewed changes

src/coreclr/src/jit/emitarm64.cpp

+                      //  Branch Instructions
+                      //
+                      case IF_BI_0A:                                      // b, bl_local

Contributor

TamarChristinaArm Dec 16, 2019

Curious.. what are bl_local and b_tail?

src/coreclr/src/jit/emitarm64.cpp

+                              case INS_udiv:
+                                  if (id->idOpSize() == EA_4BYTE)
+                                  {
+                                      result.insThroughput = PERFSCORE_THROUGHPUT_4C;

Contributor

TamarChristinaArm Dec 16, 2019

Shouldn't this be 3? Or are you applying an additional penalty? same for the X-form one below.

src/coreclr/src/jit/emitarm64.cpp

+                      //  Load/Store Instructions
+                      //
+                      case IF_LS_1A: // ldr, ldrsw (literal, pc relative immediate)

Contributor

TamarChristinaArm Dec 16, 2019

I'm assuming it's intentional that none of the loads have a latency? one does seem to be set for ldp and stp but the default value is PERFSCORE_LATENCY_ILLEGAL isn't it?

src/coreclr/src/jit/emitarm64.cpp

+                          }
+                          break;
+                      case IF_LS_2D: // ld1                         (vector - multiple structures)

Contributor

TamarChristinaArm Dec 16, 2019

should probably clarify that these are only for the 1 register case.

src/coreclr/src/jit/emitarm64.cpp

+                      case IF_DV_1C: // fcmp vn, #0.0
+                          result.insThroughput = PERFSCORE_THROUGHPUT_1C;
+                          result.insLatency    = PERFSCORE_LATENCY_3C;

Contributor

TamarChristinaArm Dec 16, 2019

This is off, I think this should be 1 as well, the explicit compare with 0.0 isn't more expensive than an arbitrary scalar.

src/coreclr/src/jit/emitarm64.cpp

+                              case INS_fabs:
+                              case INS_fneg:
+                                  result.insThroughput = PERFSCORE_THROUGHPUT_2X;
+                                  result.insLatency = (id->idOpSize() == EA_8BYTE) ? PERFSCORE_LATENCY_2C : PERFSCORE_LATENCY_3C / 2;

Contributor

TamarChristinaArm Dec 16, 2019

Hmm why the different costs for the Q one? shouldn't this just be 4?

src/coreclr/src/jit/emitarm64.cpp

+                                  if ((id->idInsOpt() == INS_OPTS_2S) || (id->idInsOpt() == INS_OPTS_4S))
+                                  {
+                                      // S-form
+                                      result.insThroughput = PERFSCORE_THROUGHPUT_3C;

Contributor

TamarChristinaArm Dec 16, 2019

These seem off, the guide says latency 12, throughput 1/9 for the S variant and latency 22 throughput 1/19 for the D form. These seem to be quite a bit cheaper.

src/coreclr/src/jit/emitarm64.cpp

+                                  {
+                                      // D-form
+                                      assert(id->idInsOpt() == INS_OPTS_2D);
+                                      result.insThroughput = PERFSCORE_THROUGHPUT_10C;

Contributor

TamarChristinaArm Dec 16, 2019

think this should be 19.

src/coreclr/src/jit/emitarm64.cpp

+                                  if (id->idOpSize() == EA_8BYTE)
+                                  {
+                                      // D-form
+                                      result.insThroughput = PERFSCORE_THROUGHPUT_6C;

Contributor

TamarChristinaArm Dec 16, 2019

hmm, the scalar and vector ones have the same latency and throughput in the guide. are these intentionally lower?

ghost locked as resolved and limited conversation to collaborators

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.