-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cse tuning #1463
Cse tuning #1463
Conversation
1. Changed csdLiveAcrossCall to a bool (zero-diff)
x64 5 improvements 0 regressions, Total PerfScore diff: -10.72 x86 16 improvements 5 regressions, Total PerfScore diff: -72.95
… two. x64 250 improvements 51 regressions, Total PerfScore diff: -459.09 arm64 162 improvements 16 regressions, Total PerfScore diff: -1712.52
x64 280 improvements 81 regressions, Total PerfScore diff: -274.78 arm64 264 improvements 61 regressions, Total PerfScore diff: -911.00 x86 87 improvements 42 regressions, Total PerfScore diff: -123.46 arm32 195 improvements 81 regressions, Total PerfScore diff: -239.10
x64 125 improvements 136 regressions, Total PerfScore diff: +427.43 arm64 83 improvements 153 regressions, Total PerfScore diff: +260.68 x86 218 improvements 193 regressions, Total PerfScore diff: +199.81 arm32 145 improvements 181 regressions, Total PerfScore diff: -33283.10 arm32 method with improvement: -33864.40 (-2.87% of base) : System.Private.CoreLib.dasm - TypeBuilder:CreateTypeNoLock():TypeInfo:this (2 methods)
x64 61 improvements 61 regressions, Total PerfScore diff: -189.03 arm64 90 improvements 49 regressions, Total PerfScore diff: -463.42 x86 88 improvements 80 regressions, Total PerfScore diff: -238.61 arm32 101 improvements 63 regressions, Total PerfScore diff: -259.50
…ditional caller save register x64 73 improvements 45 regressions, Total PerfScore diff: -279.88 arm64 45 improvements 76 regressions, Total PerfScore diff: -90.94 x86 13 improvements 14 regressions, Total PerfScore diff: -21.55 arm32 45 improvements 33 regressions, Total PerfScore diff: -78.60
…he cse def/use cost for SMALL_CODE No diffs in System.Private.Corelib
…or a hugeFrame x64 199 improvements 50 regressions, Total PerfScore diff: -2061.36 arm64 11 improvements 3 regressions, Total PerfScore diff: -46.84 x86 136 improvements 80 regressions, Total PerfScore diff: -1795.00 arm32 50 improvements 35 regressions, Total PerfScore diff: -132.30
Benchmark Perf Score diff report:
|
System.Private.Corelib Perf Score diff report:
|
@dotnet/jit-contrib PTAL |
Can you also share code size diff reports? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for splitting that PR in such nice small commits.
src/coreclr/src/jit/optcse.cpp
Outdated
@@ -1707,7 +1707,11 @@ class CSE_Heuristic | |||
{ | |||
varTyp = TYP_STRUCT; | |||
} | |||
enregCount += genTypeStSz(varTyp); | |||
#ifdef _TARGET_64BIT_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know what happens with structs here (especially with SIMD types), but I would prefer to see something like genTypeSize(varTyp)/TARGET_POINTER_SIZE
or a new method in LclVarDsc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's the case that either the struct is promoted, so only it's field lclVars are tracked, or it's a SIMD type, in which case it fits in a single register, or it's not (currently) a register candidate. Additional commenting here would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The genTypeSize(TYP_STRUCT) returns 0, also genTypeSize(TYP_SIMD16) returns 16, so we can't use the logic you suggest here.
I will add some comments and make the following change:
enregCount++; // The primitive types, including TYP_SIMD types use one register
#ifndef _TARGET_64BIT_
if (varType == TYP_LONG)
{
enregCount++; // on 32-bit targets longs use two registers
}
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous block does if (varTypeIsStruct(varTyp)) varTyp = TYP_STRUCT;
so we can't see TYP_SIMD8/16/32
here.
With the current impl for structs we will see enregCount += 1
on _TARGET_64BIT_
and enregCount += 0
in #else
case for all structs/simd types. I do not understand that difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code in the previous block was added to preserve old behavior, and should be changed as we tune CSE for optimizing struct types (including SIMD).
@@ -2159,13 +2159,13 @@ class CSE_Heuristic | |||
#else // _TARGET_ARM_ | |||
if (hugeFrame) | |||
{ | |||
cse_def_cost = 10 + (2 * slotCount); // movw/movt r10 and str reg,[sp+r10] | |||
cse_use_cost = 10 + (2 * slotCount); | |||
cse_def_cost = 10 + 2; // movw/movt r10 and str reg,[sp+r10] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a semantic change here, so it is not only refactor
, are you aware of that?
Do you want to multiply 10 by slotCount
?
Where do these consts come from?
Replace 10 + 2
with 12
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I am aware of this semantic change. It is actually correct, as the cost to generate load when we have a huge frame on ARM32 in terms of code size is 12 bytes for each slot.
For SMALL_CODE we measure the costs in terms of how many bytes of code will be generated.
It looks like I need to visit the code that set hugeFrame to ensure that it does not get set for ARM64. (and adjust the size for a largeFrame on ARM64)
In any case it is very rare that we are ever optimizing for SMALL_CODE as we only do so for class constructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable. More comments would be helpful.
src/coreclr/src/jit/optcse.cpp
Outdated
@@ -1578,12 +1578,12 @@ class CSE_Heuristic | |||
sortTab = nullptr; | |||
sortSiz = 0; | |||
|
|||
#ifdef _TARGET_XARCH_ | |||
#ifdef _TARGET_X86_ | |||
if (m_pCompiler->compLongUsed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why this is needed at all, much less x86 only. Can you explain? It might make sense to count 2 regs for each long lclVar but I'm not sure why this just adds one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was needed by the old x86 tree based codegen.
It may no longer be needed with the RyuJIT based x86 codegen.
I will try removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this has positive results for x86
src/coreclr/src/jit/optcse.cpp
Outdated
@@ -1707,7 +1707,11 @@ class CSE_Heuristic | |||
{ | |||
varTyp = TYP_STRUCT; | |||
} | |||
enregCount += genTypeStSz(varTyp); | |||
#ifdef _TARGET_64BIT_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's the case that either the struct is promoted, so only it's field lclVars are tracked, or it's a SIMD type, in which case it fits in a single register, or it's not (currently) a register candidate. Additional commenting here would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also share code size diff reports?
https://gist.github.com/briansull/8518adc549803963e7f5a41c6d338a64
There are quite a few methods with regressions (both code size and perf score). Have you measured the effect of individual commits to see if any of them are responsible for the majority of regressions? |
Removed increment of enregCount on _TARGET_X86_ when we have compLongUsed: Framework diffs Total PerfScoreUnits of diff: -654.75 (-0.00% of base) diff is an improvement. 79 total methods with Perf Score differences (55 improved, 24 regressed), 146432 unchanged. Fixed setting of largeFrame/hugeFrame for ARM64 Zero framework diffs. :
@@ -1578,25 +1578,25 @@ class CSE_Heuristic | |||
sortTab = nullptr; | |||
sortSiz = 0; | |||
|
|||
#ifdef _TARGET_XARCH_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now removed:
Removed increment of enregCount on _TARGET_X86_ when we have compLongUsed:
Framework diffs
Total PerfScoreUnits of diff: -654.75 (-0.00% of base) diff is an improvement.
79 total methods with Perf Score differences (55 improved, 24 regressed), 146432 unchanged.
} | ||
#elif _TARGET_ARM64_ | ||
if (frameSize > 0x1000) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed setting of largeFrame/hugeFrame for ARM64
Zero framework diffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - only a few comment typos that could be corrected.
Thanks for the detailed comments!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you.
Code size diffs and perf scores were only reported for SPC (and some benchmarks). Would have been good to show results over the full set of FX assemblies. |
I have the perf score and code size diffs for the full frameworks as well: |
Frameworks PerfScore diffs:
|
I also have before and after Asm and Dumps for many of the top regressions and improvements:
|
Most of the regressions are due to register allocation issues, with some due to missing a cascaded optimization due to a COMMA node assignment. |
Tuning of the CSE hueristic