-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align inner loops #44370
Align inner loops #44370
Conversation
@dotnet/jit-contrib , @danmosemsft , @adamsitnik , @DrewScoggins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left you some preliminary feedback; still looking things over.
src/coreclr/src/jit/emit.cpp
Outdated
|
||
#if defined(TARGET_XARCH) | ||
// https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf | ||
bool isJccAffectedIns = ((lastIns >= INS_i_jmp && lastIns < INS_align) || (lastIns == INS_call) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit seems fragile (lastIns >= INS_i_jmp && lastIns < INS_align)
-- what guarantee is there this will remain true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree but I didn't find a better way to capture all jmp
instructions. Is there any other better way to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that the proper place for handling the INS_align size in branch tightening (emitJumpDistBind
) -- they seem to be the same problem: a set of instructions that are pessimistically estimated at a larger size, then algorithmically shrunk to the final chosen size. In this way, the INS_align padding could be predetermined precisely before the actual memory allocation in allocMem
occurs.
For alignment to be correct, it would require that estimated instruction sizes are correct, and we know there are cases today where we are not correct, and overestimate some sizes. While that is a bug, and we emit warnings in the JitDump about it, it would defeat alignment. Therefore, we could fix this for alignment by first determining if there are any INS_align in the function. If so, any time we see an instruction actual size less than estimated size, we make up the difference by emitting NOPs.
It looks like the code has only been implemented for xarch. Is the plan to validate this first, then expand to support arm32/64? In that case, maybe you should define a FEATURE_ALIGN_LOOPS defined for TARGET_XARCH and then it should be easier to find that relevant code when expanding to handle arm.
src/coreclr/src/jit/emitxarch.cpp
Outdated
|
||
void emitter::emitVariableLoopAlign(unsigned short alignmentBoundary) | ||
{ | ||
unsigned short nPaddingBytes = alignmentBoundary - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really worth making these short
instead of int
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just felt that unsigned short
type should be good enough to hold the values in these variables. Do you prefer using unsigned
instead?
Would it make sense to take into account whether the loop contains any calls? If the loop contains calls, it is less likely to benefit from alignment. |
307e011
to
b47af98
Compare
Thanks for the suggestion @jkotas . I have added a check to not perform alignment if there is a call in the loop. |
// Auto-generated message 69e114c which was merged 12/7 removed the intermediate src/coreclr/src/ folder. This PR needs to be updated as it touches files in that directory which causes conflicts. To update your commits you can use this bash script: https://gist.github.com/ViktorHofer/6d24f62abdcddb518b4966ead5ef3783. Feel free to use the comment section of the gist to improve the script for others. |
1f3b8b0
to
5f44f6e
Compare
@AndyAyersMS , @BruceForstall - This PR is ready to review again. I have added several updates in PR description that describes the approach taken to reduce the allocation size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed everything save for emit.cpp
4aef008
to
b530d14
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small comments on things I looked at before. I still need to pull down your latest and look at emit.cpp.
I wonder if you could write up some pseudo-code outlining what we do during emission as it is getting hard to keep track of all the adjustments and counter-adjustments that go on.
Something like:
Before final codegen, we have:
- marked some IGs as having align instructions at their ends
- kept track of the last IG that has align instructions
- alignment instructions size is computed assuming no size overestimates
During final codegen, we must concurrently track
- need for alignment padding
- overestimates of non-align instruction sizes upstream of alignment instructions
- overstimates impact on branch distances
If no instructions sizes are over-estimated then each alignment instruction pads to the size computed before final codegen.
To minimize likelihood of over-estimates certain instructions that are known to be prone to over-estimation are kept at the initial sizes if they are upstream of any alignment instructions.
Remaining over-estimated instructions are handled by padding out to their original size estimate....
(proabably have the details wrong here but hopefully you get the idea)
Summary
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments for emit.cpp.
/* It is fatal to under-estimate the instruction size */ | ||
noway_assert(id->idCodeSize() >= csz); | ||
// It is fatal to under-estimate the instruction size, except for alignment instructions | ||
noway_assert(estimatedSize >= actualSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is it that prevents align instructions from hitting this assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the idCodeSize()
for align
instruction is always accurate from what we calculated in emitLoopAlignAdjustments()
.
// If igInLoop's next IG is a loop and needs alignment, then igInLoop should be the last IG | ||
// of the current loop and should have backedge to current loop header. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were going to split these critical edges earlier on...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this was simplistic instead of having portion of alignment logic in register allocator as well. With this approach, we can simply check if the last IG was also marked as IGF_LOOP_ALIGN
and if yes, then just remove the align
instruction size from it.
src/coreclr/jit/emit.cpp
Outdated
// if currIG has back-edge to dstIG. | ||
// | ||
// Notes: | ||
// If the current loop covers a loop that is already marked as align, then remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe this differently?
// If the current loop covers a loop that is already marked as align, then remove | |
// If the current loop encloses a loop that is already marked as align, then remove |
Also what happens if the two loops intersect but one doesn't enclose the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give an example?
// 3b. If paddingNeeded > maxPaddingAmount, then recalculate to align to 16B boundary. | ||
// 3b. If paddingNeeded == 0, then return 0. // already aligned at 16B | ||
// 3c. If paddingNeeded > maxPaddingAmount, then return 0. // expensive to align | ||
// 3d. If the loop already fits in minimum 32B blocks, then return 0. // already best aligned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't we check this first? If padding can't reduce the number of chunks then it seems like we should not pad..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I first try to check the best alignment boundary as part of 3b. Only if we decide to move ahead, we will check for the 3d. So in 3d, i check the current offset of loop with regards to alignmentBoundary (32B or
16B`). If we do it before, we might be just checking the current offset % 32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments; I should look more
|
||
if (block->bbJumpDest->isLoopAlign()) | ||
{ | ||
GetEmitter()->emitSetLoopBackEdge(block->bbJumpDest); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition seems odd to me: Presumably, you've already determined that an "isLoopAlign" block is the top of a loop. But now, any block that jumps to that is considered a back edge. What if the lexically first block of the loop is the target of branches that aren't loop branches: forward branches jumping to the loop top, or non-loop backward branches? (The non-loop backward branches might not matter since presumably your algorithm will hit an in-loop backward branch first, but possibly marking them could confuse other code? E.g., what if the first block in a non-first loop branched back to the top of the first loop? I'm not sure if all these cases can occur, due to loop canonicalization, etc., but this case seems not too specific)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emitSetLoopBackEdge()
checks the igNum
and accordingly decides if the edge is a back-edge or not.
Have we measured this across a range of hardware to confirm? I'd be worried that this could subtly impact certain SIMD loops due to decoding bandwidth and number of instructions that can fit into the icache. |
No, I haven't measured the performance across various hardware, but currently, we optimize very few methods having inner loops out of the total methods that have inner loop. To give an example, we just align loops in 1515 out of 380K methods in .NET libraries which is around 0.4% of them. We might mark an inner loop needs alignment in a method, but there are various heuristics to determine if the loop is worth aligning and if not, we would skip doing alignment. The prominent reason where alignment is rejected is the loop size. For such methods, we will continue to keep compacted VEX encoding. Even for methods, for which we would align an inner loop, we would continue to disable VEX encoding only up to the last align instruction after which we will re-enable it. So in my previous example, even out of those 0.4% methods, we will see compacted VEX encoding after the inner loop block ends. There will definitely be edge cases which might regress as you pointed, but neither can't prove that will be the case without knowing the exact hardware/code, nor I have enough data to determine cost vs. benefit of not doing loop alignment with this approach. Alternative of not disabling the VEX encoding is to fix the estimation for such instructions, which we have both concluded offline that it will be a non trivial task and might take couple of weeks to accomplish. So given the fact that we are very selective in which inner loop gets aligned and even in those methods, we do not completely disable VEX encoding, improvements we get in normal code by alignment and the challenges to fix the existing over-estimation, I feel, existing option is worth pursing. We should regardless strive to fix the over-estimation of VEX encoding problem which will help fix the regression that we might notice in SIMD loops. By the way, could you elaborate on SIMD loops or share a sample code? |
We also intend to fix the encoding size issues once this change is in. So if there is a perf impact from changing encoding, it should hopefully be short-lived. |
Many of our most performance oriented functions using some sort of vectorization and looping as they often go hand in hand. In many cases these loops are also fairly small (typically between 16 and 64 bytes). For example, in In the case of the AVX loop, the loop is ~63 bytes today and as such likely benefits from both alignment and small encoding on modern hardware Some of this will likely be mitigated by better alignment, but we may also not see improvements or may see regressions due to these changes in other cases.
I do as well, I just am wanting to ensure we are testing any known "hot" scenarios for potential regressions 😄 My main thought is still what we had discussed offline. Which is that the primary concern (that I know of) is that due to instruction size misestimation, we won't want to cause the method allocations to grow too much. However, the maximum difference is likely going to be a handful of bytes. The actual emitted size is In the case where you are aligning both branch targets ( However, in the case of misestimation (today's world), you'd have The same will be true for every subsequent alignment needed, whether the misestimation was due to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you answer the Debug/Release code diff question?
@@ -822,6 +831,60 @@ insGroup* emitter::emitSavIG(bool emitAdd) | |||
} | |||
#endif | |||
|
|||
#ifdef FEATURE_LOOP_ALIGN | |||
// Did we have any align instructions in this group? | |||
if (emitCurIGAlignList) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can there ever be more than a single align instruction in an IG?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of non-adaptive, we can have multiple align
instructions that all represent a single padding. E.g. for non-adaptive 32B, there will be 3 align instructions
- align (15)
- align (15)
- align (1)
src/coreclr/jit/emit.h
Outdated
@@ -250,6 +250,7 @@ struct insGroup | |||
unsigned int igFuncIdx; // Which function/funclet does this belong to? (Index into Compiler::compFuncInfos array.) | |||
unsigned short igFlags; // see IGF_xxx below | |||
unsigned short igSize; // # of bytes of code in this group | |||
insGroup* igLoopBackEdge; // "last" back-edge that branches back to an aligned loop head. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems potentially expensive, e.g. for MinOpts scenario where it's not used.
Should it be under #if FEATURE_ALIGN_ALIGN
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, makes sense.
src/coreclr/jit/emitxarch.cpp
Outdated
@@ -13705,6 +13815,43 @@ size_t emitter::emitOutputInstr(insGroup* ig, instrDesc* id, BYTE** dp) | |||
emitDispIns(id, false, dspOffs, true, emitCurCodeOffs(*dp), *dp, (dst - *dp)); | |||
} | |||
|
|||
#ifdef FEATURE_LOOP_ALIGN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this code within #ifdef DEBUG
block? That's going to lead to Debug/Release code differences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be, my bad. I guess the #endif
got removed after rebase. I will fix it.
@tannergooding - Yes, that's the thing we concluded and I went ahead and explored that option. Unfortunately, it was not that simple. I noticed cases where sometimes, because of the combination of multiple instructions involving
We only align backward branch targets because that represent the loop. So, in your example it is Assembly codeG_M21588_IG01:
00007fff`56918d40 C5F877 vzeroupper
G_M21588_IG02:
00007fff`56918d43 4963C1 movsxd rax, r9d
G_M21588_IG03:
00007fff`56918d46 4803C2 add rax, rdx
G_M21588_IG04:
00007fff`56918d49 C5FC57C0 vxorps ymm0, ymm0, ymm0
G_M21588_IG05:
00007fff`56918d4d align
G_M21588_IG06:
00007fff`56918d4d C5FE6F0A vmovdqu ymm1, ymmword ptr[rdx]
G_M21588_IG07:
00007fff`56918d51 C5F564D0 vpcmpgtb ymm2, ymm1, ymm0
G_M21588_IG08:
00007fff`56918d55 C5FDD7CA vpmovmskb ecx, ymm2
G_M21588_IG09:
00007fff`56918d59 83F9FF cmp ecx, -1
G_M21588_IG10:
00007fff`56918d5c 7539 jne SHORT G_M21588_IG24
G_M21588_IG11:
00007fff`56918d5e C5F560D0 vpunpcklbw ymm2, ymm1, ymm0
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (punpcklbw: 2) 32B boundary ...............................
G_M21588_IG12:
00007fff`56918d62 C5F568C8 vpunpckhbw ymm1, ymm1, ymm0
G_M21588_IG13:
00007fff`56918d66 C4E36D46D920 vperm2i128 ymm3, ymm2, ymm1, 32
G_M21588_IG14:
00007fff`56918d6c C4E36D46C931 vperm2i128 ymm1, ymm2, ymm1, 49
G_M21588_IG15:
00007fff`56918d72 C4C17E7F18 vmovdqu ymmword ptr[r8], ymm3
G_M21588_IG16:
00007fff`56918d77 C4C17E7F4820 vmovdqu ymmword ptr[r8+32], ymm1
G_M21588_IG17:
00007fff`56918d7d 4883C220 add rdx, 32
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (add: 1) 32B boundary ...............................
G_M21588_IG18:
00007fff`56918d81 4983C040 add r8, 64
G_M21588_IG19:
00007fff`56918d85 488D48E0 lea rcx, [rax-32]
G_M21588_IG20:
00007fff`56918d89 483BD1 cmp rdx, rcx
G_M21588_IG21:
00007fff`56918d8c 76BF jbe SHORT G_M21588_IG06
G_M21588_IG22:
00007fff`56918d8e B801000000 mov eax, 1
G_M21588_IG23:
00007fff`56918d93 EB02 jmp SHORT G_M21588_IG25
G_M21588_IG24:
00007fff`56918d95 33C0 xor eax, eax
G_M21588_IG25:
00007fff`56918d97 0FB6C0 movzx rax, al
G_M21588_IG26:
00007fff`56918d9a C5F877 vzeroupper
00007fff`56918d9d C3 ret |
858608c
to
b100226
Compare
Perform loop alignment at 32B boundary by adding padding before hot inner loops. Contributes to #43227
Description
Detect all the inner loops (inside
optimizer.cpp
) in a method and mark the corresponding basic blocks that represent loop head withBBF_FIRST_BLOCK_IN_INNERLOOP
. This information is propagated when the blocks are cloned or new ones are created/moved (inflowgraph.cpp
). During codegen (incodegenlinear.cpp
), if we see that the next basic block hasBBF_FIRST_BLOCK_IN_INNERLOOP
, we mark the current IG withIGF_ALIGN_LOOP
. This IG is the one that precedes an IG that is a loop head. We emit one or morealign
instruction (see details in Implementations below) in current IG (the one that precedes the loop head IG). During emitting (emitxarch.cpp
), when we seealign
instruction, we check how much padding should be added based on the target address and emit sequence ofNOP
s.Implementations
In this PR, there are two implementations that determines how much padding should be added to align a loop. Those are non-adaptive and adaptive padding.
Non-Adaptive
In non-adaptive setup, user can specify the boundary
COMPlus_JitAlignLoopBoundary
at which the inner loops should be aligned. By default, this is set to32 byte
boundary. This is simliar to align-loops option that modern processors exposes. For such cases, the maximum padding we will do isCOMPlus_JitAlignLoopBoundary - 1
, thus by default, we will add31 bytes
of padding to align a loop. This implementation would do the padding only if loop size<= COMPlus_JitAlignLoopMaxCodeSize
. By default, this is set to96 bytes
(3 x 32B i.e. 3 chunks of 32B).Adaptive
In my experiments, I realized that having a limit on padding amount helps eliminate some regression that might occur because of execution of series of
NOP
instructions. It is also sensible to adjust the padding amount depending on the size of loop. Lastly, for32B non-adaptive
approach, I noticed that there were cases that we would not perform alignment if the target address or loop size position didn't meet the heuristics (see details in Heuristics below). In that case, we could have tried to align to 16B boundary because that would be still better than not aligning the loop. Taking that all in account, the adaptive approach works as follows:The biggest possible padding that will be added is
15 bytes
for a loop that fits in one32B
chunk. If the loop is bigger and fits in two32B
chunks, then reducing the padding amount to7 bytes
and so forth. The reasoning being that bigger the loop gets, lesser effect the alignment has on it. With this approach, we could align a loop that takes 432B
chunks if padding needed is1 byte
. With 32B non-adaptive approach, we would never align such loops. Also, overall the padding size reduced compared to 32B non-adaptive approach.If we cannot get the loop to align at
32B
boundary, we will try it to align to16B
boundary. We reduce the max padding limit if we get here as seen in table below.Heuristics
We will emit
align
instruction, but during emitting there are multiple reasons for which padding won't be added.Loop hotness: If the inner loop is not hot enough, we will not try to align it. It is controlled by
COMPlus_JitAlignLoopMinBlockWeight
which is set to10
by default.Loop size: As mentioned above, if we determine that loop size exceed the threshold (
96 bytes
for non-adaptive and variable for adaptive), it will not add padding for such loops.Boundary: If we detect that the loop is already emitted at the required alignment boundary, it will not try to align it as there is no need. Likewise, even if the loop doesn't start at the boundary, but starts at certain offset from given alignment boundary that will still make it fall under efficient
32B
chunks. I have tried experimenting the JCC Erratum impact and hence addedCOMPlus_JitAlignLoopForJcc
flag that will try to ensure that the back-edge jump doesn't fall on the exact 32B boundary. This isDEBUG
only flag.Padding limit: In case of adaptive approach, it can decide to not align the loop if padding needed exceeds the threshold of maximum padding that can be added for a loop of given size.
Update: Call inside a loop: If there are calls from inside a loop, there is probably less benefit in aligning a loop.
Code size impact
Below graph demonstrates the code size and allocation size impact. While allocation size for 32B adaptive is little more than the 16B non-adaptive approach, the overall code size for 32B adaptive is smallest among other implementations.
Below graph demonstrates the (allocation size - code size) comparison of various approaches. The difference is highest for 32B non-adaptive implementation and lowest with 16B non-adaptive. 32B adaptive is marginally higher than 16B non-adaptive, but again since the overall code size is minimal as compared to 16B/32B non-adaptive, 32B adaptive is the winner. For me, the goal was to make the diff that matches the blue bar in the graph for non alignment. In other words, I wanted to get to the point where we only allocate memory for which padding is added and not otherwise. As I pointed in #8748 (comment), the only way to not over-allocate memory is if I can predict in advance whether alignment is needed or not. If you see the
Heuristics
section above, I can easily calculate theloopSize
in advance and if I find the loop is big enough, I don't allocate extra space foralign
instruction because we know that we won't align such loop. However, for other heuristics, I need to know the precise size of all instructions before the loop so I can estimate in advance (before allocating the memory) if I will do the alignment. For eg, if the size of all instructions before the loop instruction is% 32
, I know that the loop falls on 32B boundary and there is no need of extra alignment. In that case, I can remove thealign
instruction for that loop and thus while allocating, allocate 15 bytes less. But, today, at few places, we over-estimate the instruction size and we won't know precise size until the point where we emit the machine code at target address (which happens after allocating memory). If we can fix the over-estimation part, we could get the allocation size down (for 32B adaptive approach, it will go down from 0.68% -> 0.07% regression).No-alignment vs. 32B non-adaptive
Summary of Code Size diffs:
Total bytes of base: 51527293
Total bytes of diff: 51646256
Total bytes of delta: 118963 (0.23% of base)
diff is a regression.
Code size impact details
Summary of Allocation Size diffs:
(Lower is better)
Total bytes of base: 51850481
Total bytes of diff: 52495660
Total bytes of delta: 645179 (1.24% of base)
diff is a regression.
Allocation size impact details
No-alignment vs. 16B non-adaptive
Summary of Code Size diffs:
(Lower is better)
Total bytes of base: 51527293
Total bytes of diff: 51583852
Total bytes of delta: 56559 (0.11% of base)
diff is a regression.
Code size impact details
Summary of Allocation Size diffs:
(Lower is better)
Total bytes of base: 51850481
Total bytes of diff: 52161852
Total bytes of delta: 311371 (0.60% of base)
diff is a regression.
Allocation size impact details
No-alignment vs. 32B adaptive
Summary of Code Size diffs:
(Lower is better)
Total bytes of base: 51527293
Total bytes of diff: 51567317
Total bytes of delta: 40024 (0.08% of base)
diff is a regression.
Code size impact details
Summary of Allocation Size diffs:
(Lower is better)
Total bytes of base: 51850481
Total bytes of diff: 52214682
Total bytes of delta: 364201 (0.70% of base)
diff is a regression.
Allocation size impact details
Update
Overestimation fix
Since the allocation size regressed heavily because of over-estimated
align
instructions, in the recent update, I tried to fix that problem. The only reason that exact padding needed to align a loop could not be determined before allocating the memory was because certain instructions were over-estimated and during outputting the they were outputted in compact manner leading to mismatch of the calculation we did to come up with padding amount needed. Majority of over-estimation was occurring because of #21453 where we were optimizing the encoding of certain instructions by trimming the VEX prefix from 3-bytes to 2-bytes. To make sure, that does not affect alignment calculation, I have disabled this optimization until we reach the lastalign
instruction. Disabling compact VEX encoding should not affect performance. The only downside of it is that code size of methods containingalign
instructions will be more, but there won't be any change in amount of memory allocated. For the remaining over-estimated instructions,(see more details on them at #8748 (comment)), we will add aNOP
after them. Again, this will happen until we reach lastalign
instruction of a method after which no compensation for over-estimation will be done.With that, we see that the difference of allocation size vs. code size reduces as seen in below graph.
Summary of Code Size diffs:
(Lower is better)
Total bytes of base: 51624344
Total bytes of diff: 51643618
Total bytes of delta: 19274 (0.04% of base)
diff is a regression.
Summary of Allocation Size diffs:
(Lower is better)
Total bytes of base: 51942836
Total bytes of diff: 51957731
Total bytes of delta: 14895 (0.03% of base)
diff is a regression.
Perf impact
I have been testing various Microbenchmarks with my loop alignment changes (both adaptive and non-adaptive) and sharing my findings in #44051. I also did a run of all Microbenchmarks on my machine for no-alignment, 32B (non-adaptive), 16B (non-adaptive), 32BAdaptive and 32BJCC. Then I compared the results with noise threshold of
1ns
and statistical threshold of 3%.In the following graph, X-axis is benchmarks ID and Y-axis is base/diff. Since both base and diff are measured in
nanoseconds
, higher the ratio, better is the performance.Here are some of the key observations:
If we zoom the graph and make Y-axis to log10 scale, here are some observations:
I did analyzed some regressions and some of them were coming from memory alignment although I did not verified all of them.
Defaults
Looking at the code size and perf impact, 32B adaptive alignment approach is ON by default. We can switch off loop alignment using
COMPlus_JitAlignLoops=0
.Flags
Here are set of flags that are added as part of this PR:
COMPlus_JitAlignLoopMinBlockWeight
: Debug flag that controls the minimum block weight of a loop after which alignment will happen. Default value is10
.COMPlus_JitAlignLoopMaxCodeSize
: Debug flag that controls maximum loop size for which alignment will happen. Default value is96
bytes for non-adaptive and ignored for adaptive.COMPlus_JitAlignLoopBoundary
: Debug flag that controls the boundary at which loop will be aligned. Default value is32
bytes.COMPlus_JitAlignLoopForJcc
: Debug flag that controls if JCC adjustment should be done during non-adaptive alignment. Default value is0
.COMPlus_JitAlignLoopAdaptive
: Debug flag that controls if adaptive or non-adaptive alignment should happen. Default value is1
.COMPlus_JitAlignLoops
: If loop alignment should be done or not. Default value is1
.Other changes
COMPlus_JitDasmWithAddress
is set, also emit the chunk boundary in the disassembly. The alignment boundary is whatever value is set inCOMPlus_JitAlignLoopBoundary
. If an instruction crosses the boundary, the logging will also indicate that.Sample disassembly output
Update
Impacted loops
From my preliminary
pmi
run on .NET libraries, with this feature, we align1908
loops present in1635
methods. This number is less compared to my earlier implementation where I was also aligning loops containing call. In that case, we were aligning4692
loops present in4239
, but since alignment will not benefit such loops anyway, I am disabling alignment if it has call.Edge cases
There were certain edge cases that need to be fixed:
align
instructions for next loop that follows current loop. For such cases, do not take the size ofalign
instruction in to account because it is reserved for the next loop.Follow-ups
If we get this PR in, we will monitor the performance and stability of our Microbenchmarks. If we don't see expected promising results in our perf lab, we will turn OFF this feature. Other follow-ups that we will be doing is:
NOP
might come in the hot path and cycles might be spent for them. The goal is to add padding at various blind spots that occur not just immediately before the loop but anywhere from beginning of the method to the beginning of the loop. An example would be to addNOP
after unconditional jump so that it doesn't get executed and we still our loop aligned.As described above, we need to fix the over-estimation of instructions for xarch so we can precisely determine if loop alignment is needed and if not, avoid allocating memory foralign
instruction.