-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Enable jump-to-next removal optimization in MinOpts #95340
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsFollow-up to #94239. In non-debug MinOpts scenarios (like Tier0 code), we should remove branches to the next block regardless of whether (Also, the emitter-level optimization tries to remove jumps to the next block regardless of
|
cc @dotnet/jit-contrib. Diffs. |
@AndyAyersMS PTAL, is it ok if we enable this for MinOpts so long as we aren't generating debuggable code? This gives us a slight TP boost. |
I think enabling this way is ok, though we could restrict it to Tier0 if we really want to keep minops "pure". @BruceForstall what do you think? Do you have any insight into the relatively small improvements on x86? Seems anomalous. |
I was wondering about that. Generally, I'd say we shouldn't do any optimization in MinOpts. But don't we require something like this to get rid of But remind me why we don't want to also do this for debuggable code? |
I think this could impair the debugging experience by removing instructions where we could previously place breakpoints. I cannot confirm if this is a real problem, as I haven't run any diagnostics tests and, according to a couple of JitDumps I've sampled, the number of IL offset entries in the debug info is the same after enabling the optimization. But I guess we aren't looking for brilliant TP/codegen when producing debuggable code anyway, so there isn't much value lost here in disabling this for debuggable code (removing the debug code check might improve TP for non-debug cases, which is desirable, but I imagine the improvement would be minor).
Yes, removing the
I found that odd, too. My guess is maybe the emitter-level optimization was already quite effective on x86, leaving few instances for the new optimization to hit? (The lack of this emitter-level opt on ARM64 probably explains the significant size improvements there). |
Is that optimization x86 specific? There are decent diffs for x64. So perhaps there's something else going on? |
The optimization runs on x64 too, but with an additional restriction, so it probably fires slightly less often. But I think you're right that there's something else at play. I wonder if we align at the end of |
I think we should pursue enabling this without restriction. There seem to be a couple plausible ways to verify we haven't disrupted debuggabilitiy:
|
I'd be happy to! I'll start with your latter suggestion of diffing the debug info, since that seems cheaper. |
I wrote a script to diff the IL offsets, and ran it on the asmdiffs for x64 and arm64, with and without the optimization enabled for all MinOpts contexts. The script diffs the total number of IL offsets, as well as each IL offset (though not the native offsets, as I believe these change based on how many jumps are removed, and where) if the total number of offsets are equal (I can share this script offline if you're interested). I didn't find any IL offset diffs. I also diff'd the IL offsets with the optimization always enabled (regardless of MinOpts/debuggable code/etc) versus never running the optimization, and I found different numbers of IL offsets for a few methods compiled with optimizations; looking at their disasms, these methods also had different numbers of instruction groups. Without the optimization, we sometimes have a jump to the next IG, which is empty and falls through into the next IG. With the optimization, the jump and the empty IG are removed. I'm guessing the differing numbers of IGs explains the differing numbers of IL offsets? In both cases, I didn't see any IL offset diffs for unoptimized code, so I think it should be safe to turn this optimization on for debuggable code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like good enough due diligence for me.
src/coreclr/jit/block.cpp
Outdated
const bool skipJump = tryJumpOpt && JumpsToNext() && !hasAlign() && ((bbFlags & BBF_KEEP_BBJ_ALWAYS) == 0) && | ||
!compiler->fgInDifferentRegions(this, bbJumpDest); | ||
return skipJump; | ||
return (JumpsToNext() && !hasAlign() && !compiler->fgInDifferentRegions(this, bbJumpDest)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: outermost parens are unnecessary
return (JumpsToNext() && !hasAlign() && !compiler->fgInDifferentRegions(this, bbJumpDest)); | |
return JumpsToNext() && !hasAlign() && !compiler->fgInDifferentRegions(this, bbJumpDest); |
Better diffs, and a nice MinOpts TP boost on x64/x86. x86 improvements are still suspiciously small. Will take a look next. Edit: Funnily enough, |
@AndyAyersMS I cherry-picked some of the example methods with notable size improvements on x64 and no improvements on x86, and it looks like the emitter-level optimization was already taking care of many of these unnecessary jumps on x86, but not on x64. On x64, this optimization has an additional check that blocks removing the jump if it is needed for EH reasons. For the example methods where x64 had improvements and x86 had none, removing this check and disabling the block-level optimization still results in the unnecessary jumps being removed. So I don't think the block-level optimization is doing anything odd on x86; the majority of cases in MinOpts were just already covered by the existing optimization. Side note: For the emitter-level optimization on x64, if an instruction is needed after an EH-related call, does that instruction have to be a jump? I wonder if we can replace it with a nop. |
I don't think the actual instruction matters, it just needs to occupy space. So a nop should be fine. |
@AndyAyersMS are you ok with me merging this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good -- thanks for digging into some of the corners here.
Improvement: dotnet/perf-autofiling-issues#25409 |
Follow-up to #94239. In non-debug MinOpts scenarios, we should remove branches to the next block regardless of whether
BBF_NONE_QUIRK
is set; this should yield modest code size and TP improvements.(Also, the emitter-level optimization tries to remove jumps to the next block regardless of
BBF_KEEP_BBJ_ALWAYS
being set. Removing this requirement from the block-level optimization doesn't seem to affect correctness.)