Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Enable jump-to-next removal optimization in MinOpts #95340

Merged
merged 3 commits into from
Nov 30, 2023

Conversation

amanasifkhalid
Copy link
Member

@amanasifkhalid amanasifkhalid commented Nov 28, 2023

Follow-up to #94239. In non-debug MinOpts scenarios, we should remove branches to the next block regardless of whether BBF_NONE_QUIRK is set; this should yield modest code size and TP improvements.

(Also, the emitter-level optimization tries to remove jumps to the next block regardless of BBF_KEEP_BBJ_ALWAYS being set. Removing this requirement from the block-level optimization doesn't seem to affect correctness.)

@ghost ghost assigned amanasifkhalid Nov 28, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 28, 2023
@ghost
Copy link

ghost commented Nov 28, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Follow-up to #94239. In non-debug MinOpts scenarios (like Tier0 code), we should remove branches to the next block regardless of whether BBF_NONE_QUIRK is set; this should yield modest code size and TP improvements.

(Also, the emitter-level optimization tries to remove jumps to the next block regardless of BBF_KEEP_BBJ_ALWAYS being set. Removing this requirement from the block-level optimization doesn't seem to affect correctness.)

Author: amanasifkhalid
Assignees: amanasifkhalid
Labels:

area-CodeGen-coreclr

Milestone: -

@amanasifkhalid amanasifkhalid marked this pull request as ready for review November 28, 2023 19:09
@amanasifkhalid
Copy link
Member Author

cc @dotnet/jit-contrib. Diffs.

@amanasifkhalid
Copy link
Member Author

@AndyAyersMS PTAL, is it ok if we enable this for MinOpts so long as we aren't generating debuggable code? This gives us a slight TP boost.

@AndyAyersMS
Copy link
Member

I think enabling this way is ok, though we could restrict it to Tier0 if we really want to keep minops "pure". @BruceForstall what do you think?

Do you have any insight into the relatively small improvements on x86? Seems anomalous.

@BruceForstall
Copy link
Member

I think enabling this way is ok, though we could restrict it to Tier0 if we really want to keep minops "pure". @BruceForstall what do you think?

I was wondering about that. Generally, I'd say we shouldn't do any optimization in MinOpts. But don't we require something like this to get rid of BBF_NONE_QUIRK (and not have a branch at the end of every MinOpts block)? In that case, I'm ok with this.

But remind me why we don't want to also do this for debuggable code?

@amanasifkhalid
Copy link
Member Author

But remind me why we don't want to also do this for debuggable code?

I think this could impair the debugging experience by removing instructions where we could previously place breakpoints. I cannot confirm if this is a real problem, as I haven't run any diagnostics tests and, according to a couple of JitDumps I've sampled, the number of IL offset entries in the debug info is the same after enabling the optimization. But I guess we aren't looking for brilliant TP/codegen when producing debuggable code anyway, so there isn't much value lost here in disabling this for debuggable code (removing the debug code check might improve TP for non-debug cases, which is desirable, but I imagine the improvement would be minor).

But don't we require something like this to get rid of BBF_NONE_QUIRK (and not have a branch at the end of every MinOpts block)?

Yes, removing the BBF_NONE_QUIRK check and disabling this optimization in MinOpts would significantly regress the asmdiffs, since we no longer have BBJ_NONE. Eventually, we want to remove BBF_NONE_QUIRK, so I will note that if we remove the check now, we get significant size regressions for debuggable code (which might be ok?). If we verify that this optimization doesn't hurt the debugging experience, then maybe we can get rid of the const bool tryJumpOpt = !compiler->opts.compDbgCode || ((bbFlags & BBF_NONE_QUIRK) != 0); check altogether, so we can remove a BBF_NONE_QUIRK usage without regressing debuggable code size. But maybe that's best left for a future PR.

Do you have any insight into the relatively small improvements on x86? Seems anomalous.

I found that odd, too. My guess is maybe the emitter-level optimization was already quite effective on x86, leaving few instances for the new optimization to hit? (The lack of this emitter-level opt on ARM64 probably explains the significant size improvements there).

@AndyAyersMS
Copy link
Member

Do you have any insight into the relatively small improvements on x86? Seems anomalous.

I found that odd, too. My guess is maybe the emitter-level optimization was already quite effective on x86, leaving few instances for the new optimization to hit? (The lack of this emitter-level opt on ARM64 probably explains the significant size improvements there

Is that optimization x86 specific? There are decent diffs for x64. So perhaps there's something else going on?

@amanasifkhalid
Copy link
Member Author

Is that optimization x86 specific? There are decent diffs for x64. So perhaps there's something else going on?

The optimization runs on x64 too, but with an additional restriction, so it probably fires slightly less often. But I think you're right that there's something else at play. I wonder if we align at the end of BBJ_ALWAYS blocks more often on x86? That's the only condition that could block the new optimization that stands out to me. I'll take a look.

@AndyAyersMS
Copy link
Member

But remind me why we don't want to also do this for debuggable code?

I think this could impair the debugging experience by removing instructions where we could previously place breakpoints. I cannot confirm if this is a real problem, as I haven't run any diagnostics tests and, according to a couple of JitDumps I've sampled, the number of IL offset entries in the debug info is the same after enabling the optimization.

I think we should pursue enabling this without restriction. There seem to be a couple plausible ways to verify we haven't disrupted debuggabilitiy:

  • run the diagnostics tests (you will need to ask for instructions)
  • use SPMI to diff the generated debug info ... i don't recall the details on what sort of comparison we do (see NearDiffer::compareBoundaries) there so we would have to review that part, but presumably if we get the same number of entries with the same IL offsets we're happy. I think we could hack up SPMI to just diff this aspect in the way we care about and run that over all the minops methods.

@amanasifkhalid
Copy link
Member Author

I think we should pursue enabling this without restriction.

I'd be happy to! I'll start with your latter suggestion of diffing the debug info, since that seems cheaper.

@amanasifkhalid
Copy link
Member Author

I wrote a script to diff the IL offsets, and ran it on the asmdiffs for x64 and arm64, with and without the optimization enabled for all MinOpts contexts. The script diffs the total number of IL offsets, as well as each IL offset (though not the native offsets, as I believe these change based on how many jumps are removed, and where) if the total number of offsets are equal (I can share this script offline if you're interested). I didn't find any IL offset diffs.

I also diff'd the IL offsets with the optimization always enabled (regardless of MinOpts/debuggable code/etc) versus never running the optimization, and I found different numbers of IL offsets for a few methods compiled with optimizations; looking at their disasms, these methods also had different numbers of instruction groups. Without the optimization, we sometimes have a jump to the next IG, which is empty and falls through into the next IG. With the optimization, the jump and the empty IG are removed. I'm guessing the differing numbers of IGs explains the differing numbers of IL offsets?

In both cases, I didn't see any IL offset diffs for unoptimized code, so I think it should be safe to turn this optimization on for debuggable code.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like good enough due diligence for me.

const bool skipJump = tryJumpOpt && JumpsToNext() && !hasAlign() && ((bbFlags & BBF_KEEP_BBJ_ALWAYS) == 0) &&
!compiler->fgInDifferentRegions(this, bbJumpDest);
return skipJump;
return (JumpsToNext() && !hasAlign() && !compiler->fgInDifferentRegions(this, bbJumpDest));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: outermost parens are unnecessary

Suggested change
return (JumpsToNext() && !hasAlign() && !compiler->fgInDifferentRegions(this, bbJumpDest));
return JumpsToNext() && !hasAlign() && !compiler->fgInDifferentRegions(this, bbJumpDest);

@amanasifkhalid amanasifkhalid changed the title JIT: Enable jump-to-next removal optimization for non-debug code JIT: Enable jump-to-next removal optimization in MinOpts Nov 30, 2023
@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Nov 30, 2023

Better diffs, and a nice MinOpts TP boost on x64/x86. x86 improvements are still suspiciously small. Will take a look next.

Edit: Funnily enough, libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch has a big (-0.24%) MinOpts TP improvement on Windows x86, but no size improvements. In its example diffs, instruction counts are lower, but total bytes remain the same.

@amanasifkhalid
Copy link
Member Author

@AndyAyersMS I cherry-picked some of the example methods with notable size improvements on x64 and no improvements on x86, and it looks like the emitter-level optimization was already taking care of many of these unnecessary jumps on x86, but not on x64. On x64, this optimization has an additional check that blocks removing the jump if it is needed for EH reasons. For the example methods where x64 had improvements and x86 had none, removing this check and disabling the block-level optimization still results in the unnecessary jumps being removed. So I don't think the block-level optimization is doing anything odd on x86; the majority of cases in MinOpts were just already covered by the existing optimization.

Side note: For the emitter-level optimization on x64, if an instruction is needed after an EH-related call, does that instruction have to be a jump? I wonder if we can replace it with a nop.

@AndyAyersMS
Copy link
Member

I don't think the actual instruction matters, it just needs to occupy space. So a nop should be fine.

@amanasifkhalid
Copy link
Member Author

@AndyAyersMS are you ok with me merging this?

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good -- thanks for digging into some of the corners here.

@amanasifkhalid amanasifkhalid merged commit 4cb123a into dotnet:main Nov 30, 2023
129 checks passed
@EgorBo
Copy link
Member

EgorBo commented Dec 5, 2023

Improvement: dotnet/perf-autofiling-issues#25409

@github-actions github-actions bot locked and limited conversation to collaborators Jan 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants