JIT: Enable jump-to-next removal optimization in MinOpts #95340

amanasifkhalid · 2023-11-28T16:18:53Z

Follow-up to #94239. In non-debug MinOpts scenarios, we should remove branches to the next block regardless of whether BBF_NONE_QUIRK is set; this should yield modest code size and TP improvements.

(Also, the emitter-level optimization tries to remove jumps to the next block regardless of BBF_KEEP_BBJ_ALWAYS being set. Removing this requirement from the block-level optimization doesn't seem to affect correctness.)

ghost · 2023-11-28T16:19:09Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Follow-up to #94239. In non-debug MinOpts scenarios (like Tier0 code), we should remove branches to the next block regardless of whether BBF_NONE_QUIRK is set; this should yield modest code size and TP improvements.

(Also, the emitter-level optimization tries to remove jumps to the next block regardless of BBF_KEEP_BBJ_ALWAYS being set. Removing this requirement from the block-level optimization doesn't seem to affect correctness.)

Author:	amanasifkhalid
Assignees:	amanasifkhalid
Labels:	`area-CodeGen-coreclr`
Milestone:	-

amanasifkhalid · 2023-11-28T19:10:23Z

cc @dotnet/jit-contrib. Diffs.

amanasifkhalid · 2023-11-29T16:59:54Z

@AndyAyersMS PTAL, is it ok if we enable this for MinOpts so long as we aren't generating debuggable code? This gives us a slight TP boost.

AndyAyersMS · 2023-11-29T18:46:09Z

I think enabling this way is ok, though we could restrict it to Tier0 if we really want to keep minops "pure". @BruceForstall what do you think?

Do you have any insight into the relatively small improvements on x86? Seems anomalous.

BruceForstall · 2023-11-29T19:19:08Z

I think enabling this way is ok, though we could restrict it to Tier0 if we really want to keep minops "pure". @BruceForstall what do you think?

I was wondering about that. Generally, I'd say we shouldn't do any optimization in MinOpts. But don't we require something like this to get rid of BBF_NONE_QUIRK (and not have a branch at the end of every MinOpts block)? In that case, I'm ok with this.

But remind me why we don't want to also do this for debuggable code?

amanasifkhalid · 2023-11-29T19:41:30Z

But remind me why we don't want to also do this for debuggable code?

I think this could impair the debugging experience by removing instructions where we could previously place breakpoints. I cannot confirm if this is a real problem, as I haven't run any diagnostics tests and, according to a couple of JitDumps I've sampled, the number of IL offset entries in the debug info is the same after enabling the optimization. But I guess we aren't looking for brilliant TP/codegen when producing debuggable code anyway, so there isn't much value lost here in disabling this for debuggable code (removing the debug code check might improve TP for non-debug cases, which is desirable, but I imagine the improvement would be minor).

But don't we require something like this to get rid of BBF_NONE_QUIRK (and not have a branch at the end of every MinOpts block)?

Yes, removing the BBF_NONE_QUIRK check and disabling this optimization in MinOpts would significantly regress the asmdiffs, since we no longer have BBJ_NONE. Eventually, we want to remove BBF_NONE_QUIRK, so I will note that if we remove the check now, we get significant size regressions for debuggable code (which might be ok?). If we verify that this optimization doesn't hurt the debugging experience, then maybe we can get rid of the const bool tryJumpOpt = !compiler->opts.compDbgCode || ((bbFlags & BBF_NONE_QUIRK) != 0); check altogether, so we can remove a BBF_NONE_QUIRK usage without regressing debuggable code size. But maybe that's best left for a future PR.

Do you have any insight into the relatively small improvements on x86? Seems anomalous.

I found that odd, too. My guess is maybe the emitter-level optimization was already quite effective on x86, leaving few instances for the new optimization to hit? (The lack of this emitter-level opt on ARM64 probably explains the significant size improvements there).

AndyAyersMS · 2023-11-29T19:46:14Z

Do you have any insight into the relatively small improvements on x86? Seems anomalous.

I found that odd, too. My guess is maybe the emitter-level optimization was already quite effective on x86, leaving few instances for the new optimization to hit? (The lack of this emitter-level opt on ARM64 probably explains the significant size improvements there

Is that optimization x86 specific? There are decent diffs for x64. So perhaps there's something else going on?

amanasifkhalid · 2023-11-29T19:49:50Z

Is that optimization x86 specific? There are decent diffs for x64. So perhaps there's something else going on?

The optimization runs on x64 too, but with an additional restriction, so it probably fires slightly less often. But I think you're right that there's something else at play. I wonder if we align at the end of BBJ_ALWAYS blocks more often on x86? That's the only condition that could block the new optimization that stands out to me. I'll take a look.

AndyAyersMS · 2023-11-29T19:52:38Z

But remind me why we don't want to also do this for debuggable code?

I think this could impair the debugging experience by removing instructions where we could previously place breakpoints. I cannot confirm if this is a real problem, as I haven't run any diagnostics tests and, according to a couple of JitDumps I've sampled, the number of IL offset entries in the debug info is the same after enabling the optimization.

I think we should pursue enabling this without restriction. There seem to be a couple plausible ways to verify we haven't disrupted debuggabilitiy:

run the diagnostics tests (you will need to ask for instructions)
use SPMI to diff the generated debug info ... i don't recall the details on what sort of comparison we do (see NearDiffer::compareBoundaries) there so we would have to review that part, but presumably if we get the same number of entries with the same IL offsets we're happy. I think we could hack up SPMI to just diff this aspect in the way we care about and run that over all the minops methods.

amanasifkhalid · 2023-11-29T20:14:17Z

I think we should pursue enabling this without restriction.

I'd be happy to! I'll start with your latter suggestion of diffing the debug info, since that seems cheaper.

amanasifkhalid · 2023-11-30T01:14:21Z

I wrote a script to diff the IL offsets, and ran it on the asmdiffs for x64 and arm64, with and without the optimization enabled for all MinOpts contexts. The script diffs the total number of IL offsets, as well as each IL offset (though not the native offsets, as I believe these change based on how many jumps are removed, and where) if the total number of offsets are equal (I can share this script offline if you're interested). I didn't find any IL offset diffs.

I also diff'd the IL offsets with the optimization always enabled (regardless of MinOpts/debuggable code/etc) versus never running the optimization, and I found different numbers of IL offsets for a few methods compiled with optimizations; looking at their disasms, these methods also had different numbers of instruction groups. Without the optimization, we sometimes have a jump to the next IG, which is empty and falls through into the next IG. With the optimization, the jump and the empty IG are removed. I'm guessing the differing numbers of IGs explains the differing numbers of IL offsets?

In both cases, I didn't see any IL offset diffs for unoptimized code, so I think it should be safe to turn this optimization on for debuggable code.

BruceForstall

Seems like good enough due diligence for me.

BruceForstall · 2023-11-30T01:52:22Z

src/coreclr/jit/block.cpp

-    const bool skipJump   = tryJumpOpt && JumpsToNext() && !hasAlign() && ((bbFlags & BBF_KEEP_BBJ_ALWAYS) == 0) &&
-                          !compiler->fgInDifferentRegions(this, bbJumpDest);
-    return skipJump;
+    return (JumpsToNext() && !hasAlign() && !compiler->fgInDifferentRegions(this, bbJumpDest));


nit: outermost parens are unnecessary

Suggested change

return (JumpsToNext() && !hasAlign() && !compiler->fgInDifferentRegions(this, bbJumpDest));

return JumpsToNext() && !hasAlign() && !compiler->fgInDifferentRegions(this, bbJumpDest);

amanasifkhalid · 2023-11-30T13:17:54Z

Better diffs, and a nice MinOpts TP boost on x64/x86. x86 improvements are still suspiciously small. Will take a look next.

Edit: Funnily enough, libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch has a big (-0.24%) MinOpts TP improvement on Windows x86, but no size improvements. In its example diffs, instruction counts are lower, but total bytes remain the same.

amanasifkhalid · 2023-11-30T20:37:30Z

@AndyAyersMS I cherry-picked some of the example methods with notable size improvements on x64 and no improvements on x86, and it looks like the emitter-level optimization was already taking care of many of these unnecessary jumps on x86, but not on x64. On x64, this optimization has an additional check that blocks removing the jump if it is needed for EH reasons. For the example methods where x64 had improvements and x86 had none, removing this check and disabling the block-level optimization still results in the unnecessary jumps being removed. So I don't think the block-level optimization is doing anything odd on x86; the majority of cases in MinOpts were just already covered by the existing optimization.

Side note: For the emitter-level optimization on x64, if an instruction is needed after an EH-related call, does that instruction have to be a jump? I wonder if we can replace it with a nop.

AndyAyersMS · 2023-11-30T21:07:25Z

I don't think the actual instruction matters, it just needs to occupy space. So a nop should be fine.

amanasifkhalid · 2023-11-30T21:17:28Z

@AndyAyersMS are you ok with me merging this?

AndyAyersMS

Looks good -- thanks for digging into some of the corners here.

EgorBo · 2023-12-05T18:00:23Z

Improvement: dotnet/perf-autofiling-issues#25409

Enable peephole opt for non-debug code

5f404f6

ghost assigned amanasifkhalid Nov 28, 2023

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 28, 2023

amanasifkhalid marked this pull request as ready for review November 28, 2023 19:09

Enable peephole opt for all MinOpts scenarios

3321e5f

BruceForstall approved these changes Nov 30, 2023

View reviewed changes

Style

c64c851

amanasifkhalid changed the title ~~JIT: Enable jump-to-next removal optimization for non-debug code~~ JIT: Enable jump-to-next removal optimization in MinOpts Nov 30, 2023

amanasifkhalid requested a review from AndyAyersMS November 30, 2023 21:17

AndyAyersMS approved these changes Nov 30, 2023

View reviewed changes

amanasifkhalid merged commit 4cb123a into dotnet:main Nov 30, 2023
129 checks passed

EgorBo mentioned this pull request Dec 7, 2023

[Perf] Linux/arm64: Regressions in System.Collections.TryAddGiventSize #96500

Closed

github-actions bot locked and limited conversation to collaborators Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Enable jump-to-next removal optimization in MinOpts #95340

JIT: Enable jump-to-next removal optimization in MinOpts #95340

amanasifkhalid commented Nov 28, 2023 •

edited

Loading

ghost commented Nov 28, 2023

amanasifkhalid commented Nov 28, 2023

amanasifkhalid commented Nov 29, 2023

AndyAyersMS commented Nov 29, 2023

BruceForstall commented Nov 29, 2023

amanasifkhalid commented Nov 29, 2023

AndyAyersMS commented Nov 29, 2023

amanasifkhalid commented Nov 29, 2023

AndyAyersMS commented Nov 29, 2023

amanasifkhalid commented Nov 29, 2023

amanasifkhalid commented Nov 30, 2023

BruceForstall left a comment

BruceForstall Nov 30, 2023

amanasifkhalid commented Nov 30, 2023 •

edited

Loading

amanasifkhalid commented Nov 30, 2023

AndyAyersMS commented Nov 30, 2023

amanasifkhalid commented Nov 30, 2023

AndyAyersMS left a comment

EgorBo commented Dec 5, 2023

	return (JumpsToNext() && !hasAlign() && !compiler->fgInDifferentRegions(this, bbJumpDest));
	return JumpsToNext() && !hasAlign() && !compiler->fgInDifferentRegions(this, bbJumpDest);

JIT: Enable jump-to-next removal optimization in MinOpts #95340

JIT: Enable jump-to-next removal optimization in MinOpts #95340

Conversation

amanasifkhalid commented Nov 28, 2023 • edited Loading

ghost commented Nov 28, 2023

amanasifkhalid commented Nov 28, 2023

amanasifkhalid commented Nov 29, 2023

AndyAyersMS commented Nov 29, 2023

BruceForstall commented Nov 29, 2023

amanasifkhalid commented Nov 29, 2023

AndyAyersMS commented Nov 29, 2023

amanasifkhalid commented Nov 29, 2023

AndyAyersMS commented Nov 29, 2023

amanasifkhalid commented Nov 29, 2023

amanasifkhalid commented Nov 30, 2023

BruceForstall left a comment

Choose a reason for hiding this comment

BruceForstall Nov 30, 2023

Choose a reason for hiding this comment

amanasifkhalid commented Nov 30, 2023 • edited Loading

amanasifkhalid commented Nov 30, 2023

AndyAyersMS commented Nov 30, 2023

amanasifkhalid commented Nov 30, 2023

AndyAyersMS left a comment

Choose a reason for hiding this comment

EgorBo commented Dec 5, 2023

amanasifkhalid commented Nov 28, 2023 •

edited

Loading

amanasifkhalid commented Nov 30, 2023 •

edited

Loading