Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Make isRemovableJmpCandidate less restrictive on AMD64 #95493

Merged
merged 12 commits into from
Dec 7, 2023

Conversation

amanasifkhalid
Copy link
Member

On AMD64, the emitter-level jump-to-next removal optimization is skipped if succeeding a call instruction, as we need an instruction after a call to support exception handling semantics. We already compute a more specific condition for this in the block-level optimization, and reusing that condition should allow us to do the emitter-level optimization in a few more places. The size improvements were small locally, but I think this should be done if the conditions are identical between the two optimizations.

@ghost ghost assigned amanasifkhalid Dec 1, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 1, 2023
@ghost
Copy link

ghost commented Dec 1, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

On AMD64, the emitter-level jump-to-next removal optimization is skipped if succeeding a call instruction, as we need an instruction after a call to support exception handling semantics. We already compute a more specific condition for this in the block-level optimization, and reusing that condition should allow us to do the emitter-level optimization in a few more places. The size improvements were small locally, but I think this should be done if the conditions are identical between the two optimizations.

Author: amanasifkhalid
Assignees: amanasifkhalid
Labels:

area-CodeGen-coreclr

Milestone: -

@amanasifkhalid
Copy link
Member Author

@dotnet/jit-contrib PTAL. Diffs are small as expected.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change leads to removing a jump-to-next after a call that immediately precedes an epilog. This is not allowed, by the comments in this function (line 608) and the documentation in clr-abi.md: https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/botr/clr-abi.md#amd64-padding-info

Note that it would be ok to convert these JMP to NOP.

(btw, if you're changing this file, can you change the reference to "X64 and ARM ABIs.docx" to be "clr-abi.md"?)

@ghost ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label Dec 4, 2023
@amanasifkhalid
Copy link
Member Author

@BruceForstall thank you for clarifying that; I'll update this change to insert a nop when needed.

(btw, if you're changing this file, can you change the reference to "X64 and ARM ABIs.docx" to be "clr-abi.md"?)

Sure thing.

@ghost ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Dec 4, 2023
@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Dec 4, 2023

I've adjusted my implementation to always remove a useless jump succeeding a call instruction on AMD64 (assuming all the other optimization checks are met) and emit a nop instead. I added a new member, idjIsAfterCall, to instrDescJmp to indicate the jump will need to be replaced with a nop if removed. Note that this reduced the size of idjOffs on AMD64 to 28 bits instead of 29 -- I don't think this should be a problem?

Size improvements were bigger locally. I noticed the newly-introduced nop isn't showing up in JIT disasms, though the size decreases for the affected instruction groups suggest the nop was emitted. I'm guessing this is because my implementation does not create a new instrDesc for the nop.

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Dec 5, 2023

@BruceForstall PTAL. SuperPMI replay failure is the "Method is of low integrity" issue. Thank you!

Edit: Diffs.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a few problems:

  1. The NOP is not showing up in disassembly. This is a problem, because it is very confusing when looking at asm diffs or debugging code and comparing against JIT dump/disasm. We accept it for INS_align, but at least that displays something.
  2. It appears that you are replacing all CALL + JMP-to-next with CALL+NOP? That's not required -- only when a CALL precedes an OS epilog.
  3. Several places you use the constant "1" as a code size, because that is the size of an x64 NOP instruction. That needs to be well commented (if that code remains).
  4. We already have emitOutputPreEpilogNOP(). I guess this doesn't work because it happens when we are generating code and go to generate an epilog. In the problematic case, the last instruction is a JMP, we call emitOutputPreEpilogNOP(), it doesn't see the last instruction as a CALL, and so it doesn't insert a NOP. Later on, the emitter::emitRemoveJumpToNextInst() phase comes along and removes the JMP. At that point, it's "too late" to insert the necessary NOP.

We might want to insert a provisional NOP after a CALL before a removable JMP. If a JMP is removed, and the next IG is an epilog, the NOP becomes non-provisional and should have actual size. Unfortunately, that's not great, because we'd also need to "zero out" these NOPs (set code size to zero), and there is no obvious place to do that.

Another (ugly?) alternative is to "bash" the JMP instrDesc with a NOP instrDesc when necessary. Then the NOP would be properly output. This might work if it's the very last instruction in an IG but is pretty gross. This is another implication of the fact we can't add/remove instrDescs once generated.

@ghost ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label Dec 6, 2023
@ghost ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Dec 6, 2023
@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Dec 6, 2023

@BruceForstall thank you for the review! I adjusted my implementation to only emit the nop in place of the removed jump if the jump succeeded a call and preceded an OS epilog; this yielded some nice size improvements locally. Also, for printing the newly-added nops in disasms, I've modified emitter::emitOutputInstr to temporarily change the removed jump's instruction description to that of a nop right before printing the disasm, and then change it back after to avoid triggering asserts related to the instrDesc size mismatch -- I'm guessing this is what you meant by "bashing" the jump instrDesc? I've added in comments to explain this process.

I think this implementation should cover the problematic case you mentioned with emitOutputPreEpilogNOP, right? Suppose there is a jmp between a call and an OS epilog. emitOutputPreEpilogNOP should see the jmp and emit nothing. Then in emitRemoveJumpToNextInst, should we decide to remove the jmp, we'll see it is after a call (because idjIsAfterCallBeforeEpilog was set to true in emitIns_J) and before an OS epilog (because we can check for the IGF_EPILOG flag in the jump's target IG), and thus mark it as needing a nop. Then in emitOutputInstr, we emit a nop in lieu of the jmp.

Edit: TP diffs are surprisingly big. This might have to do with the optimization being able to consider more jumps, but the previous run of runtime-coreclr superpmi-diffs didn't have TP regressions this big. I'll investigate tomorrow.

@amanasifkhalid
Copy link
Member Author

I've determined the TP diffs are from the new logic for printing removed jumps as nops in disasms. I've tweaked this logic so that we only incur the TP hit if we're printing disasms, which doesn't seem to be a performance-critical path, anyway.

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Dec 6, 2023

TP diffs are better, but still a bit high in MinOpts. @BruceForstall do we have a threshold for how much TP this change is worth? I have a few ideas for improving TP a bit more, but the code is slightly less intuitive; I'll try it locally and see if it's worth a revision.

Edit: Scratch that, I found reverting my changes to emitJmpInstHasNoCode simplifies it and makes it marginally faster; I just had to move a few asserts around in emitOutputInstr. I don't expect a significant reduction in the TP regressions from this, though.

@amanasifkhalid
Copy link
Member Author

I'm surprised TP worsened on Linux x64. I'll investigate in WSL.

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Dec 7, 2023

@BruceForstall sorry for all the pings -- TP diffs now peak at +0.1% on Linux x64. I found casting an instrDesc* to an instrDescJmp* and checking idjIsRemovableJmpCandidate is cheaper than checking if the instrDesc's code size is 0 on Linux, but indistinguishable on Windows. I think the rest of the TP impact comes mainly from the additional branches introduced, as well as the optimization now being able to consider a few more jumps. Are you ok with this TP hit? This seems to be a recurring theme in my recent PRs...

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a few minor suggestions. The TP change is ok with me.

// a candidate for being removed if it jumps to the next instruction
unsigned idjIsRemovableJmpCandidate : 1;
// Indicates the jump succeeds a call instruction and precedes an OS epilog.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use "follows" instead of "succeeds"?

#else
30;
#endif
#endif // !defined(TARGET_XARCH)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this comment is not actually helpful.

// Indicates the jump succeeds a call instruction and precedes an OS epilog.
// If this jump is removed, a nop will need to be emitted instead (see clr-abi.md for details).
unsigned idjIsAfterCallBeforeEpilog : 1;
#elif defined(TARGET_XARCH)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You split off amd64, so the only "xarch" left is x86

Suggested change
#elif defined(TARGET_XARCH)
#elif defined(TARGET_X86)

Comment on lines +1872 to +1874
#if defined(TARGET_AMD64)
28;
// Indicates the jump was added at the end of a BBJ_ALWAYS basic block and is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, all this bit fiddling is silly on 64-bit, where we're going to have 32 bits of padding after the unsigned fields, in which case we should just move all the bits to the end and make them bool :1 fields. Maybe not for 32-bit, though.

But that shouldn't be done in this PR :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll come back to this in a follow-up PR.

@amanasifkhalid amanasifkhalid merged commit 303571a into dotnet:main Dec 7, 2023
119 of 139 checks passed
@amanasifkhalid
Copy link
Member Author

Feedback changes are just comments and an ifdef. Failures from previous CI run were known NativeAOT failures.

@github-actions github-actions bot locked and limited conversation to collaborators Jan 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants