Skip to content

Remove unnecessary delayfree from xarch FMA and TERNLOG instructions#128350

Open
tannergooding wants to merge 8 commits into
dotnet:mainfrom
tannergooding:remove-unnecessary-delayfree
Open

Remove unnecessary delayfree from xarch FMA and TERNLOG instructions#128350
tannergooding wants to merge 8 commits into
dotnet:mainfrom
tannergooding:remove-unnecessary-delayfree

Conversation

@tannergooding
Copy link
Copy Markdown
Member

@tannergooding tannergooding commented May 19, 2026

These instructions are fully reorderable and so much like various commutative nodes do not need to be marked delay free except in special scenarios.

This resolves #62215

Copilot AI review requested due to automatic review settings May 19, 2026 00:37
@github-actions github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 19, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates xarch JIT HWIntrinsic register allocation/codegen handling to avoid marking AVX FMA and AVX-512 TernaryLogic operands as “delay-free” in most cases, while adding codegen support for additional operand/target overlap scenarios for TernaryLogic.

Changes:

  • Simplifies LSRA operand-use construction for AVX2/AVX512 FMA intrinsics, only using delay-free constraints when required by CopyUpperBits semantics.
  • Adds a dedicated LSRA path for NI_AVX512_TernaryLogic when the control byte is an immediate, avoiding delay-free uses in that case.
  • Extends genHWIntrinsic_R_R_R_RM_I to adjust TernaryLogic control immediates when the target register overlaps certain operands.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/coreclr/jit/lsraxarch.cpp Adjusts LSRA use/target preferencing for FMA intrinsics and adds a special LSRA build path for AVX512_TernaryLogic with immediate control byte.
src/coreclr/jit/hwintrinsiccodegenxarch.cpp Adds TernaryLogic-specific handling to rewrite the control byte when operand/target register overlap is detected.

Comment thread src/coreclr/jit/hwintrinsiccodegenxarch.cpp Outdated
Comment thread src/coreclr/jit/lsraxarch.cpp Outdated
Copilot AI review requested due to automatic review settings May 19, 2026 02:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings May 19, 2026 12:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread src/coreclr/jit/lsraxarch.cpp
Comment thread src/coreclr/jit/lsraxarch.cpp Outdated
Copilot AI review requested due to automatic review settings May 19, 2026 17:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread src/coreclr/jit/hwintrinsiccodegenxarch.cpp
@tannergooding tannergooding requested a review from EgorBo May 19, 2026 20:46
@tannergooding
Copy link
Copy Markdown
Member Author

CC. @dotnet/jit-contrib, @EgorBo for review.

Diffs are here: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1427007&view=ms.vss-build-web.run-extensions-tab

This removes a large number of unnecessary vmovaps prior/after to vpternlog and vfmadd instructions, such as:

-       vpternlogq xmm4, xmm0, xmm9, -106
-       vmovaps  xmm0, xmm4
+       vpternlogq xmm0, xmm4, xmm9, -106

The x86 diffs show the largest size improvement due to the limited register set available (XMM0-7 only).

The Linux x64 diffs then show a smaller improvement because they end up selecting XMM16-XMM31 in many cases which requires more bytes to encode. All XMM registers are CALLEE_TRASH and most of the SIMD methods don't involve calls.

The Windows x64 diffs then show a size regression because the register allocator ends up selecting callee saved registers more causing a bloat in the method prologue/epilogue due to them having to be saved/restored.

The change is overall an improvement and can be seen when observing the three variations here in unison. We probably want to look a bit into the Windowx 64 register ordering though since it really should be preferencing the EVEX registers over using the callee save registers.

@EgorBo
Copy link
Copy Markdown
Member

EgorBo commented May 19, 2026

The Windows x64 diffs then show a size regression

I'm trying to understand why quite a few lines of changes regressed more contexts than improved (size-wise) and PerfScore says that 6 collections regressed (overall) and only 3 improved and we should take it 😐

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve FMA code generation related to operand last use

3 participants