Remove unnecessary delayfree from xarch FMA and TERNLOG instructions#128350
Remove unnecessary delayfree from xarch FMA and TERNLOG instructions#128350tannergooding wants to merge 8 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates xarch JIT HWIntrinsic register allocation/codegen handling to avoid marking AVX FMA and AVX-512 TernaryLogic operands as “delay-free” in most cases, while adding codegen support for additional operand/target overlap scenarios for TernaryLogic.
Changes:
- Simplifies LSRA operand-use construction for AVX2/AVX512 FMA intrinsics, only using delay-free constraints when required by CopyUpperBits semantics.
- Adds a dedicated LSRA path for
NI_AVX512_TernaryLogicwhen the control byte is an immediate, avoiding delay-free uses in that case. - Extends
genHWIntrinsic_R_R_R_RM_Ito adjustTernaryLogiccontrol immediates when the target register overlaps certain operands.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/coreclr/jit/lsraxarch.cpp | Adjusts LSRA use/target preferencing for FMA intrinsics and adds a special LSRA build path for AVX512_TernaryLogic with immediate control byte. |
| src/coreclr/jit/hwintrinsiccodegenxarch.cpp | Adds TernaryLogic-specific handling to rewrite the control byte when operand/target register overlap is detected. |
|
CC. @dotnet/jit-contrib, @EgorBo for review. Diffs are here: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1427007&view=ms.vss-build-web.run-extensions-tab This removes a large number of unnecessary - vpternlogq xmm4, xmm0, xmm9, -106
- vmovaps xmm0, xmm4
+ vpternlogq xmm0, xmm4, xmm9, -106The x86 diffs show the largest size improvement due to the limited register set available ( The Linux x64 diffs then show a smaller improvement because they end up selecting The Windows x64 diffs then show a size regression because the register allocator ends up selecting callee saved registers more causing a bloat in the method prologue/epilogue due to them having to be saved/restored. The change is overall an improvement and can be seen when observing the three variations here in unison. We probably want to look a bit into the Windowx 64 register ordering though since it really should be preferencing the EVEX registers over using the callee save registers. |
I'm trying to understand why quite a few lines of changes regressed more contexts than improved (size-wise) and PerfScore says that 6 collections regressed (overall) and only 3 improved and we should take it 😐 |
These instructions are fully reorderable and so much like various commutative nodes do not need to be marked delay free except in special scenarios.
This resolves #62215