Improve simd and floating-point costing for xarch#127048
Improve simd and floating-point costing for xarch#127048tannergooding wants to merge 8 commits intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
There was a problem hiding this comment.
Pull request overview
Updates CoreCLR JIT instruction costing on xarch to better reflect modern SIMD and floating-point codegen characteristics, enabling more accurate optimization decisions (e.g., CSE and size/throughput tradeoffs).
Changes:
- Extends xarch HW intrinsic metadata to carry separate integer vs floating-point execution costs.
- Refines GT_HWINTRINSIC and indirection/address costing on xarch, including special-casing various SIMD/FP operations and constants.
- Adds a shared helper to compute address-mode costing (
gtGetAddrNodeCost) and expands floating-point indirection costing.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/coreclr/jit/valuenumfuncs.h | Updates xarch HW intrinsic macro signature to align VN function defs with new intrinsic list fields. |
| src/coreclr/jit/namedintrinsiclist.h | Updates xarch HW intrinsic macro signature to include int/float cost fields. |
| src/coreclr/jit/hwintrinsic.h | Extends HWIntrinsicInfo with intCost/fltCost and adds lookup helpers. |
| src/coreclr/jit/hwintrinsic.cpp | Populates HWIntrinsicInfo array with the new cost fields on xarch (defaults on arm64). |
| src/coreclr/jit/gentree.h | Introduces FLT_IND_COST_EX and clarifies cost constant definitions. |
| src/coreclr/jit/gentree.cpp | Major update to costing/eval-order logic for HW intrinsics, indirections, FP/SIMD constants, and FP ops on xarch; adds gtGetAddrNodeCost. |
| src/coreclr/jit/compiler.h | Declares gtGetAddrNodeCost and fixes trailing whitespace. |
|
CC. @dotnet/jit-contrib, @EgorBo, @kg This improves gtSetEvalOrder for SIMD/floating-point It's limited to xarch only in this first iteration, but basic premise is that it ensures that floating-point matches current codegen (SSE/SSE2) and not the x87 FPU codegen it was originally setup for. Likewise it ensures that hardware intrinsics aren't all set as costEx=1, costSz=1, which means they can be CSE'd, hoisted, etc We see a size regression because we now have trees that are consistently CSE'd/hoisted where they weren't previously. This uses more registers and causes larger prologue/epilogue to accommodate the non-volatile spills. However, the actual diffs in the core code, particularly loops, is much improved. We particularly see more opportunities to do broadcasts and so the method local constant sections are reduced by up to nearly 8% in some methods. |
|
We have a number of places where the core loops show diffs like this, even though the overall method size is regressed by 24 bytes (due to the prologue/epilogue growth): - ;; size=88 bbWeight=32 PerfScore 864.00
+ ;; size=76 bbWeight=32 PerfScore 736.00There are some actual regressions too, namely from when CSE decides to hoist a constant (often - vmulss xmm0, xmm0, dword ptr [reloc @RWD16]
- vdivss xmm0, xmm0, dword ptr [reloc @RWD20]
+ vmovss xmm1, dword ptr [reloc @RWD16]
+ vmovss dword ptr [rbp-0x24], xmm1
+ vmulss xmm0, xmm0, xmm1
+ vmovss xmm2, dword ptr [reloc @RWD20]
+ vmovss dword ptr [rbp-0x28], xmm2
+ vdivss xmm0, xmm0, xmm2
...
- vmulss xmm0, xmm0, dword ptr [reloc @RWD16]
- vdivss xmm0, xmm0, dword ptr [reloc @RWD20]
+ vmulss xmm0, xmm0, dword ptr [rbp-0x24]
+ vmulss xmm0, xmm0, dword ptr [rbp-0x28]When ideally we'd not spill at all and just reload from the |
|
Overall this should be a net improvement, with likely a couple issues getting filed after the subsequent perf triage. PerfScore needs a bit of a similar cleanup, namely in ensuring:
|
The execution and size costing set for SIMD and floating-point on xarch has been updated to more accurately reflect the real numbers.
The floating-point costs/sizes hadn't really been updated since the 32-bit legacy JIT where they were based on the x87 FPU, which we haven't used in a long time.
Similarly, the hardware intrinsic nodes were basically all set at
costEx=1, costSz=1which would actively prevent CSE and other optimizations from kicking in, this is despite most such operations taking significantly more codegen bytes than general-purpose instructions and most floating-point instructions taking a minimum of 4 cycles, sometimes more.This will likely result in a larger set of diffs, but should allow the JIT to make better decisions on what should be optimized based on what is taking the most bytes or cycles. If this goes through, we can do a similar PR for Arm64.