Improve simd and floating-point costing for xarch by tannergooding · Pull Request #127048 · dotnet/runtime

tannergooding · 2026-04-17T06:26:01Z

The execution and size costing set for SIMD and floating-point on xarch has been updated to more accurately reflect the real numbers.

The floating-point costs/sizes hadn't really been updated since the 32-bit legacy JIT where they were based on the x87 FPU, which we haven't used in a long time.

Similarly, the hardware intrinsic nodes were basically all set at costEx=1, costSz=1 which would actively prevent CSE and other optimizations from kicking in, this is despite most such operations taking significantly more codegen bytes than general-purpose instructions and most floating-point instructions taking a minimum of 4 cycles, sometimes more.

This will likely result in a larger set of diffs, but should allow the JIT to make better decisions on what should be optimized based on what is taking the most bytes or cycles. If this goes through, we can do a similar PR for Arm64.

dotnet-policy-service · 2026-04-17T06:28:56Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copilot

Pull request overview

Updates CoreCLR JIT instruction costing on xarch to better reflect modern SIMD and floating-point codegen characteristics, enabling more accurate optimization decisions (e.g., CSE and size/throughput tradeoffs).

Changes:

Extends xarch HW intrinsic metadata to carry separate integer vs floating-point execution costs.
Refines GT_HWINTRINSIC and indirection/address costing on xarch, including special-casing various SIMD/FP operations and constants.
Adds a shared helper to compute address-mode costing (gtGetAddrNodeCost) and expands floating-point indirection costing.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/coreclr/jit/valuenumfuncs.h	Updates xarch HW intrinsic macro signature to align VN function defs with new intrinsic list fields.
src/coreclr/jit/namedintrinsiclist.h	Updates xarch HW intrinsic macro signature to include int/float cost fields.
src/coreclr/jit/hwintrinsic.h	Extends `HWIntrinsicInfo` with `intCost`/`fltCost` and adds lookup helpers.
src/coreclr/jit/hwintrinsic.cpp	Populates `HWIntrinsicInfo` array with the new cost fields on xarch (defaults on arm64).
src/coreclr/jit/gentree.h	Introduces `FLT_IND_COST_EX` and clarifies cost constant definitions.
src/coreclr/jit/gentree.cpp	Major update to costing/eval-order logic for HW intrinsics, indirections, FP/SIMD constants, and FP ops on xarch; adds `gtGetAddrNodeCost`.
src/coreclr/jit/compiler.h	Declares `gtGetAddrNodeCost` and fixes trailing whitespace.

Copilot

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.

Copilot

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.

tannergooding · 2026-04-17T20:26:36Z

CC. @dotnet/jit-contrib, @EgorBo, @kg

This improves gtSetEvalOrder for SIMD/floating-point

It's limited to xarch only in this first iteration, but basic premise is that it ensures that floating-point matches current codegen (SSE/SSE2) and not the x87 FPU codegen it was originally setup for. Likewise it ensures that hardware intrinsics aren't all set as costEx=1, costSz=1, which means they can be CSE'd, hoisted, etc

We see a size regression because we now have trees that are consistently CSE'd/hoisted where they weren't previously. This uses more registers and causes larger prologue/epilogue to accommodate the non-volatile spills. However, the actual diffs in the core code, particularly loops, is much improved. We particularly see more opportunities to do broadcasts and so the method local constant sections are reduced by up to nearly 8% in some methods.

tannergooding · 2026-04-17T20:51:09Z

We have a number of places where the core loops show diffs like this, even though the overall method size is regressed by 24 bytes (due to the prologue/epilogue growth):

-						;; size=88 bbWeight=32 PerfScore 864.00
+						;; size=76 bbWeight=32 PerfScore 736.00

There are some actual regressions too, namely from when CSE decides to hoist a constant (often aggressive) and then LSRA decides to not keep it enregistered and marks it as spill-single-def. This causes a diff like:

-    vmulss xmm0, xmm0, dword ptr [reloc @RWD16]
-    vdivss xmm0, xmm0, dword ptr [reloc @RWD20]
+    vmovss xmm1, dword ptr [reloc @RWD16]
+    vmovss dword ptr [rbp-0x24], xmm1
+    vmulss xmm0, xmm0, xmm1
+    vmovss xmm2, dword ptr [reloc @RWD20]
+    vmovss dword ptr [rbp-0x28], xmm2
+    vdivss xmm0, xmm0, xmm2
...
-    vmulss xmm0, xmm0, dword ptr [reloc @RWD16]
-    vdivss xmm0, xmm0, dword ptr [reloc @RWD20]
+    vmulss xmm0, xmm0, dword ptr [rbp-0x24]
+    vmulss xmm0, xmm0, dword ptr [rbp-0x28]

When ideally we'd not spill at all and just reload from the [reloc @RWD] slot instead. This is notably an existing issue and something that is worth us improving long term anyways, but is something that should be done in a separate PR.

tannergooding · 2026-04-17T20:54:04Z

Overall this should be a net improvement, with likely a couple issues getting filed after the subsequent perf triage.

PerfScore needs a bit of a similar cleanup, namely in ensuring:

load/store costing is based on int vs float/simd, as the latter are 2-4 cycles more expensive
that we check for simdSize == 16 ? x : y and not simdSize == 32 ? y : x, so SIMD64 has a correct score

tannergooding added 2 commits April 16, 2026 23:21

Fix the costEx and costSz for some floating-point nodes on xarch

e5bfdd1

Add appropriate costing info for hwintrinsics on xarch

f43a954

Copilot AI review requested due to automatic review settings April 17, 2026 06:26

github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 17, 2026

dotnet-policy-service bot assigned tannergooding Apr 17, 2026

Copilot started reviewing on behalf of tannergooding April 17, 2026 06:26 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp

Comment thread src/coreclr/jit/gentree.cpp Outdated

This was referenced Apr 17, 2026

Unable to pull image from mcr.microsoft.com #117164

Open

Build error: ilc exited with code 57005 #124976

Open

System.Net.NameResolution.Tests DNS failures: Name or service not known #126641

Open

tannergooding added 2 commits April 17, 2026 05:32

Respond to PR feedback and fill in a couple missing costing entries

e0ee450

Fix the 32-bit build

332ee31

Copilot AI review requested due to automatic review settings April 17, 2026 13:33

Copilot started reviewing on behalf of tannergooding April 17, 2026 13:34 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp

Comment thread src/coreclr/jit/gentree.cpp

Comment thread src/coreclr/jit/gentree.cpp

tannergooding added 2 commits April 17, 2026 07:00

Respond to PR feedback

a52954f

Avoid looping for opCount == 1

3ff4c46

Copilot AI review requested due to automatic review settings April 17, 2026 14:44

Copilot started reviewing on behalf of tannergooding April 17, 2026 14:46 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp

tannergooding added 2 commits April 17, 2026 08:07

Use the right indirection costing for GT_NEG

f8e47c8

Add a missing execution cost

e0876fe

Copilot AI review requested due to automatic review settings April 17, 2026 17:18

Copilot started reviewing on behalf of tannergooding April 17, 2026 17:18 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp

tannergooding marked this pull request as ready for review April 17, 2026 19:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve simd and floating-point costing for xarch#127048

Improve simd and floating-point costing for xarch#127048
tannergooding wants to merge 8 commits intodotnet:mainfrom
tannergooding:improve-simd-costing

tannergooding commented Apr 17, 2026

Uh oh!

dotnet-policy-service bot commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

tannergooding commented Apr 17, 2026

Uh oh!

tannergooding commented Apr 17, 2026

Uh oh!

tannergooding commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tannergooding commented Apr 17, 2026

Uh oh!

dotnet-policy-service bot commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

tannergooding commented Apr 17, 2026

Uh oh!

tannergooding commented Apr 17, 2026

Uh oh!

tannergooding commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants