JIT: Reduce Vector256/512 Sum to a shared per-lane reduction on x64 by Copilot · Pull Request #127329 · dotnet/runtime

Copilot · 2026-04-23T16:56:43Z

Vector256<float>.Sum regressed ~71% on .NET 10 vs .NET 8 on AVX-512 hardware. Compiler::gtNewSimdSumNode had a float-specific branch for 32- and 64-byte vectors that recursed on each half, emitting a full V128 horizontal-reduction (shuffle + add + shuffle + add + ToScalar) per half plus a scalar add to combine — duplicating the expensive reduction two/four times per call and bloating the hot loop.

Description

src/coreclr/jit/gentree.cpp — In gtNewSimdSumNode on x64, eliminate the duplicated per-half horizontal reduction for floating-point vectors.

For floating-point, the fix runs the vpermilps/vpermilpd + vaddps/vaddpd horizontal-reduction sequence at the full simd width first. Because those permutes operate within each 128-bit lane, this is effectively 2x V128 (V256) or 4x V128 (V512) reductions happening in parallel with no duplicated work. After this, every 128-bit lane of the vector holds that lane's reduced sum broadcast across the lane.

Only then are the lanes combined down to a single V128. Floating-point addition is commutative but not associative, so the halve-combine deliberately preserves the prior recursive Sum(lower) + Sum(upper) grouping:
- V512: extract each of the four 128-bit lanes directly from the original V512 — lane 0 via NI_Vector512_GetLower128 and lanes 1-3 via NI_AVX512_ExtractVector128 with imm 1/2/3 — reusing the same V512 value across all four extractions (three fgMakeMultiUse calls on op1). Combine as (s0 + s1) + (s2 + s3) (where s_i is lane i's sum), matching the prior managed/JIT Sum(lower256) + Sum(upper256) recursion. A fall-through V256 = V512.Lower + V512.Upper was rejected because it would pair lane0+lane2 and lane1+lane3, reordering the FP additions to (s0 + s2) + (s1 + s3). Direct V128 extraction from the V512 is preferred over nested GetLower/GetUpper chains because it keeps the IR smaller and lets the JIT share the original V512 operand across extractions.
- V256: vector128 = vector256.GetLower() + vector256.GetUpper(), set simdSize = 16, fall through.
- V128: return vector128.ToScalar().
For 512-bit widths the 512-bit encodings of the permutes (NI_AVX512_Permute4x32 / NI_AVX512_Permute2x64, which emit vpermilps / vpermilpd under EVEX) are selected via a one-line ternary.

The integer path is unchanged (integer addition is associative, so reducing halves element-wise before the V128 reduction is safe, and that path was already efficient).

Effect on Vector256<float>.Sum codegen (x64, VEX):

; Before — 64 B / 11 instructions (two full V128 reductions + vaddss)
vmovaps      ymm1, ymm0
vpermilps    xmm2, xmm1, -79
vaddps       xmm1, xmm2, xmm1
vpermilps    xmm2, xmm1, 78
vaddps       xmm1, xmm2, xmm1          ; Sum(lower)
vextractf128 xmm0, ymm0
vpermilps    xmm2, xmm0, -79
vaddps       xmm0, xmm2, xmm0
vpermilps    xmm2, xmm0, 78
vaddps       xmm0, xmm2, xmm0          ; Sum(upper)
vaddss       xmm0, xmm1, xmm0

; After — 44 B / 10 instructions (single shared per-lane reduction, then combine halves)
vpermilps    ymm1, ymm0, -79           ; per-lane shuffle at full 256-bit width (2x V128 in parallel)
vaddps       ymm0, ymm1, ymm0
vpermilps    ymm1, ymm0, 78
vaddps       ymm0, ymm1, ymm0          ; each V128 lane now holds its own sum broadcast
vmovaps      ymm1, ymm0
vextractf128 xmm0, ymm0                ; upper
vaddps       xmm0, xmm0, xmm1          ; + lower

Savings scale further for Vector512<float>.Sum, which was previously doing four full V128 reductions + three scalar adds; it now does a single shared full-width per-lane reduction followed by four direct V128 extractions combined as (s0 + s1) + (s2 + s3).

Behavioral note

Only the per-V128-lane reduction within each lane is done via the shuffle/permute + add sequence, which was already the shape used by the V128 path, so no new within-lane order differences are introduced relative to the prior JIT behavior. The cross-lane halve-combine grouping is deliberately preserved to match the prior recursive Sum(lower) + Sum(upper) shape, given FP addition is not associative.

Note

This PR was authored with assistance from GitHub Copilot.

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/f98b46df-b011-4898-8d7f-4edea22a5662 Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>

dotnet-policy-service · 2026-04-23T17:18:03Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/7dd925c6-f17d-475f-a448-008418ad164a Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/873ded98-dc23-47f4-9027-6bfb940355ab Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/5ae7cd14-a6cd-4a9c-a559-dae9287ab791 Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>

tannergooding · 2026-04-23T19:16:57Z

CC. @dotnet/jit-contrib, @kg, @EgorBo for review. This is an alternative to #126255 that preserves decision to not use horizontal instructions. It instead ensures we're not doing unnecessary work across the full vector.

tannergooding · 2026-04-23T19:17:02Z

@EgorBot -amd -intel

using System;
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

[MemoryDiagnoser]
public class Vector256RegressionBenchmark
{
    private const float NullHeight = -3.4E+38f;
    private const float FillTolerance = 0.01f;
    private const float NegCutTolerance = -0.01f;

    private float[] _baseElevations;
    private float[] _topElevations;

    [Params(1024, 4096)]
    public int NumElevations { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        var rng = new Random(42);
        _baseElevations = new float[NumElevations];
        _topElevations = new float[NumElevations];
        for (var i = 0; i < NumElevations; i++)
        {
            _baseElevations[i] = rng.NextDouble() < 0.2
                ? NullHeight : (float)(rng.NextDouble() * 100);
            _topElevations[i] = rng.NextDouble() < 0.2
                ? NullHeight : (float)(rng.NextDouble() * 100 + 10);
        }
    }

    [Benchmark(Description = "Vector<float> portable SIMD", Baseline = true)]
    public unsafe (double cut, double fill) PortableVector()
    {
        double cutVol = 0, fillVol = 0;
        var nullVec = new Vector<float>(NullHeight);
        var fillTolVec = new Vector<float>(FillTolerance);
        var negCutTolVec = new Vector<float>(NegCutTolerance);

        fixed (float* bp = _baseElevations, tp = _topElevations)
        {
            var bv = (Vector<float>*)bp;
            var tv = (Vector<float>*)tp;
            for (int i = 0, limit = NumElevations / Vector<float>.Count;
                 i < limit; i++, bv++, tv++)
            {
                var mask = ~(Vector.Equals(*bv, nullVec)
                           | Vector.Equals(*tv, nullVec));
                if (Vector.Sum(mask) == 0) continue;

                var delta = Vector.ConditionalSelect(mask, *tv, Vector<float>.Zero)
                          - Vector.ConditionalSelect(mask, *bv, Vector<float>.Zero);

                var fillMask = Vector.GreaterThan(delta, fillTolVec);
                var usedFill = -Vector.Sum(fillMask);
                if (usedFill > 0)
                    fillVol -= Vector.Dot(delta, Vector.ConvertToSingle(fillMask));

                if (usedFill < Vector<float>.Count)
                {
                    var cutMask = Vector.LessThan(delta, negCutTolVec);
                    var usedCut = -Vector.Sum(cutMask);
                    if (usedCut > 0)
                        cutVol -= Vector.Dot(delta, Vector.ConvertToSingle(cutMask));
                }
            }
        }
        return (cutVol, fillVol);
    }

    [Benchmark(Description = "Vector256<float> explicit SIMD")]
    public unsafe (double cut, double fill) ExplicitVector256()
    {
        if (!Vector256.IsHardwareAccelerated) return (0, 0);

        double cutVol = 0, fillVol = 0;
        var nullVec = Vector256.Create(NullHeight);
        var fillTolVec = Vector256.Create(FillTolerance);
        var negCutTolVec = Vector256.Create(NegCutTolerance);

        fixed (float* bp = _baseElevations, tp = _topElevations)
        {
            var bv = (Vector256<float>*)bp;
            var tv = (Vector256<float>*)tp;
            for (int i = 0, limit = NumElevations / Vector256<float>.Count;
                 i < limit; i++, bv++, tv++)
            {
                var mask = ~(Vector256.Equals(*bv, nullVec)
                           | Vector256.Equals(*tv, nullVec));
                if (Vector256.EqualsAll(mask, Vector256<float>.Zero)) continue;

                var delta = Vector256.ConditionalSelect(mask, *tv, Vector256<float>.Zero)
                          - Vector256.ConditionalSelect(mask, *bv, Vector256<float>.Zero);

                var fillMask = Vector256.GreaterThan(delta, fillTolVec);
                if (Vector256.ExtractMostSignificantBits(fillMask) != 0)
                    fillVol += Vector256.Sum(
                        Vector256.ConditionalSelect(fillMask, delta, Vector256<float>.Zero));

                var cutMask = Vector256.LessThan(delta, negCutTolVec);
                if (Vector256.ExtractMostSignificantBits(cutMask) != 0)
                    cutVol += Vector256.Sum(
                        Vector256.ConditionalSelect(cutMask, delta, Vector256<float>.Zero));
            }
        }
        return (cutVol, fillVol);
    }
}

Copilot

Pull request overview

Refactors x64 floating-point SIMD Vector256/Vector512.Sum IR construction to avoid duplicating the V128 horizontal reduction work by performing the permute+add sequence at the full SIMD width and then combining 128-bit lanes down to a single V128.

Changes:

Reworks FP gtNewSimdSumNode to do per-128-bit-lane permute+add at full SIMD width (V256/V512) and then reduce lanes via upper/lower half combines.
Simplifies permute intrinsic selection (AVX vs AVX512) via ternary and aligns the V512 lane-combine structure with the integer reduction shape.
Keeps the integer path as “halve-combine to V128, then V128 reduction”.

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/e62caef7-2b87-4b6c-bc83-401d7672fdfc Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/7dae7a9d-0582-4e67-b63d-83e84b191141 Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>

Initial plan

67d1f72

Copilot AI assigned Copilot and tannergooding Apr 23, 2026

Copilot AI requested review from Copilot and removed request for Copilot April 23, 2026 16:56

Copilot started work on behalf of tannergooding April 23, 2026 16:56 View session

Copilot AI linked an issue Apr 23, 2026 that may be closed by this pull request

Vector256 explicit intrinsics 71% slower on .NET 10 vs .NET 8 on AVX-512 hardware #126250

Open

Reduce Vector256/512 Sum to single V128 reduction on x64

785de0b

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/f98b46df-b011-4898-8d7f-4edea22a5662 Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>

Copilot AI requested review from Copilot and removed request for Copilot April 23, 2026 17:02

tannergooding reviewed Apr 23, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp Outdated

github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 23, 2026

Copilot AI changed the title ~~[WIP] Fix Vector256 explicit intrinsics performance regression on .NET 10~~ JIT: Reduce Vector256/512 Sum to a single V128 horizontal reduction on x64 Apr 23, 2026

Copilot finished work on behalf of tannergooding April 23, 2026 17:27

Copilot AI requested a review from tannergooding April 23, 2026 17:27

Copilot started work on behalf of tannergooding April 23, 2026 17:27 View session

Preserve FP sum ordering: reduce at full width, then combine halves

633a284

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/7dd925c6-f17d-475f-a448-008418ad164a Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>

Copilot AI requested review from Copilot and removed request for Copilot April 23, 2026 17:56

Copilot AI changed the title ~~JIT: Reduce Vector256/512 Sum to a single V128 horizontal reduction on x64~~ JIT: Reduce Vector256/512 Sum to a shared per-lane reduction on x64 Apr 23, 2026

Copilot finished work on behalf of tannergooding April 23, 2026 17:58

tannergooding reviewed Apr 23, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp Outdated

tannergooding reviewed Apr 23, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp Outdated

tannergooding reviewed Apr 23, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp

Copilot started work on behalf of tannergooding April 23, 2026 18:09 View session

Simplify ternary and fall through V512->V256->V128 halves-combine

30ff4d1

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/873ded98-dc23-47f4-9027-6bfb940355ab Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>

Copilot AI requested review from Copilot and tannergooding and removed request for Copilot April 23, 2026 18:36

tannergooding reviewed Apr 23, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp Outdated

Copilot started work on behalf of tannergooding April 23, 2026 19:08 View session

Align = on op1Shuffled/permIntrinsic declarations

ed85ea4

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/5ae7cd14-a6cd-4a9c-a559-dae9287ab791 Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>

Copilot AI requested review from Copilot and removed request for Copilot April 23, 2026 19:11

Copilot finished work on behalf of tannergooding April 23, 2026 19:13

Copilot AI requested a review from tannergooding April 23, 2026 19:13

tannergooding approved these changes Apr 23, 2026

View reviewed changes

tannergooding marked this pull request as ready for review April 23, 2026 19:14

tannergooding requested review from EgorBo, Copilot and kg April 23, 2026 19:14

Copilot started reviewing on behalf of tannergooding April 23, 2026 19:15 View session

EgorBot mentioned this pull request Apr 23, 2026

Benchmarks for dotnet/runtime#127329 (for @tannergooding) EgorBot/Benchmarks#144

Open

tannergooding mentioned this pull request Apr 23, 2026

JIT: Restore vhaddps-based lowering for Vector256/512 floating-point Sum() #126255

Open

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp Outdated

Comment thread src/coreclr/jit/gentree.cpp Outdated

Copilot started work on behalf of tannergooding April 23, 2026 19:26 View session

V512 float Sum: preserve (s0+s1)+(s2+s3) halve-combine grouping

3a5a082

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/e62caef7-2b87-4b6c-bc83-401d7672fdfc Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>

Copilot finished work on behalf of tannergooding April 23, 2026 19:53

Copilot AI requested a review from tannergooding April 23, 2026 19:53

tannergooding reviewed Apr 23, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp Outdated

Copilot started work on behalf of tannergooding April 23, 2026 19:57 View session

V512 FP Sum: extract 4 lanes directly via GetLower128/ExtractVector128

98cc1c9

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/7dae7a9d-0582-4e67-b63d-83e84b191141 Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>

Copilot AI requested review from Copilot and removed request for Copilot April 23, 2026 20:19

Copilot finished work on behalf of tannergooding April 23, 2026 20:21

Copilot AI requested a review from tannergooding April 23, 2026 20:21

tannergooding approved these changes Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Reduce Vector256/512 Sum to a shared per-lane reduction on x64#127329

JIT: Reduce Vector256/512 Sum to a shared per-lane reduction on x64#127329
Copilot wants to merge 7 commits intomainfrom
copilot/fix-vector256-intrinsics-regression

Copilot AI commented Apr 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

dotnet-policy-service Bot commented Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tannergooding commented Apr 23, 2026

Uh oh!

tannergooding commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Behavioral note

Uh oh!

Uh oh!

dotnet-policy-service Bot commented Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tannergooding commented Apr 23, 2026

Uh oh!

tannergooding commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Apr 23, 2026 •

edited

Loading