Skip to content

JIT: Reduce Vector256/512 Sum to a shared per-lane reduction on x64#127329

Open
Copilot wants to merge 7 commits intomainfrom
copilot/fix-vector256-intrinsics-regression
Open

JIT: Reduce Vector256/512 Sum to a shared per-lane reduction on x64#127329
Copilot wants to merge 7 commits intomainfrom
copilot/fix-vector256-intrinsics-regression

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 23, 2026

Vector256<float>.Sum regressed ~71% on .NET 10 vs .NET 8 on AVX-512 hardware. Compiler::gtNewSimdSumNode had a float-specific branch for 32- and 64-byte vectors that recursed on each half, emitting a full V128 horizontal-reduction (shuffle + add + shuffle + add + ToScalar) per half plus a scalar add to combine — duplicating the expensive reduction two/four times per call and bloating the hot loop.

Description

  • src/coreclr/jit/gentree.cpp — In gtNewSimdSumNode on x64, eliminate the duplicated per-half horizontal reduction for floating-point vectors.

    For floating-point, the fix runs the vpermilps/vpermilpd + vaddps/vaddpd horizontal-reduction sequence at the full simd width first. Because those permutes operate within each 128-bit lane, this is effectively 2x V128 (V256) or 4x V128 (V512) reductions happening in parallel with no duplicated work. After this, every 128-bit lane of the vector holds that lane's reduced sum broadcast across the lane.

    Only then are the lanes combined down to a single V128. Floating-point addition is commutative but not associative, so the halve-combine deliberately preserves the prior recursive Sum(lower) + Sum(upper) grouping:

    • V512: extract each of the four 128-bit lanes directly from the original V512 — lane 0 via NI_Vector512_GetLower128 and lanes 1-3 via NI_AVX512_ExtractVector128 with imm 1/2/3 — reusing the same V512 value across all four extractions (three fgMakeMultiUse calls on op1). Combine as (s0 + s1) + (s2 + s3) (where s_i is lane i's sum), matching the prior managed/JIT Sum(lower256) + Sum(upper256) recursion. A fall-through V256 = V512.Lower + V512.Upper was rejected because it would pair lane0+lane2 and lane1+lane3, reordering the FP additions to (s0 + s2) + (s1 + s3). Direct V128 extraction from the V512 is preferred over nested GetLower/GetUpper chains because it keeps the IR smaller and lets the JIT share the original V512 operand across extractions.
    • V256: vector128 = vector256.GetLower() + vector256.GetUpper(), set simdSize = 16, fall through.
    • V128: return vector128.ToScalar().

    For 512-bit widths the 512-bit encodings of the permutes (NI_AVX512_Permute4x32 / NI_AVX512_Permute2x64, which emit vpermilps / vpermilpd under EVEX) are selected via a one-line ternary.

    The integer path is unchanged (integer addition is associative, so reducing halves element-wise before the V128 reduction is safe, and that path was already efficient).

Effect on Vector256<float>.Sum codegen (x64, VEX):

; Before — 64 B / 11 instructions (two full V128 reductions + vaddss)
vmovaps      ymm1, ymm0
vpermilps    xmm2, xmm1, -79
vaddps       xmm1, xmm2, xmm1
vpermilps    xmm2, xmm1, 78
vaddps       xmm1, xmm2, xmm1          ; Sum(lower)
vextractf128 xmm0, ymm0
vpermilps    xmm2, xmm0, -79
vaddps       xmm0, xmm2, xmm0
vpermilps    xmm2, xmm0, 78
vaddps       xmm0, xmm2, xmm0          ; Sum(upper)
vaddss       xmm0, xmm1, xmm0

; After — 44 B / 10 instructions (single shared per-lane reduction, then combine halves)
vpermilps    ymm1, ymm0, -79           ; per-lane shuffle at full 256-bit width (2x V128 in parallel)
vaddps       ymm0, ymm1, ymm0
vpermilps    ymm1, ymm0, 78
vaddps       ymm0, ymm1, ymm0          ; each V128 lane now holds its own sum broadcast
vmovaps      ymm1, ymm0
vextractf128 xmm0, ymm0                ; upper
vaddps       xmm0, xmm0, xmm1          ; + lower

Savings scale further for Vector512<float>.Sum, which was previously doing four full V128 reductions + three scalar adds; it now does a single shared full-width per-lane reduction followed by four direct V128 extractions combined as (s0 + s1) + (s2 + s3).

Behavioral note

Only the per-V128-lane reduction within each lane is done via the shuffle/permute + add sequence, which was already the shape used by the V128 path, so no new within-lane order differences are introduced relative to the prior JIT behavior. The cross-lane halve-combine grouping is deliberately preserved to match the prior recursive Sum(lower) + Sum(upper) shape, given FP addition is not associative.

Note

This PR was authored with assistance from GitHub Copilot.

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/f98b46df-b011-4898-8d7f-4edea22a5662

Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>
Copilot AI requested review from Copilot and removed request for Copilot April 23, 2026 17:02
Comment thread src/coreclr/jit/gentree.cpp Outdated
@github-actions github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 23, 2026
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copilot AI changed the title [WIP] Fix Vector256 explicit intrinsics performance regression on .NET 10 JIT: Reduce Vector256/512 Sum to a single V128 horizontal reduction on x64 Apr 23, 2026
Copilot AI requested a review from tannergooding April 23, 2026 17:27
Copilot AI requested review from Copilot and removed request for Copilot April 23, 2026 17:56
Copilot AI changed the title JIT: Reduce Vector256/512 Sum to a single V128 horizontal reduction on x64 JIT: Reduce Vector256/512 Sum to a shared per-lane reduction on x64 Apr 23, 2026
Comment thread src/coreclr/jit/gentree.cpp Outdated
Comment thread src/coreclr/jit/gentree.cpp Outdated
Comment thread src/coreclr/jit/gentree.cpp
Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/873ded98-dc23-47f4-9027-6bfb940355ab

Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>
Copilot AI requested review from Copilot and tannergooding and removed request for Copilot April 23, 2026 18:36
Comment thread src/coreclr/jit/gentree.cpp Outdated
Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/5ae7cd14-a6cd-4a9c-a559-dae9287ab791

Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>
Copilot AI requested review from Copilot and removed request for Copilot April 23, 2026 19:11
Copilot AI requested a review from tannergooding April 23, 2026 19:13
@tannergooding tannergooding marked this pull request as ready for review April 23, 2026 19:14
@tannergooding tannergooding requested review from EgorBo, Copilot and kg April 23, 2026 19:14
@tannergooding
Copy link
Copy Markdown
Member

CC. @dotnet/jit-contrib, @kg, @EgorBo for review. This is an alternative to #126255 that preserves decision to not use horizontal instructions. It instead ensures we're not doing unnecessary work across the full vector.

@tannergooding
Copy link
Copy Markdown
Member

@EgorBot -amd -intel

using System;
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

[MemoryDiagnoser]
public class Vector256RegressionBenchmark
{
    private const float NullHeight = -3.4E+38f;
    private const float FillTolerance = 0.01f;
    private const float NegCutTolerance = -0.01f;

    private float[] _baseElevations;
    private float[] _topElevations;

    [Params(1024, 4096)]
    public int NumElevations { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        var rng = new Random(42);
        _baseElevations = new float[NumElevations];
        _topElevations = new float[NumElevations];
        for (var i = 0; i < NumElevations; i++)
        {
            _baseElevations[i] = rng.NextDouble() < 0.2
                ? NullHeight : (float)(rng.NextDouble() * 100);
            _topElevations[i] = rng.NextDouble() < 0.2
                ? NullHeight : (float)(rng.NextDouble() * 100 + 10);
        }
    }

    [Benchmark(Description = "Vector<float> portable SIMD", Baseline = true)]
    public unsafe (double cut, double fill) PortableVector()
    {
        double cutVol = 0, fillVol = 0;
        var nullVec = new Vector<float>(NullHeight);
        var fillTolVec = new Vector<float>(FillTolerance);
        var negCutTolVec = new Vector<float>(NegCutTolerance);

        fixed (float* bp = _baseElevations, tp = _topElevations)
        {
            var bv = (Vector<float>*)bp;
            var tv = (Vector<float>*)tp;
            for (int i = 0, limit = NumElevations / Vector<float>.Count;
                 i < limit; i++, bv++, tv++)
            {
                var mask = ~(Vector.Equals(*bv, nullVec)
                           | Vector.Equals(*tv, nullVec));
                if (Vector.Sum(mask) == 0) continue;

                var delta = Vector.ConditionalSelect(mask, *tv, Vector<float>.Zero)
                          - Vector.ConditionalSelect(mask, *bv, Vector<float>.Zero);

                var fillMask = Vector.GreaterThan(delta, fillTolVec);
                var usedFill = -Vector.Sum(fillMask);
                if (usedFill > 0)
                    fillVol -= Vector.Dot(delta, Vector.ConvertToSingle(fillMask));

                if (usedFill < Vector<float>.Count)
                {
                    var cutMask = Vector.LessThan(delta, negCutTolVec);
                    var usedCut = -Vector.Sum(cutMask);
                    if (usedCut > 0)
                        cutVol -= Vector.Dot(delta, Vector.ConvertToSingle(cutMask));
                }
            }
        }
        return (cutVol, fillVol);
    }

    [Benchmark(Description = "Vector256<float> explicit SIMD")]
    public unsafe (double cut, double fill) ExplicitVector256()
    {
        if (!Vector256.IsHardwareAccelerated) return (0, 0);

        double cutVol = 0, fillVol = 0;
        var nullVec = Vector256.Create(NullHeight);
        var fillTolVec = Vector256.Create(FillTolerance);
        var negCutTolVec = Vector256.Create(NegCutTolerance);

        fixed (float* bp = _baseElevations, tp = _topElevations)
        {
            var bv = (Vector256<float>*)bp;
            var tv = (Vector256<float>*)tp;
            for (int i = 0, limit = NumElevations / Vector256<float>.Count;
                 i < limit; i++, bv++, tv++)
            {
                var mask = ~(Vector256.Equals(*bv, nullVec)
                           | Vector256.Equals(*tv, nullVec));
                if (Vector256.EqualsAll(mask, Vector256<float>.Zero)) continue;

                var delta = Vector256.ConditionalSelect(mask, *tv, Vector256<float>.Zero)
                          - Vector256.ConditionalSelect(mask, *bv, Vector256<float>.Zero);

                var fillMask = Vector256.GreaterThan(delta, fillTolVec);
                if (Vector256.ExtractMostSignificantBits(fillMask) != 0)
                    fillVol += Vector256.Sum(
                        Vector256.ConditionalSelect(fillMask, delta, Vector256<float>.Zero));

                var cutMask = Vector256.LessThan(delta, negCutTolVec);
                if (Vector256.ExtractMostSignificantBits(cutMask) != 0)
                    cutVol += Vector256.Sum(
                        Vector256.ConditionalSelect(cutMask, delta, Vector256<float>.Zero));
            }
        }
        return (cutVol, fillVol);
    }
}

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors x64 floating-point SIMD Vector256/Vector512.Sum IR construction to avoid duplicating the V128 horizontal reduction work by performing the permute+add sequence at the full SIMD width and then combining 128-bit lanes down to a single V128.

Changes:

  • Reworks FP gtNewSimdSumNode to do per-128-bit-lane permute+add at full SIMD width (V256/V512) and then reduce lanes via upper/lower half combines.
  • Simplifies permute intrinsic selection (AVX vs AVX512) via ternary and aligns the V512 lane-combine structure with the integer reduction shape.
  • Keeps the integer path as “halve-combine to V128, then V128 reduction”.

Comment thread src/coreclr/jit/gentree.cpp Outdated
Comment thread src/coreclr/jit/gentree.cpp Outdated
Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/e62caef7-2b87-4b6c-bc83-401d7672fdfc

Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>
Comment thread src/coreclr/jit/gentree.cpp Outdated
Copilot AI requested review from Copilot and removed request for Copilot April 23, 2026 20:19
Copilot AI requested a review from tannergooding April 23, 2026 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Vector256 explicit intrinsics 71% slower on .NET 10 vs .NET 8 on AVX-512 hardware

3 participants