JIT: Reduce Vector256/512 Sum to a shared per-lane reduction on x64#127329
Open
JIT: Reduce Vector256/512 Sum to a shared per-lane reduction on x64#127329
Conversation
Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/f98b46df-b011-4898-8d7f-4edea22a5662 Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>
Contributor
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Copilot
AI
changed the title
[WIP] Fix Vector256 explicit intrinsics performance regression on .NET 10
JIT: Reduce Vector256/512 Sum to a single V128 horizontal reduction on x64
Apr 23, 2026
Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/7dd925c6-f17d-475f-a448-008418ad164a Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>
Copilot
AI
changed the title
JIT: Reduce Vector256/512 Sum to a single V128 horizontal reduction on x64
JIT: Reduce Vector256/512 Sum to a shared per-lane reduction on x64
Apr 23, 2026
Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/873ded98-dc23-47f4-9027-6bfb940355ab Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>
Copilot
AI
requested review from
Copilot and
tannergooding
and removed request for
Copilot
April 23, 2026 18:36
Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/5ae7cd14-a6cd-4a9c-a559-dae9287ab791 Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>
tannergooding
approved these changes
Apr 23, 2026
Member
Member
|
@EgorBot -amd -intel using System;
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
[MemoryDiagnoser]
public class Vector256RegressionBenchmark
{
private const float NullHeight = -3.4E+38f;
private const float FillTolerance = 0.01f;
private const float NegCutTolerance = -0.01f;
private float[] _baseElevations;
private float[] _topElevations;
[Params(1024, 4096)]
public int NumElevations { get; set; }
[GlobalSetup]
public void Setup()
{
var rng = new Random(42);
_baseElevations = new float[NumElevations];
_topElevations = new float[NumElevations];
for (var i = 0; i < NumElevations; i++)
{
_baseElevations[i] = rng.NextDouble() < 0.2
? NullHeight : (float)(rng.NextDouble() * 100);
_topElevations[i] = rng.NextDouble() < 0.2
? NullHeight : (float)(rng.NextDouble() * 100 + 10);
}
}
[Benchmark(Description = "Vector<float> portable SIMD", Baseline = true)]
public unsafe (double cut, double fill) PortableVector()
{
double cutVol = 0, fillVol = 0;
var nullVec = new Vector<float>(NullHeight);
var fillTolVec = new Vector<float>(FillTolerance);
var negCutTolVec = new Vector<float>(NegCutTolerance);
fixed (float* bp = _baseElevations, tp = _topElevations)
{
var bv = (Vector<float>*)bp;
var tv = (Vector<float>*)tp;
for (int i = 0, limit = NumElevations / Vector<float>.Count;
i < limit; i++, bv++, tv++)
{
var mask = ~(Vector.Equals(*bv, nullVec)
| Vector.Equals(*tv, nullVec));
if (Vector.Sum(mask) == 0) continue;
var delta = Vector.ConditionalSelect(mask, *tv, Vector<float>.Zero)
- Vector.ConditionalSelect(mask, *bv, Vector<float>.Zero);
var fillMask = Vector.GreaterThan(delta, fillTolVec);
var usedFill = -Vector.Sum(fillMask);
if (usedFill > 0)
fillVol -= Vector.Dot(delta, Vector.ConvertToSingle(fillMask));
if (usedFill < Vector<float>.Count)
{
var cutMask = Vector.LessThan(delta, negCutTolVec);
var usedCut = -Vector.Sum(cutMask);
if (usedCut > 0)
cutVol -= Vector.Dot(delta, Vector.ConvertToSingle(cutMask));
}
}
}
return (cutVol, fillVol);
}
[Benchmark(Description = "Vector256<float> explicit SIMD")]
public unsafe (double cut, double fill) ExplicitVector256()
{
if (!Vector256.IsHardwareAccelerated) return (0, 0);
double cutVol = 0, fillVol = 0;
var nullVec = Vector256.Create(NullHeight);
var fillTolVec = Vector256.Create(FillTolerance);
var negCutTolVec = Vector256.Create(NegCutTolerance);
fixed (float* bp = _baseElevations, tp = _topElevations)
{
var bv = (Vector256<float>*)bp;
var tv = (Vector256<float>*)tp;
for (int i = 0, limit = NumElevations / Vector256<float>.Count;
i < limit; i++, bv++, tv++)
{
var mask = ~(Vector256.Equals(*bv, nullVec)
| Vector256.Equals(*tv, nullVec));
if (Vector256.EqualsAll(mask, Vector256<float>.Zero)) continue;
var delta = Vector256.ConditionalSelect(mask, *tv, Vector256<float>.Zero)
- Vector256.ConditionalSelect(mask, *bv, Vector256<float>.Zero);
var fillMask = Vector256.GreaterThan(delta, fillTolVec);
if (Vector256.ExtractMostSignificantBits(fillMask) != 0)
fillVol += Vector256.Sum(
Vector256.ConditionalSelect(fillMask, delta, Vector256<float>.Zero));
var cutMask = Vector256.LessThan(delta, negCutTolVec);
if (Vector256.ExtractMostSignificantBits(cutMask) != 0)
cutVol += Vector256.Sum(
Vector256.ConditionalSelect(cutMask, delta, Vector256<float>.Zero));
}
}
return (cutVol, fillVol);
}
} |
Contributor
There was a problem hiding this comment.
Pull request overview
Refactors x64 floating-point SIMD Vector256/Vector512.Sum IR construction to avoid duplicating the V128 horizontal reduction work by performing the permute+add sequence at the full SIMD width and then combining 128-bit lanes down to a single V128.
Changes:
- Reworks FP
gtNewSimdSumNodeto do per-128-bit-lane permute+add at full SIMD width (V256/V512) and then reduce lanes via upper/lower half combines. - Simplifies permute intrinsic selection (
AVXvsAVX512) via ternary and aligns the V512 lane-combine structure with the integer reduction shape. - Keeps the integer path as “halve-combine to V128, then V128 reduction”.
Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/e62caef7-2b87-4b6c-bc83-401d7672fdfc Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>
Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/7dae7a9d-0582-4e67-b63d-83e84b191141 Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>
tannergooding
approved these changes
Apr 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Vector256<float>.Sumregressed ~71% on .NET 10 vs .NET 8 on AVX-512 hardware.Compiler::gtNewSimdSumNodehad a float-specific branch for 32- and 64-byte vectors that recursed on each half, emitting a full V128 horizontal-reduction (shuffle + add + shuffle + add +ToScalar) per half plus a scalar add to combine — duplicating the expensive reduction two/four times per call and bloating the hot loop.Description
src/coreclr/jit/gentree.cpp— IngtNewSimdSumNodeon x64, eliminate the duplicated per-half horizontal reduction for floating-point vectors.For floating-point, the fix runs the
vpermilps/vpermilpd+vaddps/vaddpdhorizontal-reduction sequence at the full simd width first. Because those permutes operate within each 128-bit lane, this is effectively2x V128(V256) or4x V128(V512) reductions happening in parallel with no duplicated work. After this, every 128-bit lane of the vector holds that lane's reduced sum broadcast across the lane.Only then are the lanes combined down to a single V128. Floating-point addition is commutative but not associative, so the halve-combine deliberately preserves the prior recursive
Sum(lower) + Sum(upper)grouping:NI_Vector512_GetLower128and lanes 1-3 viaNI_AVX512_ExtractVector128with imm1/2/3— reusing the same V512 value across all four extractions (threefgMakeMultiUsecalls onop1). Combine as(s0 + s1) + (s2 + s3)(wheres_iis lanei's sum), matching the prior managed/JITSum(lower256) + Sum(upper256)recursion. A fall-throughV256 = V512.Lower + V512.Upperwas rejected because it would pairlane0+lane2andlane1+lane3, reordering the FP additions to(s0 + s2) + (s1 + s3). Direct V128 extraction from the V512 is preferred over nestedGetLower/GetUpperchains because it keeps the IR smaller and lets the JIT share the original V512 operand across extractions.vector128 = vector256.GetLower() + vector256.GetUpper(), setsimdSize = 16, fall through.return vector128.ToScalar().For 512-bit widths the 512-bit encodings of the permutes (
NI_AVX512_Permute4x32/NI_AVX512_Permute2x64, which emitvpermilps/vpermilpdunder EVEX) are selected via a one-line ternary.The integer path is unchanged (integer addition is associative, so reducing halves element-wise before the V128 reduction is safe, and that path was already efficient).
Effect on
Vector256<float>.Sumcodegen (x64, VEX):Savings scale further for
Vector512<float>.Sum, which was previously doing four full V128 reductions + three scalar adds; it now does a single shared full-width per-lane reduction followed by four direct V128 extractions combined as(s0 + s1) + (s2 + s3).Behavioral note
Only the per-V128-lane reduction within each lane is done via the shuffle/permute + add sequence, which was already the shape used by the V128 path, so no new within-lane order differences are introduced relative to the prior JIT behavior. The cross-lane halve-combine grouping is deliberately preserved to match the prior recursive
Sum(lower) + Sum(upper)shape, given FP addition is not associative.Note
This PR was authored with assistance from GitHub Copilot.