[NO-REVIEW] [NO-MERGE] Auto loop vectorization experiment by hez2010 · Pull Request #127853 · dotnet/runtime

hez2010 · 2026-05-06T10:00:54Z

Note

This is a fully vibe coded experiment with neither careful correctness review nor extensive test.
It's not aiming for reviewing or merging. I'm opening this PR to evaluate its actual impact and also aiming for finding potential vectorization opportunities within BCL.

Local SPMI Run Study

Headline:

Mode	Loops vectorized	TP base instr	TP diff instr	TP delta	TP pct	ActualCodeBytes delta	Asm diff contexts	Missing compiles note
Default policy	678	2,843,264,317,554	2,854,738,112,877	+11,473,795,323	+0.40%	+44,604	597	asmdiff missing base=322, diff=313
Aggressive policy	681	2,843,259,400,659	2,854,732,623,273	+11,473,222,614	+0.40%	+44,833	600	asmdiff missing base=322, diff=313

Default policy has profitability checks for opportunity analysis, and it will choose vector width based on pressure; aggressive policy bypasses the checks and always uses the maximum available vector size.

Complete report with asm diffs:

autovec-binary-release-asm-metrics-report.md

Artifacts including spmi logs and per-method diffs:

autovec-binary-release-artifacts-with-dasm.zip

cc: @dotnet/jit-contrib

Note

The following content is AI generated.

Summary

This change adds a late HIR auto-vectorization phase to RyuJIT. The phase recognizes profitable counted loops, builds a virtual-lane SLP plan from the scalar loop body, and rewrites the loop into a vector loop plus scalar epilogue. The generated IR uses existing SIMD/HW intrinsic nodes so rationalization, lowering, LSRA, and codegen continue to own target-specific SIMD expansion.

The vectorizer is enabled by default via JitAutoVectorization=1. A second knob, JitAggressiveVectorizing=1, bypasses the profitability policy for investigation and opportunity measurement.

Phase Placement

The new phase is wired as:

VN-DSE
If Conversion
Auto vectorization
Optimize pre-layout
Rationalization
Lowering
LSRA
Codegen

More concretely, PHASE_AUTO_VECTORIZATION runs after VN-based dead-store removal and if-conversion, and before pre-layout flow opts and rationalization.

This placement is intentional:

It runs after loop canonicalization, SSA/VN optimizations, range checks, assertion propagation, range analysis, and IV optimization have already simplified loops.
It runs after VN-DSE, so the vectorizer does not need to preserve stale SSA/VN state for later VN consumers.
It runs after if-conversion, so simple scalar conditional expressions can appear as GT_SELECT and be packed by SLP.
It runs before rationalization, while loops are still HIR BasicBlock / Statement / GenTree form and can be rewritten structurally.

After rewriting, the phase marks loop/flow/liveness-sensitive state stale and relies on the normal downstream pipeline to repair/consume the resulting HIR.

Design

The implementation is centered on AutoVectorizer in jit/autovectorizer.cpp.

The core pipeline is:

Recompute the loop table.
Visit natural loops in post-order.
Recognize a supported counted-loop shape.
Analyze memory accesses and loop-carried dependences.
Build a virtual-lane SLP plan.
Select the target vector width using the cost policy.
Rewrite the loop into:
- vector-entry check,
- vector body,
- optional runtime overlap checks,
- scalar epilogue guard,
- original scalar loop as the epilogue.
Record Metrics.LoopsVectorized.

The SLP planner does not materialize scalar unrolling in HIR. Instead, it reasons about virtual lanes and directly emits vector IR for the accepted pack:

scalar expression for i
  -> virtual lanes i + 0 ... i + VF - 1
  -> SLP pack
  -> vector load/op/store or vector reduction update

This keeps unsuccessful candidates cheap and avoids expanding scalar IR just to discover that the loop is not vectorizable.

Supported Targets and Width Selection

The phase is enabled for optimized, non-debuggable compilations on SIMD-capable xarch and arm64 targets.

Vector width selection uses the maximum hardware-supported SIMD width for the selected element type, subject to the profitability policy:

xarch:
- 512-bit when AVX512 is available and profitable,
- otherwise 256-bit when AVX2 is available and profitable,
- otherwise 128-bit.
arm64:
- 128-bit AdvSIMD.

The policy considers estimated scalar/vector cost, loop overhead, constant trip count, block hotness, simple memory-loop shape, vector pressure, reduction presence, and code size. JitAggressiveVectorizing=1 bypasses this policy and selects the first legal vector width, which is useful for finding missed opportunities and comparing the production policy against the legal maximum.

Covered Loop Shapes

The vectorizer currently handles conservative natural-loop forms:

single-entry natural loops,
one backedge,
one normal exit,
no EH participation in the preheader, loop, or exit,
innermost loops,
canonical counted loops recognized by loop analysis,
post-IV strength-reduced loops produced by IV opts,
local-limit loops where the loop test compares locals directly,
forward and descending unit-stride loops,
<, <=, >, >=, and selected != counted-loop tests,
conditional preheader entries for supported post-IV/local-limit forms,
scalar epilogue for tails.

The phase deliberately rejects unsupported or risky CFG shapes such as EH loops, non-innermost loops, multi-exit loops, and == loop termination.

Covered Memory Forms

The memory analysis supports contiguous element access through:

single-dimensional array address forms,
byref plus index forms,
post-IV local-address forms from strength reduction,
span-like and readonly-span-like morphed byref addressing,
mixed array/span/byref cases when the access and limit proof are recognized,
multiple loads and multiple stores within the fixed analysis budgets,
same-base/same-offset read-modify-write,
obviously safe different-offset access patterns,
runtime overlap checks for selected post-IV alias cases.

The vectorizer rejects volatile accesses, unsupported element types, remaining unproven bounds checks, unsupported address expressions, and dependence patterns that could change scalar semantics.

Covered Element Types and Operations

Supported element types include the primitive SIMD element types handled by the existing SIMD/HW intrinsic path, including integral and floating-point element types.

The SLP planner covers:

contiguous vector loads and stores,
splatted constants,
splatted invariant scalar locals,
unary ops,
binary ops,
ternary ops for supported scalar intrinsic patterns,
comparisons,
GT_SELECT,
min/max/abs-style intrinsic patterns where supported,
simple reductions.

Reduction support includes vector accumulator setup, vector loop update, and scalar finalization. The implementation supports add/sub reductions and min/max-style reductions for supported element types, including floating-point reduction paths where the scalar semantics are represented by the recognized intrinsic pattern.

Unsupported forms are still rejected rather than guessed: non-contiguous/gather/scatter memory, arbitrary casts and widening/narrowing packs, modulo, unsupported division forms, unsupported helper/call shapes, complicated address expressions, and control flow that was not simplified into supported straight-line HIR.

Safety Model

The implementation is intentionally conservative. It rejects a candidate unless legality is clear.

Important safety rules include:

Do not vectorize loops in or around EH regions.
Do not introduce potentially throwing preheader work.
Require memory accesses to be proven contiguous and in-bounds, or reject.
Reject volatile accesses and unsupported side effects.
Reject unsupported checked/throwing arithmetic.
Validate dependence between stores and loads before rewriting.
Keep the original scalar loop as the scalar epilogue.
Use runtime overlap checks only for selected forms where the vector rewrite can safely fall back to scalar.

The rewrite preserves the original scalar loop for the tail and redirects control flow through the new vector loop only when the vector trip count and alias checks allow it.

Diagnostics and Metrics

The phase uses normal JitDump output. Dumps include:

candidate loop shape,
IV/test information,
rejection reasons next to the relevant statement/tree dump,
accepted SLP pack structure,
selected vector size and VF,
scalar statements selected for rewrite,
generated vector trees/statements,
generated CFG edges and branch likelihoods.

This change also adds a new JIT metric LoopsVectorized.

The metric increments once per successfully rewritten loop and can be used by SuperPMI metricdiff to measure vectorization coverage per collection/method.

Files Changed

jit/autovectorizer.cpp
jit/autovectorizer.h
jit/compiler.cpp
jit/compiler.h
jit/compphases.h
jit/jitconfigvalues.h
jit/jitmetadatalist.h
jit/CMakeLists.txt

Validation

Built clr.jit Release with NoPgoOptimize=true.
Built separate Release JIT binaries for:
- auto-vectorization disabled,
- default policy enabled,
- aggressive vectorization enabled.
Ran smoke tests to check vectorization coverages.
Ran SuperPMI throughput diff over the local x64 collections.
Running SuperPMI asm/metric diffs and collecting LoopsVectorized metrics for the final report.

dotnet-policy-service · 2026-05-06T10:02:31Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR introduces a new JIT auto-vectorization optimization pass that analyzes and rewrites qualifying loops into SIMD vector loops (with scalar epilogues), along with associated config knobs, phase plumbing, build integration, and perf metrics.

Changes:

Add AutoVectorizer implementation and integrate it as a new compilation phase.
Introduce new JIT config flags to control auto-vectorization and an “aggressive” mode.
Add a new JIT metadata metric to track the number of loops vectorized.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/coreclr/jit/jitmetadatalist.h	Adds a new `LoopsVectorized` metric to track vectorized loops.
src/coreclr/jit/jitconfigvalues.h	Adds config switches to enable/disable auto-vectorization and aggressive vectorizing.
src/coreclr/jit/compphases.h	Registers a new `PHASE_AUTO_VECTORIZATION` phase name.
src/coreclr/jit/compiler.h	Adds `optAutoVectorize()` and grants the new pass friend access.
src/coreclr/jit/compiler.cpp	Wires the new phase into the pipeline when optimizations are enabled.
src/coreclr/jit/autovectorizer.h	Declares the `AutoVectorizer` pass and its planning/rewriting machinery.
src/coreclr/jit/autovectorizer.cpp	Implements loop analysis, SLP planning, profitability heuristics, and CFG/IR rewrite.
src/coreclr/jit/CMakeLists.txt	Adds the new source/header to the JIT build.

+GenTree* AutoVectorizer::BuildVectorReductionOp(LoopVectorizationPlan*                      plan,
+                                                const LoopVectorizationPlan::ReductionInfo& reduction,
+                                                GenTree*                                    op1,
+                                                GenTree*                                    op2)
+{
+#if defined(FEATURE_HW_INTRINSICS) && (defined(TARGET_XARCH) || defined(TARGET_ARM64))
+    const var_types simdType = Compiler::getSIMDTypeForSize(plan->VectorSizeBytes);
+    if (reduction.Oper != GT_INTRINSIC)
+    {
+        return m_compiler->gtNewSimdBinOpNode(GT_ADD, simdType, op1, op2, plan->ElementType, plan->VectorSizeBytes);
+    }
+
+    return BuildVectorMinMaxOp(reduction, op1, op2, simdType, plan->VectorSizeBytes);
+#else
+    unreached();
+#endif
+}


+    for (unsigned i = 0; i < plan->LoadCount; i++)
+    {
+        const LoopVectorizationPlan::ScalarAccess& existing = plan->LoadAccesses[i];
+        if ((existing.Address == access.Address) ||
+            ((existing.BaseLocalIfKnown == access.BaseLocalIfKnown) &&
+             (existing.OffsetLocalIfKnown == access.OffsetLocalIfKnown) &&
+             (existing.IndexOffset == access.IndexOffset) && (existing.PostIVOffset == access.PostIVOffset) &&
+             (existing.ElementType == access.ElementType) && (existing.IsArray == access.IsArray) &&
+             (existing.IsByrefLocal == access.IsByrefLocal) &&
+             (existing.IsByrefBaseWithOffset == access.IsByrefBaseWithOffset) &&
+             (existing.IsByrefWithIndex == access.IsByrefWithIndex)))
+        {
+            *index = i;
+            return true;
+        }
+    }


 CONFIG_STRING(JitObjectStackAllocationTrackFieldsRange, "JitObjectStackAllocationTrackFieldsRange")
 CONFIG_INTEGER(JitObjectStackAllocationDumpConnGraph, "JitObjectStackAllocationDumpConnGraph", 0)

+RELEASE_CONFIG_INTEGER(JitAutoVectorization, "JitAutoVectorization", 1)


+class AutoVectorizer
+{
+public:
+    explicit AutoVectorizer(Compiler* compiler);


+    if (first.IsArray && second.IsArray)
+    {
+        return true;
+    }
+
+    if ((first.IsByrefLocal || first.IsByrefBaseWithOffset || first.IsByrefWithIndex) &&
+        (second.IsByrefLocal || second.IsByrefBaseWithOffset || second.IsByrefWithIndex))
+    {
+        return true;
+    }
+
+    // Array and byref/span bases can still describe the same storage after morphing.


+        if (doAutoVectorization)
+        {
+            // Rewrite HIR loops late, after VN-DSE and if-conversion but before rationalization.
+            //
+            DoPhase(this, PHASE_AUTO_VECTORIZATION, &Compiler::optAutoVectorize);
+        }


hez2010 · 2026-05-06T10:24:09Z

@MihuBot

hez2010 · 2026-05-06T11:10:28Z

@MihuBot

Copilot

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 6 comments.

+    const var_types simdType = Compiler::getSIMDTypeForSize(plan->VectorSizeBytes);
+    if (reduction.Oper != GT_INTRINSIC)
+    {
+        return m_compiler->gtNewSimdBinOpNode(GT_ADD, simdType, op1, op2, plan->ElementType, plan->VectorSizeBytes);


+    if (tree->OperIs(GT_ADD))
+    {
+        GenTree* op1 = tree->AsOp()->gtOp1;
+        GenTree* op2 = tree->AsOp()->gtOp2;
+
+        if (op1->IsCnsIntOrI())
+        {
+            *offset += static_cast<int>(op1->AsIntConCommon()->IconValue());
+            return TryAnalyzeIndexExpr(plan, op2, ivLcl, offset, invariantLcl, sawIv, depth + 1);
+        }
+
+        if (op2->IsCnsIntOrI())
+        {
+            *offset += static_cast<int>(op2->AsIntConCommon()->IconValue());
+            return TryAnalyzeIndexExpr(plan, op1, ivLcl, offset, invariantLcl, sawIv, depth + 1);
+        }


+    LclVarDsc* const ivDsc = m_compiler->lvaGetDesc(plan->InductionVar);
+    GenTree*         iv    = m_compiler->gtNewLclvNode(plan->InductionVar, ivDsc->TypeGet());
+    GenTree*         end   = m_compiler->gtCloneExpr(plan->End);
+
+    GenTree* lastLane = m_compiler->gtNewCastNode(TYP_LONG, iv, false, TYP_LONG);
+    if (plan->VectorizationFactor > 1)
+    {
+        lastLane =
+            m_compiler->gtNewOperNode(plan->Step < 0 ? GT_SUB : GT_ADD, TYP_LONG, lastLane,
+                                      m_compiler->gtNewLconNode(static_cast<int64_t>(plan->VectorizationFactor - 1)));
+    }
+
+    end = m_compiler->gtNewCastNode(TYP_LONG, end, false, TYP_LONG);


+    if (first.IsArray && second.IsArray)
+    {
+        return true;
+    }
+
+    if ((first.IsByrefLocal || first.IsByrefBaseWithOffset || first.IsByrefWithIndex) &&
+        (second.IsByrefLocal || second.IsByrefBaseWithOffset || second.IsByrefWithIndex))
+    {
+        return true;
+    }
+
+    // Array and byref/span bases can still describe the same storage after morphing.


+            BasicBlock* const header           = loop->GetHeader();
+            bool              alreadyRewritten = false;
+            for (unsigned rewrittenHeader : rewrittenHeaders)
+            {
+                if (rewrittenHeader == header->bbNum)
+                {
+                    alreadyRewritten = true;
+                    break;
+                }
+            }


+        if (doAutoVectorization)
+        {
+            // Rewrite HIR loops late, after VN-DSE and if-conversion but before rationalization.
+            //
+            DoPhase(this, PHASE_AUTO_VECTORIZATION, &Compiler::optAutoVectorize);
+        }


hez2010 · 2026-05-06T14:21:29Z

@MihuBot

Copilot

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.

+
+    if (!changed)
+    {
+        m_compiler->fgInvalidateDfsTree();


+    const LoopVectorizationPlan originalPlan = *plan;
+
+    for (unsigned i = 0; i < vectorSizeCount; i++)
+    {
+        *plan = originalPlan;
+
+        plan->VectorSizeBytes     = vectorSizes[i];


+    if (first.IsArray && second.IsArray)
+    {
+        return true;
+    }
+
+    if ((first.IsByrefLocal || first.IsByrefBaseWithOffset || first.IsByrefWithIndex) &&
+        (second.IsByrefLocal || second.IsByrefBaseWithOffset || second.IsByrefWithIndex))
+    {
+        return true;
+    }
+
+    // Array and byref/span bases can still describe the same storage after morphing.


+            BasicBlock* const header           = loop->GetHeader();
+            bool              alreadyRewritten = false;
+            for (unsigned rewrittenHeader : rewrittenHeaders)
+            {
+                if (rewrittenHeader == header->bbNum)
+                {
+                    alreadyRewritten = true;
+                    break;
+                }
+            }


 CONFIG_STRING(JitObjectStackAllocationTrackFieldsRange, "JitObjectStackAllocationTrackFieldsRange")
 CONFIG_INTEGER(JitObjectStackAllocationDumpConnGraph, "JitObjectStackAllocationDumpConnGraph", 0)

+RELEASE_CONFIG_INTEGER(JitAutoVectorization, "JitAutoVectorization", 1)


hez2010 · 2026-05-06T16:00:09Z

pmi on S.P.CoreLib and framework assemblies:

PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies [invoking .cctors] for  default jit

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 84302919
Total bytes of diff: 84308903
Total bytes of delta: 5984 (0.01 % of base)
Total relative delta: NaN
    diff is a regression.
    relative diff is a regression.


Total byte diff includes -117 bytes from reconciling methods
        Base had    1 unique methods,      117 unique bytes
        Diff had    0 unique methods,        0 unique bytes

Top file regressions (bytes):
        3090 : FSharp.Core.dasm (0.07 % of base)
         446 : System.Numerics.Tensors.dasm (0.04 % of base)
         401 : System.Private.CoreLib.dasm (0.01 % of base)
         255 : System.Text.RegularExpressions.dasm (0.03 % of base)
         244 : System.Diagnostics.Process.dasm (0.16 % of base)
         239 : Microsoft.VisualBasic.Core.dasm (0.05 % of base)
         233 : System.Runtime.Numerics.dasm (0.14 % of base)
         217 : System.Collections.Immutable.dasm (0.01 % of base)
         195 : System.Net.Security.dasm (0.08 % of base)
         143 : System.Data.Common.dasm (0.01 % of base)
         117 : System.Net.Http.dasm (0.01 % of base)
          88 : System.Reflection.Metadata.dasm (0.02 % of base)
          84 : Newtonsoft.Json.dasm (0.01 % of base)
          67 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.00 % of base)
          51 : xunit.runner.utility.netcoreapp10.dasm (0.02 % of base)
          51 : xunit.execution.dotnet.dasm (0.02 % of base)
          38 : System.Net.NameResolution.dasm (0.06 % of base)
          37 : System.Reflection.MetadataLoadContext.dasm (0.02 % of base)
          35 : Microsoft.Extensions.Logging.Abstractions.dasm (0.04 % of base)

Top file improvements (bytes):
         -47 : Microsoft.CodeAnalysis.dasm (-0.00 % of base)

20 total files with Code Size differences (1 improved, 19 regressed), 260 unchanged.

Top method regressions (bytes):
         195 (6.40 % of base) : System.Net.Security.dasm - System.Net.Security.NetSecurityTelemetry:OnEventCommand(System.Diagnostics.Tracing.EventCommandEventArgs):this (FullOpts)
         165 (150.00 % of base) : System.Numerics.Tensors.dasm - System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[byte]:Invoke(System.ReadOnlySpan`1[byte],byte,System.Span`1[byte]) (FullOpts)
         132 (12.94 % of base) : System.Private.CoreLib.dasm - System.PasteArguments:AppendArgument(byref,System.String) (FullOpts)
         132 (12.94 % of base) : System.Diagnostics.Process.dasm - System.PasteArguments:AppendArgument(byref,System.String) (FullOpts)
         119 (10.21 % of base) : System.Private.CoreLib.dasm - System.Globalization.CalendarData:NormalizeDatePattern(System.String):System.String (FullOpts)
         117 (4.96 % of base) : System.Net.Http.dasm - System.Net.Http.HttpTelemetry:OnEventCommand(System.Diagnostics.Tracing.EventCommandEventArgs):this (FullOpts)
         115 (19.07 % of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.ArrayModule:splitAt$cont@170[byte](int,byte[],Microsoft.FSharp.Core.Unit):System.Tuple`2[byte[],byte[]] (FullOpts)
         108 (6.36 % of base) : System.Data.Common.dasm - System.Data.SqlTypes.SqlDecimal:MpDiv(System.ReadOnlySpan`1[uint],int,System.Span`1[uint],int,System.Span`1[uint],byref,System.Span`1[uint],byref) (FullOpts)
          95 (16.18 % of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.ArrayModule:splitAt$cont@170[long](int,long[],Microsoft.FSharp.Core.Unit):System.Tuple`2[long[],long[]] (FullOpts)
          83 (94.32 % of base) : System.Numerics.Tensors.dasm - System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[double]:Invoke(System.ReadOnlySpan`1[double],double,System.Span`1[double]) (FullOpts)
          80 (15.04 % of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.ArrayModule:splitAt$cont@170[int](int,int[],Microsoft.FSharp.Core.Unit):System.Tuple`2[int[],int[]] (FullOpts)
          77 (21.10 % of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.ArrayModule:TakeWhile[double](Microsoft.FSharp.Core.FSharpFunc`2[double,bool],double[]):double[] (FullOpts)
          77 (9.45 % of base) : System.Diagnostics.Process.dasm - System.Diagnostics.ProcessUtils:GetNextArgument(System.String,byref):System.String (FullOpts)
          68 (40.48 % of base) : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.Match:Reset(System.String,int):this (FullOpts) (2 methods)
          67 (23.43 % of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.Syntax.KeywordTable:EnsureHalfWidth(System.String):System.String (FullOpts)
          67 (10.84 % of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.ArrayModule:splitAt$cont@170[double](int,double[],Microsoft.FSharp.Core.Unit):System.Tuple`2[double[],double[]] (FullOpts)
          67 (10.86 % of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.ArrayModule:splitAt$cont@170[short](int,short[],Microsoft.FSharp.Core.Unit):System.Tuple`2[short[],short[]] (FullOpts)
          67 (20.18 % of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.ArrayModule:Tail[double](double[]):double[] (FullOpts)
          67 (20.24 % of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.ArrayModule:Tail[short](short[]):short[] (FullOpts)
          67 (18.87 % of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.ArrayModule:TakeWhile[byte](Microsoft.FSharp.Core.FSharpFunc`2[byte,bool],byte[]):byte[] (FullOpts)

Top method improvements (bytes):
        -117 (-100.00 % of base) : Microsoft.CodeAnalysis.dasm - Microsoft.CodeAnalysis.SmallDictionary`2[System.__Canon,int]:LeftComplex(Microsoft.CodeAnalysis.SmallDictionary`2+AvlNode[System.__Canon,int]):Microsoft.CodeAnalysis.SmallDictionary`2+AvlNode[System.__Canon,int] (FullOpts) (1 base, 0 diff methods)

Top method regressions (percentages):
         165 (150.00 % of base) : System.Numerics.Tensors.dasm - System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[byte]:Invoke(System.ReadOnlySpan`1[byte],byte,System.Span`1[byte]) (FullOpts)
          83 (94.32 % of base) : System.Numerics.Tensors.dasm - System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[double]:Invoke(System.ReadOnlySpan`1[double],double,System.Span`1[double]) (FullOpts)
          67 (69.07 % of base) : System.Numerics.Tensors.dasm - System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[int]:Invoke(System.ReadOnlySpan`1[int],int,System.Span`1[int]) (FullOpts)
          67 (69.07 % of base) : System.Numerics.Tensors.dasm - System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[long]:Invoke(System.ReadOnlySpan`1[long],long,System.Span`1[long]) (FullOpts)
          64 (57.66 % of base) : System.Numerics.Tensors.dasm - System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[short]:Invoke(System.ReadOnlySpan`1[short],short,System.Span`1[short]) (FullOpts)
          39 (48.15 % of base) : Microsoft.VisualBasic.Core.dasm - Microsoft.VisualBasic.CompilerServices.NewLateBinding:ResetCopyback(bool[]) (FullOpts)
          68 (40.48 % of base) : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.Match:Reset(System.String,int):this (FullOpts) (2 methods)
          35 (40.23 % of base) : System.Data.Common.dasm - System.Data.SqlTypes.SqlDecimal:MpMove(System.ReadOnlySpan`1[uint],int,System.Span`1[uint],byref) (FullOpts)
          48 (39.67 % of base) : System.Runtime.Numerics.dasm - System.Text.ValueStringBuilder`1[byte]:Append(byte,int):this (FullOpts)
          65 (38.69 % of base) : System.Collections.Immutable.dasm - System.Collections.Immutable.ImmutableArray`1+Builder[byte]:AddRange[byte](System.ReadOnlySpan`1[byte]):this (FullOpts)
          44 (37.61 % of base) : Microsoft.VisualBasic.Core.dasm - Microsoft.VisualBasic.CompilerServices.OverloadResolution:CreateMatchTable(int,int):bool[] (FullOpts)
          58 (34.32 % of base) : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.Symbolic.BitVector:And(System.Text.RegularExpressions.Symbolic.BitVector,System.Text.RegularExpressions.Symbolic.BitVector):System.Text.RegularExpressions.Symbolic.BitVector (FullOpts)
          58 (34.32 % of base) : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.Symbolic.BitVector:Or(System.Text.RegularExpressions.Symbolic.BitVector,System.Text.RegularExpressions.Symbolic.BitVector):System.Text.RegularExpressions.Symbolic.BitVector (FullOpts)
          53 (31.18 % of base) : System.Reflection.Metadata.dasm - System.Reflection.Metadata.MetadataReader:CombineRowCounts(int[],int[],byte):int[] (FullOpts)
          44 (30.34 % of base) : System.Runtime.Numerics.dasm - System.Text.ValueStringBuilder`1[double]:Append(double,int):this (FullOpts)
          36 (28.12 % of base) : System.Runtime.Numerics.dasm - System.Numerics.NumericsHelpers:DangerousMakeOnesComplement(System.Span`1[nuint]) (FullOpts)
          35 (25.74 % of base) : System.Runtime.Numerics.dasm - System.Text.ValueStringBuilder`1[int]:Append(int,int):this (FullOpts)
          35 (25.74 % of base) : System.Runtime.Numerics.dasm - System.Text.ValueStringBuilder`1[long]:Append(long,int):this (FullOpts)
          35 (25.55 % of base) : Microsoft.Extensions.Logging.Abstractions.dasm - System.Text.ValueStringBuilder:Append(char,int):this (FullOpts)
          35 (25.55 % of base) : System.Reflection.Metadata.dasm - System.Text.ValueStringBuilder:Append(char,int):this (FullOpts)

Top method improvements (percentages):
        -117 (-100.00 % of base) : Microsoft.CodeAnalysis.dasm - Microsoft.CodeAnalysis.SmallDictionary`2[System.__Canon,int]:LeftComplex(Microsoft.CodeAnalysis.SmallDictionary`2+AvlNode[System.__Canon,int]):Microsoft.CodeAnalysis.SmallDictionary`2+AvlNode[System.__Canon,int] (FullOpts) (1 base, 0 diff methods)

117 total methods with Code Size differences (1 improved, 116 regressed), 502070 unchanged.

hez2010 · 2026-05-06T16:21:14Z

CoreLib and framework assemblies full diffs:
method_assembly_diff_report.md

Method lists (potential candidates for us to vectorize them in the BCL):

(+53 bytes, +22.46 %) Microsoft.FSharp.Collections.ArrayModule:Create[byte](int,byte):byte[]
(+36 bytes, +15.06 %) Microsoft.FSharp.Collections.ArrayModule:Create[short](int,short):short[]
(+36 bytes, +15.13 %) Microsoft.FSharp.Collections.ArrayModule:Create[int](int,int):int[]
(+35 bytes, +13.01 %) Microsoft.FSharp.Collections.ArrayModule:Create[double](int,double):double[]
(+36 bytes, +15.00 %) Microsoft.FSharp.Collections.ArrayModule:Create[long](int,long):long[]
(+56 bytes, +17.02 %) Microsoft.FSharp.Collections.ArrayModule:Tail[byte](byte[]):byte[]
(+67 bytes, +20.24 %) Microsoft.FSharp.Collections.ArrayModule:Tail[short](short[]):short[]
(+60 bytes, +18.29 %) Microsoft.FSharp.Collections.ArrayModule:Tail[int](int[]):int[]
(+67 bytes, +20.18 %) Microsoft.FSharp.Collections.ArrayModule:Tail[double](double[]):double[]
(+60 bytes, +18.24 %) Microsoft.FSharp.Collections.ArrayModule:Tail[long](long[]):long[]
(+53 bytes, +22.46 %) Microsoft.FSharp.Collections.ArrayModule:Replicate[byte](int,byte):byte[]
(+36 bytes, +15.06 %) Microsoft.FSharp.Collections.ArrayModule:Replicate[short](int,short):short[]
(+36 bytes, +15.13 %) Microsoft.FSharp.Collections.ArrayModule:Replicate[int](int,int):int[]
(+35 bytes, +13.01 %) Microsoft.FSharp.Collections.ArrayModule:Replicate[double](int,double):double[]
(+36 bytes, +15.00 %) Microsoft.FSharp.Collections.ArrayModule:Replicate[long](int,long):long[]
(+48 bytes, +7.57 %) Microsoft.FSharp.Collections.ArrayModule:SplitAt[byte](int,byte[]):System.Tuple`2[byte[],byte[]]
(+39 bytes, +6.15 %) Microsoft.FSharp.Collections.ArrayModule:SplitAt[short](int,short[]):System.Tuple`2[short[],short[]]
(+39 bytes, +6.76 %) Microsoft.FSharp.Collections.ArrayModule:SplitAt[int](int,int[]):System.Tuple`2[int[],int[]]
(+39 bytes, +6.14 %) Microsoft.FSharp.Collections.ArrayModule:SplitAt[double](int,double[]):System.Tuple`2[double[],double[]]
(+39 bytes, +6.17 %) Microsoft.FSharp.Collections.ArrayModule:SplitAt[long](int,long[]):System.Tuple`2[long[],long[]]
(+46 bytes, +8.00 %) Microsoft.FSharp.Collections.ArrayModule:Take[byte](int,byte[]):byte[]
(+46 bytes, +7.97 %) Microsoft.FSharp.Collections.ArrayModule:Take[short](int,short[]):short[]
(+46 bytes, +8.76 %) Microsoft.FSharp.Collections.ArrayModule:Take[int](int,int[]):int[]
(+50 bytes, +8.68 %) Microsoft.FSharp.Collections.ArrayModule:Take[double](int,double[]):double[]
(+46 bytes, +8.00 %) Microsoft.FSharp.Collections.ArrayModule:Take[long](int,long[]):long[]
(+67 bytes, +18.87 %) Microsoft.FSharp.Collections.ArrayModule:TakeWhile[byte](Microsoft.FSharp.Core.FSharpFunc`2[byte,bool],byte[]):byte[]
(+62 bytes, +17.13 %) Microsoft.FSharp.Collections.ArrayModule:TakeWhile[short](Microsoft.FSharp.Core.FSharpFunc`2[short,bool],short[]):short[]
(+56 bytes, +18.12 %) Microsoft.FSharp.Collections.ArrayModule:TakeWhile[int](Microsoft.FSharp.Core.FSharpFunc`2[int,bool],int[]):int[]
(+77 bytes, +21.10 %) Microsoft.FSharp.Collections.ArrayModule:TakeWhile[double](Microsoft.FSharp.Core.FSharpFunc`2[double,bool],double[]):double[]
(+62 bytes, +17.37 %) Microsoft.FSharp.Collections.ArrayModule:TakeWhile[long](Microsoft.FSharp.Core.FSharpFunc`2[long,bool],long[]):long[]
(+51 bytes, +11.67 %) Microsoft.FSharp.Collections.ArrayModule:Distinct[byte](byte[]):byte[]
(+35 bytes, +8.01 %) Microsoft.FSharp.Collections.ArrayModule:Distinct[short](short[]):short[]
(+35 bytes, +8.08 %) Microsoft.FSharp.Collections.ArrayModule:Distinct[int](int[]):int[]
(+35 bytes, +7.94 %) Microsoft.FSharp.Collections.ArrayModule:Distinct[double](double[]):double[]
(+35 bytes, +8.05 %) Microsoft.FSharp.Collections.ArrayModule:Distinct[long](long[]):long[]
(+41 bytes, +7.56 %) Microsoft.FSharp.Collections.ArrayModule:DistinctBy[byte,System.Nullable`1[int]](Microsoft.FSharp.Core.FSharpFunc`2[byte,System.Nullable`1[int]],byte[]):byte[]
(+38 bytes, +7.01 %) Microsoft.FSharp.Collections.ArrayModule:DistinctBy[short,System.Nullable`1[int]](Microsoft.FSharp.Core.FSharpFunc`2[short,System.Nullable`1[int]],short[]):short[]
(+38 bytes, +7.06 %) Microsoft.FSharp.Collections.ArrayModule:DistinctBy[int,System.Nullable`1[int]](Microsoft.FSharp.Core.FSharpFunc`2[int,System.Nullable`1[int]],int[]):int[]
(+38 bytes, +6.96 %) Microsoft.FSharp.Collections.ArrayModule:DistinctBy[double,System.Nullable`1[int]](Microsoft.FSharp.Core.FSharpFunc`2[double,System.Nullable`1[int]],double[]):double[]
(+38 bytes, +7.04 %) Microsoft.FSharp.Collections.ArrayModule:DistinctBy[long,System.Nullable`1[int]](Microsoft.FSharp.Core.FSharpFunc`2[long,System.Nullable`1[int]],long[]):long[]
(+53 bytes, +9.74 %) Microsoft.FSharp.Collections.ArrayModule:Partition[byte](Microsoft.FSharp.Core.FSharpFunc`2[byte,bool],byte[]):System.Tuple`2[byte[],byte[]]
(+35 bytes, +6.24 %) Microsoft.FSharp.Collections.ArrayModule:Partition[short](Microsoft.FSharp.Core.FSharpFunc`2[short,bool],short[]):System.Tuple`2[short[],short[]]
(+50 bytes, +9.31 %) Microsoft.FSharp.Collections.ArrayModule:Partition[int](Microsoft.FSharp.Core.FSharpFunc`2[int,bool],int[]):System.Tuple`2[int[],int[]]
(+35 bytes, +6.11 %) Microsoft.FSharp.Collections.ArrayModule:Partition[double](Microsoft.FSharp.Core.FSharpFunc`2[double,bool],double[]):System.Tuple`2[double[],double[]]
(+35 bytes, +6.31 %) Microsoft.FSharp.Collections.ArrayModule:Partition[long](Microsoft.FSharp.Core.FSharpFunc`2[long,bool],long[]):System.Tuple`2[long[],long[]]
(+47 bytes, +15.26 %) Microsoft.FSharp.Collections.ArrayModule:Truncate[byte](int,byte[]):byte[]
(+38 bytes, +12.26 %) Microsoft.FSharp.Collections.ArrayModule:Truncate[short](int,short[]):short[]
(+38 bytes, +14.73 %) Microsoft.FSharp.Collections.ArrayModule:Truncate[int](int,int[]):int[]
(+38 bytes, +12.03 %) Microsoft.FSharp.Collections.ArrayModule:Truncate[double](int,double[]):double[]
(+38 bytes, +12.42 %) Microsoft.FSharp.Collections.ArrayModule:Truncate[long](int,long[]):long[]
(+115 bytes, +19.07 %) Microsoft.FSharp.Collections.ArrayModule:splitAt$cont@170[byte](int,byte[],Microsoft.FSharp.Core.Unit):System.Tuple`2[byte[],byte[]]
(+67 bytes, +10.86 %) Microsoft.FSharp.Collections.ArrayModule:splitAt$cont@170[short](int,short[],Microsoft.FSharp.Core.Unit):System.Tuple`2[short[],short[]]
(+80 bytes, +15.04 %) Microsoft.FSharp.Collections.ArrayModule:splitAt$cont@170[int](int,int[],Microsoft.FSharp.Core.Unit):System.Tuple`2[int[],int[]]
(+67 bytes, +10.84 %) Microsoft.FSharp.Collections.ArrayModule:splitAt$cont@170[double](int,double[],Microsoft.FSharp.Core.Unit):System.Tuple`2[double[],double[]]
(+95 bytes, +16.18 %) Microsoft.FSharp.Collections.ArrayModule:splitAt$cont@170[long](int,long[],Microsoft.FSharp.Core.Unit):System.Tuple`2[long[],long[]]
(+54 bytes, +11.16 %) Microsoft.FSharp.Collections.SeqModule:toArray$cont@1026[byte](System.Collections.Generic.IEnumerator`1[byte],Microsoft.FSharp.Core.Unit):byte[]
(+38 bytes, +7.63 %) Microsoft.FSharp.Collections.SeqModule:toArray$cont@1026[short](System.Collections.Generic.IEnumerator`1[short],Microsoft.FSharp.Core.Unit):short[]
(+38 bytes, +7.69 %) Microsoft.FSharp.Collections.SeqModule:toArray$cont@1026[int](System.Collections.Generic.IEnumerator`1[int],Microsoft.FSharp.Core.Unit):int[]
(+41 bytes, +8.15 %) Microsoft.FSharp.Collections.SeqModule:toArray$cont@1026[double](System.Collections.Generic.IEnumerator`1[double],Microsoft.FSharp.Core.Unit):double[]
(+38 bytes, +7.69 %) Microsoft.FSharp.Collections.SeqModule:toArray$cont@1026[long](System.Collections.Generic.IEnumerator`1[long],Microsoft.FSharp.Core.Unit):long[]
(+48 bytes, +15.89 %) Microsoft.FSharp.Collections.SeqModule:nextChunk@1812[byte](int,System.Collections.Generic.IEnumerator`1[byte],Microsoft.FSharp.Core.Unit):byte[]
(+40 bytes, +13.33 %) Microsoft.FSharp.Collections.SeqModule:nextChunk@1812[short](int,System.Collections.Generic.IEnumerator`1[short],Microsoft.FSharp.Core.Unit):short[]
(+38 bytes, +12.84 %) Microsoft.FSharp.Collections.SeqModule:nextChunk@1812[int](int,System.Collections.Generic.IEnumerator`1[int],Microsoft.FSharp.Core.Unit):int[]
(+38 bytes, +10.76 %) Microsoft.FSharp.Collections.SeqModule:nextChunk@1812[double](int,System.Collections.Generic.IEnumerator`1[double],Microsoft.FSharp.Core.Unit):double[]
(+38 bytes, +12.75 %) Microsoft.FSharp.Collections.SeqModule:nextChunk@1812[long](int,System.Collections.Generic.IEnumerator`1[long],Microsoft.FSharp.Core.Unit):long[]
(+67 bytes, +23.43 %) Microsoft.CodeAnalysis.VisualBasic.Syntax.KeywordTable:EnsureHalfWidth(System.String):System.String
(+35 bytes, +11.63 %) Microsoft.CodeAnalysis.BitVector:AllSet(int):Microsoft.CodeAnalysis.BitVector
(+35 bytes, +1.23 %) Microsoft.CodeAnalysis.Emit.DeltaMetadataWriter:GetDelta(Microsoft.CodeAnalysis.Compilation,System.Guid,System.Reflection.Metadata.Ecma335.MetadataSizes):Microsoft.CodeAnalysis.Emit.EmitBaseline:this
(+35 bytes, +25.55 %) System.Text.ValueStringBuilder:Append(char,int):this
(+39 bytes, +48.15 %) Microsoft.VisualBasic.CompilerServices.NewLateBinding:ResetCopyback(bool[])
(+44 bytes, +37.61 %) Microsoft.VisualBasic.CompilerServices.OverloadResolution:CreateMatchTable(int,int):bool[]
(+47 bytes, +6.19 %) Microsoft.VisualBasic.CompilerServices.OverloadResolution:ReorderArgumentArray(Microsoft.VisualBasic.CompilerServices.Symbols+Method,System.Object[],System.Object[],bool[],int)
(+47 bytes, +0.31 %) Microsoft.VisualBasic.CompilerServices.VBBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this
(+62 bytes, +5.12 %) Microsoft.VisualBasic.CompilerServices.VBBinder:CreateParamOrder(bool,int[],System.Reflection.ParameterInfo[],System.Object[],System.String[]):System.Exception:this
(+43 bytes, +9.07 %) Newtonsoft.Json.Utilities.CollectionUtils:CopyFromJaggedToMultidimensionalArray(System.Collections.IList,System.Array,int[])
(+41 bytes, +4.67 %) Newtonsoft.Json.Serialization.JsonSerializerInternalWriter:SerializeMultidimensionalArray(Newtonsoft.Json.JsonWriter,System.Array,Newtonsoft.Json.Serialization.JsonArrayContract,Newtonsoft.Json.Serialization.JsonProperty,int,int[]):this
(+65 bytes, +38.69 %) System.Collections.Immutable.ImmutableArray`1+Builder[byte]:AddRange[byte](System.ReadOnlySpan`1[byte]):this
(+38 bytes, +22.35 %) System.Collections.Immutable.ImmutableArray`1+Builder[short]:AddRange[short](System.ReadOnlySpan`1[short]):this
(+38 bytes, +22.89 %) System.Collections.Immutable.ImmutableArray`1+Builder[int]:AddRange[int](System.ReadOnlySpan`1[int]):this
(+38 bytes, +21.84 %) System.Collections.Immutable.ImmutableArray`1+Builder[double]:AddRange[double](System.ReadOnlySpan`1[double]):this
(+38 bytes, +22.62 %) System.Collections.Immutable.ImmutableArray`1+Builder[long]:AddRange[long](System.ReadOnlySpan`1[long]):this
(+35 bytes, +40.23 %) System.Data.SqlTypes.SqlDecimal:MpMove(System.ReadOnlySpan`1[uint],int,System.Span`1[uint],byref)
(+108 bytes, +6.36 %) System.Data.SqlTypes.SqlDecimal:MpDiv(System.ReadOnlySpan`1[uint],int,System.Span`1[uint],int,System.Span`1[uint],byref,System.Span`1[uint],byref)
(+132 bytes, +12.94 %) System.PasteArguments:AppendArgument(byref,System.String)
(+35 bytes, +25.55 %) System.Text.ValueStringBuilder:Append(char,int):this
(+77 bytes, +9.45 %) System.Diagnostics.ProcessUtils:GetNextArgument(System.String,byref):System.String
(+19 bytes, n/a) System.DirectoryServices.Protocols.LdapConnection:Finalize():this
(+117 bytes, +4.96 %) System.Net.Http.HttpTelemetry:OnEventCommand(System.Diagnostics.Tracing.EventCommandEventArgs):this
(+38 bytes, +5.39 %) System.Net.NameResolutionTelemetry:OnEventCommand(System.Diagnostics.Tracing.EventCommandEventArgs):this
(+195 bytes, +6.40 %) System.Net.Security.NetSecurityTelemetry:OnEventCommand(System.Diagnostics.Tracing.EventCommandEventArgs):this
(+165 bytes, +150.00 %) System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[byte]:Invoke(System.ReadOnlySpan`1[byte],byte,System.Span`1[byte])
(+64 bytes, +57.66 %) System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[short]:Invoke(System.ReadOnlySpan`1[short],short,System.Span`1[short])
(+67 bytes, +69.07 %) System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[int]:Invoke(System.ReadOnlySpan`1[int],int,System.Span`1[int])
(+83 bytes, +94.32 %) System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[double]:Invoke(System.ReadOnlySpan`1[double],double,System.Span`1[double])
(+67 bytes, +69.07 %) System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[long]:Invoke(System.ReadOnlySpan`1[long],long,System.Span`1[long])
(+47 bytes, n/a) System.Byte:ToString(System.String,System.IFormatProvider):System.String:this
(+43 bytes, +7.52 %) System.DefaultBinder:CreateParamOrder(int[],System.ReadOnlySpan`1[System.Reflection.ParameterInfo],System.String[]):bool
(+132 bytes, +12.94 %) System.PasteArguments:AppendArgument(byref,System.String)
(+119 bytes, +10.21 %) System.Globalization.CalendarData:NormalizeDatePattern(System.String):System.String
(+35 bytes, +25.55 %) System.Text.ValueStringBuilder:Append(char,int):this
(+37 bytes, +6.60 %) System.Runtime.CompilerServices.ConditionalWeakTable`2+Container[System.__Canon,System.__Canon]:Resize(int):System.Runtime.CompilerServices.ConditionalWeakTable`2+Container[System.__Canon,System.__Canon]:this
(+35 bytes, +23.81 %) System.Diagnostics.Tracing.EventCounter:.ctor(System.String,System.Diagnostics.Tracing.EventSource):this
(+35 bytes, +25.55 %) System.Text.ValueStringBuilder:Append(char,int):this
(+53 bytes, +31.18 %) System.Reflection.Metadata.MetadataReader:CombineRowCounts(int[],int[],byte):int[]
(+37 bytes, +7.58 %) System.Reflection.TypeLoading.GetTypeCoreCache+Container:Resize():this
(+48 bytes, +39.67 %) System.Text.ValueStringBuilder`1[byte]:Append(byte,int):this
(+35 bytes, +25.55 %) System.Text.ValueStringBuilder`1[short]:Append(short,int):this
(+35 bytes, +25.74 %) System.Text.ValueStringBuilder`1[int]:Append(int,int):this
(+44 bytes, +30.34 %) System.Text.ValueStringBuilder`1[double]:Append(double,int):this
(+35 bytes, +25.74 %) System.Text.ValueStringBuilder`1[long]:Append(long,int):this
(+36 bytes, +28.12 %) System.Numerics.NumericsHelpers:DangerousMakeOnesComplement(System.Span`1[nuint])
(+34 bytes, +40.48 %) System.Text.RegularExpressions.Match:Reset(System.String,int):this
(+35 bytes, +25.55 %) System.Text.ValueStringBuilder:Append(char,int):this
(+34 bytes, +40.48 %) System.Text.RegularExpressions.Match:Reset(System.String,int):this
(+58 bytes, +34.32 %) System.Text.RegularExpressions.Symbolic.BitVector:And(System.Text.RegularExpressions.Symbolic.BitVector,System.Text.RegularExpressions.Symbolic.BitVector):System.Text.RegularExpressions.Symbolic.BitVector
(+58 bytes, +34.32 %) System.Text.RegularExpressions.Symbolic.BitVector:Or(System.Text.RegularExpressions.Symbolic.BitVector,System.Text.RegularExpressions.Symbolic.BitVector):System.Text.RegularExpressions.Symbolic.BitVector
(+36 bytes, +16.98 %) System.Text.RegularExpressions.Symbolic.BitVector:Not(System.Text.RegularExpressions.Symbolic.BitVector):System.Text.RegularExpressions.Symbolic.BitVector
(+51 bytes, +5.48 %) Xunit.Serialization.XunitSerializationInfo+ArraySerializer:Deserialize(Xunit.Abstractions.IXunitSerializationInfo):this
(+51 bytes, +5.48 %) Xunit.Serialization.XunitSerializationInfo+ArraySerializer:Deserialize(Xunit.Abstractions.IXunitSerializationInfo):this

It seems there're some interesting spots that can be manually vectorized in tensor and regex libraries.

cc: @tannergooding @stephentoub

hez2010 · 2026-05-06T23:20:42Z

Diffs

The final TP impact seems to be +0.12% to +0.38% for fullopts.

hez2010 · 2026-05-06T23:22:58Z

Closing as I've got everything I was curious about in this experiment.

EgorBo · 2026-05-07T00:17:50Z

Diffs

The final TP impact seems to be +0.12% to +0.38% for fullopts.

I think it's fine to accept a lot bigger TP regression for a proper auto-vec. The problem with your diffs (if they're correct - did CI finish?) is that they violate the memory model like I said in Discord, e.g.:

public void Double(int[] array, int size)
{
    for (int i = 0; i < size; i++)
    {
        array[i] = array[i] * 2;
    }
}

what exactly makes it legal to fold this into a SIMD loop like the diffs in your SPMI report show?

Also, I inspected a few examples and noticed how it vectorized various "let's handle the remaining elements via plain loop" so presumably an auto-vec like that should rely on PGO or general assertions about the possible size

tannergooding · 2026-05-07T01:31:25Z

is that they violate the memory model

Notably the main issue being violating atomicity guarantees. Most hardware does not guarantee per element atomicity of general SIMD loads/stores. While Intel/AMD and Arm64 all have some subset of scenarios they will guarantee, they're typically outside what the GC allows us to assert.

The loading of Count elements up front is fine so long as it maintains single-threaded consistency, so it's safe in this example because we know array[i] cannot alias array[i+1]. This would not be safe with ROSpan<T> source and Span<T> dest, as they could overlap without being the same source.

hez2010 · 2026-05-07T01:55:15Z

The loading of Count elements up front is fine so long as it maintains single-threaded consistency, so it's safe in this example because we know array[i] cannot alias array[i+1]. This would not be safe with ROSpan source and Span dest, as they could overlap without being the same source.

Yeah. And there's a simple conservative aliasing check to guard the overlapping cases in this prototype.

hez2010 added 30 commits May 5, 2026 17:38

Add auto vectorizer phase skeleton

1ede29c

Recognize auto vectorization loop candidates

cd2494d

Analyze vectorizable int array accesses

71de3ce

Build SLP plan for auto vectorization

1581429

Rewrite post-IV int array loops with SIMD

393e483

Extend auto vectorization loop support

11381ed

Broaden auto vectorization loop coverage

b6dc9a5

Normalize auto vectorizer diagnostics

591f230

Clean up auto vectorization phase

1184e24

Flip the auto vectorization switch

06a993d

Generalize auto vectorizer SLP packs

4860feb

Recognize local-limit auto vectorization loops

959dcab

Vectorize simple integer reductions

28abfc4

Run jit-format on auto vectorizer

f25800d

Guard auto vectorizer pack recursion

e6000c7

Broaden auto vectorizer loop coverage

cfcf130

Add auto vectorizer width policy

5c5d14d

Broaden auto vectorizer operation coverage

e43c1c8

Broaden auto vectorizer loop legality

1f4f293

Support descending auto-vectorized loops

405da8a

Fix auto vectorizer local-limit guards

bda28a5

Recognize scalar FMA for auto vectorization

e4ad0df

Complete scalar FMA auto-vectorization

76bf16d

Broaden auto vectorizer reductions

a0cb32e

Strengthen auto vectorizer bounds proofs

24f910d

Format auto vectorizer changes

fdb9751

Allow compatible small integer vector packs

0eaf1a2

Handle not-equal auto vector loop tests

bc6c396

Handle conditional-entry descending vector loops

2d55ae2

Improve min max reduction vectorization

3345e72

github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 6, 2026

dotnet-policy-service Bot added the community-contribution Indicates that the PR has been added by a community member label May 6, 2026

MihuBot mentioned this pull request May 6, 2026

[JitDiff X64] [hez2010] [NO-REVIEW] [NO-MERGE] Auto vectorization experiment MihuBot/runtime-utils#1880

Open

hez2010 changed the title ~~[NO-REVIEW] [NO-MERGE] Auto vectorization experiment~~ [NO-REVIEW] [NO-MERGE] Auto loop vectorization experiment May 6, 2026

Copilot AI reviewed May 6, 2026

View reviewed changes

Guard auto vectorizer compare intrinsics by target

edf84f7

MihuBot mentioned this pull request May 6, 2026

[JitDiff X64] [hez2010] [NO-REVIEW] [NO-MERGE] Auto loop vectorization exper ... MihuBot/runtime-utils#1881

Open

Avoid unsafe loop def queries in auto vectorizer

02fd1cb

Copilot AI review requested due to automatic review settings May 6, 2026 11:10

MihuBot mentioned this pull request May 6, 2026

[JitDiff X64] [hez2010] [NO-REVIEW] [NO-MERGE] Auto loop vectorization exper ... MihuBot/runtime-utils#1882

Open

Copilot AI reviewed May 6, 2026

View reviewed changes

Synthesize profile after auto vectorization phase

9eb0094

MihuBot mentioned this pull request May 6, 2026

[JitDiff X64] [hez2010] [NO-REVIEW] [NO-MERGE] Auto loop vectorization exper ... MihuBot/runtime-utils#1883

Open

Reject checked arithmetic in auto vectorizer

f016e5d

Copilot AI review requested due to automatic review settings May 6, 2026 14:45

Copilot AI reviewed May 6, 2026

View reviewed changes

This was referenced May 6, 2026

CI failure on iOS: test process was killed because of OS_REASON_CODESIGNING #127867

Open

[ios-arm64 Release AllSubsets_CoreCLR_Smoke] The app 'net.dot.System.Runtime.Tests' terminated with signal 9 #127872

Open

hez2010 closed this May 6, 2026

Conversation

hez2010 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Local SPMI Run Study

Summary

Phase Placement

Design

Supported Targets and Width Selection

Covered Loop Shapes

Covered Memory Forms

Covered Element Types and Operations

Safety Model

Diagnostics and Metrics

Files Changed

Validation

Uh oh!

dotnet-policy-service Bot commented May 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

hez2010 commented May 6, 2026

Uh oh!

hez2010 commented May 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

hez2010 commented May 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

hez2010 commented May 6, 2026

Uh oh!

hez2010 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hez2010 commented May 6, 2026

Uh oh!

hez2010 commented May 6, 2026

Uh oh!

EgorBo commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tannergooding commented May 7, 2026

Uh oh!

hez2010 commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hez2010 commented May 6, 2026 •

edited

Loading

hez2010 commented May 6, 2026 •

edited

Loading

EgorBo commented May 7, 2026 •

edited

Loading

hez2010 commented May 7, 2026 •

edited

Loading