Skip to content

[NO-REVIEW] [NO-MERGE] Auto loop vectorization experiment#127853

Closed
hez2010 wants to merge 35 commits intodotnet:mainfrom
hez2010:unroll-slp-vec
Closed

[NO-REVIEW] [NO-MERGE] Auto loop vectorization experiment#127853
hez2010 wants to merge 35 commits intodotnet:mainfrom
hez2010:unroll-slp-vec

Conversation

@hez2010
Copy link
Copy Markdown
Contributor

@hez2010 hez2010 commented May 6, 2026

Note

This is a fully vibe coded experiment with neither careful correctness review nor extensive test.
It's not aiming for reviewing or merging. I'm opening this PR to evaluate its actual impact and also aiming for finding potential vectorization opportunities within BCL.

Local SPMI Run Study

Headline:

Mode Loops vectorized TP base instr TP diff instr TP delta TP pct ActualCodeBytes delta Asm diff contexts Missing compiles note
Default policy 678 2,843,264,317,554 2,854,738,112,877 +11,473,795,323 +0.40% +44,604 597 asmdiff missing base=322, diff=313
Aggressive policy 681 2,843,259,400,659 2,854,732,623,273 +11,473,222,614 +0.40% +44,833 600 asmdiff missing base=322, diff=313

Default policy has profitability checks for opportunity analysis, and it will choose vector width based on pressure; aggressive policy bypasses the checks and always uses the maximum available vector size.

Complete report with asm diffs:

autovec-binary-release-asm-metrics-report.md

Artifacts including spmi logs and per-method diffs:

autovec-binary-release-artifacts-with-dasm.zip

cc: @dotnet/jit-contrib


Note

The following content is AI generated.

Summary

This change adds a late HIR auto-vectorization phase to RyuJIT. The phase recognizes profitable counted loops, builds a virtual-lane SLP plan from the scalar loop body, and rewrites the loop into a vector loop plus scalar epilogue. The generated IR uses existing SIMD/HW intrinsic nodes so rationalization, lowering, LSRA, and codegen continue to own target-specific SIMD expansion.

The vectorizer is enabled by default via JitAutoVectorization=1. A second knob, JitAggressiveVectorizing=1, bypasses the profitability policy for investigation and opportunity measurement.

Phase Placement

The new phase is wired as:

VN-DSE
If Conversion
Auto vectorization
Optimize pre-layout
Rationalization
Lowering
LSRA
Codegen

More concretely, PHASE_AUTO_VECTORIZATION runs after VN-based dead-store removal and if-conversion, and before pre-layout flow opts and rationalization.

This placement is intentional:

  • It runs after loop canonicalization, SSA/VN optimizations, range checks, assertion propagation, range analysis, and IV optimization have already simplified loops.
  • It runs after VN-DSE, so the vectorizer does not need to preserve stale SSA/VN state for later VN consumers.
  • It runs after if-conversion, so simple scalar conditional expressions can appear as GT_SELECT and be packed by SLP.
  • It runs before rationalization, while loops are still HIR BasicBlock / Statement / GenTree form and can be rewritten structurally.

After rewriting, the phase marks loop/flow/liveness-sensitive state stale and relies on the normal downstream pipeline to repair/consume the resulting HIR.

Design

The implementation is centered on AutoVectorizer in jit/autovectorizer.cpp.

The core pipeline is:

  1. Recompute the loop table.
  2. Visit natural loops in post-order.
  3. Recognize a supported counted-loop shape.
  4. Analyze memory accesses and loop-carried dependences.
  5. Build a virtual-lane SLP plan.
  6. Select the target vector width using the cost policy.
  7. Rewrite the loop into:
    • vector-entry check,
    • vector body,
    • optional runtime overlap checks,
    • scalar epilogue guard,
    • original scalar loop as the epilogue.
  8. Record Metrics.LoopsVectorized.

The SLP planner does not materialize scalar unrolling in HIR. Instead, it reasons about virtual lanes and directly emits vector IR for the accepted pack:

scalar expression for i
  -> virtual lanes i + 0 ... i + VF - 1
  -> SLP pack
  -> vector load/op/store or vector reduction update

This keeps unsuccessful candidates cheap and avoids expanding scalar IR just to discover that the loop is not vectorizable.

Supported Targets and Width Selection

The phase is enabled for optimized, non-debuggable compilations on SIMD-capable xarch and arm64 targets.

Vector width selection uses the maximum hardware-supported SIMD width for the selected element type, subject to the profitability policy:

  • xarch:
    • 512-bit when AVX512 is available and profitable,
    • otherwise 256-bit when AVX2 is available and profitable,
    • otherwise 128-bit.
  • arm64:
    • 128-bit AdvSIMD.

The policy considers estimated scalar/vector cost, loop overhead, constant trip count, block hotness, simple memory-loop shape, vector pressure, reduction presence, and code size. JitAggressiveVectorizing=1 bypasses this policy and selects the first legal vector width, which is useful for finding missed opportunities and comparing the production policy against the legal maximum.

Covered Loop Shapes

The vectorizer currently handles conservative natural-loop forms:

  • single-entry natural loops,
  • one backedge,
  • one normal exit,
  • no EH participation in the preheader, loop, or exit,
  • innermost loops,
  • canonical counted loops recognized by loop analysis,
  • post-IV strength-reduced loops produced by IV opts,
  • local-limit loops where the loop test compares locals directly,
  • forward and descending unit-stride loops,
  • <, <=, >, >=, and selected != counted-loop tests,
  • conditional preheader entries for supported post-IV/local-limit forms,
  • scalar epilogue for tails.

The phase deliberately rejects unsupported or risky CFG shapes such as EH loops, non-innermost loops, multi-exit loops, and == loop termination.

Covered Memory Forms

The memory analysis supports contiguous element access through:

  • single-dimensional array address forms,
  • byref plus index forms,
  • post-IV local-address forms from strength reduction,
  • span-like and readonly-span-like morphed byref addressing,
  • mixed array/span/byref cases when the access and limit proof are recognized,
  • multiple loads and multiple stores within the fixed analysis budgets,
  • same-base/same-offset read-modify-write,
  • obviously safe different-offset access patterns,
  • runtime overlap checks for selected post-IV alias cases.

The vectorizer rejects volatile accesses, unsupported element types, remaining unproven bounds checks, unsupported address expressions, and dependence patterns that could change scalar semantics.

Covered Element Types and Operations

Supported element types include the primitive SIMD element types handled by the existing SIMD/HW intrinsic path, including integral and floating-point element types.

The SLP planner covers:

  • contiguous vector loads and stores,
  • splatted constants,
  • splatted invariant scalar locals,
  • unary ops,
  • binary ops,
  • ternary ops for supported scalar intrinsic patterns,
  • comparisons,
  • GT_SELECT,
  • min/max/abs-style intrinsic patterns where supported,
  • simple reductions.

Reduction support includes vector accumulator setup, vector loop update, and scalar finalization. The implementation supports add/sub reductions and min/max-style reductions for supported element types, including floating-point reduction paths where the scalar semantics are represented by the recognized intrinsic pattern.

Unsupported forms are still rejected rather than guessed: non-contiguous/gather/scatter memory, arbitrary casts and widening/narrowing packs, modulo, unsupported division forms, unsupported helper/call shapes, complicated address expressions, and control flow that was not simplified into supported straight-line HIR.

Safety Model

The implementation is intentionally conservative. It rejects a candidate unless legality is clear.

Important safety rules include:

  • Do not vectorize loops in or around EH regions.
  • Do not introduce potentially throwing preheader work.
  • Require memory accesses to be proven contiguous and in-bounds, or reject.
  • Reject volatile accesses and unsupported side effects.
  • Reject unsupported checked/throwing arithmetic.
  • Validate dependence between stores and loads before rewriting.
  • Keep the original scalar loop as the scalar epilogue.
  • Use runtime overlap checks only for selected forms where the vector rewrite can safely fall back to scalar.

The rewrite preserves the original scalar loop for the tail and redirects control flow through the new vector loop only when the vector trip count and alias checks allow it.

Diagnostics and Metrics

The phase uses normal JitDump output. Dumps include:

  • candidate loop shape,
  • IV/test information,
  • rejection reasons next to the relevant statement/tree dump,
  • accepted SLP pack structure,
  • selected vector size and VF,
  • scalar statements selected for rewrite,
  • generated vector trees/statements,
  • generated CFG edges and branch likelihoods.

This change also adds a new JIT metric LoopsVectorized.

The metric increments once per successfully rewritten loop and can be used by SuperPMI metricdiff to measure vectorization coverage per collection/method.

Files Changed

  • jit/autovectorizer.cpp
  • jit/autovectorizer.h
  • jit/compiler.cpp
  • jit/compiler.h
  • jit/compphases.h
  • jit/jitconfigvalues.h
  • jit/jitmetadatalist.h
  • jit/CMakeLists.txt

Validation

  • Built clr.jit Release with NoPgoOptimize=true.
  • Built separate Release JIT binaries for:
    • auto-vectorization disabled,
    • default policy enabled,
    • aggressive vectorization enabled.
  • Ran smoke tests to check vectorization coverages.
  • Ran SuperPMI throughput diff over the local x64 collections.
  • Running SuperPMI asm/metric diffs and collecting LoopsVectorized metrics for the final report.

@github-actions github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 6, 2026
@dotnet-policy-service dotnet-policy-service Bot added the community-contribution Indicates that the PR has been added by a community member label May 6, 2026
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@hez2010 hez2010 changed the title [NO-REVIEW] [NO-MERGE] Auto vectorization experiment [NO-REVIEW] [NO-MERGE] Auto loop vectorization experiment May 6, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR introduces a new JIT auto-vectorization optimization pass that analyzes and rewrites qualifying loops into SIMD vector loops (with scalar epilogues), along with associated config knobs, phase plumbing, build integration, and perf metrics.

Changes:

  • Add AutoVectorizer implementation and integrate it as a new compilation phase.
  • Introduce new JIT config flags to control auto-vectorization and an “aggressive” mode.
  • Add a new JIT metadata metric to track the number of loops vectorized.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/coreclr/jit/jitmetadatalist.h Adds a new LoopsVectorized metric to track vectorized loops.
src/coreclr/jit/jitconfigvalues.h Adds config switches to enable/disable auto-vectorization and aggressive vectorizing.
src/coreclr/jit/compphases.h Registers a new PHASE_AUTO_VECTORIZATION phase name.
src/coreclr/jit/compiler.h Adds optAutoVectorize() and grants the new pass friend access.
src/coreclr/jit/compiler.cpp Wires the new phase into the pipeline when optimizations are enabled.
src/coreclr/jit/autovectorizer.h Declares the AutoVectorizer pass and its planning/rewriting machinery.
src/coreclr/jit/autovectorizer.cpp Implements loop analysis, SLP planning, profitability heuristics, and CFG/IR rewrite.
src/coreclr/jit/CMakeLists.txt Adds the new source/header to the JIT build.

Comment on lines +3823 to +3839
GenTree* AutoVectorizer::BuildVectorReductionOp(LoopVectorizationPlan* plan,
const LoopVectorizationPlan::ReductionInfo& reduction,
GenTree* op1,
GenTree* op2)
{
#if defined(FEATURE_HW_INTRINSICS) && (defined(TARGET_XARCH) || defined(TARGET_ARM64))
const var_types simdType = Compiler::getSIMDTypeForSize(plan->VectorSizeBytes);
if (reduction.Oper != GT_INTRINSIC)
{
return m_compiler->gtNewSimdBinOpNode(GT_ADD, simdType, op1, op2, plan->ElementType, plan->VectorSizeBytes);
}

return BuildVectorMinMaxOp(reduction, op1, op2, simdType, plan->VectorSizeBytes);
#else
unreached();
#endif
}
Comment on lines +1252 to +1267
for (unsigned i = 0; i < plan->LoadCount; i++)
{
const LoopVectorizationPlan::ScalarAccess& existing = plan->LoadAccesses[i];
if ((existing.Address == access.Address) ||
((existing.BaseLocalIfKnown == access.BaseLocalIfKnown) &&
(existing.OffsetLocalIfKnown == access.OffsetLocalIfKnown) &&
(existing.IndexOffset == access.IndexOffset) && (existing.PostIVOffset == access.PostIVOffset) &&
(existing.ElementType == access.ElementType) && (existing.IsArray == access.IsArray) &&
(existing.IsByrefLocal == access.IsByrefLocal) &&
(existing.IsByrefBaseWithOffset == access.IsByrefBaseWithOffset) &&
(existing.IsByrefWithIndex == access.IsByrefWithIndex)))
{
*index = i;
return true;
}
}
CONFIG_STRING(JitObjectStackAllocationTrackFieldsRange, "JitObjectStackAllocationTrackFieldsRange")
CONFIG_INTEGER(JitObjectStackAllocationDumpConnGraph, "JitObjectStackAllocationDumpConnGraph", 0)

RELEASE_CONFIG_INTEGER(JitAutoVectorization, "JitAutoVectorization", 1)
Comment on lines +7 to +10
class AutoVectorizer
{
public:
explicit AutoVectorizer(Compiler* compiler);
Comment on lines +1292 to +1303
if (first.IsArray && second.IsArray)
{
return true;
}

if ((first.IsByrefLocal || first.IsByrefBaseWithOffset || first.IsByrefWithIndex) &&
(second.IsByrefLocal || second.IsByrefBaseWithOffset || second.IsByrefWithIndex))
{
return true;
}

// Array and byref/span bases can still describe the same storage after morphing.
Comment on lines +4960 to +4965
if (doAutoVectorization)
{
// Rewrite HIR loops late, after VN-DSE and if-conversion but before rationalization.
//
DoPhase(this, PHASE_AUTO_VECTORIZATION, &Compiler::optAutoVectorize);
}
@hez2010
Copy link
Copy Markdown
Contributor Author

hez2010 commented May 6, 2026

@MihuBot

Copilot AI review requested due to automatic review settings May 6, 2026 11:10
@hez2010
Copy link
Copy Markdown
Contributor Author

hez2010 commented May 6, 2026

@MihuBot

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 6 comments.

const var_types simdType = Compiler::getSIMDTypeForSize(plan->VectorSizeBytes);
if (reduction.Oper != GT_INTRINSIC)
{
return m_compiler->gtNewSimdBinOpNode(GT_ADD, simdType, op1, op2, plan->ElementType, plan->VectorSizeBytes);
Comment on lines +4476 to +4491
if (tree->OperIs(GT_ADD))
{
GenTree* op1 = tree->AsOp()->gtOp1;
GenTree* op2 = tree->AsOp()->gtOp2;

if (op1->IsCnsIntOrI())
{
*offset += static_cast<int>(op1->AsIntConCommon()->IconValue());
return TryAnalyzeIndexExpr(plan, op2, ivLcl, offset, invariantLcl, sawIv, depth + 1);
}

if (op2->IsCnsIntOrI())
{
*offset += static_cast<int>(op2->AsIntConCommon()->IconValue());
return TryAnalyzeIndexExpr(plan, op1, ivLcl, offset, invariantLcl, sawIv, depth + 1);
}
Comment on lines +3282 to +3294
LclVarDsc* const ivDsc = m_compiler->lvaGetDesc(plan->InductionVar);
GenTree* iv = m_compiler->gtNewLclvNode(plan->InductionVar, ivDsc->TypeGet());
GenTree* end = m_compiler->gtCloneExpr(plan->End);

GenTree* lastLane = m_compiler->gtNewCastNode(TYP_LONG, iv, false, TYP_LONG);
if (plan->VectorizationFactor > 1)
{
lastLane =
m_compiler->gtNewOperNode(plan->Step < 0 ? GT_SUB : GT_ADD, TYP_LONG, lastLane,
m_compiler->gtNewLconNode(static_cast<int64_t>(plan->VectorizationFactor - 1)));
}

end = m_compiler->gtNewCastNode(TYP_LONG, end, false, TYP_LONG);
Comment on lines +1308 to +1319
if (first.IsArray && second.IsArray)
{
return true;
}

if ((first.IsByrefLocal || first.IsByrefBaseWithOffset || first.IsByrefWithIndex) &&
(second.IsByrefLocal || second.IsByrefBaseWithOffset || second.IsByrefWithIndex))
{
return true;
}

// Array and byref/span bases can still describe the same storage after morphing.
Comment on lines +37 to +46
BasicBlock* const header = loop->GetHeader();
bool alreadyRewritten = false;
for (unsigned rewrittenHeader : rewrittenHeaders)
{
if (rewrittenHeader == header->bbNum)
{
alreadyRewritten = true;
break;
}
}
Comment on lines +4960 to +4965
if (doAutoVectorization)
{
// Rewrite HIR loops late, after VN-DSE and if-conversion but before rationalization.
//
DoPhase(this, PHASE_AUTO_VECTORIZATION, &Compiler::optAutoVectorize);
}
@hez2010
Copy link
Copy Markdown
Contributor Author

hez2010 commented May 6, 2026

@MihuBot

Copilot AI review requested due to automatic review settings May 6, 2026 14:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.


if (!changed)
{
m_compiler->fgInvalidateDfsTree();
Comment on lines +380 to +386
const LoopVectorizationPlan originalPlan = *plan;

for (unsigned i = 0; i < vectorSizeCount; i++)
{
*plan = originalPlan;

plan->VectorSizeBytes = vectorSizes[i];
Comment on lines +1322 to +1333
if (first.IsArray && second.IsArray)
{
return true;
}

if ((first.IsByrefLocal || first.IsByrefBaseWithOffset || first.IsByrefWithIndex) &&
(second.IsByrefLocal || second.IsByrefBaseWithOffset || second.IsByrefWithIndex))
{
return true;
}

// Array and byref/span bases can still describe the same storage after morphing.
Comment on lines +38 to +47
BasicBlock* const header = loop->GetHeader();
bool alreadyRewritten = false;
for (unsigned rewrittenHeader : rewrittenHeaders)
{
if (rewrittenHeader == header->bbNum)
{
alreadyRewritten = true;
break;
}
}
CONFIG_STRING(JitObjectStackAllocationTrackFieldsRange, "JitObjectStackAllocationTrackFieldsRange")
CONFIG_INTEGER(JitObjectStackAllocationDumpConnGraph, "JitObjectStackAllocationDumpConnGraph", 0)

RELEASE_CONFIG_INTEGER(JitAutoVectorization, "JitAutoVectorization", 1)
@hez2010
Copy link
Copy Markdown
Contributor Author

hez2010 commented May 6, 2026

pmi on S.P.CoreLib and framework assemblies:

PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies [invoking .cctors] for  default jit

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 84302919
Total bytes of diff: 84308903
Total bytes of delta: 5984 (0.01 % of base)
Total relative delta: NaN
    diff is a regression.
    relative diff is a regression.


Total byte diff includes -117 bytes from reconciling methods
        Base had    1 unique methods,      117 unique bytes
        Diff had    0 unique methods,        0 unique bytes

Top file regressions (bytes):
        3090 : FSharp.Core.dasm (0.07 % of base)
         446 : System.Numerics.Tensors.dasm (0.04 % of base)
         401 : System.Private.CoreLib.dasm (0.01 % of base)
         255 : System.Text.RegularExpressions.dasm (0.03 % of base)
         244 : System.Diagnostics.Process.dasm (0.16 % of base)
         239 : Microsoft.VisualBasic.Core.dasm (0.05 % of base)
         233 : System.Runtime.Numerics.dasm (0.14 % of base)
         217 : System.Collections.Immutable.dasm (0.01 % of base)
         195 : System.Net.Security.dasm (0.08 % of base)
         143 : System.Data.Common.dasm (0.01 % of base)
         117 : System.Net.Http.dasm (0.01 % of base)
          88 : System.Reflection.Metadata.dasm (0.02 % of base)
          84 : Newtonsoft.Json.dasm (0.01 % of base)
          67 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.00 % of base)
          51 : xunit.runner.utility.netcoreapp10.dasm (0.02 % of base)
          51 : xunit.execution.dotnet.dasm (0.02 % of base)
          38 : System.Net.NameResolution.dasm (0.06 % of base)
          37 : System.Reflection.MetadataLoadContext.dasm (0.02 % of base)
          35 : Microsoft.Extensions.Logging.Abstractions.dasm (0.04 % of base)

Top file improvements (bytes):
         -47 : Microsoft.CodeAnalysis.dasm (-0.00 % of base)

20 total files with Code Size differences (1 improved, 19 regressed), 260 unchanged.

Top method regressions (bytes):
         195 (6.40 % of base) : System.Net.Security.dasm - System.Net.Security.NetSecurityTelemetry:OnEventCommand(System.Diagnostics.Tracing.EventCommandEventArgs):this (FullOpts)
         165 (150.00 % of base) : System.Numerics.Tensors.dasm - System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[byte]:Invoke(System.ReadOnlySpan`1[byte],byte,System.Span`1[byte]) (FullOpts)
         132 (12.94 % of base) : System.Private.CoreLib.dasm - System.PasteArguments:AppendArgument(byref,System.String) (FullOpts)
         132 (12.94 % of base) : System.Diagnostics.Process.dasm - System.PasteArguments:AppendArgument(byref,System.String) (FullOpts)
         119 (10.21 % of base) : System.Private.CoreLib.dasm - System.Globalization.CalendarData:NormalizeDatePattern(System.String):System.String (FullOpts)
         117 (4.96 % of base) : System.Net.Http.dasm - System.Net.Http.HttpTelemetry:OnEventCommand(System.Diagnostics.Tracing.EventCommandEventArgs):this (FullOpts)
         115 (19.07 % of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.ArrayModule:splitAt$cont@170[byte](int,byte[],Microsoft.FSharp.Core.Unit):System.Tuple`2[byte[],byte[]] (FullOpts)
         108 (6.36 % of base) : System.Data.Common.dasm - System.Data.SqlTypes.SqlDecimal:MpDiv(System.ReadOnlySpan`1[uint],int,System.Span`1[uint],int,System.Span`1[uint],byref,System.Span`1[uint],byref) (FullOpts)
          95 (16.18 % of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.ArrayModule:splitAt$cont@170[long](int,long[],Microsoft.FSharp.Core.Unit):System.Tuple`2[long[],long[]] (FullOpts)
          83 (94.32 % of base) : System.Numerics.Tensors.dasm - System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[double]:Invoke(System.ReadOnlySpan`1[double],double,System.Span`1[double]) (FullOpts)
          80 (15.04 % of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.ArrayModule:splitAt$cont@170[int](int,int[],Microsoft.FSharp.Core.Unit):System.Tuple`2[int[],int[]] (FullOpts)
          77 (21.10 % of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.ArrayModule:TakeWhile[double](Microsoft.FSharp.Core.FSharpFunc`2[double,bool],double[]):double[] (FullOpts)
          77 (9.45 % of base) : System.Diagnostics.Process.dasm - System.Diagnostics.ProcessUtils:GetNextArgument(System.String,byref):System.String (FullOpts)
          68 (40.48 % of base) : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.Match:Reset(System.String,int):this (FullOpts) (2 methods)
          67 (23.43 % of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.Syntax.KeywordTable:EnsureHalfWidth(System.String):System.String (FullOpts)
          67 (10.84 % of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.ArrayModule:splitAt$cont@170[double](int,double[],Microsoft.FSharp.Core.Unit):System.Tuple`2[double[],double[]] (FullOpts)
          67 (10.86 % of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.ArrayModule:splitAt$cont@170[short](int,short[],Microsoft.FSharp.Core.Unit):System.Tuple`2[short[],short[]] (FullOpts)
          67 (20.18 % of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.ArrayModule:Tail[double](double[]):double[] (FullOpts)
          67 (20.24 % of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.ArrayModule:Tail[short](short[]):short[] (FullOpts)
          67 (18.87 % of base) : FSharp.Core.dasm - Microsoft.FSharp.Collections.ArrayModule:TakeWhile[byte](Microsoft.FSharp.Core.FSharpFunc`2[byte,bool],byte[]):byte[] (FullOpts)

Top method improvements (bytes):
        -117 (-100.00 % of base) : Microsoft.CodeAnalysis.dasm - Microsoft.CodeAnalysis.SmallDictionary`2[System.__Canon,int]:LeftComplex(Microsoft.CodeAnalysis.SmallDictionary`2+AvlNode[System.__Canon,int]):Microsoft.CodeAnalysis.SmallDictionary`2+AvlNode[System.__Canon,int] (FullOpts) (1 base, 0 diff methods)

Top method regressions (percentages):
         165 (150.00 % of base) : System.Numerics.Tensors.dasm - System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[byte]:Invoke(System.ReadOnlySpan`1[byte],byte,System.Span`1[byte]) (FullOpts)
          83 (94.32 % of base) : System.Numerics.Tensors.dasm - System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[double]:Invoke(System.ReadOnlySpan`1[double],double,System.Span`1[double]) (FullOpts)
          67 (69.07 % of base) : System.Numerics.Tensors.dasm - System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[int]:Invoke(System.ReadOnlySpan`1[int],int,System.Span`1[int]) (FullOpts)
          67 (69.07 % of base) : System.Numerics.Tensors.dasm - System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[long]:Invoke(System.ReadOnlySpan`1[long],long,System.Span`1[long]) (FullOpts)
          64 (57.66 % of base) : System.Numerics.Tensors.dasm - System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[short]:Invoke(System.ReadOnlySpan`1[short],short,System.Span`1[short]) (FullOpts)
          39 (48.15 % of base) : Microsoft.VisualBasic.Core.dasm - Microsoft.VisualBasic.CompilerServices.NewLateBinding:ResetCopyback(bool[]) (FullOpts)
          68 (40.48 % of base) : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.Match:Reset(System.String,int):this (FullOpts) (2 methods)
          35 (40.23 % of base) : System.Data.Common.dasm - System.Data.SqlTypes.SqlDecimal:MpMove(System.ReadOnlySpan`1[uint],int,System.Span`1[uint],byref) (FullOpts)
          48 (39.67 % of base) : System.Runtime.Numerics.dasm - System.Text.ValueStringBuilder`1[byte]:Append(byte,int):this (FullOpts)
          65 (38.69 % of base) : System.Collections.Immutable.dasm - System.Collections.Immutable.ImmutableArray`1+Builder[byte]:AddRange[byte](System.ReadOnlySpan`1[byte]):this (FullOpts)
          44 (37.61 % of base) : Microsoft.VisualBasic.Core.dasm - Microsoft.VisualBasic.CompilerServices.OverloadResolution:CreateMatchTable(int,int):bool[] (FullOpts)
          58 (34.32 % of base) : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.Symbolic.BitVector:And(System.Text.RegularExpressions.Symbolic.BitVector,System.Text.RegularExpressions.Symbolic.BitVector):System.Text.RegularExpressions.Symbolic.BitVector (FullOpts)
          58 (34.32 % of base) : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.Symbolic.BitVector:Or(System.Text.RegularExpressions.Symbolic.BitVector,System.Text.RegularExpressions.Symbolic.BitVector):System.Text.RegularExpressions.Symbolic.BitVector (FullOpts)
          53 (31.18 % of base) : System.Reflection.Metadata.dasm - System.Reflection.Metadata.MetadataReader:CombineRowCounts(int[],int[],byte):int[] (FullOpts)
          44 (30.34 % of base) : System.Runtime.Numerics.dasm - System.Text.ValueStringBuilder`1[double]:Append(double,int):this (FullOpts)
          36 (28.12 % of base) : System.Runtime.Numerics.dasm - System.Numerics.NumericsHelpers:DangerousMakeOnesComplement(System.Span`1[nuint]) (FullOpts)
          35 (25.74 % of base) : System.Runtime.Numerics.dasm - System.Text.ValueStringBuilder`1[int]:Append(int,int):this (FullOpts)
          35 (25.74 % of base) : System.Runtime.Numerics.dasm - System.Text.ValueStringBuilder`1[long]:Append(long,int):this (FullOpts)
          35 (25.55 % of base) : Microsoft.Extensions.Logging.Abstractions.dasm - System.Text.ValueStringBuilder:Append(char,int):this (FullOpts)
          35 (25.55 % of base) : System.Reflection.Metadata.dasm - System.Text.ValueStringBuilder:Append(char,int):this (FullOpts)

Top method improvements (percentages):
        -117 (-100.00 % of base) : Microsoft.CodeAnalysis.dasm - Microsoft.CodeAnalysis.SmallDictionary`2[System.__Canon,int]:LeftComplex(Microsoft.CodeAnalysis.SmallDictionary`2+AvlNode[System.__Canon,int]):Microsoft.CodeAnalysis.SmallDictionary`2+AvlNode[System.__Canon,int] (FullOpts) (1 base, 0 diff methods)

117 total methods with Code Size differences (1 improved, 116 regressed), 502070 unchanged.

@hez2010
Copy link
Copy Markdown
Contributor Author

hez2010 commented May 6, 2026

CoreLib and framework assemblies full diffs:
method_assembly_diff_report.md

Method lists (potential candidates for us to vectorize them in the BCL):

(+53 bytes, +22.46 %) Microsoft.FSharp.Collections.ArrayModule:Create[byte](int,byte):byte[]
(+36 bytes, +15.06 %) Microsoft.FSharp.Collections.ArrayModule:Create[short](int,short):short[]
(+36 bytes, +15.13 %) Microsoft.FSharp.Collections.ArrayModule:Create[int](int,int):int[]
(+35 bytes, +13.01 %) Microsoft.FSharp.Collections.ArrayModule:Create[double](int,double):double[]
(+36 bytes, +15.00 %) Microsoft.FSharp.Collections.ArrayModule:Create[long](int,long):long[]
(+56 bytes, +17.02 %) Microsoft.FSharp.Collections.ArrayModule:Tail[byte](byte[]):byte[]
(+67 bytes, +20.24 %) Microsoft.FSharp.Collections.ArrayModule:Tail[short](short[]):short[]
(+60 bytes, +18.29 %) Microsoft.FSharp.Collections.ArrayModule:Tail[int](int[]):int[]
(+67 bytes, +20.18 %) Microsoft.FSharp.Collections.ArrayModule:Tail[double](double[]):double[]
(+60 bytes, +18.24 %) Microsoft.FSharp.Collections.ArrayModule:Tail[long](long[]):long[]
(+53 bytes, +22.46 %) Microsoft.FSharp.Collections.ArrayModule:Replicate[byte](int,byte):byte[]
(+36 bytes, +15.06 %) Microsoft.FSharp.Collections.ArrayModule:Replicate[short](int,short):short[]
(+36 bytes, +15.13 %) Microsoft.FSharp.Collections.ArrayModule:Replicate[int](int,int):int[]
(+35 bytes, +13.01 %) Microsoft.FSharp.Collections.ArrayModule:Replicate[double](int,double):double[]
(+36 bytes, +15.00 %) Microsoft.FSharp.Collections.ArrayModule:Replicate[long](int,long):long[]
(+48 bytes, +7.57 %) Microsoft.FSharp.Collections.ArrayModule:SplitAt[byte](int,byte[]):System.Tuple`2[byte[],byte[]]
(+39 bytes, +6.15 %) Microsoft.FSharp.Collections.ArrayModule:SplitAt[short](int,short[]):System.Tuple`2[short[],short[]]
(+39 bytes, +6.76 %) Microsoft.FSharp.Collections.ArrayModule:SplitAt[int](int,int[]):System.Tuple`2[int[],int[]]
(+39 bytes, +6.14 %) Microsoft.FSharp.Collections.ArrayModule:SplitAt[double](int,double[]):System.Tuple`2[double[],double[]]
(+39 bytes, +6.17 %) Microsoft.FSharp.Collections.ArrayModule:SplitAt[long](int,long[]):System.Tuple`2[long[],long[]]
(+46 bytes, +8.00 %) Microsoft.FSharp.Collections.ArrayModule:Take[byte](int,byte[]):byte[]
(+46 bytes, +7.97 %) Microsoft.FSharp.Collections.ArrayModule:Take[short](int,short[]):short[]
(+46 bytes, +8.76 %) Microsoft.FSharp.Collections.ArrayModule:Take[int](int,int[]):int[]
(+50 bytes, +8.68 %) Microsoft.FSharp.Collections.ArrayModule:Take[double](int,double[]):double[]
(+46 bytes, +8.00 %) Microsoft.FSharp.Collections.ArrayModule:Take[long](int,long[]):long[]
(+67 bytes, +18.87 %) Microsoft.FSharp.Collections.ArrayModule:TakeWhile[byte](Microsoft.FSharp.Core.FSharpFunc`2[byte,bool],byte[]):byte[]
(+62 bytes, +17.13 %) Microsoft.FSharp.Collections.ArrayModule:TakeWhile[short](Microsoft.FSharp.Core.FSharpFunc`2[short,bool],short[]):short[]
(+56 bytes, +18.12 %) Microsoft.FSharp.Collections.ArrayModule:TakeWhile[int](Microsoft.FSharp.Core.FSharpFunc`2[int,bool],int[]):int[]
(+77 bytes, +21.10 %) Microsoft.FSharp.Collections.ArrayModule:TakeWhile[double](Microsoft.FSharp.Core.FSharpFunc`2[double,bool],double[]):double[]
(+62 bytes, +17.37 %) Microsoft.FSharp.Collections.ArrayModule:TakeWhile[long](Microsoft.FSharp.Core.FSharpFunc`2[long,bool],long[]):long[]
(+51 bytes, +11.67 %) Microsoft.FSharp.Collections.ArrayModule:Distinct[byte](byte[]):byte[]
(+35 bytes, +8.01 %) Microsoft.FSharp.Collections.ArrayModule:Distinct[short](short[]):short[]
(+35 bytes, +8.08 %) Microsoft.FSharp.Collections.ArrayModule:Distinct[int](int[]):int[]
(+35 bytes, +7.94 %) Microsoft.FSharp.Collections.ArrayModule:Distinct[double](double[]):double[]
(+35 bytes, +8.05 %) Microsoft.FSharp.Collections.ArrayModule:Distinct[long](long[]):long[]
(+41 bytes, +7.56 %) Microsoft.FSharp.Collections.ArrayModule:DistinctBy[byte,System.Nullable`1[int]](Microsoft.FSharp.Core.FSharpFunc`2[byte,System.Nullable`1[int]],byte[]):byte[]
(+38 bytes, +7.01 %) Microsoft.FSharp.Collections.ArrayModule:DistinctBy[short,System.Nullable`1[int]](Microsoft.FSharp.Core.FSharpFunc`2[short,System.Nullable`1[int]],short[]):short[]
(+38 bytes, +7.06 %) Microsoft.FSharp.Collections.ArrayModule:DistinctBy[int,System.Nullable`1[int]](Microsoft.FSharp.Core.FSharpFunc`2[int,System.Nullable`1[int]],int[]):int[]
(+38 bytes, +6.96 %) Microsoft.FSharp.Collections.ArrayModule:DistinctBy[double,System.Nullable`1[int]](Microsoft.FSharp.Core.FSharpFunc`2[double,System.Nullable`1[int]],double[]):double[]
(+38 bytes, +7.04 %) Microsoft.FSharp.Collections.ArrayModule:DistinctBy[long,System.Nullable`1[int]](Microsoft.FSharp.Core.FSharpFunc`2[long,System.Nullable`1[int]],long[]):long[]
(+53 bytes, +9.74 %) Microsoft.FSharp.Collections.ArrayModule:Partition[byte](Microsoft.FSharp.Core.FSharpFunc`2[byte,bool],byte[]):System.Tuple`2[byte[],byte[]]
(+35 bytes, +6.24 %) Microsoft.FSharp.Collections.ArrayModule:Partition[short](Microsoft.FSharp.Core.FSharpFunc`2[short,bool],short[]):System.Tuple`2[short[],short[]]
(+50 bytes, +9.31 %) Microsoft.FSharp.Collections.ArrayModule:Partition[int](Microsoft.FSharp.Core.FSharpFunc`2[int,bool],int[]):System.Tuple`2[int[],int[]]
(+35 bytes, +6.11 %) Microsoft.FSharp.Collections.ArrayModule:Partition[double](Microsoft.FSharp.Core.FSharpFunc`2[double,bool],double[]):System.Tuple`2[double[],double[]]
(+35 bytes, +6.31 %) Microsoft.FSharp.Collections.ArrayModule:Partition[long](Microsoft.FSharp.Core.FSharpFunc`2[long,bool],long[]):System.Tuple`2[long[],long[]]
(+47 bytes, +15.26 %) Microsoft.FSharp.Collections.ArrayModule:Truncate[byte](int,byte[]):byte[]
(+38 bytes, +12.26 %) Microsoft.FSharp.Collections.ArrayModule:Truncate[short](int,short[]):short[]
(+38 bytes, +14.73 %) Microsoft.FSharp.Collections.ArrayModule:Truncate[int](int,int[]):int[]
(+38 bytes, +12.03 %) Microsoft.FSharp.Collections.ArrayModule:Truncate[double](int,double[]):double[]
(+38 bytes, +12.42 %) Microsoft.FSharp.Collections.ArrayModule:Truncate[long](int,long[]):long[]
(+115 bytes, +19.07 %) Microsoft.FSharp.Collections.ArrayModule:splitAt$cont@170[byte](int,byte[],Microsoft.FSharp.Core.Unit):System.Tuple`2[byte[],byte[]]
(+67 bytes, +10.86 %) Microsoft.FSharp.Collections.ArrayModule:splitAt$cont@170[short](int,short[],Microsoft.FSharp.Core.Unit):System.Tuple`2[short[],short[]]
(+80 bytes, +15.04 %) Microsoft.FSharp.Collections.ArrayModule:splitAt$cont@170[int](int,int[],Microsoft.FSharp.Core.Unit):System.Tuple`2[int[],int[]]
(+67 bytes, +10.84 %) Microsoft.FSharp.Collections.ArrayModule:splitAt$cont@170[double](int,double[],Microsoft.FSharp.Core.Unit):System.Tuple`2[double[],double[]]
(+95 bytes, +16.18 %) Microsoft.FSharp.Collections.ArrayModule:splitAt$cont@170[long](int,long[],Microsoft.FSharp.Core.Unit):System.Tuple`2[long[],long[]]
(+54 bytes, +11.16 %) Microsoft.FSharp.Collections.SeqModule:toArray$cont@1026[byte](System.Collections.Generic.IEnumerator`1[byte],Microsoft.FSharp.Core.Unit):byte[]
(+38 bytes, +7.63 %) Microsoft.FSharp.Collections.SeqModule:toArray$cont@1026[short](System.Collections.Generic.IEnumerator`1[short],Microsoft.FSharp.Core.Unit):short[]
(+38 bytes, +7.69 %) Microsoft.FSharp.Collections.SeqModule:toArray$cont@1026[int](System.Collections.Generic.IEnumerator`1[int],Microsoft.FSharp.Core.Unit):int[]
(+41 bytes, +8.15 %) Microsoft.FSharp.Collections.SeqModule:toArray$cont@1026[double](System.Collections.Generic.IEnumerator`1[double],Microsoft.FSharp.Core.Unit):double[]
(+38 bytes, +7.69 %) Microsoft.FSharp.Collections.SeqModule:toArray$cont@1026[long](System.Collections.Generic.IEnumerator`1[long],Microsoft.FSharp.Core.Unit):long[]
(+48 bytes, +15.89 %) Microsoft.FSharp.Collections.SeqModule:nextChunk@1812[byte](int,System.Collections.Generic.IEnumerator`1[byte],Microsoft.FSharp.Core.Unit):byte[]
(+40 bytes, +13.33 %) Microsoft.FSharp.Collections.SeqModule:nextChunk@1812[short](int,System.Collections.Generic.IEnumerator`1[short],Microsoft.FSharp.Core.Unit):short[]
(+38 bytes, +12.84 %) Microsoft.FSharp.Collections.SeqModule:nextChunk@1812[int](int,System.Collections.Generic.IEnumerator`1[int],Microsoft.FSharp.Core.Unit):int[]
(+38 bytes, +10.76 %) Microsoft.FSharp.Collections.SeqModule:nextChunk@1812[double](int,System.Collections.Generic.IEnumerator`1[double],Microsoft.FSharp.Core.Unit):double[]
(+38 bytes, +12.75 %) Microsoft.FSharp.Collections.SeqModule:nextChunk@1812[long](int,System.Collections.Generic.IEnumerator`1[long],Microsoft.FSharp.Core.Unit):long[]
(+67 bytes, +23.43 %) Microsoft.CodeAnalysis.VisualBasic.Syntax.KeywordTable:EnsureHalfWidth(System.String):System.String
(+35 bytes, +11.63 %) Microsoft.CodeAnalysis.BitVector:AllSet(int):Microsoft.CodeAnalysis.BitVector
(+35 bytes, +1.23 %) Microsoft.CodeAnalysis.Emit.DeltaMetadataWriter:GetDelta(Microsoft.CodeAnalysis.Compilation,System.Guid,System.Reflection.Metadata.Ecma335.MetadataSizes):Microsoft.CodeAnalysis.Emit.EmitBaseline:this
(+35 bytes, +25.55 %) System.Text.ValueStringBuilder:Append(char,int):this
(+39 bytes, +48.15 %) Microsoft.VisualBasic.CompilerServices.NewLateBinding:ResetCopyback(bool[])
(+44 bytes, +37.61 %) Microsoft.VisualBasic.CompilerServices.OverloadResolution:CreateMatchTable(int,int):bool[]
(+47 bytes, +6.19 %) Microsoft.VisualBasic.CompilerServices.OverloadResolution:ReorderArgumentArray(Microsoft.VisualBasic.CompilerServices.Symbols+Method,System.Object[],System.Object[],bool[],int)
(+47 bytes, +0.31 %) Microsoft.VisualBasic.CompilerServices.VBBinder:BindToMethod(int,System.Reflection.MethodBase[],byref,System.Reflection.ParameterModifier[],System.Globalization.CultureInfo,System.String[],byref):System.Reflection.MethodBase:this
(+62 bytes, +5.12 %) Microsoft.VisualBasic.CompilerServices.VBBinder:CreateParamOrder(bool,int[],System.Reflection.ParameterInfo[],System.Object[],System.String[]):System.Exception:this
(+43 bytes, +9.07 %) Newtonsoft.Json.Utilities.CollectionUtils:CopyFromJaggedToMultidimensionalArray(System.Collections.IList,System.Array,int[])
(+41 bytes, +4.67 %) Newtonsoft.Json.Serialization.JsonSerializerInternalWriter:SerializeMultidimensionalArray(Newtonsoft.Json.JsonWriter,System.Array,Newtonsoft.Json.Serialization.JsonArrayContract,Newtonsoft.Json.Serialization.JsonProperty,int,int[]):this
(+65 bytes, +38.69 %) System.Collections.Immutable.ImmutableArray`1+Builder[byte]:AddRange[byte](System.ReadOnlySpan`1[byte]):this
(+38 bytes, +22.35 %) System.Collections.Immutable.ImmutableArray`1+Builder[short]:AddRange[short](System.ReadOnlySpan`1[short]):this
(+38 bytes, +22.89 %) System.Collections.Immutable.ImmutableArray`1+Builder[int]:AddRange[int](System.ReadOnlySpan`1[int]):this
(+38 bytes, +21.84 %) System.Collections.Immutable.ImmutableArray`1+Builder[double]:AddRange[double](System.ReadOnlySpan`1[double]):this
(+38 bytes, +22.62 %) System.Collections.Immutable.ImmutableArray`1+Builder[long]:AddRange[long](System.ReadOnlySpan`1[long]):this
(+35 bytes, +40.23 %) System.Data.SqlTypes.SqlDecimal:MpMove(System.ReadOnlySpan`1[uint],int,System.Span`1[uint],byref)
(+108 bytes, +6.36 %) System.Data.SqlTypes.SqlDecimal:MpDiv(System.ReadOnlySpan`1[uint],int,System.Span`1[uint],int,System.Span`1[uint],byref,System.Span`1[uint],byref)
(+132 bytes, +12.94 %) System.PasteArguments:AppendArgument(byref,System.String)
(+35 bytes, +25.55 %) System.Text.ValueStringBuilder:Append(char,int):this
(+77 bytes, +9.45 %) System.Diagnostics.ProcessUtils:GetNextArgument(System.String,byref):System.String
(+19 bytes, n/a) System.DirectoryServices.Protocols.LdapConnection:Finalize():this
(+117 bytes, +4.96 %) System.Net.Http.HttpTelemetry:OnEventCommand(System.Diagnostics.Tracing.EventCommandEventArgs):this
(+38 bytes, +5.39 %) System.Net.NameResolutionTelemetry:OnEventCommand(System.Diagnostics.Tracing.EventCommandEventArgs):this
(+195 bytes, +6.40 %) System.Net.Security.NetSecurityTelemetry:OnEventCommand(System.Diagnostics.Tracing.EventCommandEventArgs):this
(+165 bytes, +150.00 %) System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[byte]:Invoke(System.ReadOnlySpan`1[byte],byte,System.Span`1[byte])
(+64 bytes, +57.66 %) System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[short]:Invoke(System.ReadOnlySpan`1[short],short,System.Span`1[short])
(+67 bytes, +69.07 %) System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[int]:Invoke(System.ReadOnlySpan`1[int],int,System.Span`1[int])
(+83 bytes, +94.32 %) System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[double]:Invoke(System.ReadOnlySpan`1[double],double,System.Span`1[double])
(+67 bytes, +69.07 %) System.Numerics.Tensors.TensorOperation+SumOfSquaredDifferences`1[long]:Invoke(System.ReadOnlySpan`1[long],long,System.Span`1[long])
(+47 bytes, n/a) System.Byte:ToString(System.String,System.IFormatProvider):System.String:this
(+43 bytes, +7.52 %) System.DefaultBinder:CreateParamOrder(int[],System.ReadOnlySpan`1[System.Reflection.ParameterInfo],System.String[]):bool
(+132 bytes, +12.94 %) System.PasteArguments:AppendArgument(byref,System.String)
(+119 bytes, +10.21 %) System.Globalization.CalendarData:NormalizeDatePattern(System.String):System.String
(+35 bytes, +25.55 %) System.Text.ValueStringBuilder:Append(char,int):this
(+37 bytes, +6.60 %) System.Runtime.CompilerServices.ConditionalWeakTable`2+Container[System.__Canon,System.__Canon]:Resize(int):System.Runtime.CompilerServices.ConditionalWeakTable`2+Container[System.__Canon,System.__Canon]:this
(+35 bytes, +23.81 %) System.Diagnostics.Tracing.EventCounter:.ctor(System.String,System.Diagnostics.Tracing.EventSource):this
(+35 bytes, +25.55 %) System.Text.ValueStringBuilder:Append(char,int):this
(+53 bytes, +31.18 %) System.Reflection.Metadata.MetadataReader:CombineRowCounts(int[],int[],byte):int[]
(+37 bytes, +7.58 %) System.Reflection.TypeLoading.GetTypeCoreCache+Container:Resize():this
(+48 bytes, +39.67 %) System.Text.ValueStringBuilder`1[byte]:Append(byte,int):this
(+35 bytes, +25.55 %) System.Text.ValueStringBuilder`1[short]:Append(short,int):this
(+35 bytes, +25.74 %) System.Text.ValueStringBuilder`1[int]:Append(int,int):this
(+44 bytes, +30.34 %) System.Text.ValueStringBuilder`1[double]:Append(double,int):this
(+35 bytes, +25.74 %) System.Text.ValueStringBuilder`1[long]:Append(long,int):this
(+36 bytes, +28.12 %) System.Numerics.NumericsHelpers:DangerousMakeOnesComplement(System.Span`1[nuint])
(+34 bytes, +40.48 %) System.Text.RegularExpressions.Match:Reset(System.String,int):this
(+35 bytes, +25.55 %) System.Text.ValueStringBuilder:Append(char,int):this
(+34 bytes, +40.48 %) System.Text.RegularExpressions.Match:Reset(System.String,int):this
(+58 bytes, +34.32 %) System.Text.RegularExpressions.Symbolic.BitVector:And(System.Text.RegularExpressions.Symbolic.BitVector,System.Text.RegularExpressions.Symbolic.BitVector):System.Text.RegularExpressions.Symbolic.BitVector
(+58 bytes, +34.32 %) System.Text.RegularExpressions.Symbolic.BitVector:Or(System.Text.RegularExpressions.Symbolic.BitVector,System.Text.RegularExpressions.Symbolic.BitVector):System.Text.RegularExpressions.Symbolic.BitVector
(+36 bytes, +16.98 %) System.Text.RegularExpressions.Symbolic.BitVector:Not(System.Text.RegularExpressions.Symbolic.BitVector):System.Text.RegularExpressions.Symbolic.BitVector
(+51 bytes, +5.48 %) Xunit.Serialization.XunitSerializationInfo+ArraySerializer:Deserialize(Xunit.Abstractions.IXunitSerializationInfo):this
(+51 bytes, +5.48 %) Xunit.Serialization.XunitSerializationInfo+ArraySerializer:Deserialize(Xunit.Abstractions.IXunitSerializationInfo):this

It seems there're some interesting spots that can be manually vectorized in tensor and regex libraries.

cc: @tannergooding @stephentoub

@hez2010
Copy link
Copy Markdown
Contributor Author

hez2010 commented May 6, 2026

Diffs

The final TP impact seems to be +0.12% to +0.38% for fullopts.

@hez2010
Copy link
Copy Markdown
Contributor Author

hez2010 commented May 6, 2026

Closing as I've got everything I was curious about in this experiment.

@hez2010 hez2010 closed this May 6, 2026
@EgorBo
Copy link
Copy Markdown
Member

EgorBo commented May 7, 2026

Diffs

The final TP impact seems to be +0.12% to +0.38% for fullopts.

I think it's fine to accept a lot bigger TP regression for a proper auto-vec. The problem with your diffs (if they're correct - did CI finish?) is that they violate the memory model like I said in Discord, e.g.:

public void Double(int[] array, int size)
{
    for (int i = 0; i < size; i++)
    {
        array[i] = array[i] * 2;
    }
}

what exactly makes it legal to fold this into a SIMD loop like the diffs in your SPMI report show?

{B5409786-4C03-4B7B-AF06-1E0BDA5155F8}

Also, I inspected a few examples and noticed how it vectorized various "let's handle the remaining elements via plain loop" so presumably an auto-vec like that should rely on PGO or general assertions about the possible size

@tannergooding
Copy link
Copy Markdown
Member

is that they violate the memory model

Notably the main issue being violating atomicity guarantees. Most hardware does not guarantee per element atomicity of general SIMD loads/stores. While Intel/AMD and Arm64 all have some subset of scenarios they will guarantee, they're typically outside what the GC allows us to assert.

The loading of Count elements up front is fine so long as it maintains single-threaded consistency, so it's safe in this example because we know array[i] cannot alias array[i+1]. This would not be safe with ROSpan<T> source and Span<T> dest, as they could overlap without being the same source.

@hez2010
Copy link
Copy Markdown
Contributor Author

hez2010 commented May 7, 2026

The loading of Count elements up front is fine so long as it maintains single-threaded consistency, so it's safe in this example because we know array[i] cannot alias array[i+1]. This would not be safe with ROSpan source and Span dest, as they could overlap without being the same source.

Yeah. And there's a simple conservative aliasing check to guard the overlapping cases in this prototype.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants