JIT: PGO value-profiling for non-constant zero-init stackalloc by EgorBo · Pull Request #127970 · dotnet/runtime

EgorBo · 2026-05-08T19:54:21Z

Revive my old PR that tried to do this, now that #127959 is merged

Instrument stackallocs of variable size for PGO + Generalize value probing so we can reuse it for other future ideas.

void Test(int len)
{
    Span<byte> buffer = stackalloc byte[len];
    Consume(buffer);
}

is optimized into additional fast path for most common length:

       cmp      rax, 100
       je       SHORT G_M44326_IG06
       ...
G_M44326_IG06:  ;; offset=0x004F
       test     dword ptr [rsp], esp
       sub      rsp, 112
       lea      rax, [rsp+0x20]
       vxorps   ymm0, ymm0, ymm0
       vmovdqu32 zmmword ptr [rax], zmm0
       vmovdqu32 zmmword ptr [rax+0x30], zmm0

when we call it mostly for the same size (100 in this example) so we zero it with just 3 AVX512 instructions instead of slow loop

Benchmark

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    [Benchmark] public void Stackalloc32()     => Consume(stackalloc byte[GetVar(32)]);
    [Benchmark] public void Stackalloc40()     => Consume(stackalloc byte[GetVar(40)]);
    [Benchmark] public void Stackalloc50()     => Consume(stackalloc byte[GetVar(50)]);
    [Benchmark] public void Stackalloc64()     => Consume(stackalloc byte[GetVar(64)]);
    [Benchmark] public void Stackalloc128()    => Consume(stackalloc byte[GetVar(128)]);
    [Benchmark] public void Stackalloc200()    => Consume(stackalloc byte[GetVar(200)]);
    [Benchmark] public void Stackalloc256()    => Consume(stackalloc byte[GetVar(256)]);
    [Benchmark] public void Stackalloc512()    => Consume(stackalloc byte[GetVar(512)]);
    [Benchmark] public void Stackalloc1024()   => Consume(stackalloc byte[GetVar(1024)]);
    [Benchmark] public void Stackalloc16384()  => Consume(stackalloc byte[GetVar(16384)]);
    [Benchmark] public void Stackalloc524288() => Consume(stackalloc byte[GetVar(524288)]);

    [MethodImpl(MethodImplOptions.NoInlining)] static int GetVar(int a) => a;
    [MethodImpl(MethodImplOptions.NoInlining)] static void Consume(Span<byte> buffer) { }
}

Results

Speedup of PR vs main (>1.00× = PR is faster). Sizes <= 32 bytes are intentionally not specialized (variable-size loop is already cheap there).

Size	Linux x64 (EPYC 9V45)	Apple M4 arm64
32	1.00×	0.92×
40	1.25×	3.23×
50	1.37×	2.32×
64	1.34×	2.28×
128	1.77×	0.87×
200	2.26×	2.03×
256	2.46×	0.89×
512	3.76×	1.34×
1,024	2.43×	1.27×
16,384	2.47×	1.69×
524,288	2.54×	1.24×

Raw runs: EgorBot/Benchmarks#197.

Wins are large on Linux x64 across the board (up to 3.76× at 512 bytes). On Apple M4 the wins are also significant for non-power-of-two sizes (40 / 50 / 64 / 200 -> 2-3×), with two minor regressions at 128 and 256 (-13% / -11%) and parity-or-better elsewhere.

When stackalloc has a non-constant size and the method has compInitMem, codegen falls back to a slow per-stack-alignment zero-init loop. Use PGO value probing to record the most popular size, and in Tier1 specialize: size == popularSize ? LCLHEAP(popularSize) : LCLHEAP(size) The constant-size LCLHEAP fast path is unrolled by Lowering as STORE_BLK, producing efficient SIMD zero-init. Generalized value-probe infrastructure: * Introduce GenTreeOpWithILOffset (a GenTreeOp that carries an IL offset), used by GT_LCLHEAP so it can participate in value-histogram instrumentation alongside calls without a side hash table. * Centralize value-probe candidate identification in fgprofile.cpp via a single IsValueHistogramProbeCandidate helper used by the visitor, schema builder, and probe inserter. * Extract pickProfiledValue helper for sharing between specializations. * Make GT_LCLHEAP report its potential stack-overflow exception via OperExceptions so GTF_EXCEPT survives gtUpdateNodeOperSideEffects and if-conversion does not collapse the QMARK arms into a CMOV. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* pickProfiledValue: avoid copying likelyValues[0] before the empty-result check; read directly from likelyValues[0] after the early return. * pickProfiledValue: fix %u format specifier for ssize_t value (use %zd). * pickProfiledValue: doc comment no longer implies the helper is call-only. * impProfileLclHeap: replace truncating ((uint32_t)profiledValue > INT_MAX) range check with !FitsIn<int>(profiledValue). * GT_LCLHEAP OperExceptions: reword comment to reflect unconditional modeling of stack overflow. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add ValueProfiledLclHeap, ValueProfiledMemmove, ValueProfiledSequenceEqual metrics so SPMI replay/diff can show how often value-profile specialization fires per shape. * impProfileLclHeap: clamp profiledValue to [0, getUnrollThreshold(Memset)] so we only specialize when the constant-size LCLHEAP fast path is actually unrolled by Lowering. Outside that range the cmp/jne guard adds overhead with no codegen win. Also reject negative values that FitsIn<int> would otherwise allow. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Removes the duplicated getLikelyValues+stress-test+threshold-check block and the unguarded likelyValues[0] read that could observe uninitialized data when getLikelyValues returns 0. Now goes through the shared pickProfiledValue helper which handles the empty-result case correctly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

For profiled values <= DEFAULT_MAX_LOCALLOC_TO_LOCAL_SIZE (32 bytes), use a zero-init must-init local for the fast path arm instead of LCLHEAP(N). On ARM64 (and to a lesser extent on x64), the contained-constant LCLHEAP codegen carries non-trivial overhead (stack probes, outgoing-arg-area adjustment, separate STORE_BLK) that dwarfs the cost of the slow variable-size loop for very small allocations. The local is zeroed at function entry by must-init regardless of which arm runs; for sizes <= 32 bytes that's a single SIMD store at most. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Both impDuplicateWithProfiledArg and impProfileLclHeap had a TODO to set weights for the QMARK branches and were leaving the default 50/50 split. Use the histogram likelihood (already computed and the gating threshold for creating the QMARK in the first place) so morph's QMARK->control-flow expansion sets accurate edge probabilities for the cond/then/else blocks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

The promote-to-local fast path turned out to be more complexity than it was worth: at sizes <= DEFAULT_MAX_LOCALLOC_TO_LOCAL_SIZE (32) the variable-size LCLHEAP loop is already fast enough that the cmp/jne guard plus any constant-size codegen does not pay off. Just bail out in that range. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

EgorBo · 2026-05-08T22:28:35Z

Note

AI-generated benchmark (Copilot CLI). Re-run with larger size set after skipping <= 32 byte specialization.

@EgorBot -linux_amd -osx_arm64

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    [Benchmark] public void Stackalloc32()     => Consume(stackalloc byte[GetVar(32)]);
    [Benchmark] public void Stackalloc40()     => Consume(stackalloc byte[GetVar(40)]);
    [Benchmark] public void Stackalloc50()     => Consume(stackalloc byte[GetVar(50)]);
    [Benchmark] public void Stackalloc64()     => Consume(stackalloc byte[GetVar(64)]);
    [Benchmark] public void Stackalloc128()    => Consume(stackalloc byte[GetVar(128)]);
    [Benchmark] public void Stackalloc200()    => Consume(stackalloc byte[GetVar(200)]);
    [Benchmark] public void Stackalloc256()    => Consume(stackalloc byte[GetVar(256)]);
    [Benchmark] public void Stackalloc512()    => Consume(stackalloc byte[GetVar(512)]);
    [Benchmark] public void Stackalloc1024()   => Consume(stackalloc byte[GetVar(1024)]);
    [Benchmark] public void Stackalloc16384()  => Consume(stackalloc byte[GetVar(16384)]);
    [Benchmark] public void Stackalloc524288() => Consume(stackalloc byte[GetVar(524288)]);

    [MethodImpl(MethodImplOptions.NoInlining)] static int GetVar(int a) => a;
    [MethodImpl(MethodImplOptions.NoInlining)] static void Consume(Span<byte> buffer) { }
}

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

* fgprofile.cpp: IsValueHistogramProbeCandidate now mirrors the consumer conditions for GT_LCLHEAP (compInitMem + non-constant op), so we don't waste schema slots / instrumentation overhead on locallocs whose data impProfileLclHeap would discard. * importercalls.cpp: introduce a dedicated minProfitableSize constant so the cutoff is no longer coupled to DEFAULT_MAX_LOCALLOC_TO_LOCAL_SIZE (which serves an unrelated heuristic). * importercalls.cpp: pickProfiledValue defensively initializes its output parameters on entry so future callers cannot observe garbage on the failure path. * compiler.h: drop the default ilOffset = 0 on gtNewLclHeapNode to force callers to explicitly supply the IL offset (0 is a valid offset; an accidentally-omitted argument would silently desync the PGO schema). * gentree.cpp: extend the GT_LCLHEAP comment in OperExceptions to spell out why we now model StackOverflow (required to keep if-conversion from collapsing the QMARK arms in impProfileLclHeap into a SELECT). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Stack overflow from localloc is process-fatal, not a catchable C# exception, so modeling GT_LCLHEAP as GTF_EXCEPT was overstating its semantics and risked pessimizing JIT phases that special-case GTF_EXCEPT (morph spill logic, optExtractSideEffList, parent-flag propagation, etc.). GTF_ORDER_SIDEEFF is the more accurate signal: it just tells the optimizer not to reorder/fold across the node, which is exactly what we need to keep if-conversion from collapsing the QMARK arms in impProfileLclHeap into a SELECT/CMOV. * gtNewLclHeapNode: set GTF_ORDER_SIDEEFF | GTF_DONT_CSE. * OperSupportsOrderingSideEffect: add GT_LCLHEAP so OperEffects doesn't strip the flag. * OperExceptions: revert the GT_LCLHEAP entry so we are back to not reporting any throwable exception (matches the existing intent of valuenum.cpp's "It is not necessary to model the StackOverflow exception for GT_LCLHEAP" comment). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 8, 2026 19:54

github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 8, 2026

dotnet-policy-service Bot assigned EgorBo May 8, 2026

This comment was marked as resolved.

Sign in to view

Copilot started reviewing on behalf of EgorBo May 8, 2026 19:56 View session

This comment was marked as outdated.

Sign in to view

EgorBot mentioned this pull request May 8, 2026

Benchmarks for dotnet/runtime#127970 (for @EgorBo) EgorBot/Benchmarks#194

Open

This comment was marked as resolved.

Sign in to view

EgorBo and others added 2 commits May 8, 2026 22:12

Copilot AI review requested due to automatic review settings May 8, 2026 20:39

Copilot started reviewing on behalf of EgorBo May 8, 2026 20:40 View session

This comment was marked as resolved.

Sign in to view

EgorBot mentioned this pull request May 8, 2026

Benchmarks for dotnet/runtime#127970 (for @EgorBo) EgorBot/Benchmarks#195

Open

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread src/coreclr/jit/importercalls.cpp

EgorBo and others added 2 commits May 8, 2026 23:14

Copilot AI review requested due to automatic review settings May 8, 2026 21:23

Copilot started reviewing on behalf of EgorBo May 8, 2026 21:23 View session

This comment was marked as resolved.

Sign in to view

EgorBot mentioned this pull request May 8, 2026

Benchmarks for dotnet/runtime#127970 (for @EgorBo) EgorBot/Benchmarks#196

Open

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread src/coreclr/jit/importercalls.cpp Outdated

Comment thread src/coreclr/jit/importer.cpp

Comment thread src/coreclr/jit/fgprofile.cpp

Comment thread src/coreclr/jit/gentree.h Outdated

EgorBo and others added 2 commits May 9, 2026 00:01

clean up

1fcbcb2

Copilot AI review requested due to automatic review settings May 8, 2026 22:23

Copilot started reviewing on behalf of EgorBo May 8, 2026 22:23 View session

EgorBot mentioned this pull request May 8, 2026

Benchmarks for dotnet/runtime#127970 (for @EgorBo) EgorBot/Benchmarks#197

Open

Copilot AI reviewed May 8, 2026

View reviewed changes

EgorBo and others added 2 commits May 9, 2026 00:40

Copilot AI review requested due to automatic review settings May 8, 2026 22:59

Copilot started reviewing on behalf of EgorBo May 8, 2026 22:59 View session

Conversation

EgorBo commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Results

Uh oh!

This comment was marked as resolved.

This comment was marked as outdated.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

This comment was marked as resolved.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EgorBo commented May 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EgorBo commented May 8, 2026 •

edited

Loading