Skip to content

JIT: PGO value-profiling for non-constant zero-init stackalloc#127970

Draft
EgorBo wants to merge 10 commits intodotnet:mainfrom
EgorBo:lclheap-pgo-value-probing
Draft

JIT: PGO value-profiling for non-constant zero-init stackalloc#127970
EgorBo wants to merge 10 commits intodotnet:mainfrom
EgorBo:lclheap-pgo-value-probing

Conversation

@EgorBo
Copy link
Copy Markdown
Member

@EgorBo EgorBo commented May 8, 2026

Revive my old PR that tried to do this, now that #127959 is merged

Instrument stackallocs of variable size for PGO + Generalize value probing so we can reuse it for other future ideas.

void Test(int len)
{
    Span<byte> buffer = stackalloc byte[len];
    Consume(buffer);
}

is optimized into additional fast path for most common length:

       cmp      rax, 100
       je       SHORT G_M44326_IG06
       ...
G_M44326_IG06:  ;; offset=0x004F
       test     dword ptr [rsp], esp
       sub      rsp, 112
       lea      rax, [rsp+0x20]
       vxorps   ymm0, ymm0, ymm0
       vmovdqu32 zmmword ptr [rax], zmm0
       vmovdqu32 zmmword ptr [rax+0x30], zmm0

when we call it mostly for the same size (100 in this example) so we zero it with just 3 AVX512 instructions instead of slow loop

Benchmark

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    [Benchmark] public void Stackalloc32()     => Consume(stackalloc byte[GetVar(32)]);
    [Benchmark] public void Stackalloc40()     => Consume(stackalloc byte[GetVar(40)]);
    [Benchmark] public void Stackalloc50()     => Consume(stackalloc byte[GetVar(50)]);
    [Benchmark] public void Stackalloc64()     => Consume(stackalloc byte[GetVar(64)]);
    [Benchmark] public void Stackalloc128()    => Consume(stackalloc byte[GetVar(128)]);
    [Benchmark] public void Stackalloc200()    => Consume(stackalloc byte[GetVar(200)]);
    [Benchmark] public void Stackalloc256()    => Consume(stackalloc byte[GetVar(256)]);
    [Benchmark] public void Stackalloc512()    => Consume(stackalloc byte[GetVar(512)]);
    [Benchmark] public void Stackalloc1024()   => Consume(stackalloc byte[GetVar(1024)]);
    [Benchmark] public void Stackalloc16384()  => Consume(stackalloc byte[GetVar(16384)]);
    [Benchmark] public void Stackalloc524288() => Consume(stackalloc byte[GetVar(524288)]);

    [MethodImpl(MethodImplOptions.NoInlining)] static int GetVar(int a) => a;
    [MethodImpl(MethodImplOptions.NoInlining)] static void Consume(Span<byte> buffer) { }
}

Results

Speedup of PR vs main (>1.00× = PR is faster). Sizes <= 32 bytes are intentionally not specialized (variable-size loop is already cheap there).

Size Linux x64 (EPYC 9V45) Apple M4 arm64
32 1.00× 0.92×
40 1.25× 3.23×
50 1.37× 2.32×
64 1.34× 2.28×
128 1.77× 0.87×
200 2.26× 2.03×
256 2.46× 0.89×
512 3.76× 1.34×
1,024 2.43× 1.27×
16,384 2.47× 1.69×
524,288 2.54× 1.24×

Raw runs: EgorBot/Benchmarks#197.

Wins are large on Linux x64 across the board (up to 3.76× at 512 bytes). On Apple M4 the wins are also significant for non-power-of-two sizes (40 / 50 / 64 / 200 -> 2-3×), with two minor regressions at 128 and 256 (-13% / -11%) and parity-or-better elsewhere.

When stackalloc has a non-constant size and the method has compInitMem,
codegen falls back to a slow per-stack-alignment zero-init loop. Use PGO
value probing to record the most popular size, and in Tier1 specialize:

    size == popularSize ? LCLHEAP(popularSize) : LCLHEAP(size)

The constant-size LCLHEAP fast path is unrolled by Lowering as STORE_BLK,
producing efficient SIMD zero-init.

Generalized value-probe infrastructure:
* Introduce GenTreeOpWithILOffset (a GenTreeOp that carries an IL offset),
  used by GT_LCLHEAP so it can participate in value-histogram instrumentation
  alongside calls without a side hash table.
* Centralize value-probe candidate identification in fgprofile.cpp via a
  single IsValueHistogramProbeCandidate helper used by the visitor, schema
  builder, and probe inserter.
* Extract pickProfiledValue helper for sharing between specializations.
* Make GT_LCLHEAP report its potential stack-overflow exception via
  OperExceptions so GTF_EXCEPT survives gtUpdateNodeOperSideEffects and
  if-conversion does not collapse the QMARK arms into a CMOV.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 8, 2026 19:54
@github-actions github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 8, 2026
@dotnet-policy-service

This comment was marked as resolved.

@EgorBo

This comment was marked as outdated.

This comment was marked as resolved.

EgorBo and others added 2 commits May 8, 2026 22:12
* pickProfiledValue: avoid copying likelyValues[0] before the empty-result
  check; read directly from likelyValues[0] after the early return.
* pickProfiledValue: fix %u format specifier for ssize_t value (use %zd).
* pickProfiledValue: doc comment no longer implies the helper is call-only.
* impProfileLclHeap: replace truncating ((uint32_t)profiledValue > INT_MAX)
  range check with !FitsIn<int>(profiledValue).
* GT_LCLHEAP OperExceptions: reword comment to reflect unconditional
  modeling of stack overflow.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add ValueProfiledLclHeap, ValueProfiledMemmove, ValueProfiledSequenceEqual
  metrics so SPMI replay/diff can show how often value-profile specialization
  fires per shape.
* impProfileLclHeap: clamp profiledValue to [0, getUnrollThreshold(Memset)]
  so we only specialize when the constant-size LCLHEAP fast path is actually
  unrolled by Lowering. Outside that range the cmp/jne guard adds overhead
  with no codegen win. Also reject negative values that FitsIn<int> would
  otherwise allow.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 8, 2026 20:39
@EgorBo

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Comment thread src/coreclr/jit/importercalls.cpp
EgorBo and others added 2 commits May 8, 2026 23:14
Removes the duplicated getLikelyValues+stress-test+threshold-check block
and the unguarded likelyValues[0] read that could observe uninitialized
data when getLikelyValues returns 0. Now goes through the shared
pickProfiledValue helper which handles the empty-result case correctly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
For profiled values <= DEFAULT_MAX_LOCALLOC_TO_LOCAL_SIZE (32 bytes), use
a zero-init must-init local for the fast path arm instead of LCLHEAP(N).
On ARM64 (and to a lesser extent on x64), the contained-constant LCLHEAP
codegen carries non-trivial overhead (stack probes, outgoing-arg-area
adjustment, separate STORE_BLK) that dwarfs the cost of the slow
variable-size loop for very small allocations.

The local is zeroed at function entry by must-init regardless of which
arm runs; for sizes <= 32 bytes that's a single SIMD store at most.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 8, 2026 21:23
@EgorBo

This comment was marked as resolved.

Both impDuplicateWithProfiledArg and impProfileLclHeap had a TODO to set
weights for the QMARK branches and were leaving the default 50/50 split.
Use the histogram likelihood (already computed and the gating threshold
for creating the QMARK in the first place) so morph's QMARK->control-flow
expansion sets accurate edge probabilities for the cond/then/else blocks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Comment thread src/coreclr/jit/importercalls.cpp Outdated
Comment thread src/coreclr/jit/importer.cpp
Comment thread src/coreclr/jit/fgprofile.cpp
Comment thread src/coreclr/jit/gentree.h Outdated
EgorBo and others added 2 commits May 9, 2026 00:01
The promote-to-local fast path turned out to be more complexity than it
was worth: at sizes <= DEFAULT_MAX_LOCALLOC_TO_LOCAL_SIZE (32) the
variable-size LCLHEAP loop is already fast enough that the cmp/jne
guard plus any constant-size codegen does not pay off. Just bail out
in that range.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 8, 2026 22:23
@EgorBo
Copy link
Copy Markdown
Member Author

EgorBo commented May 8, 2026

Note

AI-generated benchmark (Copilot CLI). Re-run with larger size set after skipping <= 32 byte specialization.

@EgorBot -linux_amd -osx_arm64

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    [Benchmark] public void Stackalloc32()     => Consume(stackalloc byte[GetVar(32)]);
    [Benchmark] public void Stackalloc40()     => Consume(stackalloc byte[GetVar(40)]);
    [Benchmark] public void Stackalloc50()     => Consume(stackalloc byte[GetVar(50)]);
    [Benchmark] public void Stackalloc64()     => Consume(stackalloc byte[GetVar(64)]);
    [Benchmark] public void Stackalloc128()    => Consume(stackalloc byte[GetVar(128)]);
    [Benchmark] public void Stackalloc200()    => Consume(stackalloc byte[GetVar(200)]);
    [Benchmark] public void Stackalloc256()    => Consume(stackalloc byte[GetVar(256)]);
    [Benchmark] public void Stackalloc512()    => Consume(stackalloc byte[GetVar(512)]);
    [Benchmark] public void Stackalloc1024()   => Consume(stackalloc byte[GetVar(1024)]);
    [Benchmark] public void Stackalloc16384()  => Consume(stackalloc byte[GetVar(16384)]);
    [Benchmark] public void Stackalloc524288() => Consume(stackalloc byte[GetVar(524288)]);

    [MethodImpl(MethodImplOptions.NoInlining)] static int GetVar(int a) => a;
    [MethodImpl(MethodImplOptions.NoInlining)] static void Consume(Span<byte> buffer) { }
}

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

Comment thread src/coreclr/jit/gentree.cpp
Comment thread src/coreclr/jit/gentree.cpp Outdated
Comment thread src/coreclr/jit/fgprofile.cpp
Comment thread src/coreclr/jit/importercalls.cpp Outdated
Comment thread src/coreclr/jit/importercalls.cpp
Comment thread src/coreclr/jit/compiler.h Outdated
Comment thread src/coreclr/jit/importer.cpp
EgorBo and others added 2 commits May 9, 2026 00:40
* fgprofile.cpp: IsValueHistogramProbeCandidate now mirrors the consumer
  conditions for GT_LCLHEAP (compInitMem + non-constant op), so we don't
  waste schema slots / instrumentation overhead on locallocs whose data
  impProfileLclHeap would discard.
* importercalls.cpp: introduce a dedicated minProfitableSize constant so
  the cutoff is no longer coupled to DEFAULT_MAX_LOCALLOC_TO_LOCAL_SIZE
  (which serves an unrelated heuristic).
* importercalls.cpp: pickProfiledValue defensively initializes its output
  parameters on entry so future callers cannot observe garbage on the
  failure path.
* compiler.h: drop the default ilOffset = 0 on gtNewLclHeapNode to force
  callers to explicitly supply the IL offset (0 is a valid offset; an
  accidentally-omitted argument would silently desync the PGO schema).
* gentree.cpp: extend the GT_LCLHEAP comment in OperExceptions to spell
  out why we now model StackOverflow (required to keep if-conversion
  from collapsing the QMARK arms in impProfileLclHeap into a SELECT).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Stack overflow from localloc is process-fatal, not a catchable C# exception,
so modeling GT_LCLHEAP as GTF_EXCEPT was overstating its semantics and risked
pessimizing JIT phases that special-case GTF_EXCEPT (morph spill logic,
optExtractSideEffList, parent-flag propagation, etc.).

GTF_ORDER_SIDEEFF is the more accurate signal: it just tells the optimizer
not to reorder/fold across the node, which is exactly what we need to keep
if-conversion from collapsing the QMARK arms in impProfileLclHeap into a
SELECT/CMOV.

* gtNewLclHeapNode: set GTF_ORDER_SIDEEFF | GTF_DONT_CSE.
* OperSupportsOrderingSideEffect: add GT_LCLHEAP so OperEffects doesn't
  strip the flag.
* OperExceptions: revert the GT_LCLHEAP entry so we are back to not
  reporting any throwable exception (matches the existing intent of
  valuenum.cpp's "It is not necessary to model the StackOverflow
  exception for GT_LCLHEAP" comment).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 8, 2026 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants