JIT: PGO value-profiling for non-constant zero-init stackalloc#127970
Draft
EgorBo wants to merge 10 commits intodotnet:mainfrom
Draft
JIT: PGO value-profiling for non-constant zero-init stackalloc#127970EgorBo wants to merge 10 commits intodotnet:mainfrom
EgorBo wants to merge 10 commits intodotnet:mainfrom
Conversation
When stackalloc has a non-constant size and the method has compInitMem,
codegen falls back to a slow per-stack-alignment zero-init loop. Use PGO
value probing to record the most popular size, and in Tier1 specialize:
size == popularSize ? LCLHEAP(popularSize) : LCLHEAP(size)
The constant-size LCLHEAP fast path is unrolled by Lowering as STORE_BLK,
producing efficient SIMD zero-init.
Generalized value-probe infrastructure:
* Introduce GenTreeOpWithILOffset (a GenTreeOp that carries an IL offset),
used by GT_LCLHEAP so it can participate in value-histogram instrumentation
alongside calls without a side hash table.
* Centralize value-probe candidate identification in fgprofile.cpp via a
single IsValueHistogramProbeCandidate helper used by the visitor, schema
builder, and probe inserter.
* Extract pickProfiledValue helper for sharing between specializations.
* Make GT_LCLHEAP report its potential stack-overflow exception via
OperExceptions so GTF_EXCEPT survives gtUpdateNodeOperSideEffects and
if-conversion does not collapse the QMARK arms into a CMOV.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
* pickProfiledValue: avoid copying likelyValues[0] before the empty-result check; read directly from likelyValues[0] after the early return. * pickProfiledValue: fix %u format specifier for ssize_t value (use %zd). * pickProfiledValue: doc comment no longer implies the helper is call-only. * impProfileLclHeap: replace truncating ((uint32_t)profiledValue > INT_MAX) range check with !FitsIn<int>(profiledValue). * GT_LCLHEAP OperExceptions: reword comment to reflect unconditional modeling of stack overflow. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add ValueProfiledLclHeap, ValueProfiledMemmove, ValueProfiledSequenceEqual metrics so SPMI replay/diff can show how often value-profile specialization fires per shape. * impProfileLclHeap: clamp profiledValue to [0, getUnrollThreshold(Memset)] so we only specialize when the constant-size LCLHEAP fast path is actually unrolled by Lowering. Outside that range the cmp/jne guard adds overhead with no codegen win. Also reject negative values that FitsIn<int> would otherwise allow. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment was marked as resolved.
This comment was marked as resolved.
Removes the duplicated getLikelyValues+stress-test+threshold-check block and the unguarded likelyValues[0] read that could observe uninitialized data when getLikelyValues returns 0. Now goes through the shared pickProfiledValue helper which handles the empty-result case correctly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
For profiled values <= DEFAULT_MAX_LOCALLOC_TO_LOCAL_SIZE (32 bytes), use a zero-init must-init local for the fast path arm instead of LCLHEAP(N). On ARM64 (and to a lesser extent on x64), the contained-constant LCLHEAP codegen carries non-trivial overhead (stack probes, outgoing-arg-area adjustment, separate STORE_BLK) that dwarfs the cost of the slow variable-size loop for very small allocations. The local is zeroed at function entry by must-init regardless of which arm runs; for sizes <= 32 bytes that's a single SIMD store at most. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment was marked as resolved.
This comment was marked as resolved.
Both impDuplicateWithProfiledArg and impProfileLclHeap had a TODO to set weights for the QMARK branches and were leaving the default 50/50 split. Use the histogram likelihood (already computed and the gating threshold for creating the QMARK in the first place) so morph's QMARK->control-flow expansion sets accurate edge probabilities for the cond/then/else blocks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The promote-to-local fast path turned out to be more complexity than it was worth: at sizes <= DEFAULT_MAX_LOCALLOC_TO_LOCAL_SIZE (32) the variable-size LCLHEAP loop is already fast enough that the cmp/jne guard plus any constant-size codegen does not pay off. Just bail out in that range. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
Note AI-generated benchmark (Copilot CLI). Re-run with larger size set after skipping <= 32 byte specialization. @EgorBot -linux_amd -osx_arm64 using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);
public class Bench
{
[Benchmark] public void Stackalloc32() => Consume(stackalloc byte[GetVar(32)]);
[Benchmark] public void Stackalloc40() => Consume(stackalloc byte[GetVar(40)]);
[Benchmark] public void Stackalloc50() => Consume(stackalloc byte[GetVar(50)]);
[Benchmark] public void Stackalloc64() => Consume(stackalloc byte[GetVar(64)]);
[Benchmark] public void Stackalloc128() => Consume(stackalloc byte[GetVar(128)]);
[Benchmark] public void Stackalloc200() => Consume(stackalloc byte[GetVar(200)]);
[Benchmark] public void Stackalloc256() => Consume(stackalloc byte[GetVar(256)]);
[Benchmark] public void Stackalloc512() => Consume(stackalloc byte[GetVar(512)]);
[Benchmark] public void Stackalloc1024() => Consume(stackalloc byte[GetVar(1024)]);
[Benchmark] public void Stackalloc16384() => Consume(stackalloc byte[GetVar(16384)]);
[Benchmark] public void Stackalloc524288() => Consume(stackalloc byte[GetVar(524288)]);
[MethodImpl(MethodImplOptions.NoInlining)] static int GetVar(int a) => a;
[MethodImpl(MethodImplOptions.NoInlining)] static void Consume(Span<byte> buffer) { }
} |
* fgprofile.cpp: IsValueHistogramProbeCandidate now mirrors the consumer conditions for GT_LCLHEAP (compInitMem + non-constant op), so we don't waste schema slots / instrumentation overhead on locallocs whose data impProfileLclHeap would discard. * importercalls.cpp: introduce a dedicated minProfitableSize constant so the cutoff is no longer coupled to DEFAULT_MAX_LOCALLOC_TO_LOCAL_SIZE (which serves an unrelated heuristic). * importercalls.cpp: pickProfiledValue defensively initializes its output parameters on entry so future callers cannot observe garbage on the failure path. * compiler.h: drop the default ilOffset = 0 on gtNewLclHeapNode to force callers to explicitly supply the IL offset (0 is a valid offset; an accidentally-omitted argument would silently desync the PGO schema). * gentree.cpp: extend the GT_LCLHEAP comment in OperExceptions to spell out why we now model StackOverflow (required to keep if-conversion from collapsing the QMARK arms in impProfileLclHeap into a SELECT). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Stack overflow from localloc is process-fatal, not a catchable C# exception, so modeling GT_LCLHEAP as GTF_EXCEPT was overstating its semantics and risked pessimizing JIT phases that special-case GTF_EXCEPT (morph spill logic, optExtractSideEffList, parent-flag propagation, etc.). GTF_ORDER_SIDEEFF is the more accurate signal: it just tells the optimizer not to reorder/fold across the node, which is exactly what we need to keep if-conversion from collapsing the QMARK arms in impProfileLclHeap into a SELECT/CMOV. * gtNewLclHeapNode: set GTF_ORDER_SIDEEFF | GTF_DONT_CSE. * OperSupportsOrderingSideEffect: add GT_LCLHEAP so OperEffects doesn't strip the flag. * OperExceptions: revert the GT_LCLHEAP entry so we are back to not reporting any throwable exception (matches the existing intent of valuenum.cpp's "It is not necessary to model the StackOverflow exception for GT_LCLHEAP" comment). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Revive my old PR that tried to do this, now that #127959 is merged
Instrument stackallocs of variable size for PGO + Generalize value probing so we can reuse it for other future ideas.
is optimized into additional fast path for most common length:
when we call it mostly for the same size (100 in this example) so we zero it with just 3 AVX512 instructions instead of slow loop
Benchmark
Results
Speedup of PR vs
main(>1.00×= PR is faster). Sizes <= 32 bytes are intentionally not specialized (variable-size loop is already cheap there).Raw runs: EgorBot/Benchmarks#197.
Wins are large on Linux x64 across the board (up to 3.76× at 512 bytes). On Apple M4 the wins are also significant for non-power-of-two sizes (40 / 50 / 64 / 200 -> 2-3×), with two minor regressions at 128 and 256 (-13% / -11%) and parity-or-better elsewhere.