JIT: avoid store forward stall for struct params in GS frames#127487
JIT: avoid store forward stall for struct params in GS frames#127487AndyAyersMS wants to merge 1 commit intodotnet:mainfrom
Conversation
If we have a struct param in a GS frame, we will spill it using narrow writes and then copy it to the shadow param with wide stores, causing a store-forward stall. Try and avoid this by forcing the copies to be int-register sized. Addresses dotnet#121248.
|
@EgorBo FYI -- probably needs revising, but this is the rough idea. The resulting code is ugly, we spill then copy, but at least all the memory traffic is the same sized chunks, so it should be faster. |
There was a problem hiding this comment.
Pull request overview
This PR adjusts x86/x64 JIT block-copy lowering/LSRA/codegen to avoid generating SIMD wide-load/store sequences when copying multi-register struct arguments in GS (shadow-param) frames, preventing store-forwarding stalls caused by mismatched spill/copy store widths.
Changes:
- Detect when the source of a
GT_STORE_BLKunrolled copy is a multi-register struct argument and avoid SIMD-based unrolled copying in that case. - Propagate the “disable SIMD for this copy” decision through lowering (threshold selection), LSRA (internal register needs), and codegen (actual instruction selection).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/coreclr/jit/lsraxarch.cpp | Avoids allocating SIMD temps for unrolled block stores when copying from multi-reg arg locals; ensures an int temp is available instead. |
| src/coreclr/jit/lowerxarch.cpp | Disables SIMD eligibility (and thus affects unroll heuristics) for block copies whose source is a multi-reg arg local. |
| src/coreclr/jit/codegenxarch.cpp | Skips SIMD unrolled copy emission for multi-reg arg sources, falling back to GPR-sized moves to avoid store-forward stalls. |
| bool srcIsRegArg = false; | ||
|
|
||
| if (src->OperIs(GT_LCL_VAR)) | ||
| { | ||
| unsigned srcLclNum = src->AsLclVar()->GetLclNum(); | ||
| srcIsRegArg = m_compiler->lvaGetDesc(srcLclNum)->lvIsMultiRegArg; | ||
| } |
There was a problem hiding this comment.
The new flag name srcIsRegArg is misleading here: it is set based on lvIsMultiRegArg (multi-register struct arg), not lvIsRegArg or a general "reg arg" condition. Consider renaming to something like srcIsMultiRegArg (or similar) to avoid confusion with the existing lvIsRegArg field and to better document why SIMD is being disabled.
| GenTree* src = node->Data(); | ||
| bool srcIsRegArg = false; | ||
|
|
||
| assert(src->isContained()); | ||
|
|
||
| if (src->OperIs(GT_LCL_VAR, GT_LCL_FLD)) | ||
| { | ||
| srcLclNum = src->AsLclVarCommon()->GetLclNum(); | ||
| srcOffset = src->AsLclVarCommon()->GetLclOffs(); | ||
|
|
||
| if (src->OperIs(GT_LCL_VAR)) | ||
| { | ||
| srcIsRegArg = m_compiler->lvaGetDesc(srcLclNum)->lvIsMultiRegArg; | ||
| } |
There was a problem hiding this comment.
The new boolean srcIsRegArg is derived from lvIsMultiRegArg (multi-register struct arg), not from lvIsRegArg/a generic register-argument property. Renaming it to something like srcIsMultiRegArg would make the intent clearer and avoid confusion with the existing lvIsRegArg flag on locals.
|
@EgorBot -arm -amd -windows_intel using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Engines;
using BenchmarkDotNet.Running;
public class Benchmarks
{
[Benchmark]
public long Bench_stackalloc() => ParseNonCanonical_stackalloc("11");
[Benchmark]
public long Bench_InlineArray() => ParseNonCanonical_InlineArray("11");
[MethodImpl(MethodImplOptions.NoInlining)]
int ParseNonCanonical_stackalloc(ReadOnlySpan<char> name)
{
Span<long> parts = stackalloc long[3];
Consume(parts);
return name[1];
}
[MethodImpl(MethodImplOptions.NoInlining)]
int ParseNonCanonical_InlineArray(ReadOnlySpan<char> name)
{
Span<long> parts = [0, 0, 0];
Consume(parts);
return name[1];
}
[MethodImpl(MethodImplOptions.NoInlining)]
static void Consume(Span<long> parts) { }
} |
|
@EgorBot -arm -amd using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Engines;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Benchmarks).Assembly).Run(args);
public class Benchmarks
{
[Benchmark]
public long Bench_stackalloc() => ParseNonCanonical_stackalloc("11");
[Benchmark]
public long Bench_InlineArray() => ParseNonCanonical_InlineArray("11");
[MethodImpl(MethodImplOptions.NoInlining)]
int ParseNonCanonical_stackalloc(ReadOnlySpan<char> name)
{
Span<long> parts = stackalloc long[3];
Consume(parts);
return name[1];
}
[MethodImpl(MethodImplOptions.NoInlining)]
int ParseNonCanonical_InlineArray(ReadOnlySpan<char> name)
{
Span<long> parts = [0, 0, 0];
Consume(parts);
return name[1];
}
[MethodImpl(MethodImplOptions.NoInlining)]
static void Consume(Span<long> parts) { }
} |
|
Oops, you beat me to it... |
If we have a struct param in a GS frame, we will spill it using narrow writes and then copy it to the shadow param with wide stores, causing a store-forward stall. Try and avoid this by forcing the copies to be int-register sized.
Addresses #121248.