Skip to content

JIT: avoid store forward stall for struct params in GS frames#127487

Draft
AndyAyersMS wants to merge 1 commit intodotnet:mainfrom
AndyAyersMS:GsAvoidStoreForwardStall
Draft

JIT: avoid store forward stall for struct params in GS frames#127487
AndyAyersMS wants to merge 1 commit intodotnet:mainfrom
AndyAyersMS:GsAvoidStoreForwardStall

Conversation

@AndyAyersMS
Copy link
Copy Markdown
Member

If we have a struct param in a GS frame, we will spill it using narrow writes and then copy it to the shadow param with wide stores, causing a store-forward stall. Try and avoid this by forcing the copies to be int-register sized.

Addresses #121248.

If we have a struct param in a GS frame, we will spill it using narrow
writes and then copy it to the shadow param with wide stores, causing
a store-forward stall. Try and avoid this by forcing the copies to be
int-register sized.

Addresses dotnet#121248.
Copilot AI review requested due to automatic review settings April 28, 2026 00:25
@github-actions github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 28, 2026
@AndyAyersMS
Copy link
Copy Markdown
Member Author

@EgorBo FYI -- probably needs revising, but this is the rough idea.

The resulting code is ugly, we spill then copy, but at least all the memory traffic is the same sized chunks, so it should be faster.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts x86/x64 JIT block-copy lowering/LSRA/codegen to avoid generating SIMD wide-load/store sequences when copying multi-register struct arguments in GS (shadow-param) frames, preventing store-forwarding stalls caused by mismatched spill/copy store widths.

Changes:

  • Detect when the source of a GT_STORE_BLK unrolled copy is a multi-register struct argument and avoid SIMD-based unrolled copying in that case.
  • Propagate the “disable SIMD for this copy” decision through lowering (threshold selection), LSRA (internal register needs), and codegen (actual instruction selection).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/coreclr/jit/lsraxarch.cpp Avoids allocating SIMD temps for unrolled block stores when copying from multi-reg arg locals; ensures an int temp is available instead.
src/coreclr/jit/lowerxarch.cpp Disables SIMD eligibility (and thus affects unroll heuristics) for block copies whose source is a multi-reg arg local.
src/coreclr/jit/codegenxarch.cpp Skips SIMD unrolled copy emission for multi-reg arg sources, falling back to GPR-sized moves to avoid store-forward stalls.

Comment on lines +1498 to +1504
bool srcIsRegArg = false;

if (src->OperIs(GT_LCL_VAR))
{
unsigned srcLclNum = src->AsLclVar()->GetLclNum();
srcIsRegArg = m_compiler->lvaGetDesc(srcLclNum)->lvIsMultiRegArg;
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new flag name srcIsRegArg is misleading here: it is set based on lvIsMultiRegArg (multi-register struct arg), not lvIsRegArg or a general "reg arg" condition. Consider renaming to something like srcIsMultiRegArg (or similar) to avoid confusion with the existing lvIsRegArg field and to better document why SIMD is being disabled.

Copilot uses AI. Check for mistakes.
Comment on lines 3436 to +3449
GenTree* src = node->Data();
bool srcIsRegArg = false;

assert(src->isContained());

if (src->OperIs(GT_LCL_VAR, GT_LCL_FLD))
{
srcLclNum = src->AsLclVarCommon()->GetLclNum();
srcOffset = src->AsLclVarCommon()->GetLclOffs();

if (src->OperIs(GT_LCL_VAR))
{
srcIsRegArg = m_compiler->lvaGetDesc(srcLclNum)->lvIsMultiRegArg;
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new boolean srcIsRegArg is derived from lvIsMultiRegArg (multi-register struct arg), not from lvIsRegArg/a generic register-argument property. Renaming it to something like srcIsMultiRegArg would make the intent clearer and avoid confusion with the existing lvIsRegArg flag on locals.

Copilot uses AI. Check for mistakes.
@EgorBo
Copy link
Copy Markdown
Member

EgorBo commented Apr 28, 2026

@EgorBot -arm -amd -windows_intel

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Engines;
using BenchmarkDotNet.Running;

public class Benchmarks
{
    [Benchmark]
    public long Bench_stackalloc() => ParseNonCanonical_stackalloc("11");

    [Benchmark]
    public long Bench_InlineArray() => ParseNonCanonical_InlineArray("11");


    [MethodImpl(MethodImplOptions.NoInlining)]
    int ParseNonCanonical_stackalloc(ReadOnlySpan<char> name)
    {
        Span<long> parts = stackalloc long[3];
        Consume(parts);
        return name[1];
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    int ParseNonCanonical_InlineArray(ReadOnlySpan<char> name)
    {
        Span<long> parts = [0, 0, 0];
        Consume(parts);
        return name[1];
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Consume(Span<long> parts) { }
}

@AndyAyersMS
Copy link
Copy Markdown
Member Author

@EgorBot -arm -amd

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Engines;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Benchmarks).Assembly).Run(args);

public class Benchmarks
{
    [Benchmark]
    public long Bench_stackalloc() => ParseNonCanonical_stackalloc("11");

    [Benchmark]
    public long Bench_InlineArray() => ParseNonCanonical_InlineArray("11");


    [MethodImpl(MethodImplOptions.NoInlining)]
    int ParseNonCanonical_stackalloc(ReadOnlySpan<char> name)
    {
        Span<long> parts = stackalloc long[3];
        Consume(parts);
        return name[1];
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    int ParseNonCanonical_InlineArray(ReadOnlySpan<char> name)
    {
        Span<long> parts = [0, 0, 0];
        Consume(parts);
        return name[1];
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Consume(Span<long> parts) { }
}

@AndyAyersMS
Copy link
Copy Markdown
Member Author

Oops, you beat me to it...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants