Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow codegen when using Vector128/256<T> parameter #78883

Open
rickbrew opened this issue Nov 27, 2022 · 8 comments
Open

Slow codegen when using Vector128/256<T> parameter #78883

rickbrew opened this issue Nov 27, 2022 · 8 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@rickbrew
Copy link
Contributor

Description

The following method generates native code that reloads the value parameter from rdx every time it's used:

private static unsafe void BitwiseFillVector128_BadCodeGen(Vector128<byte>* startAddress, Vector128<byte> value, ulong elementCount)
{
    while (elementCount >= 4)
    {            
        Vector128.Store(value, (byte*)startAddress);
        Vector128.Store(value, (byte*)(startAddress + 1));
        Vector128.Store(value, (byte*)(startAddress + 2));
        Vector128.Store(value, (byte*)(startAddress + 3));

        elementCount -= 4;
        startAddress += 4;
    }

    while (elementCount > 0)
    {
        Vector128.Store(value, (byte*)startAddress);

        --elementCount;
        ++startAddress;
    }
}
C.BitwiseFillVector128_BadCodeGen(System.Runtime.Intrinsics.Vector128`1<Byte>*, System.Runtime.Intrinsics.Vector128`1<Byte>, UInt64)
    L0000: vzeroupper
    L0003: cmp r8, 4
    L0007: jb short L0049
    L0009: nop [rax]
    L0010: vmovupd xmm0, [rdx]
    L0015: vmovdqu [rcx], xmm0
    L001a: vmovupd xmm0, [rdx]
    L001f: vmovdqu [rcx+0x10], xmm0
    L0025: vmovupd xmm0, [rdx]
    L002a: vmovdqu [rcx+0x20], xmm0
    L0030: vmovupd xmm0, [rdx]
    L0035: vmovdqu [rcx+0x30], xmm0
    L003b: add r8, 0xfffffffffffffffc
    L003f: add rcx, 0x40
    L0043: cmp r8, 4
    L0047: jae short L0010
    L0049: test r8, r8
    L004c: je short L0064
    L004e: xchg ax, ax
    L0050: vmovupd xmm0, [rdx]
    L0054: vmovdqu [rcx], xmm0
    L0058: dec r8
    L005b: add rcx, 0x10
    L005f: test r8, r8
    L0062: jne short L0050
    L0064: ret

If you copy value to a local/temp, the codegen is fixed:

private static unsafe void BitwiseFillVector128_GoodCodeGen(Vector128<byte>* startAddress, Vector128<byte> value, ulong elementCount)
{
    Vector128<Byte> valueP = value;

    while (elementCount >= 4)
    {            
        Vector128.Store(valueP, (byte*)startAddress);
        Vector128.Store(valueP, (byte*)(startAddress + 1));
        Vector128.Store(valueP, (byte*)(startAddress + 2));
        Vector128.Store(valueP, (byte*)(startAddress + 3));

        elementCount -= 4;
        startAddress += 4;
    }

    while (elementCount > 0)
    {
        Vector128.Store(valueP, (byte*)startAddress);

        --elementCount;
        ++startAddress;
    }
}
C.BitwiseFillVector128_GoodCodeGen(System.Runtime.Intrinsics.Vector128`1<Byte>*, System.Runtime.Intrinsics.Vector128`1<Byte>, UInt64)
    L0000: vzeroupper
    L0003: vmovupd xmm0, [rdx]
    L0008: cmp r8, 4
    L000c: jb short L0033
    L000e: vmovdqu [rcx], xmm0
    L0013: vmovdqu [rcx+0x10], xmm0
    L0019: vmovdqu [rcx+0x20], xmm0
    L001f: vmovdqu [rcx+0x30], xmm0
    L0025: add r8, 0xfffffffffffffffc
    L0029: add rcx, 0x40
    L002d: cmp r8, 4
    L0031: jae short L000e
    L0033: test r8, r8
    L0036: je short L0050
    L0038: nop [rax+rax]
    L0040: vmovdqu [rcx], xmm0
    L0044: dec r8
    L0047: add rcx, 0x10
    L004b: test r8, r8
    L004e: jne short L0040
    L0050: ret

A similar method using Vector256 exhibits the same issue.

Sharplab link: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4a9Jq049+ggJI8ovLrmXqamgMxNSDIQwDeNBm6a3iKBgFkAFACUzq7uAL4hbhEMYiYK2Bgw2kgM3LjYAGaJXgwAQrwYAO68uDAAYlKSAGowYBjQ5KQAHAD6OdgAJiLtMADiMFx+1bX1TQA8wACeCQB8AFQMuBjYUBgAgu3tsLi4aAxDdVANjeNTMNMMcZIcMLsckhBcAOYMMNICPCLcGAFRLtTuAIYBQAFlJEn5XjB3hhPjwGNMALwMFA/f6AtxOdHoqJYvY1A5HFiyA4wPyXa67PyTBKzAKLZZrDZbXABDRo3FufYjRpEklk7BXG4MKmnWl+ekrdabVS4BgAagY5ACrJxWK5hyavOgpPJQpFNIC4qWkqZMvlDFIyrZHPc6sJxO1/MFlOpMDFEsZ0u25usVqs7I5kOhsIwDDgSJQ1ptC2NnuZ8ojUfR4WoqrcILBwqD/RhEC+8IYdFRuL+0bxww1PIdsCdFOFrtpHqlzJVqYDuLgcGzHzzPCTHLlcqbpu2/YBKfHUSiMV4cQSSRSZgyWW8eUKxTKFTtTWaPQgEE6B96/UG+O5Jxm82HXp25YJY1d511t3uTxebxzIeL7lLarPldGHJTifAVrgABQYJFdTZNMgVBaQsw/Ht80RZFv3RTFo1ggFtyrPldTAl1RTpWNmxlVsy1wrUawIoiDSNBkyO9BUlQo6MqOrHVQJgQj62IhiTRvc1LTYm0OPw7jeP1N1DWveMFV9VtsPcbtc3zcNkTHLE5LNOVE1glNYIzBCISQtS4XOItYN/MT/3tCTBSkhsSMYkcWRg9ssU7VSQy09FBx00cDKiUIARoUIgA==

Reproduction Steps

...

Expected behavior

value is not reloaded from rdx every time it is needed

Actual behavior

value is reloaded from rdx every time it's needed

Regression?

No idea

Known Workarounds

Copy the value in question to a local temp before using it -- don't use parameters directly, in other words

Configuration

.NET 7.0.0 x64

Other information

No response

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 27, 2022
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Nov 27, 2022
@ghost
Copy link

ghost commented Nov 27, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

The following method generates native code that reloads the value parameter from rdx every time it's used:

private static unsafe void BitwiseFillVector128_BadCodeGen(Vector128<byte>* startAddress, Vector128<byte> value, ulong elementCount)
{
    while (elementCount >= 4)
    {            
        Vector128.Store(value, (byte*)startAddress);
        Vector128.Store(value, (byte*)(startAddress + 1));
        Vector128.Store(value, (byte*)(startAddress + 2));
        Vector128.Store(value, (byte*)(startAddress + 3));

        elementCount -= 4;
        startAddress += 4;
    }

    while (elementCount > 0)
    {
        Vector128.Store(value, (byte*)startAddress);

        --elementCount;
        ++startAddress;
    }
}
C.BitwiseFillVector128_BadCodeGen(System.Runtime.Intrinsics.Vector128`1<Byte>*, System.Runtime.Intrinsics.Vector128`1<Byte>, UInt64)
    L0000: vzeroupper
    L0003: cmp r8, 4
    L0007: jb short L0049
    L0009: nop [rax]
    L0010: vmovupd xmm0, [rdx]
    L0015: vmovdqu [rcx], xmm0
    L001a: vmovupd xmm0, [rdx]
    L001f: vmovdqu [rcx+0x10], xmm0
    L0025: vmovupd xmm0, [rdx]
    L002a: vmovdqu [rcx+0x20], xmm0
    L0030: vmovupd xmm0, [rdx]
    L0035: vmovdqu [rcx+0x30], xmm0
    L003b: add r8, 0xfffffffffffffffc
    L003f: add rcx, 0x40
    L0043: cmp r8, 4
    L0047: jae short L0010
    L0049: test r8, r8
    L004c: je short L0064
    L004e: xchg ax, ax
    L0050: vmovupd xmm0, [rdx]
    L0054: vmovdqu [rcx], xmm0
    L0058: dec r8
    L005b: add rcx, 0x10
    L005f: test r8, r8
    L0062: jne short L0050
    L0064: ret

If you copy value to a local/temp, the codegen is fixed:

private static unsafe void BitwiseFillVector128_GoodCodeGen(Vector128<byte>* startAddress, Vector128<byte> value, ulong elementCount)
{
    Vector128<Byte> valueP = value;

    while (elementCount >= 4)
    {            
        Vector128.Store(valueP, (byte*)startAddress);
        Vector128.Store(valueP, (byte*)(startAddress + 1));
        Vector128.Store(valueP, (byte*)(startAddress + 2));
        Vector128.Store(valueP, (byte*)(startAddress + 3));

        elementCount -= 4;
        startAddress += 4;
    }

    while (elementCount > 0)
    {
        Vector128.Store(valueP, (byte*)startAddress);

        --elementCount;
        ++startAddress;
    }
}
C.BitwiseFillVector128_GoodCodeGen(System.Runtime.Intrinsics.Vector128`1<Byte>*, System.Runtime.Intrinsics.Vector128`1<Byte>, UInt64)
    L0000: vzeroupper
    L0003: vmovupd xmm0, [rdx]
    L0008: cmp r8, 4
    L000c: jb short L0033
    L000e: vmovdqu [rcx], xmm0
    L0013: vmovdqu [rcx+0x10], xmm0
    L0019: vmovdqu [rcx+0x20], xmm0
    L001f: vmovdqu [rcx+0x30], xmm0
    L0025: add r8, 0xfffffffffffffffc
    L0029: add rcx, 0x40
    L002d: cmp r8, 4
    L0031: jae short L000e
    L0033: test r8, r8
    L0036: je short L0050
    L0038: nop [rax+rax]
    L0040: vmovdqu [rcx], xmm0
    L0044: dec r8
    L0047: add rcx, 0x10
    L004b: test r8, r8
    L004e: jne short L0040
    L0050: ret

A similar method using Vector256 exhibits the same issue.

Sharplab link: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4a9Jq049+ggJI8ovLrmXqamgMxNSDIQwDeNBm6a3iKBgFkAFACUzq7uAL4hbhEMYiYK2Bgw2kgM3LjYAGaJXgwAQrwYAO68uDAAYlKSAGowYBjQ5KQAHAD6OdgAJiLtMADiMFx+1bX1TQA8wACeCQB8AFQMuBjYUBgAgu3tsLi4aAxDdVANjeNTMNMMcZIcMLsckhBcAOYMMNICPCLcGAFRLtTuAIYBQAFlJEn5XjB3hhPjwGNMALwMFA/f6AtxOdHoqJYvY1A5HFiyA4wPyXa67PyTBKzAKLZZrDZbXABDRo3FufYjRpEklk7BXG4MKmnWl+ekrdabVS4BgAagY5ACrJxWK5hyavOgpPJQpFNIC4qWkqZMvlDFIyrZHPc6sJxO1/MFlOpMDFEsZ0u25usVqs7I5kOhsIwDDgSJQ1ptC2NnuZ8ojUfR4WoqrcILBwqD/RhEC+8IYdFRuL+0bxww1PIdsCdFOFrtpHqlzJVqYDuLgcGzHzzPCTHLlcqbpu2/YBKfHUSiMV4cQSSRSZgyWW8eUKxTKFTtTWaPQgEE6B96/UG+O5Jxm82HXp25YJY1d511t3uTxebxzIeL7lLarPldGHJTifAVrgABQYJFdTZNMgVBaQsw/Ht80RZFv3RTFo1ggFtyrPldTAl1RTpWNmxlVsy1wrUawIoiDSNBkyO9BUlQo6MqOrHVQJgQj62IhiTRvc1LTYm0OPw7jeP1N1DWveMFV9VtsPcbtc3zcNkTHLE5LNOVE1glNYIzBCISQtS4XOItYN/MT/3tCTBSkhsSMYkcWRg9ssU7VSQy09FBx00cDKiUIARoUIgA==

Reproduction Steps

...

Expected behavior

value is not reloaded from rdx every time it is needed

Actual behavior

value is reloaded from rdx every time it's needed

Regression?

No idea

Known Workarounds

Copy the value in question to a local temp before using it -- don't use parameters directly, in other words

Configuration

.NET 7.0.0 x64

Other information

No response

Author: rickbrew
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@rickbrew
Copy link
Contributor Author

Also, the codegen is fine if the method is inlined, e.g. [MethodImpl(MethodImplOptions.AggressiveInlining)]

@JulieLeeMSFT
Copy link
Member

CC @EgorBo.

@JulieLeeMSFT
Copy link
Member

CC @tannergooding.

@JulieLeeMSFT JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Jan 9, 2023
@JulieLeeMSFT JulieLeeMSFT removed untriaged New issue has not been triaged by the area owner needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration labels Jan 30, 2023
@JulieLeeMSFT JulieLeeMSFT added this to the 8.0.0 milestone Jan 30, 2023
@tannergooding tannergooding modified the milestones: 8.0.0, 9.0.0 Jul 28, 2023
@tannergooding
Copy link
Member

This isn't something we'll be able to get to for .NET 8, it is a scenario we should support however so I'm moving it to 9, rather than future.

This is probably something where @jakobbotsch or @EgorBo would be better to weigh in. We notably track arg1/V01 as a LCL_VAR throughout and while we see that its a byref arg very early, we don't ever take advantage of that knowledge.

There's also been inverse cases in the past where knowing an arg, or a sequence of args, were sequential in memory would've allowed us to optimize them better. If there were an easy way to take advantage of this info, such as in morph, then we could opt to hoist such implicit byref args to avoid repeated memory access or to take advantage of the fact they are memory accesses to combine neighboring loads.

@jakobbotsch
Copy link
Member

We can maybe do something simple for implicit byref args by marking their created indirections as invariant when we are able to tell that the arg itself is never modified (common scenario). It would probably require some VN work to avoid regressions.

The general problem is harder and touches upon having a more sophisticated aliasing model -- in this case the fact we'd like to make use of is that the ABI disallows implicit byref args from aliasing anything else. In particular the same optimization cannot be allowed if value is passed by reference or pointer, even though this looks the same at the ABI level, since in that case it could alias startAddress.

@tannergooding
Copy link
Member

Definitely understand the broader aliasing issues for explicit byrefs. The biggest issues I've seen are the implicit cases like this one and I think that will help catch some of the worst offenders.

The other case I particularly remember is Matrix4x4(float m11, ..., float m44). We have 16x by-value args, the latter 10-12 of which are passed by reference and will be sequential in memory.

Morph has a bit that will look for Vector128.Create(x, y, z, w) where x, y, z, w are sequential in memory and optimize that instead to Vector128.Load. This works great for regular locals, arrays, and some other cases. But it doesn't work for the Matrix4x4 constructor since they appear as LCL_VAR and the implicit byref is "hidden".

@tannergooding tannergooding removed their assignment Apr 18, 2024
@JulieLeeMSFT JulieLeeMSFT modified the milestones: 9.0.0, Future Apr 18, 2024
@JulieLeeMSFT
Copy link
Member

Moving to future. Codegen has always been like this, also fairly complex change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

4 participants