Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vzeroupper fails to be emitted in some P/Invoke scenarios #82132

Closed
tannergooding opened this issue Feb 14, 2023 · 7 comments · Fixed by #98261
Closed

vzeroupper fails to be emitted in some P/Invoke scenarios #82132

tannergooding opened this issue Feb 14, 2023 · 7 comments · Fixed by #98261
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@tannergooding
Copy link
Member

Consider the following code:

public Vector256<int> M(Vector256<int> x, long* lpFrequency)
{
    x += x;
    QueryPerformanceFrequency(lpFrequency);
    return x;
}

public Vector256<int> M2(Vector256<int> x, long* lpFrequency)
{
    x += x;
    N(lpFrequency);
    return x;
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static int N(long* lpFrequency)
{
    return QueryPerformanceFrequency(lpFrequency);
}

[DllImport("kernel32", ExactSpelling = true)]
[SuppressGCTransition]
public static extern int QueryPerformanceFrequency(long* lpFrequency);

For M, because the QPC is direct and because Vector256<int> is directly used in the function we get a vzeroupper directly before the P/Invoke is called:

; Method C:M(System.Runtime.Intrinsics.Vector256`1[int],ulong):System.Runtime.Intrinsics.Vector256`1[int]:this
G_M000_IG01:                ;; offset=0000H
       55                   push     rbp
       57                   push     rdi
       56                   push     rsi
       4883EC20             sub      rsp, 32
       C5F877               vzeroupper 
       488D6C2430           lea      rbp, [rsp+30H]
       48895518             mov      bword ptr [rbp+18H], rdx
       498BF0               mov      rsi, r8

G_M000_IG02:                ;; offset=0016H
       C5FC1006             vmovups  ymm0, ymmword ptr[rsi]
       C5FDFEC0             vpaddd   ymm0, ymm0, ymm0
       48897520             mov      bword ptr [rbp+20H], rsi
       C5FC1106             vmovups  ymmword ptr[rsi], ymm0
       498BC9               mov      rcx, r9
       48B880474922FE7F0000 mov      rax, 0x7FFE22494780

G_M000_IG03:                ;; offset=0033H
       C5F877               vzeroupper 
       FFD0                 call     rax ; C:QueryPerformanceFrequency(ulong):int
       488B7520             mov      rsi, bword ptr [rbp+20H]
       C5FC1006             vmovups  ymm0, ymmword ptr[rsi]
       488B7D18             mov      rdi, bword ptr [rbp+18H]
       C5FC1107             vmovups  ymmword ptr[rdi], ymm0
       833DB59EE15F00       cmp      dword ptr [(reloc 0x7ffcd12fd8a4)], 0
       750E                 jne      SHORT G_M000_IG06

G_M000_IG04:                ;; offset=0051H
       488BC7               mov      rax, rdi

G_M000_IG05:                ;; offset=0054H
       C5F877               vzeroupper 
       4883C420             add      rsp, 32
       5E                   pop      rsi
       5F                   pop      rdi
       5D                   pop      rbp
       C3                   ret      

G_M000_IG06:                ;; offset=005FH
       E85C79AA5F           call     CORINFO_HELP_POLL_GC
       EBEB                 jmp      SHORT G_M000_IG04
; Total bytes of code: 102

However, for M2 because the call to the P/Invoke is hidden behind the non-inlined method N and because N does not itself utilize any instructions requiring YMM, we miss out on this and do not emit an additional vzeroupper:

; Method C:M2(System.Runtime.Intrinsics.Vector256`1[int],ulong):System.Runtime.Intrinsics.Vector256`1[int]:this
G_M000_IG01:                ;; offset=0000H
       57                   push     rdi
       56                   push     rsi
       4883EC28             sub      rsp, 40
       C5F877               vzeroupper 
       488BFA               mov      rdi, rdx
       498BF0               mov      rsi, r8

G_M000_IG02:                ;; offset=000FH
       C5FC1006             vmovups  ymm0, ymmword ptr[rsi]
       C5FDFEC0             vpaddd   ymm0, ymm0, ymm0
       C5FC1106             vmovups  ymmword ptr[rsi], ymm0
       498BC9               mov      rcx, r9
       FF15FCF92300         call     [C:N(ulong):int]
       C5FC1006             vmovups  ymm0, ymmword ptr[rsi]
       C5FC1107             vmovups  ymmword ptr[rdi], ymm0
       488BC7               mov      rax, rdi

G_M000_IG03:                ;; offset=002FH
       C5F877               vzeroupper 
       4883C428             add      rsp, 40
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret      
; Total bytes of code: 57

; Method C:N(ulong):int
G_M000_IG01:                ;; offset=0000H
       55                   push     rbp
       56                   push     rsi
       4883EC28             sub      rsp, 40
       488D6C2430           lea      rbp, [rsp+30H]
       488BF1               mov      rsi, rcx

G_M000_IG02:                ;; offset=000EH
       833DEF9EE45F00       cmp      dword ptr [(reloc 0x7ffcd12fd8a4)], 0
       7517                 jne      SHORT G_M000_IG06

G_M000_IG03:                ;; offset=0017H
       488BCE               mov      rcx, rsi
       48B880474922FE7F0000 mov      rax, 0x7FFE22494780

G_M000_IG04:                ;; offset=0024H
       FFD0                 call     rax ; C:QueryPerformanceFrequency(ulong):int
       90                   nop      

G_M000_IG05:                ;; offset=0027H
       4883C428             add      rsp, 40
       5E                   pop      rsi
       5D                   pop      rbp
       C3                   ret      

G_M000_IG06:                ;; offset=002EH
       E88D79AD5F           call     CORINFO_HELP_POLL_GC
       EBE2                 jmp      SHORT G_M000_IG03
; Total bytes of code: 53

If the P/Invoke being called uses any legacy encoded SIMD instructions, then there is a significant (up to 10x perf regression per SIMD instruction) penalty that is incurred. Native code that utilizes SIMD instructions typically ends up with the "legacy encoding" since AVX is not part of the baseline instruction set.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 14, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Feb 14, 2023
@ghost
Copy link

ghost commented Feb 14, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Consider the following code:

public Vector256<int> M(Vector256<int> x, long* lpFrequency)
{
    x += x;
    QueryPerformanceFrequency(lpFrequency);
    return x;
}

public Vector256<int> M2(Vector256<int> x, long* lpFrequency)
{
    x += x;
    N(lpFrequency);
    return x;
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static int N(long* lpFrequency)
{
    return QueryPerformanceFrequency(lpFrequency);
}

[DllImport("kernel32", ExactSpelling = true)]
[SuppressGCTransition]
public static extern int QueryPerformanceFrequency(long* lpFrequency);

For M, because the QPC is direct and because Vector256<int> is directly used in the function we get a vzeroupper directly before the P/Invoke is called:

; Method C:M(System.Runtime.Intrinsics.Vector256`1[int],ulong):System.Runtime.Intrinsics.Vector256`1[int]:this
G_M000_IG01:                ;; offset=0000H
       55                   push     rbp
       57                   push     rdi
       56                   push     rsi
       4883EC20             sub      rsp, 32
       C5F877               vzeroupper 
       488D6C2430           lea      rbp, [rsp+30H]
       48895518             mov      bword ptr [rbp+18H], rdx
       498BF0               mov      rsi, r8

G_M000_IG02:                ;; offset=0016H
       C5FC1006             vmovups  ymm0, ymmword ptr[rsi]
       C5FDFEC0             vpaddd   ymm0, ymm0, ymm0
       48897520             mov      bword ptr [rbp+20H], rsi
       C5FC1106             vmovups  ymmword ptr[rsi], ymm0
       498BC9               mov      rcx, r9
       48B880474922FE7F0000 mov      rax, 0x7FFE22494780

G_M000_IG03:                ;; offset=0033H
       C5F877               vzeroupper 
       FFD0                 call     rax ; C:QueryPerformanceFrequency(ulong):int
       488B7520             mov      rsi, bword ptr [rbp+20H]
       C5FC1006             vmovups  ymm0, ymmword ptr[rsi]
       488B7D18             mov      rdi, bword ptr [rbp+18H]
       C5FC1107             vmovups  ymmword ptr[rdi], ymm0
       833DB59EE15F00       cmp      dword ptr [(reloc 0x7ffcd12fd8a4)], 0
       750E                 jne      SHORT G_M000_IG06

G_M000_IG04:                ;; offset=0051H
       488BC7               mov      rax, rdi

G_M000_IG05:                ;; offset=0054H
       C5F877               vzeroupper 
       4883C420             add      rsp, 32
       5E                   pop      rsi
       5F                   pop      rdi
       5D                   pop      rbp
       C3                   ret      

G_M000_IG06:                ;; offset=005FH
       E85C79AA5F           call     CORINFO_HELP_POLL_GC
       EBEB                 jmp      SHORT G_M000_IG04
; Total bytes of code: 102

However, for M2 because the call to the P/Invoke is hidden behind the non-inlined method N and because N does not itself utilize any instructions requiring YMM, we miss out on this and do not emit an additional vzeroupper:

; Method C:M2(System.Runtime.Intrinsics.Vector256`1[int],ulong):System.Runtime.Intrinsics.Vector256`1[int]:this
G_M000_IG01:                ;; offset=0000H
       57                   push     rdi
       56                   push     rsi
       4883EC28             sub      rsp, 40
       C5F877               vzeroupper 
       488BFA               mov      rdi, rdx
       498BF0               mov      rsi, r8

G_M000_IG02:                ;; offset=000FH
       C5FC1006             vmovups  ymm0, ymmword ptr[rsi]
       C5FDFEC0             vpaddd   ymm0, ymm0, ymm0
       C5FC1106             vmovups  ymmword ptr[rsi], ymm0
       498BC9               mov      rcx, r9
       FF15FCF92300         call     [C:N(ulong):int]
       C5FC1006             vmovups  ymm0, ymmword ptr[rsi]
       C5FC1107             vmovups  ymmword ptr[rdi], ymm0
       488BC7               mov      rax, rdi

G_M000_IG03:                ;; offset=002FH
       C5F877               vzeroupper 
       4883C428             add      rsp, 40
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret      
; Total bytes of code: 57

; Method C:N(ulong):int
G_M000_IG01:                ;; offset=0000H
       55                   push     rbp
       56                   push     rsi
       4883EC28             sub      rsp, 40
       488D6C2430           lea      rbp, [rsp+30H]
       488BF1               mov      rsi, rcx

G_M000_IG02:                ;; offset=000EH
       833DEF9EE45F00       cmp      dword ptr [(reloc 0x7ffcd12fd8a4)], 0
       7517                 jne      SHORT G_M000_IG06

G_M000_IG03:                ;; offset=0017H
       488BCE               mov      rcx, rsi
       48B880474922FE7F0000 mov      rax, 0x7FFE22494780

G_M000_IG04:                ;; offset=0024H
       FFD0                 call     rax ; C:QueryPerformanceFrequency(ulong):int
       90                   nop      

G_M000_IG05:                ;; offset=0027H
       4883C428             add      rsp, 40
       5E                   pop      rsi
       5D                   pop      rbp
       C3                   ret      

G_M000_IG06:                ;; offset=002EH
       E88D79AD5F           call     CORINFO_HELP_POLL_GC
       EBE2                 jmp      SHORT G_M000_IG03
; Total bytes of code: 53

If the P/Invoke being called uses any legacy encoded SIMD instructions, then there is a significant (up to 10x perf regression per SIMD instruction) penalty that is incurred. Native code that utilizes SIMD instructions typically ends up with the "legacy encoding" since AVX is not part of the baseline instruction set.

Author: tannergooding
Assignees: -
Labels:

area-CodeGen-coreclr, untriaged

Milestone: -

@tannergooding
Copy link
Member Author

CC. @AaronRobinsonMSFT, @jkoritzinsky, @anthonycanino

The simplest fix here is likely to emit vzeroupper before any P/Invoke as that is in general a "best practice".

There exists an inverse issue in that we emit vzeroupper too frequently in some other scenarios today: #11496

In general we don't need vzeroupper for managed to managed calls. We only need it before calling into some method which might be AVX unaware which includes P/Invokes and some crossgen/ready-to-run (pre-jit) methods.

For the case where SuppressGCTransition is not involved, then the call won't be directly to the P/Invoke but will instead be to a pre-code stub that resolves the native function address and dispatches to it. We should also ensure that the vzeroupper is included as part of that and not unnecessarily duplicated outside of it.

@JulieLeeMSFT
Copy link
Member

@tannergooding, is this correctness issue or code quality issue? If it is code quality issue, I will set it to Future.

@JulieLeeMSFT
Copy link
Member

@tannergooding Ping.

@tannergooding, is this correctness issue or code quality issue? If it is code quality issue, I will set it to Future.

@tannergooding
Copy link
Member Author

@JulieLeeMSFT This is a code quality issue but one that has a significant impact on perf. The impact on some hardware can be that the code in question runs 10x slower or more.

@JulieLeeMSFT
Copy link
Member

@tannergooding, thanks for clarifying it. Setting it to .net 8 for now and assigning it to you. You can push it out if you don’t have time to get to work on it

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Mar 20, 2023
@JulieLeeMSFT JulieLeeMSFT added this to the 8.0.0 milestone Mar 20, 2023
@tannergooding tannergooding modified the milestones: 8.0.0, 9.0.0 Aug 8, 2023
@tannergooding
Copy link
Member Author

This is a longstanding perf issue, but not a regression nor a correctness issue. Moving to .NET 9

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 10, 2024
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 13, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Mar 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants