Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call to unmanaged function pointer always emits full transition frame #45077

Closed
Sergio0694 opened this issue Nov 22, 2020 · 8 comments
Closed
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue untriaged New issue has not been triaged by the area owner

Comments

@Sergio0694
Copy link
Contributor

Description

Stumbled upon this missed optimization while analyzing some codegen from @tannergooding's TerraFx library, which relies very heavily on unmanaged function pointers. It seems that no matter what signature is used, the JIT will always emit a full transition frame when calling an unmanaged function pointer, backup up all local registers on the stack, etc.
They basically behave similarly to P/Invoke calls up to .NET Core 3.1.

Consider this example:

public static int Test(delegate* unmanaged<int, int> f, int x)
{
    return f(x);
}

This results in the following (using disasmo from a local CoreCLR build from release/5.0):

; Assembly listing for method Program:Test(long,int):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T01] (  3,  3   )    long  ->  rsi        
;  V01 arg1         [V01,T02] (  3,  3   )     int  ->  rdi        
;* V02 loc0         [V02    ] (  0,  0   )    long  ->  zero-ref   
;  V03 OutArgs      [V03    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V04 tmp1         [V04,T03] (  2,  4   )     int  ->  rax         "Single return block return value"
;  V05 FramesRoot   [V05,T00] (  6,  6   )    long  ->  rbx         "Pinvoke FrameListRoot"
;  V06 PInvokeFrame [V06    ] (  7,  7   )     blk (72) [rbp-0x80]   do-not-enreg[X] addr-exposed "Pinvoke FrameVar"
;
; Lcl frame size = 104

G_M24319_IG01:
       push     rbp
       push     r15
       push     r14
       push     r13
       push     r12
       push     rdi
       push     rsi
       push     rbx
       sub      rsp, 104
       lea      rbp, [rsp+A0H]
       mov      rsi, rcx
       mov      edi, edx
						;; bbWeight=1    PerfScore 9.25
G_M24319_IG02:
       lea      rcx, [rbp-78H]
       mov      rdx, r10
       call     CORINFO_HELP_INIT_PINVOKE_FRAME
       mov      rbx, rax
       mov      rcx, rsp
       mov      qword ptr [rbp-58H], rcx
       mov      rcx, rbp
       mov      qword ptr [rbp-48H], rcx
       mov      ecx, edi
       lea      rax, G_M24319_IG04
       mov      qword ptr [rbp-50H], rax
       lea      rax, bword ptr [rbp-78H]
       mov      qword ptr [rbx+16], rax
       mov      byte  ptr [rbx+12], 0
						;; bbWeight=1    PerfScore 9.25
G_M24319_IG03:
       call     rsi
						;; bbWeight=1    PerfScore 3.00
G_M24319_IG04:
       mov      byte  ptr [rbx+12], 1
       mov      rdx, 0xD1FFAB1E
       cmp      dword ptr [rdx], 0
       je       SHORT G_M24319_IG05
       mov      rcx, 0xD1FFAB1E
       call     qword ptr [rcx]CORINFO_HELP_STOP_FOR_GC
						;; bbWeight=1    PerfScore 7.50
G_M24319_IG05:
       mov      rdx, bword ptr [rbp-70H]
       mov      qword ptr [rbx+16], rdx
						;; bbWeight=1    PerfScore 2.00
G_M24319_IG06:
       lea      rsp, [rbp-38H]
       pop      rbx
       pop      rsi
       pop      rdi
       pop      r12
       pop      r13
       pop      r14
       pop      r15
       pop      rbp
       ret      
						;; bbWeight=1    PerfScore 5.50

; Total bytes of code 141, prolog size 24, PerfScore 50.60, (MethodHash=7fcfa100) for method Program:Test(long,int):int
; ============================================================

That's... Quite a lot of codegen for just invoking a single unmanaged function pointer 🤔

Configuration

  • .NET 5, from release.5.0 (from ea56d0c)
  • Built with build -c checked clr+libs -os Windows_NT -a x64

Any chances a fix for this might be serviced to a .NET 5.x update?
Since function pointers are especially useful for high performance scenarios, this seems like a relatively big missed optimization, especially considering standard P/Invoke methods don't have this issue anymore when running on .NET 5?

@Sergio0694 Sergio0694 added the tenet-performance Performance related issue label Nov 22, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Nov 22, 2020
@Perksey
Copy link
Member

Perksey commented Nov 22, 2020

This massively impacts Silk.NET where function pointers are used as part of 99% of all functions it exposes. Eliminating the transition frame would be a pleasant boost in performance library-wide.

Context:
GL.28747011.gen.cs

(Don't worry a lot of the ugliness in this file is to workaround various other JIT quirks)

@jkotas
Copy link
Member

jkotas commented Nov 22, 2020

the JIT will always emit a full transition frame when calling an unmanaged function pointer, backup up all local registers on the stack, etc.

That is by design. It is required for the precise GC scanning.

We have #38134 opened on exposing SuppressGCTransition calling convention for function pointers. However, this optimization can be only used in very specific situations as described in https://docs.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.suppressgctransitionattribute?view=net-5.0

@jkotas jkotas closed this as completed Nov 22, 2020
@john-h-k
Copy link
Contributor

That is by design. It is required for the precise GC scanning.

Out of curiosity, why is this needed for a fnptr but not a pinvoke?

@jkotas
Copy link
Member

jkotas commented Nov 22, 2020

It is needed for both fptr and PInvoke.

considering standard P/Invoke methods don't have this issue anymore when running on .NET 5?

This is not a correct statement.

@Sergio0694
Copy link
Contributor Author

Sergio0694 commented Nov 22, 2020

@jkotas I'll admit I'm not an absolute expert in this area, was just quoting @tannergooding there (from chatting in Discord).
His original comments while I originally spotted this unexpected codegen while investigating TerraFX:

[...] then log a bug for the function pointers as it looks like they are missing this optimization and always emitting the full transition frame regardless of what registers are actually in use, etc.
[...] its not a bug, just not as fast as it could be
[...] they are still fast, they just incur a larger transition frame, just like P/Invokes did in 3.1 and prior

I was specifically referring to this last bit where he said P/Invoke methods got this additional optimization on .NET 5.

I'd love to understand more about what is happening here, as the resulting codegen is at the very least surprising I'd say.
Also in particular, it seems unclear as to why all non-volatile registers are backed up and pushed to the stack, even if unused?

@tannergooding
Copy link
Member

@jkotas, could you elaborate why all non-volatile registers need to be saved?

Is that a hard limitation in how the GC does tracking today and is it something that would be reasonable to track adjusting in the future to make the transition lower cost?

Likewise, is there something that could be optimized here for multiple transitions in a single call? Now that function pointers and P/Invoke transitions can be somewhat inlined, it would seem that if you have:

public unsafe void M(delegate* unmanaged[Stdcall]<int, void> test) {
        test(1);
        test(2);
}

It would be beneficial to do effectively:

transition
p/invoke
p/invoke
transition

rather than the following that it looks to do:

transition
p/invoke
transition

transition
p/invoke
transition

@jkotas
Copy link
Member

jkotas commented Nov 22, 2020

it seems unclear as to why all non-volatile registers are backed up and pushed to the stack, even if unused?

The non-volatile registers can contain object references and the stackwalk needs to find them if the GC runs while the PInvoke is executing.

I do not see how we can adjust it without hurting performance in other places.

Yes, it should be possible to optimize out some of in-between transitions in the sequence of PInvokes with no code between them. This optimization is not specific to function pointers in any way. This optimization would likely require changes in number of places, for example debugger may need work to make stepping work in the presence of this optimization.

Now that function pointers and P/Invoke transitions can be somewhat inlined

Inlining of P/Invoke transitions was around since .NET Framework 2.0 (it was probably in .NET Framework 1.x too - I am just not 100% sure). We have not changed anything fundamental here recently.

@jkotas
Copy link
Member

jkotas commented Nov 22, 2020

@ghost ghost locked as resolved and limited conversation to collaborators Dec 22, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

6 participants