Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sse.IsSupported prevents auto-inlining? #11595

Closed
EgorBo opened this issue Dec 2, 2018 · 5 comments
Closed

Sse.IsSupported prevents auto-inlining? #11595

EgorBo opened this issue Dec 2, 2018 · 5 comments

Comments

@EgorBo
Copy link
Member

EgorBo commented Dec 2, 2018

I have a small benchmark:

    private static float Sqrt_1(float x)  // inlineable!
    {
        x = MathF.Sqrt(x);
        return x;
    }

    private static float Sqrt_2(float x) // not inlineable :(
    {
        if (Sse41.IsSupported)
            x = MathF.Sqrt(x);
        return x;
    }

the only difference between Sqrt_1 and Sqrt_2 is that if (Sse41.IsSupported) expression which apparently prevents this method from being inlined (but is eliminated anyway) in Benchmark2()
Asm output:

; ConsoleApp88.Program.Benchmark1()
       vzeroupper
       xchg    ax,ax
       vmovss  xmm0,dword ptr [rcx+8]
       vsqrtss xmm0,xmm0,xmm0                  ;*inlined!!*
       ret
       add     byte ptr [rax],al
       add     byte ptr [rcx],bl
       add     byte ptr [rax],al
       add     byte ptr [rax],al
       add     byte ptr [rax],al
       add     byte ptr [rax],al
       add     byte ptr [rax],al
       add     byte ptr [rax],al
       add     byte ptr [rax],al
       add     byte ptr [rax],al
       add     byte ptr [rax],al
       ???
       sbb     dword ptr [rbx],esp
       in      eax,0FCh
       jg      M00_L00
M00_L00:
       add     byte ptr [rdi],cl
; ConsoleApp88.Program.Benchmark2()
       vzeroupper
       xchg    ax,ax
       vmovss  xmm0,dword ptr [rcx+8]
       mov     rax,7FFCE50D9518h
       jmp     rax
       sbb     dword ptr [rax],eax
       add     byte ptr [rax],al
       add     byte ptr [rax],al
       add     byte ptr [rdx],al
       add     byte ptr [rax],al
       add     byte ptr [rax],al
       add     byte ptr [rax],al
       add     al,ch
       xor     esp,dword ptr [rbp-4Bh]
       pop     rdi
       pop     rsi
       add     byte ptr [rdx-18h],bl
       sub     esp,dword ptr [rbp-4Bh]
       pop     rdi

; ConsoleApp88.Program.Sqrt_2(Single)
       vzeroupper
       xchg    ax,ax
       vsqrtss xmm0,xmm0,xmm0
       ret
       add     byte ptr [rcx],bl
       add     byte ptr [rax],al
       add     byte ptr [rax],al
       add     byte ptr [rax],al
       add     byte ptr [rax],al
       add     byte ptr [rax],al
       push    0FFFFFFFFFCE5231Dh
       jg      M01_L00
M01_L00:
       add     byte ptr [rdi],cl
       ???
       add     byte ptr [rax],r8b
       ret
       add     byte ptr [rax],al
       sbb     dword ptr [rax],eax

PS: Tiered JIT is disabled.

BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17134.407 (1803/April2018Update/Redstone4)
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
Frequency=3609372 Hz, Resolution=277.0565 ns, Timer=TSC
.NET Core SDK=3.0.100-preview-009812
  [Host]     : .NET Core 3.0.0-preview-27122-01 (CoreCLR 4.6.27121.03, CoreFX 4.7.18.57103), 64bit RyuJIT
  Job-FJESCM : .NET Core 3.0.0-preview-27122-01 (CoreCLR 4.6.27121.03, CoreFX 4.7.18.57103), 64bit RyuJIT
@EgorBo EgorBo changed the title Sse.IsSupported brakes auto-inlining? Sse.IsSupported prevents auto-inlining? Dec 2, 2018
@AndyAyersMS
Copy link
Member

The IL For Sqrt_2 is just a bit too big for it to be classified as always inline, and so the jit evaluates via its profitability heuristics, and decides this is not a profitable inline candidate:

Invoking compiler for the inlinee method X:Sqrt_2(float):float :
IL to import:
IL_0000  28 07 00 00 0a    call         0xA000007
IL_0005  2c 08             brfalse.s    8 (IL_000f)
IL_0007  02                ldarg.0     
IL_0008  28 06 00 00 0a    call         0xA000006
IL_000d  10 00             starg.s      0x0
IL_000f  02                ldarg.0     
IL_0010  2a                ret         

INLINER impTokenLookupContextHandle for X:Sqrt_2(float):float is 0x00000000D1FFAB1E.
*************** In fgFindBasicBlocks() for X:Sqrt_2(float):float
weight= 79 : state  40 [ call ]
weight= 27 : state  44 [ brfalse.s ]
weight= 10 : state   3 [ ldarg.0 ]
weight= 79 : state  40 [ call ]
weight= 21 : state  17 [ starg.s ]
weight= 10 : state   3 [ ldarg.0 ]
weight= 19 : state  42 [ ret ]

Inline candidate callsite is boring.  Multiplier increased to 1.3.
calleeNativeSizeEstimate=245
callsiteNativeSizeEstimate=85
benefit multiplier=1.3
threshold=110
Native estimate for function size exceeds threshold for inlining 24.5 > 11 (multiplier = 1.3)

INLINER: during 'fgInline' result 'failed this call site' reason 'unprofitable inline' for 'X:Benchmark2():float:this' calling 'X:Sqrt_2(float):float'

The inline heuristics we use today don't resolve tokens and so the jit does not realize that the calls in this method are intrinsics and the first will fully resolve to a constant at jit time. So the cost of the inline is over-estimated and the inline is rejected.

Not much we can do about this right now. A more detailed examination of the inline candidates would be helpful (and there are plenty of examples where better heuristics would give widespread benefits) but also would be cost a fair amount of jit time. It is something I'd like to address someday, perhaps when we are more confident about increasing the jit time for Tier1 jitting or we introduce even higher tiers.

You will need to add [MethodImpl(MethodImplOptions.AggressiveInlining)] to tell the jit to consider this method as always inline. Or you can try and streamline the IL produced, eg this inlines:

public static float Sqrt_2(float x)
{
    if (Sse41.IsSupported)
        return MathF.Sqrt(x);
    return x;
}

@EgorBo
Copy link
Member Author

EgorBo commented Dec 2, 2018

Thanks @AndyAyersMS, that's what I thought - I am just wondering if I could add [MethodImpl(MethodImplOptions.NoInline)] on top of Sse.IsSupported - would it help? 🙂 As far as I understand the method is "too big" because IsSupported was inlined into my Sqrt_2, right?

@tannergooding
Copy link
Member

tannergooding commented Dec 2, 2018

because IsSupported was inlined

I don't believe so. The problem is that, at this point, the JIT hasn't resolved what is being called, it just sees that a call is being made and gives it a "static" weight.

If we knew that Sse.IsSupported or MathF.Sqrt were being called, we could just do a if (methodTable.IsIntrinsic) check and reduce the cost from 79 to something much less.

(feel free to correct me if I got something mixed up here Andy 😄)

@EgorBo
Copy link
Member Author

EgorBo commented Dec 2, 2018

Thank you for the clarification!

@EgorBo EgorBo closed this as completed Dec 2, 2018
@AndyAyersMS
Copy link
Member

The jit does inlining top down, so by and large the inlineability of an inline's callees has no impact on whether the inline itself is viable (this is not strictly true in some generic cases but is true enough). As Tanner says all the jit sees early on is that there is call.

The state machine the jit uses to estimate code size is very crude and hasn't been updated in a long time. I made some attempts to improve on this early modelling via machine learning derived heuristics and managed to get some respectable size predictions but modelling the code quality improvements proved more elusive. And even on the code size front there were challenges in predicting cases where an inline actually reduced code size or increased it much less than one might expect given the amount of IL.

Part of the challenge in evaulating inline impact by just looking at the IL stream is that there is a wide variety of expansions for some IL constructs -- calls in particular. Thus the jit can't really know what an IL level call means without mapping the token to the right runtime information and this mapping is somewhat expensive (look at what resolveToken followed by getCallInfo has to do over in the jit interface). So, the jit can't understand the potential impact of a call that resolves to an intrinsic that in turn evaluates to a constant that feeds a compare that will eliminate a bunch of code in the inlinee without modelling all of this in some detail.

There are enough inline candidates and enough constraints on jit time that historically it has been deemed impractical to do this level of detailed analysis for every inline. Instead, the jit uses cheap but sometimes overly conservative heuristics.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants