JIT: generated code for generic static method with `Nullable<T>` can be optimized #9677

feO2x · 2018-02-06T20:01:50Z

Consider the following Benchmark .NET benchmarks (gist here):

namespace MustHaveValueBenchmark
{
    [CoreJob]
    [MemoryDiagnoser]
    [DisassemblyDiagnoser]
    public class IntBenchmarks
    {
        public readonly int? NullableWithValue = 42;

        [Benchmark(Baseline = true)]
        public int? MustHaveValueBaseVersion()
        {
            if (NullableWithValue.HasValue == false) throw new NullableHasNoValueException(nameof(NullableWithValue));
            return NullableWithValue;
        }

        [Benchmark]
        public int? MustHaveValueExtensionMethod()
        {
            return NullableWithValue.MustHaveValue(nameof(NullableWithValue));
        }
    }
    
    public static class AssertionMethods
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static T? MustHaveValue<T>(this T? parameter, string parameterName = null, string message = null) where T : struct
        {
            if (parameter.HasValue == false)
                Throw.NullableHasNoValue(parameterName, message);
            return parameter;
        }
    }
    
    public static class Throw
    {
        [MethodImpl(MethodImplOptions.NoInlining)]
        public static void NullableHasNoValue(string parameterName = null, string message = null)
        {
            throw new NullableHasNoValueException(parameterName, message ?? $"{parameterName ?? "The nullable"} must have a value, but it actually is null.");
        }
    }
       
    [Serializable]
    public class NullableHasNoValueException : ArgumentNullException
    {
        public NullableHasNoValueException(string parameterName = null, string message = null)
            : base(parameterName, message) { }
    }
}

The two benchmarks execute the same guard clause checking that a Nullable<T> has a value. The first is written in imperative style, the second one is written as a generic extension method.

The results for .NET Core 2.1.0-preview2-26130-04 are the following:

BenchmarkDotNet=v0.10.12, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.214)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical cores and 4 physical cores
Frequency=3515623 Hz, Resolution=284.4446 ns, Timer=TSC
.NET Core SDK=2.2.0-preview1-007947
  [Host] : .NET Core 2.1.0-preview2-26130-04 (Framework 4.6.26130.05), 64bit RyuJIT
  Core   : .NET Core 2.1.0-preview2-26130-04 (Framework 4.6.26130.05), 64bit RyuJIT

Job=Core  Runtime=Core

Method	Mean	Error	StdDev	Scaled	ScaledSD	Allocated
MustHaveValueBaseVersion	0.7766 ns	0.0176 ns	0.0117 ns	1.00	0.00	0 B
MustHaveValueExtensionMethod	3.5325 ns	0.1015 ns	0.0996 ns	4.55	0.14	0 B

As we can see, the extension method is ca. 4.5 times slower than the imperative version.

@redknightlois and I think that the code generated by the JIT can be optimized so that the performance of the extension method version increases considerably. The imperative version returns fast:

00007ff8`6020ec10 MustHaveValueBenchmark.IntBenchmarks.MustHaveValueBaseVersion()
00007ff8`6020ec15 488d4108        lea     rax,[rcx+8]
00007ff8`6020ec19 0fb610          movzx   edx,byte ptr [rax]
00007ff8`6020ec1c 8b4004          mov     eax,dword ptr [rax+4]
00007ff8`6020ec1f 84d2            test    dl,dl
00007ff8`6020ec21 740a            je      00007ff8`6020ec2d
00007ff8`6020ec23 488b4108        mov     rax,qword ptr [rcx+8]
00007ff8`6020ec27 4883c420        add     rsp,20h
00007ff8`6020ec2b 5e              pop     rsi
00007ff8`6020ec2c c3              ret
;additional statements left out

The extension method does not return fast:

00007ff8`6021ec10 MustHaveValueBenchmark.IntBenchmarks.MustHaveValueExtensionMethod()
00007ff8`6021ec14 4883c108        add     rcx,8
00007ff8`6021ec18 0fb611          movzx   edx,byte ptr [rcx]
00007ff8`6021ec1b 88542430        mov     byte ptr [rsp+30h],dl
00007ff8`6021ec1f 8b4904          mov     ecx,dword ptr [rcx+4]
00007ff8`6021ec22 894c2434        mov     dword ptr [rsp+34h],ecx
00007ff8`6021ec26 c644242800      mov     byte ptr [rsp+28h],0
00007ff8`6021ec2b 33c9            xor     ecx,ecx
00007ff8`6021ec2d 894c242c        mov     dword ptr [rsp+2Ch],ecx
00007ff8`6021ec31 807c243000      cmp     byte ptr [rsp+30h],0
00007ff8`6021ec36 7407            je      00007ff8`6021ec3f
00007ff8`6021ec38 488b442430      mov     rax,qword ptr [rsp+30h]
00007ff8`6021ec3d eb24            jmp     00007ff8`6021ec63
00007ff8`6021ec3f 48b9a8333577c0020000 mov rcx,2C0773533A8h
00007ff8`6021ec49 488b09          mov     rcx,qword ptr [rcx]
00007ff8`6021ec4c 33d2            xor     edx,edx
00007ff8`6021ec4e e8add7ffff      call    00007ff8`6021c400
ThrowMethodInlined.Benchmarks.Throw.NullableHasNoValue(System.String, System.String)
00007ff8`6021ec53 c644242800      mov     byte ptr [rsp+28h],0
00007ff8`6021ec58 33c0            xor     eax,eax
00007ff8`6021ec5a 8944242c        mov     dword ptr [rsp+2Ch],eax
00007ff8`6021ec5e 488b442428      mov     rax,qword ptr [rsp+28h]
 
No ILOffsetMap found
ThrowMethodInlined.Benchmarks.Throw.NullableHasNoValue(System.String, System.String)

Is there a way to optimize the JIT output so that the extension method is about as fast as the imperative version?

category:cq
theme:structs
skill-level:expert
cost:large

The text was updated successfully, but these errors were encountered:

AndyAyersMS · 2018-02-06T20:27:11Z

Since nullables are value classes, there is an additional cost to passing them as an explicit by-value argument vs as an implicit ("true" this) argument -- in the explicit case a copy must be made.

The ABI may then require that this explicit by-value value class actually be passed implicitly by reference. Today the jit exposes the ABI impact very early in its processing. If the jit then inlines this call, it will try to undo the impact of the implicit by-reference arg passing and the copy, but sometimes it can't get rid of it all. So that's likely what you see here.

We have ambitions of improving the jit's ability to deal with structs in cases like this but haven't gotten around to it yet. It is a fairly large undertaking.

When I have a chance, I'll take a deeper look to see if there is something going on here that we might be able to address more easily.

AndyAyersMS · 2018-02-07T01:18:54Z

Took a deeper look and yes it's what I described above. Not sure if a jit dump will prove instructive but this is how the jit models the code after inlining:

***** BB01, stmt 1
               [000037] ------------              *  STMT      void  (IL 0x000...  ???)
               [000009] ---XG-------              |  /--*  OBJ(8)    struct
               [000008] ---XG-------              |  |  \--*  ADDR      byref 
               [000002] ---XG-------              |  |     \--*  FIELD     struct NullableWithValue
               [000001] ------------              |  |        \--*  LCL_VAR   ref    V00 this         
               [000036] -A-XG---R---              \--*  ASG       struct (copy)
               [000034] D-----------                 \--*  LCL_VAR   struct V01 tmp0         

------------ BB02 [000..001) -> BB04 (cond), preds={} succs={BB03,BB04}

***** BB02, stmt 2
               [000024] ------------              *  STMT      void  (IL 0x000...  ???)
               [000023] --C---------              \--*  JTRUE     void  
               [000021] ------------                 |  /--*  CNS_INT   int    0
               [000022] --C---------                 \--*  NE        int   
               [000042] ------------                    \--*  FIELD     bool   hasValue
               [000040] L-----------                       \--*  ADDR      byref 
               [000041] ------------                          \--*  LCL_VAR   struct V01 tmp0

The first blob says to copy the value of this to a temporary tmp0. The second says to read the HasValue field of the copy. The jit sees that tmp0's address is used and so decides that it must be kept on the stack.

One might hope that we could do a bit of lookahead when importing like we do at times for ldloca to see if the address is immediately consumed, but here production is in a caller and consumption in a callee. So it would require a bit more magic than that.

feO2x · 2018-02-07T06:37:25Z

@AndyAyersMS Thanks for your analysis. Should we close this issue for now or should we leave it open?

AndyAyersMS · 2018-02-07T16:21:18Z

It's a useful example to keep in mind as we think about future jit work., so I'm inclined to leave it open. I can't say when or if we'll work on it though.

The convenient C# syntax for nullables may lead programmers to not fully appreciate that they have some extra performance overhead. While we can't get rid of it all -- as there really is extra work to do for nullables -- we can work to minimize the differences.

Likewise, the natural looking extension method syntax gives the impression that extension methods on value types might be just as efficient as built-in methods on value types. And extension methods on reference types are in fact just as efficient as built-in methods. So again this is an area where we might work to reduce the amount of performance surprise, even if we can't get rid of it all together.

jakobbotsch · 2022-12-14T14:49:59Z

Codegen today looks much better:

; Assembly listing for method MustHaveValueBenchmark.IntBenchmarks:MustHaveValueBaseVersion():System.Nullable`1[int]:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 this         [V00,T00] (  4,  4   )     ref  ->  rcx         this class-hnd single-def
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T01] (  3,  0   )     ref  ->  rsi         class-hnd exact single-def "NewObj constructor temp"
;* V03 tmp2         [V03    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;  V04 tmp3         [V04,T02] (  2,  0   )     ref  ->  rdx         single-def "argument with side effect"
;
; Lcl frame size = 32

G_M45272_IG01:
       push     rsi
       sub      rsp, 32
						;; size=5 bbWeight=1    PerfScore 1.25
G_M45272_IG02:
       cmp      byte  ptr [rcx+08H], 0
       je       SHORT G_M45272_IG04
       mov      rax, qword ptr [rcx+08H]
						;; size=10 bbWeight=1    PerfScore 6.00
G_M45272_IG03:
       add      rsp, 32
       pop      rsi
       ret      
						;; size=6 bbWeight=1    PerfScore 1.75
G_M45272_IG04:
       mov      rcx, 0xD1FFAB1E      ; MustHaveValueBenchmark.NullableHasNoValueException
       call     CORINFO_HELP_NEWSFAST
       mov      rsi, rax
       mov      ecx, 1
       mov      rdx, 0xD1FFAB1E
       call     CORINFO_HELP_STRCNS
       mov      rdx, rax
       mov      rcx, rsi
       xor      r8, r8
       call     [System.ArgumentNullException:.ctor(System.String,System.String):this]
       mov      rcx, rsi
       call     CORINFO_HELP_THROW
       int3     
						;; size=62 bbWeight=0    PerfScore 0.00

; Total bytes of code 83, prolog size 5, PerfScore 17.30, instruction count 21, allocated bytes for code 83 (MethodHash=52794f27) for method MustHaveValueBenchmark.IntBenchmarks:MustHaveValueBaseVersion():System.Nullable`1[int]:this
; ============================================================

; Assembly listing for method MustHaveValueBenchmark.IntBenchmarks:MustHaveValueExtensionMethod():System.Nullable`1[int]:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
;  V00 this         [V00,T00] (  3,  3   )     ref  ->  rcx         this class-hnd single-def
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V02 tmp1         [V02    ] (  3,  6   )  struct ( 8) [rsp+20H]   do-not-enreg[S] ld-addr-op "Inlining Arg"
;  V03 tmp2         [V03,T01] (  3,  5   )    bool  ->  [rsp+20H]   do-not-enreg[] V02.hasValue(offs=0x00) P-DEP "field V02.hasValue (fldOffset=0x0)"
;  V04 tmp3         [V04,T02] (  2,  4   )     int  ->  [rsp+24H]   do-not-enreg[] V02.value(offs=0x04) P-DEP "field V02.value (fldOffset=0x4)"
;
; Lcl frame size = 40

G_M2833_IG01:
       sub      rsp, 40
						;; size=4 bbWeight=1    PerfScore 0.25
G_M2833_IG02:
       mov      rcx, qword ptr [rcx+08H]
       mov      qword ptr [rsp+20H], rcx
       cmp      byte  ptr [rsp+20H], 0
       jne      SHORT G_M2833_IG04
						;; size=16 bbWeight=1    PerfScore 6.00
G_M2833_IG03:
       mov      rcx, 0xD1FFAB1E      ; 'NullableWithValue'

       xor      rdx, rdx
       call     [MustHaveValueBenchmark.Throw:NullableHasNoValue(System.String,System.String)]
						;; size=18 bbWeight=0.50 PerfScore 1.75
G_M2833_IG04:
       mov      rax, qword ptr [rsp+20H]
						;; size=5 bbWeight=1    PerfScore 1.00
G_M2833_IG05:
       add      rsp, 40
       ret      
						;; size=5 bbWeight=1    PerfScore 1.25

; Total bytes of code 48, prolog size 4, PerfScore 15.05, instruction count 11, allocated bytes for code 48 (MethodHash=65dcf4ee) for method MustHaveValueBenchmark.IntBenchmarks:MustHaveValueExtensionMethod():System.Nullable`1[int]:this
; ============================================================

We still end up with one copy but getting rid of that does not seem possible given that NullableHasNoValue is marked NoInlining and may affect the field.
I'm not sure there's much more we can do here without fundamental work to allow the JIT to keep structs packed in registers (probably done as part of removing DNER, which is one of the stretch items on #76931). Will close this as fixed given the much improved codegen.

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

BruceForstall changed the title ~~JIT - generated code for generic static method with Nullable<T> can be optimized~~ JIT: generated code for generic static method with Nullable<T> can be optimized Nov 24, 2020

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 24, 2020

jakobbotsch closed this as completed Dec 14, 2022

ghost locked as resolved and limited conversation to collaborators Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: generated code for generic static method with `Nullable<T>` can be optimized #9677

JIT: generated code for generic static method with `Nullable<T>` can be optimized #9677

feO2x commented Feb 6, 2018 •

edited by BruceForstall

AndyAyersMS commented Feb 6, 2018

AndyAyersMS commented Feb 7, 2018

feO2x commented Feb 7, 2018

AndyAyersMS commented Feb 7, 2018

jakobbotsch commented Dec 14, 2022

JIT: generated code for generic static method with Nullable<T> can be optimized #9677

JIT: generated code for generic static method with Nullable<T> can be optimized #9677

Comments

feO2x commented Feb 6, 2018 • edited by BruceForstall

AndyAyersMS commented Feb 6, 2018

AndyAyersMS commented Feb 7, 2018

feO2x commented Feb 7, 2018

AndyAyersMS commented Feb 7, 2018

jakobbotsch commented Dec 14, 2022

JIT: generated code for generic static method with `Nullable<T>` can be optimized #9677

JIT: generated code for generic static method with `Nullable<T>` can be optimized #9677

feO2x commented Feb 6, 2018 •

edited by BruceForstall