Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack allocate unescaped boxes #103361

Open
wants to merge 50 commits into
base: main
Choose a base branch
from

Conversation

AndyAyersMS
Copy link
Member

@AndyAyersMS AndyAyersMS commented Jun 12, 2024

Enable object stack allocation for ref classes and extend the support to include boxed value classes. Use a specialized unbox helper for stack allocated boxes, both to avoid apparent escape of the box by the helper, and to ensure all box field accesses are visible to the JIT. Update the local address visitor to rewrite trees involving address of stack allocated boxes in some cases to avoid address exposure. Disable old promotion for stack allocated boxes (since we have no field handles) and allow physical promotion to enregister the box method table and/or payload as appropriate. In OSR methods handle the fact that the stack allocation may actually have been a heap allocation by the Tier0 method.

The analysis TP cost is around 0.4-0.7% (notes below). Boxes are much less likely to escape than ref classes (roughly ~90% of boxes escape, ~99.8% of ref classes escape). Codegen impact is diminished somewhat because many of the boxes are dead and were already getting optimized away.
 
Fixes #4584, #9118, #10195, #11192, #53585, #58554, #85570

Still todo:

  • settle on the right new type for a stack allocated box (currently KeyValuePair)
  • R2R/NAOT support
  • continue trying to do a realistic assessment of impact

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 12, 2024
@AndyAyersMS
Copy link
Member Author

For case from #9118:

; Assembly listing for method C:Main():int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 1 inlinees with PGO data; 3 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
;  V00 OutArgs      [V00    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V01 tmp1         [V01,T00] (  0,  0   )     ref  ->  zero-ref    class-hnd exact single-def "NewObj constructor temp" <ObjectEqualityComparer`1[int]>
;* V02 tmp2         [V02    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
;* V03 tmp3         [V03    ] (  0,  0   )     int  ->  zero-ref    ld-addr-op "Inlining Arg"
;* V04 tmp4         [V04    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact "Single-def Box Helper" <System.Int32>
;* V05 tmp5         [V05    ] (  0,  0   )   ubyte  ->  zero-ref    "Inline return value spill temp"
;* V06 tmp6         [V06    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact "Inlining Arg" <System.Int32>
;* V07 tmp7         [V07    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] "stack allocated boxed value class temp" <System.Collections.Generic.KeyValuePair`2[long,int]>
;* V08 tmp8         [V08    ] (  0,  0   )    long  ->  zero-ref    "V07.[000..008)"
;* V09 tmp9         [V09    ] (  0,  0   )     int  ->  zero-ref    "V07.[008..012)"
;
; Lcl frame size = 40

G_M46144_IG01:  ;; offset=0x0000
       sub      rsp, 40
                                                ;; size=4 bbWeight=1 PerfScore 0.25
G_M46144_IG02:  ;; offset=0x0004
       mov      eax, 100
                                                ;; size=5 bbWeight=1 PerfScore 0.25
G_M46144_IG03:  ;; offset=0x0009
       add      rsp, 40
       ret
                                                ;; size=5 bbWeight=1 PerfScore 1.25

; Total bytes of code 14, prolog size 4, PerfScore 1.75, instruction count 4, allocated bytes for code 14 (MethodHash=cd934bbf) for method C:Main():int (FullOpts)

@AndyAyersMS
Copy link
Member Author

Current code for #10195

; Assembly listing for method Program:Pattern[Dog](Dog) (FullOpts)
; Emitting BLENDED_CODE for X86 with AVX512 - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )  struct ( 8) rcx         single-def <Dog>
;* V01 loc0         [V01    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact <Dog>
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1         [V03    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact "Single-def Box Helper" <Dog>
;* V04 tmp2         [V04    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V05 tmp3         [V05    ] (  2,  2   )  struct (16) [rsp+0x08]  do-not-enreg[XSF] must-init addr-exposed "stack allocated boxed value class temp" <System.Collections.Generic.KeyValuePair`2[long,Dog]>
;
; Lcl frame size = 24

G_M46299_IG01:  ;; offset=0x0000
       sub      rsp, 24
       xor      eax, eax
       mov      qword ptr [rsp+0x08], rax
       mov      qword ptr [rsp+0x10], rax
						;; size=16 bbWeight=1 PerfScore 2.50
G_M46299_IG02:  ;; offset=0x0010
       mov      rax, 0x7FFEDABE4C08      ; Dog
       mov      qword ptr [rsp+0x08], rax
       mov      byte  ptr [rsp+0x10], cl
						;; size=19 bbWeight=1 PerfScore 2.25
G_M46299_IG03:  ;; offset=0x0023
       add      rsp, 24
       ret      
						;; size=5 bbWeight=1 PerfScore 1.25

; Total bytes of code 40, prolog size 16, PerfScore 6.00, instruction count 9, allocated bytes for code 40 (MethodHash=a9204b24) for method Program:Pattern[Dog](Dog) (FullOpts)
; ============================================================

No more boxing, but room for improvement...

@jakobbotsch
Copy link
Member

Current code for #10195

; Assembly listing for method Program:Pattern[Dog](Dog) (FullOpts)
; Emitting BLENDED_CODE for X86 with AVX512 - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )  struct ( 8) rcx         single-def <Dog>
;* V01 loc0         [V01    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact <Dog>
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1         [V03    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact "Single-def Box Helper" <Dog>
;* V04 tmp2         [V04    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V05 tmp3         [V05    ] (  2,  2   )  struct (16) [rsp+0x08]  do-not-enreg[XSF] must-init addr-exposed "stack allocated boxed value class temp" <System.Collections.Generic.KeyValuePair`2[long,Dog]>
;
; Lcl frame size = 24

G_M46299_IG01:  ;; offset=0x0000
       sub      rsp, 24
       xor      eax, eax
       mov      qword ptr [rsp+0x08], rax
       mov      qword ptr [rsp+0x10], rax
						;; size=16 bbWeight=1 PerfScore 2.50
G_M46299_IG02:  ;; offset=0x0010
       mov      rax, 0x7FFEDABE4C08      ; Dog
       mov      qword ptr [rsp+0x08], rax
       mov      byte  ptr [rsp+0x10], cl
						;; size=19 bbWeight=1 PerfScore 2.25
G_M46299_IG03:  ;; offset=0x0023
       add      rsp, 24
       ret      
						;; size=5 bbWeight=1 PerfScore 1.25

; Total bytes of code 40, prolog size 16, PerfScore 6.00, instruction count 9, allocated bytes for code 40 (MethodHash=a9204b24) for method Program:Pattern[Dog](Dog) (FullOpts)
; ============================================================

No more boxing, but room for improvement...

Any idea why we end up with address exposure for V05?

@AndyAyersMS
Copy link
Member Author

Any idea why we end up with address exposure for V05?

I think it's that local morph needs to deal with a nullcheck... will know shortly

TypeHandle mt(CoreLibBinder::GetElementType(ELEMENT_TYPE_I));
TypeHandle boxedFields[2] = {mt, VMClsHnd};
Instantiation boxedFieldsInst((TypeHandle*)&boxedFields, 2);
TypeHandle genericKvp = CoreLibBinder::GetClass(CLASS__KEYVALUEPAIRGENERIC);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work well. KeyValuePair is auto-layout when the type contains references. auto-layout moves references to the front.

For example, KeyValuePair<IntPtr, string> has string at offset 0 and IntPtr at offset 8.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it probably works ok for valuetypes currently, but it is fragile.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks... we were trying to find a suitable type without needing too much from the runtime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should introduce dedicated internal helper type for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also try and do this entirely within the jit (that's where I started), but it would be nice to have class and field handles and the like.

I also am still not sure how effective any of this will be, so I want to do some more analysis. Earlier studies suggest we can prove maybe 1% of boxes do not escape (static, not dynamic) but were based on SPMI which can't replay a bunch of possibly interesting cases. Given the cost of the analysis I would like to see if we can get that number a bit higher.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That 1% is a static measure, that is, 1% of the box call sites, not necessarily 1% of boxes allocated.

@AndyAyersMS
Copy link
Member Author

Yep, that was the reason...

; Assembly listing for method Program:Pattern[Dog](Dog) (FullOpts)
; Emitting BLENDED_CODE for X86 with AVX512 - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;* V00 arg0         [V00    ] (  0,  0   )  struct ( 8) zero-ref    single-def <Dog>
;* V01 loc0         [V01    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact <Dog>
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1         [V03    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact "Single-def Box Helper" <Dog>
;* V04 tmp2         [V04    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V05 tmp3         [V05    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] "stack allocated boxed value class temp" <System.Collections.Generic.KeyValuePair`2[long,Dog]>
;
; Lcl frame size = 0

G_M46299_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00
G_M46299_IG02:  ;; offset=0x0000
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00

; Total bytes of code 1, prolog size 0, PerfScore 1.00, instruction count 1, allocated bytes for code 1 (MethodHash=a9204b24) for method Program:Pattern[Dog](Dog) (FullOpts)
; ============================================================

@AndyAyersMS
Copy link
Member Author

For #85570 one box gets optimized, but the other box is passed to an unbox helper. This helper causes the box to escape if the type tests fail.

There is some preliminary helper escape logic in #103148 but it is too aggressive for unboxing. We would need some kind of partial escape analysis and a different helper pattern for this.

@AndyAyersMS
Copy link
Member Author

We stack allocate in #58554 but the local box is exposed, possibly because of a ?: -- need to review if we get conservative handling these in local morph.

So asm is not that pretty yet

; Assembly listing for method Program:Test[System.ValueTuple`2[int,System.__Canon],System.ValueTuple`2[int,System.__Canon]](System.ValueTuple`2[int,System.__Canon]):ubyte (FullOpts)
; Emitting BLENDED_CODE for X86 with AVX512 - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 TypeCtx      [V00,T01] (  4,  4   )    long  ->  rcx         single-def
;  V01 arg0         [V01,T00] (  3,  6   )   byref  ->  rdx         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1         [V03    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact "Single-def Box Helper" <System.ValueTuple`2[int,System.__Canon]>
;* V04 tmp2         [V04    ] (  0,  0   )   byref  ->  zero-ref    "ISINST eval op1"
;  V05 tmp3         [V05,T02] (  2,  4   )   byref  ->  rdx         class-hnd "spilling qmarkNull" <System.ValueTuple`2[int,System.__Canon]>
;  V06 tmp4         [V06    ] (  4,  4   )  struct (24) [rsp+0x08]  do-not-enreg[XSF] must-init addr-exposed "stack allocated boxed value class temp" <System.Collections.Generic.KeyValuePair`2[long,System.ValueTuple`2[int,System.__Canon]]>
;* V07 tmp5         [V07    ] (  0,  0   )     ref  ->  zero-ref    "field V01.Item2 (fldOffset=0x0)" P-INDEP
;* V08 tmp6         [V08    ] (  0,  0   )     int  ->  zero-ref    "field V01.Item1 (fldOffset=0x8)" P-INDEP
;* V09 tmp7         [V09    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref" <System.ValueTuple`2[int,System.__Canon]>
;  V10 cse0         [V10,T03] (  3,  3   )    long  ->  rax         "CSE #01: aggressive"
;
; Lcl frame size = 40

G_M33081_IG01:  ;; offset=0x0000
       sub      rsp, 40
       vxorps   xmm4, xmm4, xmm4
       vmovdqu  xmmword ptr [rsp+0x08], xmm4
       xor      eax, eax
       mov      qword ptr [rsp+0x18], rax
       mov      qword ptr [rsp+0x20], rcx
						;; size=26 bbWeight=1 PerfScore 4.83
G_M33081_IG02:  ;; offset=0x001A
       mov      rax, qword ptr [rcx+0x40]
       mov      rcx, qword ptr [rax]
       mov      qword ptr [rsp+0x08], rcx
						;; size=12 bbWeight=1 PerfScore 5.00
G_M33081_IG03:  ;; offset=0x0026
       vmovdqu  xmm0, xmmword ptr [rdx]
       vmovdqu  xmmword ptr [rsp+0x10], xmm0
						;; size=10 bbWeight=1 PerfScore 5.00
G_M33081_IG04:  ;; offset=0x0030
       mov      rcx, qword ptr [rsp+0x08]
       lea      rdx, bword ptr [rsp+0x08]
       xor      r8, r8
       cmp      rcx, qword ptr [rax+0x08]
       cmovne   rdx, r8
       test     rdx, rdx
       setne    al
       movzx    rax, al
						;; size=30 bbWeight=1 PerfScore 6.50
G_M33081_IG05:  ;; offset=0x004E
       add      rsp, 40
       ret      
						;; size=5 bbWeight=1 PerfScore 1.25

; Total bytes of code 83, prolog size 26, PerfScore 22.58, instruction count 21, allocated bytes for code 83 (MethodHash=0b7f7ec6) for method Program:Test[System.ValueTuple`2[int,System.__Canon],System.ValueTuple`2[int,System.__Canon]](System.ValueTuple`2[int,System.__Canon]):ubyte (FullOpts)
; ============================================================

@AndyAyersMS
Copy link
Member Author

Will run mihu bot once we're a bit further along in testing, to get a sense of diffs.

Comment on lines 1074 to 1076
}
INDEBUG(TopValue(0).Consume());
PopValue();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to call EscapeValue in the non-transformed case (that might be the reason for the asserts)

@jakobbotsch
Copy link
Member

jakobbotsch commented Jun 13, 2024

We stack allocate in #58554 but the local box is exposed, possibly because of a ?: -- need to review if we get conservative handling these in local morph.

So asm is not that pretty yet

; Assembly listing for method Program:Test[System.ValueTuple`2[int,System.__Canon],System.ValueTuple`2[int,System.__Canon]](System.ValueTuple`2[int,System.__Canon]):ubyte (FullOpts)
; Emitting BLENDED_CODE for X86 with AVX512 - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 TypeCtx      [V00,T01] (  4,  4   )    long  ->  rcx         single-def
;  V01 arg0         [V01,T00] (  3,  6   )   byref  ->  rdx         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1         [V03    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact "Single-def Box Helper" <System.ValueTuple`2[int,System.__Canon]>
;* V04 tmp2         [V04    ] (  0,  0   )   byref  ->  zero-ref    "ISINST eval op1"
;  V05 tmp3         [V05,T02] (  2,  4   )   byref  ->  rdx         class-hnd "spilling qmarkNull" <System.ValueTuple`2[int,System.__Canon]>
;  V06 tmp4         [V06    ] (  4,  4   )  struct (24) [rsp+0x08]  do-not-enreg[XSF] must-init addr-exposed "stack allocated boxed value class temp" <System.Collections.Generic.KeyValuePair`2[long,System.ValueTuple`2[int,System.__Canon]]>
;* V07 tmp5         [V07    ] (  0,  0   )     ref  ->  zero-ref    "field V01.Item2 (fldOffset=0x0)" P-INDEP
;* V08 tmp6         [V08    ] (  0,  0   )     int  ->  zero-ref    "field V01.Item1 (fldOffset=0x8)" P-INDEP
;* V09 tmp7         [V09    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref" <System.ValueTuple`2[int,System.__Canon]>
;  V10 cse0         [V10,T03] (  3,  3   )    long  ->  rax         "CSE #01: aggressive"
;
; Lcl frame size = 40

G_M33081_IG01:  ;; offset=0x0000
       sub      rsp, 40
       vxorps   xmm4, xmm4, xmm4
       vmovdqu  xmmword ptr [rsp+0x08], xmm4
       xor      eax, eax
       mov      qword ptr [rsp+0x18], rax
       mov      qword ptr [rsp+0x20], rcx
						;; size=26 bbWeight=1 PerfScore 4.83
G_M33081_IG02:  ;; offset=0x001A
       mov      rax, qword ptr [rcx+0x40]
       mov      rcx, qword ptr [rax]
       mov      qword ptr [rsp+0x08], rcx
						;; size=12 bbWeight=1 PerfScore 5.00
G_M33081_IG03:  ;; offset=0x0026
       vmovdqu  xmm0, xmmword ptr [rdx]
       vmovdqu  xmmword ptr [rsp+0x10], xmm0
						;; size=10 bbWeight=1 PerfScore 5.00
G_M33081_IG04:  ;; offset=0x0030
       mov      rcx, qword ptr [rsp+0x08]
       lea      rdx, bword ptr [rsp+0x08]
       xor      r8, r8
       cmp      rcx, qword ptr [rax+0x08]
       cmovne   rdx, r8
       test     rdx, rdx
       setne    al
       movzx    rax, al
						;; size=30 bbWeight=1 PerfScore 6.50
G_M33081_IG05:  ;; offset=0x004E
       add      rsp, 40
       ret      
						;; size=5 bbWeight=1 PerfScore 1.25

; Total bytes of code 83, prolog size 26, PerfScore 22.58, instruction count 21, allocated bytes for code 83 (MethodHash=0b7f7ec6) for method Program:Test[System.ValueTuple`2[int,System.__Canon],System.ValueTuple`2[int,System.__Canon]](System.ValueTuple`2[int,System.__Canon]):ubyte (FullOpts)
; ============================================================

With this PR + #103391 the disassembly for this function becomes

; Assembly listing for method C:Test[System.ValueTuple`2[int,System.__Canon],System.ValueTuple`2[int,System.__Canon]](System.ValueTuple`2[int,System.__Canon]):ubyte (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 TypeCtx      [V00,T00] (  4,  4   )    long  ->  rcx         single-def
;* V01 arg0         [V01    ] (  0,  0   )   byref  ->  zero-ref    single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1         [V03    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact "Single-def Box Helper" <System.ValueTuple`2[int,System.__Canon]>
;* V04 tmp2         [V04    ] (  0,  0   )   byref  ->  zero-ref    "ISINST eval op1"
;* V05 tmp3         [V05    ] (  0,  0   )     int  ->  zero-ref    "spilling qmarkNull"
;* V06 tmp4         [V06    ] (  0,  0   )  struct (24) zero-ref    do-not-enreg[SF] "stack allocated boxed value class temp" <System.Collections.Generic.KeyValuePair`2[long,System.ValueTuple`2[int,System.__Canon]]>
;* V07 tmp5         [V07    ] (  0,  0   )     ref  ->  zero-ref    "field V01.Item2 (fldOffset=0x0)" P-INDEP
;* V08 tmp6         [V08    ] (  0,  0   )     int  ->  zero-ref    "field V01.Item1 (fldOffset=0x8)" P-INDEP
;  V09 tmp7         [V09,T02] (  2,  2   )    long  ->  rcx         "V06.[000..008)"
;* V10 tmp8         [V10    ] (  0,  0   )     ref  ->  zero-ref    "V06.[008..016)"
;* V11 tmp9         [V11    ] (  0,  0   )     int  ->  zero-ref    "V06.[016..020)"
;* V12 tmp10        [V12    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref" <System.ValueTuple`2[int,System.__Canon]>
;  V13 cse0         [V13,T01] (  3,  3   )    long  ->  rax         "CSE #01: aggressive"
;
; Lcl frame size = 8

G_M43406_IG01:  ;; offset=0x0000
       push     rax
       mov      qword ptr [rsp], rcx
                                                ;; size=5 bbWeight=1 PerfScore 2.00
G_M43406_IG02:  ;; offset=0x0005
       mov      rax, qword ptr [rcx+0x40]
       mov      rcx, qword ptr [rax]
       cmp      rcx, qword ptr [rax+0x08]
       setne    al
       movzx    rax, al
                                                ;; size=17 bbWeight=1 PerfScore 8.25
G_M43406_IG03:  ;; offset=0x0016
       add      rsp, 8
       ret
                                                ;; size=5 bbWeight=1 PerfScore 1.25

; Total bytes of code 27, prolog size 5, PerfScore 11.50, instruction count 9, allocated bytes for code 27 (MethodHash=e76e5671) for method C:Test[System.ValueTuple`2[int,System.__Canon],System.ValueTuple`2[int,System.__Canon]](System.ValueTuple`2[int,System.__Canon]):ubyte (FullOpts)
; ============================================================

Comment on lines +966 to +967
if ((secondArgNode->OperIsLocal() || secondArgNode->OperIs(GT_LCL_ADDR)) &&
!secondArgNode->TypeIs(TYP_REF))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What guarantees that secondArgNode in the secondArgNode->OperIsLocal() case is pointing to something stack allocated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the rewritten type. I can validate the local num too I suppose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually not quite this simple, I would need to track all locals that get rewritten. I think checking the type is ok.

We can safely do this transformation for all unbox helper calls, but if the box is on the heap it will just increase code size with no benefit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like doing this transformation during local morph would make its status as "avoiding address exposure" a bit more clear, but I think this is ok then.

Comment on lines 997 to 998
// flag this comma so we can find it later
comma->gtFlags |= GTF_MAKE_CSE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this about?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to enable the rewriting of IND(COMMA(x, y)) to COMMA(x, IND(y)) just below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in favor of more elaborate check on the IND side.

GenTreeIndir* const indir = tree->AsIndir();
GenTree* const addr = indir->Addr();

if (addr->OperIs(GT_COMMA) && ((addr->gtFlags & GTF_MAKE_CSE) != 0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we look for the specific pattern rather than repurposing a flag like this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we can check if the comma side effect child is the unbox type test helper.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Co-authored-by: Jan Kotas <jkotas@microsoft.com>
Co-authored-by: Jakob Botsch Nielsen <Jakob.botsch.nielsen@gmail.com>
@AndyAyersMS
Copy link
Member Author

/azp run runtime-coreclr jitstress

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@AndyAyersMS
Copy link
Member Author

/azp run runtime-coreclr jitstress

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@AndyAyersMS
Copy link
Member Author

Jitstress has possibly one novel failure, in CoreMangLib\system\span\RefStructWithSpan ... investigating.

@jakobbotsch
Copy link
Member

Jitstress has possibly one novel failure, in CoreMangLib\system\span\RefStructWithSpan ... investigating.

That one might be #103624

@AndyAyersMS
Copy link
Member Author

Jitstress has possibly one novel failure, in CoreMangLib\system\span\RefStructWithSpan ... investigating.

That one might be #103624

Seems likely, yes.

Any other feedback?

Comment on lines 829 to 839
if (lhsType != newType)
{
lhs->ChangeType(newType);
}

if (rhsType != newType)
{
rhs->ChangeType(newType);
}

parent->ChangeType(newType);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still seems odd to me to forcefully retype an arbitrary sibling node of an ancestor, but maybe all TYP_REF JIT IR nodes really do have equivalently typed TYP_BYREF nodes.

Copy link
Member

@jakobbotsch jakobbotsch Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, this will change TYP_REF LCL_VAR(local whose LclVarDsc is TYP_REF) to TYP_BYREF LCL_VAR(local whose LclVarDsc is TYP_REF)... That will hit this assert:

// Check that the LCL_VAR node has the same type as the underlying variable, save a few mismatches we allow.
assert(tree->TypeIs(varDsc->TypeGet(), genActualType(varDsc)) ||
(tree->TypeIs(TYP_BYREF) && (varDsc->TypeGet() == TYP_I_IMPL)) || // Created by inliner substitution.
(tree->TypeIs(TYP_INT) && (varDsc->TypeGet() == TYP_LONG))); // Created by "optNarrowTree".

Not sure if it will actually cause issues downstream, but it feels like we should insert some form of cast to maintain IR invariants. Potentially in quite a few places in object stack allocation. Anyway, something to think about, we don't need to block on this given that QMARKs have very restricted shapes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to reconsider ... I added this because of assertion failures but don't recall the specific case and now can't reproduce it. Not clear to me how the JIT could create a TYP_REF GT_COLON, at least not one that this code would see.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the motivating example -- from the object stack allocation tests

               [000035] DA-X-------                         *  STORE_LCL_VAR ref    V03 tmp3
               [000034] ---X-------                         \--*  QMARK     ref
               [000025] -----------    if                      +--*  EQ        int
               [000024] -----------                            |  +--*  LCL_VAR   ref    V02 tmp2
               [000023] -----------                            |  \--*  CNS_INT   ref    null
               [000033] ---X-------    if                      \--*  COLON     ref
               [000031] ---X------- else                          +--*  QMARK     ref
               [000027] ---X-------    if                         |  +--*  NE        int
               [000026] #--X-------                               |  |  +--*  IND       long
               [000022] -----------                               |  |  |  \--*  LCL_VAR   ref    V02 tmp2
               [000021] H----------                               |  |  \--*  CNS_INT(h) long   0x7ffe34d5e970 class ObjectStackAllocation.SimpleClassB
               [000030] -----------    if                         |  \--*  COLON     ref
               [000028] ----------- else                          |     +--*  LCL_VAR   ref    V02 tmp2
               [000029] ----------- then                          |     \--*  CNS_INT   ref    null
               [000032] ----------- then                          \--*  CNS_INT   ref    null

We stack allocate the ref class, and so retype tmp2 to byref. This propagates up and if we don't retype sibling [000029] we hit asserts.

I think it may be the case that the mistyped sibling must be null... I can check for that.

Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a comment on why the retyping feels wrong to me.

@AndyAyersMS
Copy link
Member Author

@MichalStrehovsky any feedback? Think you got added as a reviewer automatically.

@AndyAyersMS
Copy link
Member Author

LGTM, a comment on why the retyping feels wrong to me.

Turns out we only need to retype nulls, hopefully that's a bit more palatable.

@AndyAyersMS
Copy link
Member Author

CI seems stuck...

@AndyAyersMS AndyAyersMS reopened this Jun 28, 2024
@AndyAyersMS
Copy link
Member Author

@dotnet/dnceng is CI stuck? No evident progress here in 2 hours.

@mmitche mmitche closed this Jun 28, 2024
@mmitche mmitche reopened this Jun 28, 2024
@mmitche
Copy link
Member

mmitche commented Jun 28, 2024

@dotnet/dnceng is CI stuck? No evident progress here in 2 hours.

GitHub had an outage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CLR/JIT should optimize "alloc temporary small object" to "alloc on stack" automatically
7 participants