JIT: Improve last-use copy omission for implicit byrefs #76069

jakobbotsch · 2022-09-23T13:41:43Z

Consider:

[MethodImpl(MethodImplOptions.NoInlining)]
public static int Test(Span<int> s)
{
    return s.Length + Test2(s);
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static int Test2(Span<int> s)
{
    return s.Length;
}

Today we generate the following on win-x64:

; Assembly listing for method Program:Test(System.Span`1[int]):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  8   )   byref  ->  rcx         ld-addr-op single-def
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )     int  ->  zero-ref    "non-inline candidate call"
;* V03 tmp2         [V03    ] (  0,  0   )   byref  ->  zero-ref    V05._reference(offs=0x00) P-INDEP "field V00._reference (fldOffset=0x0)"
;* V04 tmp3         [V04    ] (  0,  0   )     int  ->  zero-ref    V05._length(offs=0x08) P-INDEP "field V00._length (fldOffset=0x8)"
;* V05 tmp4         [V05    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref"
;  V06 tmp5         [V06    ] (  2,  4   )  struct (16) [rsp+20H]   do-not-enreg[XS] addr-exposed "by-value struct argument"
;
; Lcl frame size = 48

G_M62473_IG01:              ;; offset=0000H
       56                   push     rsi
       4883EC30             sub      rsp, 48
       C5F877               vzeroupper
                                                ;; size=8 bbWeight=1    PerfScore 2.25
G_M62473_IG02:              ;; offset=0008H
       8B7108               mov      esi, dword ptr [rcx+08H]
                                                ;; size=3 bbWeight=1    PerfScore 2.00
G_M62473_IG03:              ;; offset=000BH
       C5FA6F01             vmovdqu  xmm0, xmmword ptr [rcx]             ; unnecessary
       C5FA7F442420         vmovdqu  xmmword ptr [rsp+20H], xmm0         ; unnecessary
                                                ;; size=10 bbWeight=1    PerfScore 5.00
G_M62473_IG04:              ;; offset=0015H
       488D4C2420           lea      rcx, [rsp+20H]                      ; unnecessary
       FF15C86B1C00         call     [Program:Test2(System.Span`1[int]):int]
       03C6                 add      eax, esi
                                                ;; size=13 bbWeight=1    PerfScore 3.75
G_M62473_IG05:              ;; offset=0022H
       4883C430             add      rsp, 48
       5E                   pop      rsi
       C3                   ret
                                                ;; size=6 bbWeight=1    PerfScore 1.75

; Total bytes of code 40, prolog size 8, PerfScore 18.95, instruction count 12, allocated bytes for code 42 (MethodHash=e5250bf6) for method Program:Test(System.Span`1[int]):int
; ============================================================

We would probably need an early pass of (struct) liveness to get rid of some of these copies in a general fashion.

Some simple ad-hoc pattern matching during codegen (code) shows roughly 24k instances of this over libraries.crossgen and 3.2k over benchmarks.run. These are cases where we create a copy from a non-address exposed GTF_VAR_DEATH marked struct local.

We expect that this liveness pass may eventually be useful (necessary, even) for generalized promotion as well, and perhaps for forward sub too. Thus the pass will need to run early enough to be useful for these purposes. The likely best fit is right after local address visitor.

The text was updated successfully, but these errors were encountered:

jakobbotsch · 2022-09-23T13:41:50Z

cc @dotnet/jit-contrib

ghost · 2022-09-23T13:41:50Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Consider:

[MethodImpl(MethodImplOptions.NoInlining)]
public static int Test(Span<int> s)
{
    return s.Length + Test2(s);
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static int Test2(Span<int> s)
{
    return s.Length;
}

Today we generate the following on win-x64:

; Assembly listing for method Program:Test(System.Span`1[int]):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  8   )   byref  ->  rcx         ld-addr-op single-def
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )     int  ->  zero-ref    "non-inline candidate call"
;* V03 tmp2         [V03    ] (  0,  0   )   byref  ->  zero-ref    V05._reference(offs=0x00) P-INDEP "field V00._reference (fldOffset=0x0)"
;* V04 tmp3         [V04    ] (  0,  0   )     int  ->  zero-ref    V05._length(offs=0x08) P-INDEP "field V00._length (fldOffset=0x8)"
;* V05 tmp4         [V05    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref"
;  V06 tmp5         [V06    ] (  2,  4   )  struct (16) [rsp+20H]   do-not-enreg[XS] addr-exposed "by-value struct argument"
;
; Lcl frame size = 48

G_M62473_IG01:              ;; offset=0000H
       56                   push     rsi
       4883EC30             sub      rsp, 48
       C5F877               vzeroupper
                                                ;; size=8 bbWeight=1    PerfScore 2.25
G_M62473_IG02:              ;; offset=0008H
       8B7108               mov      esi, dword ptr [rcx+08H]
                                                ;; size=3 bbWeight=1    PerfScore 2.00
G_M62473_IG03:              ;; offset=000BH
       C5FA6F01             vmovdqu  xmm0, xmmword ptr [rcx]             ; unnecessary
       C5FA7F442420         vmovdqu  xmmword ptr [rsp+20H], xmm0         ; unnecessary
                                                ;; size=10 bbWeight=1    PerfScore 5.00
G_M62473_IG04:              ;; offset=0015H
       488D4C2420           lea      rcx, [rsp+20H]                      ; unnecessary
       FF15C86B1C00         call     [Program:Test2(System.Span`1[int]):int]
       03C6                 add      eax, esi
                                                ;; size=13 bbWeight=1    PerfScore 3.75
G_M62473_IG05:              ;; offset=0022H
       4883C430             add      rsp, 48
       5E                   pop      rsi
       C3                   ret
                                                ;; size=6 bbWeight=1    PerfScore 1.75

; Total bytes of code 40, prolog size 8, PerfScore 18.95, instruction count 12, allocated bytes for code 42 (MethodHash=e5250bf6) for method Program:Test(System.Span`1[int]):int
; ============================================================

We would probably need an early pass of (struct) liveness to get rid of some of these copies in a general fashion.

Some simple ad-hoc pattern matching during codegen (code) shows roughly 24k instances of this over libraries.crossgen and 3.2k over benchmarks.run. These are cases where we create a copy from a non-address exposed GTF_VAR_DEATH marked struct local.

Author:	jakobbotsch
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

jakobbotsch · 2022-10-12T14:43:03Z

Another possibility is to home by-value structs that are last used by a call directly in the outgoing arg area. Of course this is only possible if there are no interfering calls.

jakobbotsch · 2022-12-16T08:51:35Z

With #79346 we generate:

@@ -13,32 +13,23 @@
 ;* V03 tmp2         [V03    ] (  0,  0   )   byref  ->  zero-ref    V05._reference(offs=0x00) P-INDEP "field V00._reference (fldOffset=0x0)"
 ;* V04 tmp3         [V04    ] (  0,  0   )     int  ->  zero-ref    V05._length(offs=0x08) P-INDEP "field V00._length (fldOffset=0x8)"
 ;* V05 tmp4         [V05    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref"
-;  V06 tmp5         [V06    ] (  2,  4   )  struct (16) [rsp+20H]   do-not-enreg[XS] addr-exposed "by-value struct argument"
 ;
-; Lcl frame size = 48
+; Lcl frame size = 32
 
 G_M46558_IG01:
        push     rsi
-       sub      rsp, 48
-       vzeroupper 
-						;; size=8 bbWeight=1    PerfScore 2.25
+       sub      rsp, 32
+						;; size=5 bbWeight=1    PerfScore 1.25
 G_M46558_IG02:
        mov      esi, dword ptr [rcx+08H]
-						;; size=3 bbWeight=1    PerfScore 2.00
-G_M46558_IG03:
-       vmovdqu  xmm0, xmmword ptr [rcx]
-       vmovdqu  xmmword ptr [rsp+20H], xmm0
-						;; size=10 bbWeight=1    PerfScore 5.00
-G_M46558_IG04:
-       lea      rcx, [rsp+20H]
        call     [N.C:Test2(System.Span`1[int]):int]
        add      eax, esi
-						;; size=13 bbWeight=1    PerfScore 3.75
-G_M46558_IG05:
-       add      rsp, 48
+						;; size=11 bbWeight=1    PerfScore 5.25
+G_M46558_IG03:
+       add      rsp, 32
        pop      rsi
        ret      
 						;; size=6 bbWeight=1    PerfScore 1.75
 
-; Total bytes of code 40, prolog size 8, PerfScore 18.75, instruction count 12, allocated bytes for code 40 (MethodHash=63e24a21) for method N.C:Test(System.Span`1[int]):int
+; Total bytes of code 22, prolog size 5, PerfScore 10.45, instruction count 8, allocated bytes for code 22 (MethodHash=63e24a21) for method N.C:Test(System.Span`1[int]):int
 ; ============================================================

…-use copy elision for implicit byrefs (#79346) This runs a pass of liveness right after local morph and uses it for forward sub and to omit copies of structs when passed as implicit byrefs at their last use. Fix #76069 Fix #75206 Fix #65025 Fix #9839 This PR introduces the following new JIT invariants: * When optimizing, local morph will now thread all locals into a tree list accessed by Statement::LocalsTreeList. This tree list is kept valid starting from local morph and ending with forward sub. There is no memory impact of this since we reuse the GenTree::gtPrev and GenTree::gtNext fields. * Early liveness information (GTF_VAR_DEATH and the promoted struct death vars map) is kept valid (sound) starting from early liveness and ending with morph. There are asserts that the tree list is up to date when it is accessed. This is done through a new member fgNodeThreading that replaces the preexisting fgStmtListThreaded and keeps information about what the current kind of node threading is. The benefits are large, -2 MB on win-x64 collections (-0.85% on libraries.pmi that only has optimized contexts), with a number of regressions as expected when removing locals. The improvements primarily come from the omission of copies for implicit byrefs, so the benefits on platforms with fewer implicit byrefs is smaller, but the forward sub change alone is still very impactful (e.g. -300K on linux-x64). The throughput impact is around 1% in optimized contexts and below 0.1% in unoptimized contexts, the latter due to local morph needing to check if it should be threading nodes.

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 23, 2022

ghost added the untriaged New issue has not been triaged by the area owner label Sep 23, 2022

jakobbotsch added this to the 8.0.0 milestone Sep 23, 2022

jakobbotsch removed the untriaged New issue has not been triaged by the area owner label Sep 23, 2022

This was referenced Oct 12, 2022

JIT: Generalized struct promotion #76928

Closed

JIT struct work planned for .NET 8 #76931

Closed

jakobbotsch mentioned this issue Dec 11, 2022

JIT: Add a pass of early liveness and use it for forward sub and last-use copy elision for implicit byrefs #79346

Merged

10 tasks

ghost added the in-pr There is an active PR which will close this issue when it is merged label Dec 11, 2022

ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Jan 10, 2023

jakobbotsch closed this as completed in #79346 Jan 11, 2023

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jan 11, 2023

ghost locked as resolved and limited conversation to collaborators Feb 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Improve last-use copy omission for implicit byrefs #76069

JIT: Improve last-use copy omission for implicit byrefs #76069

jakobbotsch commented Sep 23, 2022 •

edited

jakobbotsch commented Sep 23, 2022

ghost commented Sep 23, 2022

jakobbotsch commented Oct 12, 2022

jakobbotsch commented Dec 16, 2022

JIT: Improve last-use copy omission for implicit byrefs #76069

JIT: Improve last-use copy omission for implicit byrefs #76069

Comments

jakobbotsch commented Sep 23, 2022 • edited

jakobbotsch commented Sep 23, 2022

ghost commented Sep 23, 2022

jakobbotsch commented Oct 12, 2022

jakobbotsch commented Dec 16, 2022

jakobbotsch commented Sep 23, 2022 •

edited