Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Improve last-use copy omission for implicit byrefs #76069

Closed
Tracked by #76931
jakobbotsch opened this issue Sep 23, 2022 · 4 comments · Fixed by #79346
Closed
Tracked by #76931

JIT: Improve last-use copy omission for implicit byrefs #76069

jakobbotsch opened this issue Sep 23, 2022 · 4 comments · Fixed by #79346
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@jakobbotsch
Copy link
Member

jakobbotsch commented Sep 23, 2022

Consider:

[MethodImpl(MethodImplOptions.NoInlining)]
public static int Test(Span<int> s)
{
    return s.Length + Test2(s);
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static int Test2(Span<int> s)
{
    return s.Length;
}

Today we generate the following on win-x64:

; Assembly listing for method Program:Test(System.Span`1[int]):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  8   )   byref  ->  rcx         ld-addr-op single-def
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )     int  ->  zero-ref    "non-inline candidate call"
;* V03 tmp2         [V03    ] (  0,  0   )   byref  ->  zero-ref    V05._reference(offs=0x00) P-INDEP "field V00._reference (fldOffset=0x0)"
;* V04 tmp3         [V04    ] (  0,  0   )     int  ->  zero-ref    V05._length(offs=0x08) P-INDEP "field V00._length (fldOffset=0x8)"
;* V05 tmp4         [V05    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref"
;  V06 tmp5         [V06    ] (  2,  4   )  struct (16) [rsp+20H]   do-not-enreg[XS] addr-exposed "by-value struct argument"
;
; Lcl frame size = 48

G_M62473_IG01:              ;; offset=0000H
       56                   push     rsi
       4883EC30             sub      rsp, 48
       C5F877               vzeroupper
                                                ;; size=8 bbWeight=1    PerfScore 2.25
G_M62473_IG02:              ;; offset=0008H
       8B7108               mov      esi, dword ptr [rcx+08H]
                                                ;; size=3 bbWeight=1    PerfScore 2.00
G_M62473_IG03:              ;; offset=000BH
       C5FA6F01             vmovdqu  xmm0, xmmword ptr [rcx]             ; unnecessary
       C5FA7F442420         vmovdqu  xmmword ptr [rsp+20H], xmm0         ; unnecessary
                                                ;; size=10 bbWeight=1    PerfScore 5.00
G_M62473_IG04:              ;; offset=0015H
       488D4C2420           lea      rcx, [rsp+20H]                      ; unnecessary
       FF15C86B1C00         call     [Program:Test2(System.Span`1[int]):int]
       03C6                 add      eax, esi
                                                ;; size=13 bbWeight=1    PerfScore 3.75
G_M62473_IG05:              ;; offset=0022H
       4883C430             add      rsp, 48
       5E                   pop      rsi
       C3                   ret
                                                ;; size=6 bbWeight=1    PerfScore 1.75

; Total bytes of code 40, prolog size 8, PerfScore 18.95, instruction count 12, allocated bytes for code 42 (MethodHash=e5250bf6) for method Program:Test(System.Span`1[int]):int
; ============================================================

We would probably need an early pass of (struct) liveness to get rid of some of these copies in a general fashion.

Some simple ad-hoc pattern matching during codegen (code) shows roughly 24k instances of this over libraries.crossgen and 3.2k over benchmarks.run. These are cases where we create a copy from a non-address exposed GTF_VAR_DEATH marked struct local.

We expect that this liveness pass may eventually be useful (necessary, even) for generalized promotion as well, and perhaps for forward sub too. Thus the pass will need to run early enough to be useful for these purposes. The likely best fit is right after local address visitor.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 23, 2022
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Sep 23, 2022
@jakobbotsch
Copy link
Member Author

cc @dotnet/jit-contrib

@ghost
Copy link

ghost commented Sep 23, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Consider:

[MethodImpl(MethodImplOptions.NoInlining)]
public static int Test(Span<int> s)
{
    return s.Length + Test2(s);
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static int Test2(Span<int> s)
{
    return s.Length;
}

Today we generate the following on win-x64:

; Assembly listing for method Program:Test(System.Span`1[int]):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  8   )   byref  ->  rcx         ld-addr-op single-def
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )     int  ->  zero-ref    "non-inline candidate call"
;* V03 tmp2         [V03    ] (  0,  0   )   byref  ->  zero-ref    V05._reference(offs=0x00) P-INDEP "field V00._reference (fldOffset=0x0)"
;* V04 tmp3         [V04    ] (  0,  0   )     int  ->  zero-ref    V05._length(offs=0x08) P-INDEP "field V00._length (fldOffset=0x8)"
;* V05 tmp4         [V05    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref"
;  V06 tmp5         [V06    ] (  2,  4   )  struct (16) [rsp+20H]   do-not-enreg[XS] addr-exposed "by-value struct argument"
;
; Lcl frame size = 48

G_M62473_IG01:              ;; offset=0000H
       56                   push     rsi
       4883EC30             sub      rsp, 48
       C5F877               vzeroupper
                                                ;; size=8 bbWeight=1    PerfScore 2.25
G_M62473_IG02:              ;; offset=0008H
       8B7108               mov      esi, dword ptr [rcx+08H]
                                                ;; size=3 bbWeight=1    PerfScore 2.00
G_M62473_IG03:              ;; offset=000BH
       C5FA6F01             vmovdqu  xmm0, xmmword ptr [rcx]             ; unnecessary
       C5FA7F442420         vmovdqu  xmmword ptr [rsp+20H], xmm0         ; unnecessary
                                                ;; size=10 bbWeight=1    PerfScore 5.00
G_M62473_IG04:              ;; offset=0015H
       488D4C2420           lea      rcx, [rsp+20H]                      ; unnecessary
       FF15C86B1C00         call     [Program:Test2(System.Span`1[int]):int]
       03C6                 add      eax, esi
                                                ;; size=13 bbWeight=1    PerfScore 3.75
G_M62473_IG05:              ;; offset=0022H
       4883C430             add      rsp, 48
       5E                   pop      rsi
       C3                   ret
                                                ;; size=6 bbWeight=1    PerfScore 1.75

; Total bytes of code 40, prolog size 8, PerfScore 18.95, instruction count 12, allocated bytes for code 42 (MethodHash=e5250bf6) for method Program:Test(System.Span`1[int]):int
; ============================================================

We would probably need an early pass of (struct) liveness to get rid of some of these copies in a general fashion.

Some simple ad-hoc pattern matching during codegen (code) shows roughly 24k instances of this over libraries.crossgen and 3.2k over benchmarks.run. These are cases where we create a copy from a non-address exposed GTF_VAR_DEATH marked struct local.

Author: jakobbotsch
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@jakobbotsch jakobbotsch added this to the 8.0.0 milestone Sep 23, 2022
@jakobbotsch jakobbotsch removed the untriaged New issue has not been triaged by the area owner label Sep 23, 2022
@jakobbotsch
Copy link
Member Author

Another possibility is to home by-value structs that are last used by a call directly in the outgoing arg area. Of course this is only possible if there are no interfering calls.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Dec 11, 2022
@jakobbotsch
Copy link
Member Author

With #79346 we generate:

@@ -13,32 +13,23 @@
 ;* V03 tmp2         [V03    ] (  0,  0   )   byref  ->  zero-ref    V05._reference(offs=0x00) P-INDEP "field V00._reference (fldOffset=0x0)"
 ;* V04 tmp3         [V04    ] (  0,  0   )     int  ->  zero-ref    V05._length(offs=0x08) P-INDEP "field V00._length (fldOffset=0x8)"
 ;* V05 tmp4         [V05    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref"
-;  V06 tmp5         [V06    ] (  2,  4   )  struct (16) [rsp+20H]   do-not-enreg[XS] addr-exposed "by-value struct argument"
 ;
-; Lcl frame size = 48
+; Lcl frame size = 32
 
 G_M46558_IG01:
        push     rsi
-       sub      rsp, 48
-       vzeroupper 
-						;; size=8 bbWeight=1    PerfScore 2.25
+       sub      rsp, 32
+						;; size=5 bbWeight=1    PerfScore 1.25
 G_M46558_IG02:
        mov      esi, dword ptr [rcx+08H]
-						;; size=3 bbWeight=1    PerfScore 2.00
-G_M46558_IG03:
-       vmovdqu  xmm0, xmmword ptr [rcx]
-       vmovdqu  xmmword ptr [rsp+20H], xmm0
-						;; size=10 bbWeight=1    PerfScore 5.00
-G_M46558_IG04:
-       lea      rcx, [rsp+20H]
        call     [N.C:Test2(System.Span`1[int]):int]
        add      eax, esi
-						;; size=13 bbWeight=1    PerfScore 3.75
-G_M46558_IG05:
-       add      rsp, 48
+						;; size=11 bbWeight=1    PerfScore 5.25
+G_M46558_IG03:
+       add      rsp, 32
        pop      rsi
        ret      
 						;; size=6 bbWeight=1    PerfScore 1.75
 
-; Total bytes of code 40, prolog size 8, PerfScore 18.75, instruction count 12, allocated bytes for code 40 (MethodHash=63e24a21) for method N.C:Test(System.Span`1[int]):int
+; Total bytes of code 22, prolog size 5, PerfScore 10.45, instruction count 8, allocated bytes for code 22 (MethodHash=63e24a21) for method N.C:Test(System.Span`1[int]):int
 ; ============================================================

@ghost ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Jan 10, 2023
jakobbotsch added a commit that referenced this issue Jan 11, 2023
…-use copy elision for implicit byrefs (#79346)

This runs a pass of liveness right after local morph and uses it for forward sub and to omit copies of structs when passed as implicit byrefs at their last use.

Fix #76069
Fix #75206
Fix #65025
Fix #9839

This PR introduces the following new JIT invariants:

* When optimizing, local morph will now thread all locals into a tree list accessed by Statement::LocalsTreeList. This tree list is kept valid starting from local morph and ending with forward sub. There is no memory impact of this since we reuse the GenTree::gtPrev and GenTree::gtNext fields.
* Early liveness information (GTF_VAR_DEATH and the promoted struct death vars map) is kept valid (sound) starting from early liveness and ending with morph.
  There are asserts that the tree list is up to date when it is accessed. This is done through a new member fgNodeThreading that replaces the preexisting fgStmtListThreaded and keeps information about what the current kind of node threading is.

The benefits are large, -2 MB on win-x64 collections (-0.85% on libraries.pmi that only has optimized contexts), with a number of regressions as expected when removing locals.
The improvements primarily come from the omission of copies for implicit byrefs, so the benefits on platforms with fewer implicit byrefs is smaller, but the forward sub change alone is still very impactful (e.g. -300K on linux-x64).

The throughput impact is around 1% in optimized contexts and below 0.1% in unoptimized contexts, the latter due to local morph needing to check if it should be threading nodes.
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jan 11, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Feb 10, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
1 participant