-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Improve last-use copy omission for implicit byrefs #76069
Comments
cc @dotnet/jit-contrib |
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsConsider: [MethodImpl(MethodImplOptions.NoInlining)]
public static int Test(Span<int> s)
{
return s.Length + Test2(s);
}
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Test2(Span<int> s)
{
return s.Length;
} Today we generate the following on win-x64: ; Assembly listing for method Program:Test(System.Span`1[int]):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 4, 8 ) byref -> rcx ld-addr-op single-def
; V01 OutArgs [V01 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
;* V02 tmp1 [V02 ] ( 0, 0 ) int -> zero-ref "non-inline candidate call"
;* V03 tmp2 [V03 ] ( 0, 0 ) byref -> zero-ref V05._reference(offs=0x00) P-INDEP "field V00._reference (fldOffset=0x0)"
;* V04 tmp3 [V04 ] ( 0, 0 ) int -> zero-ref V05._length(offs=0x08) P-INDEP "field V00._length (fldOffset=0x8)"
;* V05 tmp4 [V05 ] ( 0, 0 ) struct (16) zero-ref "Promoted implicit byref"
; V06 tmp5 [V06 ] ( 2, 4 ) struct (16) [rsp+20H] do-not-enreg[XS] addr-exposed "by-value struct argument"
;
; Lcl frame size = 48
G_M62473_IG01: ;; offset=0000H
56 push rsi
4883EC30 sub rsp, 48
C5F877 vzeroupper
;; size=8 bbWeight=1 PerfScore 2.25
G_M62473_IG02: ;; offset=0008H
8B7108 mov esi, dword ptr [rcx+08H]
;; size=3 bbWeight=1 PerfScore 2.00
G_M62473_IG03: ;; offset=000BH
C5FA6F01 vmovdqu xmm0, xmmword ptr [rcx] ; unnecessary
C5FA7F442420 vmovdqu xmmword ptr [rsp+20H], xmm0 ; unnecessary
;; size=10 bbWeight=1 PerfScore 5.00
G_M62473_IG04: ;; offset=0015H
488D4C2420 lea rcx, [rsp+20H] ; unnecessary
FF15C86B1C00 call [Program:Test2(System.Span`1[int]):int]
03C6 add eax, esi
;; size=13 bbWeight=1 PerfScore 3.75
G_M62473_IG05: ;; offset=0022H
4883C430 add rsp, 48
5E pop rsi
C3 ret
;; size=6 bbWeight=1 PerfScore 1.75
; Total bytes of code 40, prolog size 8, PerfScore 18.95, instruction count 12, allocated bytes for code 42 (MethodHash=e5250bf6) for method Program:Test(System.Span`1[int]):int
; ============================================================ We would probably need an early pass of (struct) liveness to get rid of some of these copies in a general fashion. Some simple ad-hoc pattern matching during codegen (code) shows roughly 24k instances of this over libraries.crossgen and 3.2k over benchmarks.run. These are cases where we create a copy from a non-address exposed
|
Another possibility is to home by-value structs that are last used by a call directly in the outgoing arg area. Of course this is only possible if there are no interfering calls. |
With #79346 we generate: @@ -13,32 +13,23 @@
;* V03 tmp2 [V03 ] ( 0, 0 ) byref -> zero-ref V05._reference(offs=0x00) P-INDEP "field V00._reference (fldOffset=0x0)"
;* V04 tmp3 [V04 ] ( 0, 0 ) int -> zero-ref V05._length(offs=0x08) P-INDEP "field V00._length (fldOffset=0x8)"
;* V05 tmp4 [V05 ] ( 0, 0 ) struct (16) zero-ref "Promoted implicit byref"
-; V06 tmp5 [V06 ] ( 2, 4 ) struct (16) [rsp+20H] do-not-enreg[XS] addr-exposed "by-value struct argument"
;
-; Lcl frame size = 48
+; Lcl frame size = 32
G_M46558_IG01:
push rsi
- sub rsp, 48
- vzeroupper
- ;; size=8 bbWeight=1 PerfScore 2.25
+ sub rsp, 32
+ ;; size=5 bbWeight=1 PerfScore 1.25
G_M46558_IG02:
mov esi, dword ptr [rcx+08H]
- ;; size=3 bbWeight=1 PerfScore 2.00
-G_M46558_IG03:
- vmovdqu xmm0, xmmword ptr [rcx]
- vmovdqu xmmword ptr [rsp+20H], xmm0
- ;; size=10 bbWeight=1 PerfScore 5.00
-G_M46558_IG04:
- lea rcx, [rsp+20H]
call [N.C:Test2(System.Span`1[int]):int]
add eax, esi
- ;; size=13 bbWeight=1 PerfScore 3.75
-G_M46558_IG05:
- add rsp, 48
+ ;; size=11 bbWeight=1 PerfScore 5.25
+G_M46558_IG03:
+ add rsp, 32
pop rsi
ret
;; size=6 bbWeight=1 PerfScore 1.75
-; Total bytes of code 40, prolog size 8, PerfScore 18.75, instruction count 12, allocated bytes for code 40 (MethodHash=63e24a21) for method N.C:Test(System.Span`1[int]):int
+; Total bytes of code 22, prolog size 5, PerfScore 10.45, instruction count 8, allocated bytes for code 22 (MethodHash=63e24a21) for method N.C:Test(System.Span`1[int]):int
; ============================================================ |
…-use copy elision for implicit byrefs (#79346) This runs a pass of liveness right after local morph and uses it for forward sub and to omit copies of structs when passed as implicit byrefs at their last use. Fix #76069 Fix #75206 Fix #65025 Fix #9839 This PR introduces the following new JIT invariants: * When optimizing, local morph will now thread all locals into a tree list accessed by Statement::LocalsTreeList. This tree list is kept valid starting from local morph and ending with forward sub. There is no memory impact of this since we reuse the GenTree::gtPrev and GenTree::gtNext fields. * Early liveness information (GTF_VAR_DEATH and the promoted struct death vars map) is kept valid (sound) starting from early liveness and ending with morph. There are asserts that the tree list is up to date when it is accessed. This is done through a new member fgNodeThreading that replaces the preexisting fgStmtListThreaded and keeps information about what the current kind of node threading is. The benefits are large, -2 MB on win-x64 collections (-0.85% on libraries.pmi that only has optimized contexts), with a number of regressions as expected when removing locals. The improvements primarily come from the omission of copies for implicit byrefs, so the benefits on platforms with fewer implicit byrefs is smaller, but the forward sub change alone is still very impactful (e.g. -300K on linux-x64). The throughput impact is around 1% in optimized contexts and below 0.1% in unoptimized contexts, the latter due to local morph needing to check if it should be threading nodes.
Consider:
Today we generate the following on win-x64:
We would probably need an early pass of (struct) liveness to get rid of some of these copies in a general fashion.
Some simple ad-hoc pattern matching during codegen (code) shows roughly 24k instances of this over libraries.crossgen and 3.2k over benchmarks.run. These are cases where we create a copy from a non-address exposed
GTF_VAR_DEATH
marked struct local.We expect that this liveness pass may eventually be useful (necessary, even) for generalized promotion as well, and perhaps for forward sub too. Thus the pass will need to run early enough to be useful for these purposes. The likely best fit is right after local address visitor.
The text was updated successfully, but these errors were encountered: