JIT: Add a (disabled) prototype for a generalized promotion pass #83388

jakobbotsch · 2023-03-14T13:08:50Z

Introduce a "physical" promotion pass (name suggested by @SingleAccretion; thanks!) that generalizes the existing promotion. More specifically, it does not have restrictions on field count and it can handle arbitrary recursive promotion.

The pass is physical in the sense that it does not rely on any field metadata for structs. Instead, it works in two separate passes over the IR:

In the first pass we find and analyze how unpromoted struct locals are accessed. For example, for a simple program like:

public static void Main()
{
    S s = default;
    Call(s, s.C);
    Console.WriteLine(s.B + s.C);
}

[MethodImpl(MethodImplOptions.NoInlining)]
private static void Call(S s, byte b)
{
}

private struct S
{
    public byte A, B, C, D, E;
}

we see IR like:

***** BB01
STMT00000 ( 0x000[E-] ... 0x003 )
               [000003] IA---------                         ▌  ASG       struct (init)
               [000001] D------N---                         ├──▌  LCL_VAR   struct<Program+S, 5> V00 loc0         
               [000002] -----------                         └──▌  CNS_INT   int    0

***** BB01
STMT00001 ( 0x008[E-] ... 0x026 )
               [000008] --C-G------                         ▌  CALL      void   Program:Call(Program+S,ubyte)
               [000004] ----------- arg0                    ├──▌  LCL_VAR   struct<Program+S, 5> V00 loc0         
               [000007] ----------- arg1                    └──▌  LCL_FLD   ubyte  V00 loc0         [+2]

***** BB01
STMT00002 ( 0x014[E-] ... ??? )
               [000016] --C-G------                         ▌  CALL      void   System.Console:WriteLine(int)
               [000015] ----------- arg0                    └──▌  ADD       int   
               [000011] -----------                            ├──▌  LCL_FLD   ubyte  V00 loc0         [+1]
               [000014] -----------                            └──▌  LCL_FLD   ubyte  V00 loc0         [+2]

and the analysis produces

Accesses for V00
  [000..005)
    #:                             (2, 200)
    # assigned from:               (0, 0)
    # assigned to:                 (1, 100)
    # as call arg:                 (1, 100)
    # as implicit by-ref call arg: (1, 100)
    # as on-stack call arg:        (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  ubyte @ 001
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (0, 0)
    # as call arg:                 (0, 0)
    # as implicit by-ref call arg: (0, 0)
    # as on-stack call arg:        (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  ubyte @ 002
    #:                             (2, 200)
    # assigned from:               (0, 0)
    # assigned to:                 (0, 0)
    # as call arg:                 (1, 100)
    # as implicit by-ref call arg: (0, 0)
    # as on-stack call arg:        (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

Here the pairs are (#ref counts, wtd ref counts).

Based on this accounting, the analysis estimates the profitability of replacing some of the accessed parts of the struct with a local. This may be costly because overlapping struct accesses (e.g. passing the whole struct as an argument) may require more expensive codegen after promotion. And of course, creating new locals introduces more register pressure. Currently the profitability analysis is very crude.

In this case the logic decides that promotion is not worth it:

Evaluating access ubyte @ 001
  Single write-back cost: 5
  Write backs: 100
  Read backs: 100
  Cost with: 1350
  Cost without: 650
  Disqualifying replacement
Evaluating access ubyte @ 002
  Single write-back cost: 5
  Write backs: 100
  Read backs: 100
  Cost with: 1700
  Cost without: 1300
  Disqualifying replacement

In the second pass the field accesses are replaced with new locals for the profitable cases. For overlapping accesses that currently involves writing back replacements to the struct local first. For arguments/OSR locals, it involves reading them back from the struct first.

In the above case we can override the profitability analysis with stress mode STRESS_PHYSICAL_PROMOTION_COST and we get:

Evaluating access ubyte @ 001
  Single write-back cost: 5
  Write backs: 100
  Read backs: 100
  Cost with: 1350
  Cost without: 650
  Promoting replacement due to stress

lvaGrabTemp returning 2 (V02 tmp1) (a long lifetime temp) called for V00.[001..002).
Evaluating access ubyte @ 002
  Single write-back cost: 5
  Write backs: 100
  Read backs: 100
  Cost with: 1700
  Cost without: 1300
  Promoting replacement due to stress

lvaGrabTemp returning 3 (V03 tmp2) (a long lifetime temp) called for V00.[002..003).
V00 promoted with 2 replacements
  [001..002) promoted as ubyte V02
  [002..003) promoted as ubyte V03

...

***** BB01
STMT00000 ( 0x000[E-] ... 0x003 )
               [000003] IA---------                         ▌  ASG       struct (init)
               [000001] D------N---                         ├──▌  LCL_VAR   struct<Program+S, 5> V00 loc0         
               [000002] -----------                         └──▌  CNS_INT   int    0

***** BB01
STMT00001 ( 0x008[E-] ... 0x026 )
               [000008] -ACXG------                         ▌  CALL      void   Program:Call(Program+S,ubyte)
               [000004] ----------- arg0                    ├──▌  LCL_VAR   struct<Program+S, 5> V00 loc0         
               [000022] -A--------- arg1                    └──▌  COMMA     ubyte 
               [000021] -A---------                            ├──▌  ASG       ubyte 
               [000019] D------N---                            │  ├──▌  LCL_VAR   ubyte  V03 tmp2         
               [000020] -----------                            │  └──▌  LCL_FLD   ubyte  V00 loc0         [+2]
               [000018] -----------                            └──▌  LCL_VAR   ubyte  V03 tmp2         

***** BB01
STMT00002 ( 0x014[E-] ... ??? )
               [000016] -ACXG------                         ▌  CALL      void   System.Console:WriteLine(int)
               [000015] -A--------- arg0                    └──▌  ADD       int   
               [000027] -A---------                            ├──▌  COMMA     ubyte 
               [000026] -A---------                            │  ├──▌  ASG       ubyte 
               [000024] D------N---                            │  │  ├──▌  LCL_VAR   ubyte  V02 tmp1         
               [000025] -----------                            │  │  └──▌  LCL_FLD   ubyte  V00 loc0         [+1]
               [000023] -----------                            │  └──▌  LCL_VAR   ubyte  V02 tmp1         
               [000028] -----------                            └──▌  LCL_VAR   ubyte  V03 tmp2

The pass still only has rudimentary support and is missing many basic CQ optimization optimizations. For example, it does not make use of any liveness yet and it does not have any decomposition support for assignments. Yet, it already shows good potential in a couple of user benchmarks (benchmark 1, benchmark 2). I have listed some follow-up improvements in #76928.

This PR is adding the pass but it is disabled by default. It can be enabled by setting DOTNET_JitStressModeNames=STRESS_PHYSICAL_PROMOTION. There are two new scenarios added to jit-experimental that enables it, to be used for testing purposes.

ghost · 2023-03-14T13:09:01Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author:	jakobbotsch
Assignees:	jakobbotsch
Labels:	`area-CodeGen-coreclr`
Milestone:	-

jakobbotsch · 2023-03-15T15:35:22Z

Currently I'm using some size heuristics to figure out when to do the promotion. Here is an example diff in CreateFromRotationMatrix:

 ; Final local variable assignments
 ;
 ;  V00 RetBuf       [V00,T01] (  4,  4   )   byref  ->  rcx         single-def
-;  V01 arg0         [V01,T00] ( 29, 34   )   byref  ->  rdx         single-def
-;  V02 loc0         [V02,T08] (  3,  2.50)   float  ->  mm2        
+;  V01 arg0         [V01,T00] ( 11, 22   )   byref  ->  rdx         single-def
+;  V02 loc0         [V02,T14] (  3,  2.50)   float  ->  mm9        
 ;  V03 loc1         [V03,T02] ( 13,  7   )  simd16  ->  registers   ld-addr-op
 ;  V04 loc2         [V04,T07] (  7,  3.50)   float  ->  mm0        
-;  V05 loc3         [V05,T12] (  3,  1.50)   float  ->  mm3        
-;  V06 loc4         [V06,T09] (  4,  2   )   float  ->  mm0        
-;  V07 loc5         [V07,T13] (  3,  1.50)   float  ->  mm3        
-;  V08 loc6         [V08,T10] (  4,  2   )   float  ->  mm0        
-;  V09 loc7         [V09,T14] (  3,  1.50)   float  ->  mm0        
-;  V10 loc8         [V10,T11] (  4,  2   )   float  ->  mm1        
+;  V05 loc3         [V05,T18] (  3,  1.50)   float  ->  mm0        
+;  V06 loc4         [V06,T15] (  4,  2   )   float  ->  mm4        
+;  V07 loc5         [V07,T19] (  3,  1.50)   float  ->  mm0        
+;  V08 loc6         [V08,T16] (  4,  2   )   float  ->  mm4        
+;  V09 loc7         [V09,T20] (  3,  1.50)   float  ->  mm0        
+;  V10 loc8         [V10,T17] (  4,  2   )   float  ->  mm4        
 ;# V11 OutArgs      [V11    ] (  1,  1   )  struct ( 0) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
-;  V12 cse0         [V12,T04] (  7,  4.50)   float  ->  mm1         "CSE - aggressive"
-;  V13 cse1         [V13,T05] (  7,  4.50)   float  ->  mm3         "CSE - aggressive"
-;  V14 cse2         [V14,T06] (  7,  4.50)   float  ->  mm0         "CSE - aggressive"
-;  V15 cse3         [V15,T03] ( 12,  6   )   float  ->  mm2         "CSE - aggressive"
+;  V12 tmp1         [V12,T04] (  7,  4.50)   float  ->  mm0         "V01.[000..004)"
+;  V13 tmp2         [V13,T08] (  5,  3   )   float  ->  mm1         "V01.[004..008)"
+;  V14 tmp3         [V14,T09] (  5,  3   )   float  ->  mm2         "V01.[008..012)"
+;  V15 tmp4         [V15,T10] (  5,  3   )   float  ->  mm3         "V01.[016..020)"
+;  V16 tmp5         [V16,T05] (  7,  4.50)   float  ->  mm4         "V01.[020..024)"
+;  V17 tmp6         [V17,T11] (  5,  3   )   float  ->  mm5         "V01.[024..028)"
+;  V18 tmp7         [V18,T12] (  5,  3   )   float  ->  mm6         "V01.[032..036)"
+;  V19 tmp8         [V19,T13] (  5,  3   )   float  ->  mm7         "V01.[036..040)"
+;  V20 tmp9         [V20,T06] (  7,  4.50)   float  ->  mm8         "V01.[040..044)"
+;  V21 cse0         [V21,T03] ( 12,  6   )   float  ->  mm9         "CSE - aggressive"
 ;
-; Lcl frame size = 0
+; Lcl frame size = 88
 
 G_M15800_IG01:
+       sub      rsp, 88
        vzeroupper 
-						;; size=3 bbWeight=1 PerfScore 1.00
+       vmovaps  xmmword ptr [rsp+40H], xmm6
+       vmovaps  xmmword ptr [rsp+30H], xmm7
+       vmovaps  xmmword ptr [rsp+20H], xmm8
+       vmovaps  xmmword ptr [rsp+10H], xmm9
+       vmovaps  xmmword ptr [rsp], xmm10
+						;; size=36 bbWeight=1 PerfScore 11.25
 G_M15800_IG02:
        vmovss   xmm0, dword ptr [rdx]
-       vmovss   xmm1, dword ptr [rdx+14H]
-       vaddss   xmm2, xmm0, xmm1
-       vmovss   xmm3, dword ptr [rdx+28H]
-       vaddss   xmm2, xmm2, xmm3
-       vxorps   xmm4, xmm4
-       vucomiss xmm2, xmm4
+       vmovss   xmm1, dword ptr [rdx+04H]
+       vmovss   xmm2, dword ptr [rdx+08H]
+       vmovss   xmm3, dword ptr [rdx+10H]
+       vmovss   xmm4, dword ptr [rdx+14H]
+       vmovss   xmm5, dword ptr [rdx+18H]
+       vmovss   xmm6, dword ptr [rdx+20H]
+       vmovss   xmm7, dword ptr [rdx+24H]
+       vmovss   xmm8, dword ptr [rdx+28H]
+       vaddss   xmm9, xmm0, xmm4
+       vaddss   xmm9, xmm9, xmm8
+       vxorps   xmm10, xmm10
+       vucomiss xmm9, xmm10
        jbe      SHORT G_M15800_IG04
-						;; size=32 bbWeight=1 PerfScore 21.33
+						;; size=65 bbWeight=1 PerfScore 45.33
 G_M15800_IG03:
-       vaddss   xmm0, xmm2, dword ptr [reloc @RWD00]
+       vaddss   xmm0, xmm9, dword ptr [reloc @RWD00]
        vsqrtss  xmm0, xmm0
-       vmovss   xmm2, dword ptr [reloc @RWD04]
-       vmulss   xmm1, xmm0, xmm2
-       vinsertps xmm1, xmm1, xmm1, 55
-       vdivss   xmm0, xmm2, xmm0
-       vmovss   xmm3, dword ptr [rdx+18H]
-       vsubss   xmm2, xmm3, dword ptr [rdx+24H]
+       vmovss   xmm9, dword ptr [reloc @RWD04]
+       vmulss   xmm4, xmm0, xmm9
+       vinsertps xmm4, xmm4, xmm4, 55
+       vdivss   xmm0, xmm9, xmm0
+       vsubss   xmm5, xmm5, xmm7
+       vmulss   xmm5, xmm5, xmm0
+       vinsertps xmm4, xmm4, xmm5, 0
+       vsubss   xmm2, xmm6, xmm2
        vmulss   xmm2, xmm2, xmm0
-       vinsertps xmm1, xmm1, xmm2, 0
-       vmovss   xmm2, dword ptr [rdx+20H]
-       vsubss   xmm2, xmm2, dword ptr [rdx+08H]
-       vmulss   xmm2, xmm2, xmm0
-       vinsertps xmm1, xmm1, xmm2, 16
-       vmovss   xmm2, dword ptr [rdx+04H]
-       vsubss   xmm2, xmm2, dword ptr [rdx+10H]
-       vmulss   xmm0, xmm2, xmm0
-       vinsertps xmm1, xmm1, xmm0, 32
+       vinsertps xmm2, xmm4, xmm2, 16
+       vsubss   xmm1, xmm1, xmm3
+       vmulss   xmm3, xmm1, xmm0
+       vinsertps xmm4, xmm2, xmm3, 32
        jmp      G_M15800_IG07
-						;; size=99 bbWeight=0.50 PerfScore 38.50
+						;; size=82 bbWeight=0.50 PerfScore 28.00
 G_M15800_IG04:
-       vucomiss xmm0, xmm1
+       vucomiss xmm0, xmm4
        jb       SHORT G_M15800_IG05
-       vucomiss xmm0, xmm3
+       vucomiss xmm0, xmm8
        jb       SHORT G_M15800_IG05
        vaddss   xmm0, xmm0, dword ptr [reloc @RWD00]
-       vsubss   xmm1, xmm0, xmm1
-       vsubss   xmm3, xmm1, xmm3
-       vsqrtss  xmm3, xmm3
-       vmovss   xmm2, dword ptr [reloc @RWD04]
-       vdivss   xmm0, xmm2, xmm3
-       vmulss   xmm2, xmm3, xmm2
-       vinsertps xmm1, xmm1, xmm2, 14
-       vmovss   xmm2, dword ptr [rdx+04H]
-       vaddss   xmm2, xmm2, dword ptr [rdx+10H]
-       vmulss   xmm2, xmm2, xmm0
-       vinsertps xmm1, xmm1, xmm2, 16
-       vmovss   xmm2, dword ptr [rdx+08H]
-       vaddss   xmm2, xmm2, dword ptr [rdx+20H]
-       vmulss   xmm2, xmm2, xmm0
-       vinsertps xmm1, xmm1, xmm2, 32
-       vmovss   xmm2, dword ptr [rdx+18H]
-       vsubss   xmm2, xmm2, dword ptr [rdx+24H]
-       vmulss   xmm0, xmm2, xmm0
-       vinsertps xmm1, xmm1, xmm0, 48
-       jmp      G_M15800_IG07
-						;; size=119 bbWeight=0.50 PerfScore 44.50
-G_M15800_IG05:
-       vucomiss xmm1, xmm3
-       jbe      SHORT G_M15800_IG06
-       vaddss   xmm1, xmm1, dword ptr [reloc @RWD00]
-       vsubss   xmm0, xmm1, xmm0
-       vsubss   xmm3, xmm0, xmm3
-       vsqrtss  xmm3, xmm3
-       vmovss   xmm2, dword ptr [reloc @RWD04]
-       vdivss   xmm0, xmm2, xmm3
-       vmovss   xmm1, dword ptr [rdx+10H]
-       vaddss   xmm1, xmm1, dword ptr [rdx+04H]
-       vmulss   xmm1, xmm1, xmm0
-       vinsertps xmm1, xmm1, xmm1, 14
-       vmulss   xmm2, xmm3, xmm2
-       vinsertps xmm1, xmm1, xmm2, 16
-       vmovss   xmm2, dword ptr [rdx+24H]
-       vaddss   xmm2, xmm2, dword ptr [rdx+18H]
-       vmulss   xmm2, xmm2, xmm0
-       vinsertps xmm1, xmm1, xmm2, 32
-       vmovss   xmm2, dword ptr [rdx+20H]
-       vsubss   xmm2, xmm2, dword ptr [rdx+08H]
-       vmulss   xmm0, xmm2, xmm0
-       vinsertps xmm1, xmm1, xmm0, 48
-       jmp      SHORT G_M15800_IG07
-						;; size=110 bbWeight=0.50 PerfScore 43.00
-G_M15800_IG06:
-       vaddss   xmm2, xmm3, dword ptr [reloc @RWD00]
-       vsubss   xmm0, xmm2, xmm0
-       vsubss   xmm0, xmm0, xmm1
+       vsubss   xmm4, xmm0, xmm4
+       vsubss   xmm0, xmm4, xmm8
        vsqrtss  xmm0, xmm0
-       vmovss   xmm2, dword ptr [reloc @RWD04]
-       vdivss   xmm1, xmm2, xmm0
-       vmovss   xmm3, dword ptr [rdx+20H]
-       vaddss   xmm3, xmm3, dword ptr [rdx+08H]
-       vmulss   xmm3, xmm3, xmm1
-       vinsertps xmm3, xmm3, xmm3, 14
-       vmovss   xmm4, dword ptr [rdx+24H]
-       vaddss   xmm4, xmm4, dword ptr [rdx+18H]
-       vmulss   xmm4, xmm4, xmm1
-       vinsertps xmm3, xmm3, xmm4, 16
-       vmulss   xmm0, xmm0, xmm2
-       vinsertps xmm0, xmm3, xmm0, 32
-       vmovss   xmm2, dword ptr [rdx+04H]
-       vsubss   xmm2, xmm2, dword ptr [rdx+10H]
-       vmulss   xmm1, xmm2, xmm1
+       vmovss   xmm9, dword ptr [reloc @RWD04]
+       vdivss   xmm4, xmm9, xmm0
+       vmulss   xmm0, xmm0, xmm9
+       vinsertps xmm0, xmm0, xmm0, 14
+       vaddss   xmm1, xmm1, xmm3
+       vmulss   xmm3, xmm1, xmm4
+       vinsertps xmm0, xmm0, xmm3, 16
+       vaddss   xmm2, xmm2, xmm6
+       vmulss   xmm1, xmm2, xmm4
+       vinsertps xmm0, xmm0, xmm1, 32
+       vsubss   xmm5, xmm5, xmm7
+       vmulss   xmm1, xmm5, xmm4
        vinsertps xmm0, xmm0, xmm1, 48
-       vmovaps  xmm1, xmm0
-						;; size=106 bbWeight=0.50 PerfScore 40.62
+       vmovaps  xmm4, xmm0
+       jmp      G_M15800_IG07
+						;; size=108 bbWeight=0.50 PerfScore 34.12
+G_M15800_IG05:
+       vucomiss xmm4, xmm8
+       jbe      SHORT G_M15800_IG06
+       vaddss   xmm4, xmm4, dword ptr [reloc @RWD00]
+       vsubss   xmm0, xmm4, xmm0
+       vsubss   xmm0, xmm0, xmm8
+       vsqrtss  xmm0, xmm0
+       vmovss   xmm9, dword ptr [reloc @RWD04]
+       vdivss   xmm4, xmm9, xmm0
+       vaddss   xmm1, xmm3, xmm1
+       vmulss   xmm3, xmm1, xmm4
+       vinsertps xmm1, xmm1, xmm3, 14
+       vmulss   xmm0, xmm0, xmm9
+       vinsertps xmm0, xmm1, xmm0, 16
+       vaddss   xmm5, xmm7, xmm5
+       vmulss   xmm1, xmm5, xmm4
+       vinsertps xmm0, xmm0, xmm1, 32
+       vsubss   xmm2, xmm6, xmm2
+       vmulss   xmm1, xmm2, xmm4
+       vinsertps xmm0, xmm0, xmm1, 48
+       vmovaps  xmm4, xmm0
+       jmp      SHORT G_M15800_IG07
+						;; size=99 bbWeight=0.50 PerfScore 32.62
+G_M15800_IG06:
+       vaddss   xmm9, xmm8, dword ptr [reloc @RWD00]
+       vsubss   xmm0, xmm9, xmm0
+       vsubss   xmm0, xmm0, xmm4
+       vsqrtss  xmm0, xmm0
+       vmovss   xmm9, dword ptr [reloc @RWD04]
+       vdivss   xmm4, xmm9, xmm0
+       vaddss   xmm2, xmm6, xmm2
+       vmulss   xmm2, xmm2, xmm4
+       vinsertps xmm2, xmm2, xmm2, 14
+       vaddss   xmm5, xmm7, xmm5
+       vmulss   xmm5, xmm5, xmm4
+       vinsertps xmm2, xmm2, xmm5, 16
+       vmulss   xmm0, xmm0, xmm9
+       vinsertps xmm0, xmm2, xmm0, 32
+       vsubss   xmm1, xmm1, xmm3
+       vmulss   xmm1, xmm1, xmm4
+       vinsertps xmm0, xmm0, xmm1, 48
+       vmovaps  xmm4, xmm0
+						;; size=89 bbWeight=0.50 PerfScore 30.12
 G_M15800_IG07:
-       vmovups  xmmword ptr [rcx], xmm1
+       vmovups  xmmword ptr [rcx], xmm4
        mov      rax, rcx
 						;; size=7 bbWeight=1 PerfScore 2.25
 G_M15800_IG08:
+       vmovaps  xmm6, xmmword ptr [rsp+40H]
+       vmovaps  xmm7, xmmword ptr [rsp+30H]
+       vmovaps  xmm8, xmmword ptr [rsp+20H]
+       vmovaps  xmm9, xmmword ptr [rsp+10H]
+       vmovaps  xmm10, xmmword ptr [rsp]
+       add      rsp, 88
        ret      
-						;; size=1 bbWeight=1 PerfScore 1.00
+						;; size=34 bbWeight=1 PerfScore 21.25
 RWD00  	dd	3F800000h		;         1
 RWD04  	dd	3F000000h		;       0.5
 
 
-; Total bytes of code 477, prolog size 3, PerfScore 239.91, instruction count 100, allocated bytes for code 477 (MethodHash=096ec247) for method System.Numerics.Quaternion:CreateFromRotationMatrix(System.Numerics.Matrix4x4):System.Numerics.Quaternion
+; Total bytes of code 520, prolog size 36, PerfScore 256.96, instruction count 108, allocated bytes for code 520 (MethodHash=096ec247) for method System.Numerics.Quaternion:CreateFromRotationMatrix(System.Numerics.Matrix4x4):System.Numerics.Quaternion

We decide to promote all the fields of the matrix because we see each of them as having multiple accesses, and we only need to load them once (since they come from a parameter):

Accesses for V01
  float @ 000
    #:                             6
    # assigned from:               0
    # assigned to:                 0
    # as call arg:                 0
    # as implicit by-ref call arg: 0
    # as on-stack call arg:        0
    # as retbuf:                   0
    # as returned value:           0

  float @ 004
    #:                             4
    # assigned from:               0
    # assigned to:                 0
    # as call arg:                 0
    # as implicit by-ref call arg: 0
    # as on-stack call arg:        0
    # as retbuf:                   0
    # as returned value:           0

  float @ 008
    #:                             4
    # assigned from:               0
    # assigned to:                 0
    # as call arg:                 0
    # as implicit by-ref call arg: 0
    # as on-stack call arg:        0
    # as retbuf:                   0
    # as returned value:           0

  float @ 016
    #:                             4
    # assigned from:               0
    # assigned to:                 0
    # as call arg:                 0
    # as implicit by-ref call arg: 0
    # as on-stack call arg:        0
    # as retbuf:                   0
    # as returned value:           0

  float @ 020
    #:                             6
    # assigned from:               0
    # assigned to:                 0
    # as call arg:                 0
    # as implicit by-ref call arg: 0
    # as on-stack call arg:        0
    # as retbuf:                   0
    # as returned value:           0

  float @ 024
    #:                             4
    # assigned from:               0
    # assigned to:                 0
    # as call arg:                 0
    # as implicit by-ref call arg: 0
    # as on-stack call arg:        0
    # as retbuf:                   0
    # as returned value:           0

  float @ 032
    #:                             4
    # assigned from:               0
    # assigned to:                 0
    # as call arg:                 0
    # as implicit by-ref call arg: 0
    # as on-stack call arg:        0
    # as retbuf:                   0
    # as returned value:           0

  float @ 036
    #:                             4
    # assigned from:               0
    # assigned to:                 0
    # as call arg:                 0
    # as implicit by-ref call arg: 0
    # as on-stack call arg:        0
    # as retbuf:                   0
    # as returned value:           0

  float @ 040
    #:                             6
    # assigned from:               0
    # assigned to:                 0
    # as call arg:                 0
    # as implicit by-ref call arg: 0
    # as on-stack call arg:        0
    # as retbuf:                   0
    # as returned value:           0

Evaluating access float @ 000
  Write-back cost: 50
  # write backs: 0
  # read backs: 1
  Cost with: 200
  Cost without: 450
  Promoting replacement

lvaGrabTemp returning 12 (V12 tmp1) (a long lifetime temp) called for V01.[000..004).
Evaluating access float @ 004
  Write-back cost: 50
  # write backs: 0
  # read backs: 1
  Cost with: 150
  Cost without: 300
  Promoting replacement

lvaGrabTemp returning 13 (V13 tmp2) (a long lifetime temp) called for V01.[004..008).
Evaluating access float @ 008
  Write-back cost: 50
  # write backs: 0
  # read backs: 1
  Cost with: 150
  Cost without: 300
  Promoting replacement

lvaGrabTemp returning 14 (V14 tmp3) (a long lifetime temp) called for V01.[008..012).
Evaluating access float @ 016
  Write-back cost: 50
  # write backs: 0
  # read backs: 1
  Cost with: 150
  Cost without: 300
  Promoting replacement

lvaGrabTemp returning 15 (V15 tmp4) (a long lifetime temp) called for V01.[016..020).
Evaluating access float @ 020
  Write-back cost: 50
  # write backs: 0
  # read backs: 1
  Cost with: 200
  Cost without: 450
  Promoting replacement

lvaGrabTemp returning 16 (V16 tmp5) (a long lifetime temp) called for V01.[020..024).
Evaluating access float @ 024
  Write-back cost: 50
  # write backs: 0
  # read backs: 1
  Cost with: 150
  Cost without: 300
  Promoting replacement

lvaGrabTemp returning 17 (V17 tmp6) (a long lifetime temp) called for V01.[024..028).
Evaluating access float @ 032
  Write-back cost: 50
  # write backs: 0
  # read backs: 1
  Cost with: 150
  Cost without: 300
  Promoting replacement

lvaGrabTemp returning 18 (V18 tmp7) (a long lifetime temp) called for V01.[032..036).
Evaluating access float @ 036
  Write-back cost: 50
  # write backs: 0
  # read backs: 1
  Cost with: 150
  Cost without: 300
  Promoting replacement

lvaGrabTemp returning 19 (V19 tmp8) (a long lifetime temp) called for V01.[036..040).
Evaluating access float @ 040
  Write-back cost: 50
  # write backs: 0
  # read backs: 1
  Cost with: 200
  Cost without: 450
  Promoting replacement

lvaGrabTemp returning 20 (V20 tmp9) (a long lifetime temp) called for V01.[040..044).
V01 promoted with 9 replacements
  [000..004) promoted as float V12
  [004..008) promoted as float V13
  [008..012) promoted as float V14
  [016..020) promoted as float V15
  [020..024) promoted as float V16
  [024..028) promoted as float V17
  [032..036) promoted as float V18
  [036..040) promoted as float V19
  [040..044) promoted as float V20

It still ends up being a size regression because we need to save a bunch of callee saves. In addition there is a large up front cost in loading all the registers, but after that we do not need any more memory accesses.
However, the function is the following:

runtime/src/libraries/System.Private.CoreLib/src/System/Numerics/Quaternion.cs

Lines 340 to 390 in c9173f2

    
           /// <summary>Creates a quaternion from the specified rotation matrix.</summary> 
        
           /// <param name="matrix">The rotation matrix.</param> 
        
           /// <returns>The newly created quaternion.</returns> 
        
           public static Quaternion CreateFromRotationMatrix(Matrix4x4 matrix) 
        
           { 
        
               float trace = matrix.M11 + matrix.M22 + matrix.M33; 
        
               Quaternion q = default; 
        
               if (trace > 0.0f) 
        
               { 
        
                   float s = MathF.Sqrt(trace + 1.0f); 
        
                   q.W = s * 0.5f; 
        
                   s = 0.5f / s; 
        
                   q.X = (matrix.M23 - matrix.M32) * s; 
        
                   q.Y = (matrix.M31 - matrix.M13) * s; 
        
                   q.Z = (matrix.M12 - matrix.M21) * s; 
        
               } 
        
               else 
        
               { 
        
                   if (matrix.M11 >= matrix.M22 && matrix.M11 >= matrix.M33) 
        
                   { 
        
                       float s = MathF.Sqrt(1.0f + matrix.M11 - matrix.M22 - matrix.M33); 
        
                       float invS = 0.5f / s; 
        
                       q.X = 0.5f * s; 
        
                       q.Y = (matrix.M12 + matrix.M21) * invS; 
        
                       q.Z = (matrix.M13 + matrix.M31) * invS; 
        
                       q.W = (matrix.M23 - matrix.M32) * invS; 
        
                   } 
        
                   else if (matrix.M22 > matrix.M33) 
        
                   { 
        
                       float s = MathF.Sqrt(1.0f + matrix.M22 - matrix.M11 - matrix.M33); 
        
                       float invS = 0.5f / s; 
        
                       q.X = (matrix.M21 + matrix.M12) * invS; 
        
                       q.Y = 0.5f * s; 
        
                       q.Z = (matrix.M32 + matrix.M23) * invS; 
        
                       q.W = (matrix.M31 - matrix.M13) * invS; 
        
                   } 
        
                   else 
        
                   { 
        
                       float s = MathF.Sqrt(1.0f + matrix.M33 - matrix.M11 - matrix.M22); 
        
                       float invS = 0.5f / s; 
        
                       q.X = (matrix.M31 + matrix.M13) * invS; 
        
                       q.Y = (matrix.M32 + matrix.M23) * invS; 
        
                       q.Z = 0.5f * s; 
        
                       q.W = (matrix.M12 - matrix.M21) * invS; 
        
                   } 
        
               } 
        
               return q; 
        
           }

So in reality, in each execution each field is only accesses once anyway and it's best to leave it in memory to allow it to be loaded/contained right at its use:

Method	Job	EnvironmentVariables	Mean	Error	StdDev	Ratio
CreateFromRotationMatrix	Job-RMJBCX	Empty	4.305 ns	0.0170 ns	0.0150 ns	1.00
CreateFromRotationMatrix	Job-BZLCPW	DOTNET_JitNewStructPromotion=0	3.510 ns	0.0158 ns	0.0140 ns	0.82

My current thinking is to try to base things more on PGO, and piggy back on @AndyAyersMS's synthetic PGO in my experiment. That is, make the logic much more conservative about introducing replacement, so that we only actually end up introducing them when we're quite sure that an execution of the function ends up accessing the fields many times.

jakobbotsch · 2023-03-16T15:20:31Z

Reworked things to use BB weights and am now testing with DOTNET_JitSynthesizeCounts=1 (not sure how much of a workable state this is in yet, though).
Looking at benchmarks for win-x64, the impact seems to be very small and still overall a size regression:

[15:35:30] 110 contexts with diffs (29 improvements, 70 regressions, 11 same size)
[15:35:30]   -689/+6,564 bytes

PerfScore is a bit better:

[15:35:48] 104 total methods with Perf Score differences (66 improved, 38 regressed).

For aspnet we have:

[15:41:55] 524 contexts with diffs (195 improvements, 298 regressions, 31 same size)
[15:41:55]   -4,605/+9,335 bytes

[15:42:29] 500 total methods with Perf Score differences (377 improved, 123 regressed).

Accurately predicting things like new spills and register pressure when creating the new locals this early in the JIT is bound to be tough, but we should be able to do better than this.

For the example in #71510 we get the following diff in Test with the pass turned on (and a small fix in the importer):

 ; 0 inlinees with PGO data; 3 single block inlinees; 1 inlinees without PGO data
 ; Final local variable assignments
 ;
-;  V00 arg0         [V00,T08] (  3,  6   )   byref  ->  rcx         single-def
-;  V01 arg1         [V01,T10] (  3,  3   )     ref  ->  rdx         class-hnd single-def
-;  V02 loc0         [V02    ] (  7,437.00)  struct (32) [rsp+48H]   do-not-enreg[XSF] addr-exposed ld-addr-op
+;  V00 arg0         [V00,T11] (  3,  6   )   byref  ->  rcx         single-def
+;  V01 arg1         [V01,T12] (  3,  3   )     ref  ->  rdx         class-hnd single-def
+;* V02 loc0         [V02    ] (  0,  0   )  struct (32) zero-ref    do-not-enreg[SF] ld-addr-op
 ;* V03 loc1         [V03    ] (  0,  0   )  struct (32) zero-ref    do-not-enreg[S] ld-addr-op
 ;  V04 OutArgs      [V04    ] (  1,  1   )  struct (32) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
-;  V05 tmp1         [V05,T09] (  3,  6   )  struct (32) [rsp+28H]   do-not-enreg[SF] must-init ld-addr-op "NewObj constructor temp"
-;  V06 tmp2         [V06,T06] (  3,216.00)     ref  ->  rax         class-hnd exact "Single-def Box Helper"
-;* V07 tmp3         [V07,T07] (  0,  0   )    bool  ->  zero-ref    "Inline return value spill temp"
-;  V08 tmp4         [V08,T03] (  4,320.00)     int  ->  rcx         "Inline stloc first use temp"
-;  V09 tmp5         [V09,T00] (  3,480.00)     ref  ->  rax         class-hnd "impAppendStmt"
-;  V10 tmp6         [V10,T02] (  3,480.00)     int  ->  rcx         "Span.get_Item index"
-;  V11 tmp7         [V11,T01] (  3,480.00)   byref  ->  rax         "Span.get_Item ptrToSpan"
-;* V12 tmp8         [V12    ] (  0,  0   )   byref  ->  zero-ref    V14._reference(offs=0x00) P-INDEP "field V00._reference (fldOffset=0x0)"
-;* V13 tmp9         [V13    ] (  0,  0   )     int  ->  zero-ref    V14._length(offs=0x08) P-INDEP "field V00._length (fldOffset=0x8)"
-;* V14 tmp10        [V14    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref"
-;  V15 cse0         [V15,T05] (  3,240.00)     int  ->  rcx         "CSE - aggressive"
-;  V16 cse1         [V16,T04] (  3,240.00)     ref  ->  rax         "CSE - aggressive"
+;  V05 tmp1         [V05,T10] (  5, 10   )  struct (32) [rsp+28H]   do-not-enreg[SF] must-init ld-addr-op "NewObj constructor temp"
+;  V06 tmp2         [V06,T03] (  3,216.00)     ref  ->  rax         class-hnd exact "Single-def Box Helper"
+;* V07 tmp3         [V07,T04] (  0,  0   )    bool  ->  zero-ref    "Inline return value spill temp"
+;  V08 tmp4         [V08,T02] (  3,240.00)     int  ->  rdi         "Inline stloc first use temp"
+;* V09 tmp5         [V09    ] (  0,  0   )     ref  ->  zero-ref    class-hnd "impAppendStmt"
+;  V10 tmp6         [V10,T00] (  3,480.00)     int  ->  rcx         "Span.get_Item index"
+;* V11 tmp7         [V11    ] (  0,  0   )   byref  ->  zero-ref    V18._reference(offs=0x00) P-INDEP "field V00._reference (fldOffset=0x0)"
+;* V12 tmp8         [V12    ] (  0,  0   )     int  ->  zero-ref    V18._length(offs=0x08) P-INDEP "field V00._length (fldOffset=0x8)"
+;  V13 tmp9         [V13,T06] (  3, 82.00)     ref  ->  rsi         single-def "V02.[000..008)"
+;  V14 tmp10        [V14,T05] (  2,116.00)     int  ->  rdi         "V02.[008..012)"
+;  V15 tmp11        [V15,T01] (  4,241.00)     int  ->  registers   "V02.[012..016)"
+;  V16 tmp12        [V16,T07] (  2, 81.00)   byref  ->  rbx         single-def "V02.[016..024)"
+;  V17 tmp13        [V17,T08] (  2, 81.00)     int  ->  rbp         single-def "V02.[024..028)"
+;* V18 tmp14        [V18    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref"
+;  V19 cse0         [V19,T09] (  2, 81.00)     int  ->  r14         "CSE - aggressive"
+;  V20 cse1         [V20,T13] (  2,  2   )     int  ->  rdi         "CSE - aggressive"
 ;
-; Lcl frame size = 104
+; Lcl frame size = 72
 
 G_M27694_IG01:
-       sub      rsp, 104
+       push     r15
+       push     r14
+       push     rdi
+       push     rsi
+       push     rbp
+       push     rbx
+       sub      rsp, 72
        vzeroupper 
        xor      eax, eax
        mov      qword ptr [rsp+28H], rax
        vxorps   xmm4, xmm4
        vmovdqa  xmmword ptr [rsp+30H], xmm4
        mov      qword ptr [rsp+40H], rax
-						;; size=29 bbWeight=1 PerfScore 5.83
+						;; size=37 bbWeight=1 PerfScore 11.83
 G_M27694_IG02:
        vmovdqu  xmm0, xmmword ptr [rcx]
        vmovdqu  xmmword ptr [rsp+38H], xmm0
 						;; size=10 bbWeight=1 PerfScore 5.00
 G_M27694_IG03:
        mov      gword ptr [rsp+28H], rdx
-						;; size=5 bbWeight=1 PerfScore 1.00
+       mov      rsi, gword ptr [rsp+28H]
+       xor      edi, edi
+       mov      rbx, bword ptr [rsp+38H]
+       mov      ebp, dword ptr [rsp+40H]
+       mov      r14d, dword ptr [rsi+08H]
+       jmp      SHORT G_M27694_IG05
+						;; size=27 bbWeight=1 PerfScore 8.25
 G_M27694_IG04:
-       vmovdqu  ymm0, ymmword ptr[rsp+28H]
-       vmovdqu  ymmword ptr[rsp+48H], ymm0
-						;; size=12 bbWeight=1 PerfScore 5.00
-G_M27694_IG05:
-       jmp      SHORT G_M27694_IG07
-						;; size=2 bbWeight=1 PerfScore 2.00
-G_M27694_IG06:
        mov      rcx, 0xD1FFAB1E      ; System.Int32
        call     CORINFO_HELP_NEWSFAST
-       mov      ecx, dword ptr [rsp+50H]
-       mov      dword ptr [rax+08H], ecx
+       mov      dword ptr [rax+08H], edi
        mov      rcx, rax
        call     [System.Console:WriteLine(System.Object)]
-						;; size=31 bbWeight=36.00 PerfScore 234.00
+       mov      edi, r15d
+						;; size=30 bbWeight=36.00 PerfScore 207.00
+G_M27694_IG05:
+       cmp      r14d, edi
+       jle      SHORT G_M27694_IG07
+						;; size=5 bbWeight=80.00 PerfScore 100.00
+G_M27694_IG06:
+       lea      ecx, [rdi+01H]
+       mov      r15d, ecx
+       mov      ecx, edi
+       mov      ecx, dword ptr [rsi+4*rcx+10H]
+       cmp      ecx, ebp
+       jae      SHORT G_M27694_IG08
+       mov      edi, dword ptr [rbx+4*rcx]
+       jmp      SHORT G_M27694_IG04
+						;; size=21 bbWeight=80.00 PerfScore 660.00
 G_M27694_IG07:
-       mov      ecx, dword ptr [rsp+54H]
-       mov      rax, gword ptr [rsp+48H]
-       cmp      ecx, dword ptr [rax+08H]
-       jge      SHORT G_M27694_IG09
-						;; size=14 bbWeight=80.00 PerfScore 480.00
-G_M27694_IG08:
-       lea      edx, [rcx+01H]
-       mov      dword ptr [rsp+54H], edx
-       cmp      ecx, dword ptr [rax+08H]
-       jae      SHORT G_M27694_IG10
-       mov      ecx, ecx
-       mov      ecx, dword ptr [rax+4*rcx+10H]
-       lea      rax, bword ptr [rsp+58H]
-       cmp      ecx, dword ptr [rax+08H]
-       jae      SHORT G_M27694_IG10
-       mov      rax, bword ptr [rax]
-       mov      ecx, dword ptr [rax+4*rcx]
-       mov      dword ptr [rsp+50H], ecx
-       jmp      SHORT G_M27694_IG06
-						;; size=40 bbWeight=80.00 PerfScore 1540.00
-G_M27694_IG09:
-       vzeroupper 
-       add      rsp, 104
+       add      rsp, 72
+       pop      rbx
+       pop      rbp
+       pop      rsi
+       pop      rdi
+       pop      r14
+       pop      r15
        ret      
-						;; size=8 bbWeight=10.00 PerfScore 22.50
-G_M27694_IG10:
+						;; size=13 bbWeight=10.00 PerfScore 42.50
+G_M27694_IG08:
        call     CORINFO_HELP_RNGCHKFAIL
        int3     
 						;; size=6 bbWeight=0 PerfScore 0.00
 
-; Total bytes of code 157, prolog size 29, PerfScore 2311.03, instruction count 41, allocated bytes for code 157 (MethodHash=b4d393d1) for method FilteredSpanEnumerator`1[int]:Test(System.ReadOnlySpan`1[int],int[])
+; Total bytes of code 149, prolog size 37, PerfScore 1049.48, instruction count 48, allocated bytes for code 149 (MethodHash=b4d393d1) for method FilteredSpanEnumerator`1[int]:Test(System.ReadOnlySpan`1[int],int[])

So this looks encouraging at least.

tannergooding · 2023-03-16T16:40:47Z

I wouldn't worry too much about regressions in APIs like CreateFromRotationMatrix

That's an API that I plan on (and need to) rewrite to use explicit SIMD code. Namely it needs to use the Matrix4x4.AsImpl() so we get 4x Vector4 fields instead rather than 16x float fields

It is somewhat representative of what "might" happen for some user code. But its still a bit different since Quaternion is a TYP_SIMD16, so we're getting very weird mixing of SIMD and non-SIMD code which isn't likely ot be typical

Eliminate commas early in block morphing, before the rest of the phase needs to look at it. Do this by extracting their side effects and adding them on top of the returned result instead. This allows us to handle arbitrary COMMAs on the source and destination operands in a general manner.

Otherwise we hit asserts when trying to allocate 0 bytes of memory.

AndyAyersMS

Did a first pass ... overall this looks very nice.

AndyAyersMS · 2023-04-05T01:51:33Z

src/coreclr/jit/promotion.h

+#ifndef _PROMOTION_H
+#define _PROMOTION_H
+
+class Compiler;


Doesn't look like we have policy on this but it is often nice for headers to compile on their own instead of relying on any includers to provide the right prerequisites.

Added includes instead of forward declarations.

AndyAyersMS · 2023-04-05T02:04:56Z

src/coreclr/jit/promotion.cpp

+        // It may also be contained. Overall we are going to cost each use of
+        // an unpromoted local at 6.5 bytes.
+        // TODO-CQ: We can make much better guesses on what will and won't be contained.
+        costWithout += access.CountWtd * 6.5;


This seems to be mixing units? Weighted costs reflect (relative) frequency so should be used to scale times, but bytes are size related.

You may want to form both a time score and a size score and then explicitly try and model the tradeoffs between the two.

Eg for inlining I would generally propose: if size is negative then always transform; else if size increases transform only if ratio of time improvement to size increase is above some threshold.

Initially this was purely a size estimate but then I switched it to make some use of PGO weights. But I didn't really spend any time on the costing after that, it's pretty hard to make good choices when a lot of the basic stuff is still missing.
I will add a todo here -- your suggestion to eventually check both size and perf makes sense. But I found it very hard to predict the size cost (in particular due to prolog/epilog cost and potential spilling). And accurately modelling execution cost is also going to be pretty hard.

Actually I can see there is already a TODO above for it:

// TODO-CQ: Tune the following heuristics. Currently they are based on // x64 code size although using BB weights when available.

I added clarification to the todo that this mixing of units does not make sense.

AndyAyersMS · 2023-04-05T02:11:09Z

src/coreclr/jit/promotion.cpp

+                    argSize = m_compiler->typGetObjLayout(arg.GetSignatureClassHandle())->GetSize();
+                }
+
+#ifdef WINDOWS_AMD64_ABI


Leverage getArgTypeForStruct here?

I will remove these for now, they are currently unused. The plan was to move ABI information determination back so that it happened before this pass and could be used for it (I guess the implicit byref case might be simple enough here, while the "passed on stack" hint would be harder to determine).

AndyAyersMS · 2023-04-05T02:24:35Z

src/coreclr/jit/promotion.cpp

+        if ((ssize_t)index < 0)
+        {
+            index = ~index;
+            if ((index > 0) && replacements[index - 1].Overlaps(offs, size))


How do you know that there aren't any lower indexed entries that overlap?

replacements has only disjoint ranges in it (ensured by EvaluateReplacement)

AndyAyersMS · 2023-04-05T02:30:00Z

src/coreclr/jit/promotion.cpp

+    //   lcl          - The struct local
+    //   offs         - The starting offset of the range in the struct local that needs to be read back from.
+    //   size         - The size of the range
+    //   conservative - Whether this is a potentially conservative read back


Is this for the case where you may be able to prove that a reassembled struct won't be modified?

This is to make it clear in the jitdump that this is an assignment. Eventually we want to decompose these directly (like block morphing) and avoid this read back in many cases, but the logic is not implemented yet.

* Avoid setting GTF_GLOB_REF on GT_FIELD_ADDR nodes * Avoid setting GTF_GLOB_REF on GT_FIELD nodes off of implicit byrefs. This is ok now since implicit byref morphing indiscriminately sets GTF_GLOB_REF for these. These changes are necessary to avoid address exposure in the two user benchmarks in dotnet#83388. Fix dotnet#74563

Fix issue seen in dotnet#83388 where tail merging ended up adding new predecessors to the scratch block, making adding more "initialization IR" impossible for downstream phases.

Fix issue seen in #83388 where tail merging ended up adding new predecessors to the scratch block, making adding more "initialization IR" impossible for downstream phases.

…motion

* Avoid setting GTF_GLOB_REF on GT_FIELD_ADDR nodes * Avoid setting GTF_GLOB_REF on GT_FIELD nodes off of implicit byrefs. This is ok now since implicit byref morphing indiscriminately sets GTF_GLOB_REF for these. * Manually clone a "pointer to span" in span intrinsic expansion when it points to a local. Unfortunately this does not fall out from the above since gtClone does not handle FIELD_ADDR, and making it handle this needs some more work. These changes are necessary to avoid address exposure in the two user benchmarks in #83388. Fix #74563 Fix #856

jakobbotsch · 2023-04-07T19:11:32Z

No diffs now. TP regressions are from the new phase (even with the early exit) and the change in fgNewStmtFromTree.

@AndyAyersMS Anything else you think I need to do here?

AndyAyersMS

LGTM.

jakobbotsch · 2023-04-25T08:54:04Z

@En3Tho, are you ok with me adding your GSeq example to the dotnet/performance micro benchmarks under the MIT license so we can track the perf of it?

En3Tho · 2023-04-25T09:50:40Z

@jakobbotsch Will be glad if you do so! I think in that particular benchmark size of collection should be increased.

JIT: Prototype a generalized promotion pass

46e8c65

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 14, 2023

ghost assigned jakobbotsch Mar 14, 2023

Remove some old code

faa8260

This was referenced Mar 14, 2023

[release/6.0] Doublelinklist GC failures on Mono #83245

Closed

[jitstress] HardwareIntrinsics_ro fails with "process cannot access the file" error #83298

Closed

Fix accounting mistake

b4cf692

jakobbotsch added 4 commits March 16, 2023 11:12

Fix a JITDUMP

8c602ee

Add overlapped assignment heuristic

294e7be

Fix register candidate assignment accounting

9fde2ee

Add a note in the dump when known conservative copies happen

a030d98

jakobbotsch mentioned this pull request Mar 16, 2023

JIT struct work planned for .NET 8 #76931

Closed

20 tasks

Weighted uses

dd3b8e3

Add a fix to enable promotion in customer scenario

a693d29

jakobbotsch mentioned this pull request Mar 22, 2023

JIT: Generalize handling of commas in block morphing #83590

Merged

jakobbotsch added 11 commits March 22, 2023 11:36

Write call arg uses before arg instead of call

75c3063

Clean up

47f27f5

Change types of comma args properly in call arg morphing

bcc1dd3

Fix write-back of local on RHS of assignment

724a582

Run jit-format

4d3f4ba

Remove some layouts, change some costing, clean up

14113a9

Early out when there are no locals

0c7bbf3

Otherwise we hit asserts when trying to allocate 0 bytes of memory.

Run jit-format

0b140bf

Remove dead code

c4a04d7

Some more cleanup

c9974d5

jakobbotsch changed the title ~~JIT: Prototype a generalized promotion pass~~ JIT: Add a (disabled) prototype for a generalized promotion pass Apr 4, 2023

jakobbotsch requested a review from AndyAyersMS April 4, 2023 19:09

build-analysis bot mentioned this pull request Apr 4, 2023

CI System.Text.RegularExpressions.Tests failure #84323

Closed

AndyAyersMS reviewed Apr 5, 2023

View reviewed changes

jakobbotsch added 6 commits April 5, 2023 12:21

Revert unnecessary change

96bad09

Include headers instead of forward declarations

3172f37

Remove some currently unused accounting

0b335ff

Clean up, add some comments

cbb7d84

Avoid diffs

0809197

Run jit-format

8338272

jakobbotsch mentioned this pull request Apr 5, 2023

JIT: Avoid unnecessary GTF_GLOB_REFs #84349

Merged

build-analysis bot mentioned this pull request Apr 5, 2023

DataContractSerializerTests.DCS_MyPersonSurrogate_Stress failing in CI #35066

Open

jakobbotsch mentioned this pull request Apr 5, 2023

JIT: Ensure tail merging does not add preds to scratch block #84353

Merged

build-analysis bot mentioned this pull request Apr 6, 2023

StackallocTests.Test4096 failing EnsureZeroed check #84398

Open

Merge branch 'main' of github.com:dotnet/runtime into generalized-pro…

ea2786d

…motion

build-analysis bot mentioned this pull request Apr 7, 2023

WindowsServiceLifetimeTests.ServiceCanStopItself fails in CI #84438

Closed

AndyAyersMS approved these changes Apr 7, 2023

View reviewed changes

jakobbotsch merged commit 388edb6 into dotnet:main Apr 11, 2023
186 checks passed

jakobbotsch deleted the generalized-promotion branch April 11, 2023 09:08

jakobbotsch mentioned this pull request Apr 11, 2023

SPMI diff failure for SqlSingleTest #84631

Closed

This was referenced Apr 25, 2023

Promote (scalar replace) structs with more than 4 fields #6534

Closed

Add a few struct promotion related benchmarks dotnet/performance#2991

Merged

dotnet locked as resolved and limited conversation to collaborators May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Add a (disabled) prototype for a generalized promotion pass #83388

JIT: Add a (disabled) prototype for a generalized promotion pass #83388

jakobbotsch commented Mar 14, 2023 •

edited

ghost commented Mar 14, 2023

jakobbotsch commented Mar 15, 2023

jakobbotsch commented Mar 16, 2023 •

edited

tannergooding commented Mar 16, 2023

AndyAyersMS left a comment

AndyAyersMS Apr 5, 2023

jakobbotsch Apr 6, 2023

AndyAyersMS Apr 5, 2023

jakobbotsch Apr 5, 2023

jakobbotsch Apr 5, 2023

jakobbotsch Apr 6, 2023

AndyAyersMS Apr 5, 2023

jakobbotsch Apr 5, 2023 •

edited

AndyAyersMS Apr 5, 2023

jakobbotsch Apr 5, 2023

AndyAyersMS Apr 5, 2023

jakobbotsch Apr 5, 2023

jakobbotsch commented Apr 7, 2023

AndyAyersMS left a comment

jakobbotsch commented Apr 25, 2023 •

edited

En3Tho commented Apr 25, 2023

JIT: Add a (disabled) prototype for a generalized promotion pass #83388

JIT: Add a (disabled) prototype for a generalized promotion pass #83388

Conversation

jakobbotsch commented Mar 14, 2023 • edited

ghost commented Mar 14, 2023

jakobbotsch commented Mar 15, 2023

jakobbotsch commented Mar 16, 2023 • edited

tannergooding commented Mar 16, 2023

AndyAyersMS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakobbotsch Apr 5, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakobbotsch commented Apr 7, 2023

AndyAyersMS left a comment

Choose a reason for hiding this comment

jakobbotsch commented Apr 25, 2023 • edited

En3Tho commented Apr 25, 2023

jakobbotsch commented Mar 14, 2023 •

edited

jakobbotsch commented Mar 16, 2023 •

edited

jakobbotsch Apr 5, 2023 •

edited

jakobbotsch commented Apr 25, 2023 •

edited