Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Add a (disabled) prototype for a generalized promotion pass #83388

Merged
merged 57 commits into from Apr 11, 2023

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Mar 14, 2023

Introduce a "physical" promotion pass (name suggested by @SingleAccretion; thanks!) that generalizes the existing promotion. More specifically, it does not have restrictions on field count and it can handle arbitrary recursive promotion.

The pass is physical in the sense that it does not rely on any field metadata for structs. Instead, it works in two separate passes over the IR:

  1. In the first pass we find and analyze how unpromoted struct locals are accessed. For example, for a simple program like:
public static void Main()
{
    S s = default;
    Call(s, s.C);
    Console.WriteLine(s.B + s.C);
}

[MethodImpl(MethodImplOptions.NoInlining)]
private static void Call(S s, byte b)
{
}

private struct S
{
    public byte A, B, C, D, E;
}

we see IR like:

***** BB01
STMT00000 ( 0x000[E-] ... 0x003 )
               [000003] IA---------ASG       struct (init)
               [000001] D------N---                         ├──▌  LCL_VAR   struct<Program+S, 5> V00 loc0         
               [000002] -----------                         └──▌  CNS_INT   int    0

***** BB01
STMT00001 ( 0x008[E-] ... 0x026 )
               [000008] --C-G------CALL      void   Program:Call(Program+S,ubyte)
               [000004] ----------- arg0                    ├──▌  LCL_VAR   struct<Program+S, 5> V00 loc0         
               [000007] ----------- arg1                    └──▌  LCL_FLD   ubyte  V00 loc0         [+2]

***** BB01
STMT00002 ( 0x014[E-] ... ??? )
               [000016] --C-G------CALL      void   System.Console:WriteLine(int)
               [000015] ----------- arg0                    └──▌  ADD       int   
               [000011] -----------                            ├──▌  LCL_FLD   ubyte  V00 loc0         [+1]
               [000014] -----------                            └──▌  LCL_FLD   ubyte  V00 loc0         [+2]

and the analysis produces

Accesses for V00
  [000..005)
    #:                             (2, 200)
    # assigned from:               (0, 0)
    # assigned to:                 (1, 100)
    # as call arg:                 (1, 100)
    # as implicit by-ref call arg: (1, 100)
    # as on-stack call arg:        (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  ubyte @ 001
    #:                             (1, 100)
    # assigned from:               (0, 0)
    # assigned to:                 (0, 0)
    # as call arg:                 (0, 0)
    # as implicit by-ref call arg: (0, 0)
    # as on-stack call arg:        (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

  ubyte @ 002
    #:                             (2, 200)
    # assigned from:               (0, 0)
    # assigned to:                 (0, 0)
    # as call arg:                 (1, 100)
    # as implicit by-ref call arg: (0, 0)
    # as on-stack call arg:        (0, 0)
    # as retbuf:                   (0, 0)
    # as returned value:           (0, 0)

Here the pairs are (#ref counts, wtd ref counts).

Based on this accounting, the analysis estimates the profitability of replacing some of the accessed parts of the struct with a local. This may be costly because overlapping struct accesses (e.g. passing the whole struct as an argument) may require more expensive codegen after promotion. And of course, creating new locals introduces more register pressure. Currently the profitability analysis is very crude.

In this case the logic decides that promotion is not worth it:

Evaluating access ubyte @ 001
  Single write-back cost: 5
  Write backs: 100
  Read backs: 100
  Cost with: 1350
  Cost without: 650
  Disqualifying replacement
Evaluating access ubyte @ 002
  Single write-back cost: 5
  Write backs: 100
  Read backs: 100
  Cost with: 1700
  Cost without: 1300
  Disqualifying replacement
  1. In the second pass the field accesses are replaced with new locals for the profitable cases. For overlapping accesses that currently involves writing back replacements to the struct local first. For arguments/OSR locals, it involves reading them back from the struct first.

In the above case we can override the profitability analysis with stress mode STRESS_PHYSICAL_PROMOTION_COST and we get:

Evaluating access ubyte @ 001
  Single write-back cost: 5
  Write backs: 100
  Read backs: 100
  Cost with: 1350
  Cost without: 650
  Promoting replacement due to stress

lvaGrabTemp returning 2 (V02 tmp1) (a long lifetime temp) called for V00.[001..002).
Evaluating access ubyte @ 002
  Single write-back cost: 5
  Write backs: 100
  Read backs: 100
  Cost with: 1700
  Cost without: 1300
  Promoting replacement due to stress

lvaGrabTemp returning 3 (V03 tmp2) (a long lifetime temp) called for V00.[002..003).
V00 promoted with 2 replacements
  [001..002) promoted as ubyte V02
  [002..003) promoted as ubyte V03

...

***** BB01
STMT00000 ( 0x000[E-] ... 0x003 )
               [000003] IA---------ASG       struct (init)
               [000001] D------N---                         ├──▌  LCL_VAR   struct<Program+S, 5> V00 loc0         
               [000002] -----------                         └──▌  CNS_INT   int    0

***** BB01
STMT00001 ( 0x008[E-] ... 0x026 )
               [000008] -ACXG------CALL      void   Program:Call(Program+S,ubyte)
               [000004] ----------- arg0                    ├──▌  LCL_VAR   struct<Program+S, 5> V00 loc0         
               [000022] -A--------- arg1                    └──▌  COMMA     ubyte 
               [000021] -A---------                            ├──▌  ASG       ubyte 
               [000019] D------N---                            │  ├──▌  LCL_VAR   ubyte  V03 tmp2         
               [000020] -----------                            │  └──▌  LCL_FLD   ubyte  V00 loc0         [+2]
               [000018] -----------                            └──▌  LCL_VAR   ubyte  V03 tmp2         

***** BB01
STMT00002 ( 0x014[E-] ... ??? )
               [000016] -ACXG------CALL      void   System.Console:WriteLine(int)
               [000015] -A--------- arg0                    └──▌  ADD       int   
               [000027] -A---------                            ├──▌  COMMA     ubyte 
               [000026] -A---------                            │  ├──▌  ASG       ubyte 
               [000024] D------N---                            │  │  ├──▌  LCL_VAR   ubyte  V02 tmp1         
               [000025] -----------                            │  │  └──▌  LCL_FLD   ubyte  V00 loc0         [+1]
               [000023] -----------                            │  └──▌  LCL_VAR   ubyte  V02 tmp1         
               [000028] -----------                            └──▌  LCL_VAR   ubyte  V03 tmp2         

The pass still only has rudimentary support and is missing many basic CQ optimization optimizations. For example, it does not make use of any liveness yet and it does not have any decomposition support for assignments. Yet, it already shows good potential in a couple of user benchmarks (benchmark 1, benchmark 2). I have listed some follow-up improvements in #76928.

This PR is adding the pass but it is disabled by default. It can be enabled by setting DOTNET_JitStressModeNames=STRESS_PHYSICAL_PROMOTION. There are two new scenarios added to jit-experimental that enables it, to be used for testing purposes.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 14, 2023
@ghost ghost assigned jakobbotsch Mar 14, 2023
@ghost
Copy link

ghost commented Mar 14, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: jakobbotsch
Assignees: jakobbotsch
Labels:

area-CodeGen-coreclr

Milestone: -

@jakobbotsch
Copy link
Member Author

Currently I'm using some size heuristics to figure out when to do the promotion. Here is an example diff in CreateFromRotationMatrix:

 ; Final local variable assignments
 ;
 ;  V00 RetBuf       [V00,T01] (  4,  4   )   byref  ->  rcx         single-def
-;  V01 arg0         [V01,T00] ( 29, 34   )   byref  ->  rdx         single-def
-;  V02 loc0         [V02,T08] (  3,  2.50)   float  ->  mm2        
+;  V01 arg0         [V01,T00] ( 11, 22   )   byref  ->  rdx         single-def
+;  V02 loc0         [V02,T14] (  3,  2.50)   float  ->  mm9        
 ;  V03 loc1         [V03,T02] ( 13,  7   )  simd16  ->  registers   ld-addr-op
 ;  V04 loc2         [V04,T07] (  7,  3.50)   float  ->  mm0        
-;  V05 loc3         [V05,T12] (  3,  1.50)   float  ->  mm3        
-;  V06 loc4         [V06,T09] (  4,  2   )   float  ->  mm0        
-;  V07 loc5         [V07,T13] (  3,  1.50)   float  ->  mm3        
-;  V08 loc6         [V08,T10] (  4,  2   )   float  ->  mm0        
-;  V09 loc7         [V09,T14] (  3,  1.50)   float  ->  mm0        
-;  V10 loc8         [V10,T11] (  4,  2   )   float  ->  mm1        
+;  V05 loc3         [V05,T18] (  3,  1.50)   float  ->  mm0        
+;  V06 loc4         [V06,T15] (  4,  2   )   float  ->  mm4        
+;  V07 loc5         [V07,T19] (  3,  1.50)   float  ->  mm0        
+;  V08 loc6         [V08,T16] (  4,  2   )   float  ->  mm4        
+;  V09 loc7         [V09,T20] (  3,  1.50)   float  ->  mm0        
+;  V10 loc8         [V10,T17] (  4,  2   )   float  ->  mm4        
 ;# V11 OutArgs      [V11    ] (  1,  1   )  struct ( 0) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
-;  V12 cse0         [V12,T04] (  7,  4.50)   float  ->  mm1         "CSE - aggressive"
-;  V13 cse1         [V13,T05] (  7,  4.50)   float  ->  mm3         "CSE - aggressive"
-;  V14 cse2         [V14,T06] (  7,  4.50)   float  ->  mm0         "CSE - aggressive"
-;  V15 cse3         [V15,T03] ( 12,  6   )   float  ->  mm2         "CSE - aggressive"
+;  V12 tmp1         [V12,T04] (  7,  4.50)   float  ->  mm0         "V01.[000..004)"
+;  V13 tmp2         [V13,T08] (  5,  3   )   float  ->  mm1         "V01.[004..008)"
+;  V14 tmp3         [V14,T09] (  5,  3   )   float  ->  mm2         "V01.[008..012)"
+;  V15 tmp4         [V15,T10] (  5,  3   )   float  ->  mm3         "V01.[016..020)"
+;  V16 tmp5         [V16,T05] (  7,  4.50)   float  ->  mm4         "V01.[020..024)"
+;  V17 tmp6         [V17,T11] (  5,  3   )   float  ->  mm5         "V01.[024..028)"
+;  V18 tmp7         [V18,T12] (  5,  3   )   float  ->  mm6         "V01.[032..036)"
+;  V19 tmp8         [V19,T13] (  5,  3   )   float  ->  mm7         "V01.[036..040)"
+;  V20 tmp9         [V20,T06] (  7,  4.50)   float  ->  mm8         "V01.[040..044)"
+;  V21 cse0         [V21,T03] ( 12,  6   )   float  ->  mm9         "CSE - aggressive"
 ;
-; Lcl frame size = 0
+; Lcl frame size = 88
 
 G_M15800_IG01:
+       sub      rsp, 88
        vzeroupper 
-						;; size=3 bbWeight=1 PerfScore 1.00
+       vmovaps  xmmword ptr [rsp+40H], xmm6
+       vmovaps  xmmword ptr [rsp+30H], xmm7
+       vmovaps  xmmword ptr [rsp+20H], xmm8
+       vmovaps  xmmword ptr [rsp+10H], xmm9
+       vmovaps  xmmword ptr [rsp], xmm10
+						;; size=36 bbWeight=1 PerfScore 11.25
 G_M15800_IG02:
        vmovss   xmm0, dword ptr [rdx]
-       vmovss   xmm1, dword ptr [rdx+14H]
-       vaddss   xmm2, xmm0, xmm1
-       vmovss   xmm3, dword ptr [rdx+28H]
-       vaddss   xmm2, xmm2, xmm3
-       vxorps   xmm4, xmm4
-       vucomiss xmm2, xmm4
+       vmovss   xmm1, dword ptr [rdx+04H]
+       vmovss   xmm2, dword ptr [rdx+08H]
+       vmovss   xmm3, dword ptr [rdx+10H]
+       vmovss   xmm4, dword ptr [rdx+14H]
+       vmovss   xmm5, dword ptr [rdx+18H]
+       vmovss   xmm6, dword ptr [rdx+20H]
+       vmovss   xmm7, dword ptr [rdx+24H]
+       vmovss   xmm8, dword ptr [rdx+28H]
+       vaddss   xmm9, xmm0, xmm4
+       vaddss   xmm9, xmm9, xmm8
+       vxorps   xmm10, xmm10
+       vucomiss xmm9, xmm10
        jbe      SHORT G_M15800_IG04
-						;; size=32 bbWeight=1 PerfScore 21.33
+						;; size=65 bbWeight=1 PerfScore 45.33
 G_M15800_IG03:
-       vaddss   xmm0, xmm2, dword ptr [reloc @RWD00]
+       vaddss   xmm0, xmm9, dword ptr [reloc @RWD00]
        vsqrtss  xmm0, xmm0
-       vmovss   xmm2, dword ptr [reloc @RWD04]
-       vmulss   xmm1, xmm0, xmm2
-       vinsertps xmm1, xmm1, xmm1, 55
-       vdivss   xmm0, xmm2, xmm0
-       vmovss   xmm3, dword ptr [rdx+18H]
-       vsubss   xmm2, xmm3, dword ptr [rdx+24H]
+       vmovss   xmm9, dword ptr [reloc @RWD04]
+       vmulss   xmm4, xmm0, xmm9
+       vinsertps xmm4, xmm4, xmm4, 55
+       vdivss   xmm0, xmm9, xmm0
+       vsubss   xmm5, xmm5, xmm7
+       vmulss   xmm5, xmm5, xmm0
+       vinsertps xmm4, xmm4, xmm5, 0
+       vsubss   xmm2, xmm6, xmm2
        vmulss   xmm2, xmm2, xmm0
-       vinsertps xmm1, xmm1, xmm2, 0
-       vmovss   xmm2, dword ptr [rdx+20H]
-       vsubss   xmm2, xmm2, dword ptr [rdx+08H]
-       vmulss   xmm2, xmm2, xmm0
-       vinsertps xmm1, xmm1, xmm2, 16
-       vmovss   xmm2, dword ptr [rdx+04H]
-       vsubss   xmm2, xmm2, dword ptr [rdx+10H]
-       vmulss   xmm0, xmm2, xmm0
-       vinsertps xmm1, xmm1, xmm0, 32
+       vinsertps xmm2, xmm4, xmm2, 16
+       vsubss   xmm1, xmm1, xmm3
+       vmulss   xmm3, xmm1, xmm0
+       vinsertps xmm4, xmm2, xmm3, 32
        jmp      G_M15800_IG07
-						;; size=99 bbWeight=0.50 PerfScore 38.50
+						;; size=82 bbWeight=0.50 PerfScore 28.00
 G_M15800_IG04:
-       vucomiss xmm0, xmm1
+       vucomiss xmm0, xmm4
        jb       SHORT G_M15800_IG05
-       vucomiss xmm0, xmm3
+       vucomiss xmm0, xmm8
        jb       SHORT G_M15800_IG05
        vaddss   xmm0, xmm0, dword ptr [reloc @RWD00]
-       vsubss   xmm1, xmm0, xmm1
-       vsubss   xmm3, xmm1, xmm3
-       vsqrtss  xmm3, xmm3
-       vmovss   xmm2, dword ptr [reloc @RWD04]
-       vdivss   xmm0, xmm2, xmm3
-       vmulss   xmm2, xmm3, xmm2
-       vinsertps xmm1, xmm1, xmm2, 14
-       vmovss   xmm2, dword ptr [rdx+04H]
-       vaddss   xmm2, xmm2, dword ptr [rdx+10H]
-       vmulss   xmm2, xmm2, xmm0
-       vinsertps xmm1, xmm1, xmm2, 16
-       vmovss   xmm2, dword ptr [rdx+08H]
-       vaddss   xmm2, xmm2, dword ptr [rdx+20H]
-       vmulss   xmm2, xmm2, xmm0
-       vinsertps xmm1, xmm1, xmm2, 32
-       vmovss   xmm2, dword ptr [rdx+18H]
-       vsubss   xmm2, xmm2, dword ptr [rdx+24H]
-       vmulss   xmm0, xmm2, xmm0
-       vinsertps xmm1, xmm1, xmm0, 48
-       jmp      G_M15800_IG07
-						;; size=119 bbWeight=0.50 PerfScore 44.50
-G_M15800_IG05:
-       vucomiss xmm1, xmm3
-       jbe      SHORT G_M15800_IG06
-       vaddss   xmm1, xmm1, dword ptr [reloc @RWD00]
-       vsubss   xmm0, xmm1, xmm0
-       vsubss   xmm3, xmm0, xmm3
-       vsqrtss  xmm3, xmm3
-       vmovss   xmm2, dword ptr [reloc @RWD04]
-       vdivss   xmm0, xmm2, xmm3
-       vmovss   xmm1, dword ptr [rdx+10H]
-       vaddss   xmm1, xmm1, dword ptr [rdx+04H]
-       vmulss   xmm1, xmm1, xmm0
-       vinsertps xmm1, xmm1, xmm1, 14
-       vmulss   xmm2, xmm3, xmm2
-       vinsertps xmm1, xmm1, xmm2, 16
-       vmovss   xmm2, dword ptr [rdx+24H]
-       vaddss   xmm2, xmm2, dword ptr [rdx+18H]
-       vmulss   xmm2, xmm2, xmm0
-       vinsertps xmm1, xmm1, xmm2, 32
-       vmovss   xmm2, dword ptr [rdx+20H]
-       vsubss   xmm2, xmm2, dword ptr [rdx+08H]
-       vmulss   xmm0, xmm2, xmm0
-       vinsertps xmm1, xmm1, xmm0, 48
-       jmp      SHORT G_M15800_IG07
-						;; size=110 bbWeight=0.50 PerfScore 43.00
-G_M15800_IG06:
-       vaddss   xmm2, xmm3, dword ptr [reloc @RWD00]
-       vsubss   xmm0, xmm2, xmm0
-       vsubss   xmm0, xmm0, xmm1
+       vsubss   xmm4, xmm0, xmm4
+       vsubss   xmm0, xmm4, xmm8
        vsqrtss  xmm0, xmm0
-       vmovss   xmm2, dword ptr [reloc @RWD04]
-       vdivss   xmm1, xmm2, xmm0
-       vmovss   xmm3, dword ptr [rdx+20H]
-       vaddss   xmm3, xmm3, dword ptr [rdx+08H]
-       vmulss   xmm3, xmm3, xmm1
-       vinsertps xmm3, xmm3, xmm3, 14
-       vmovss   xmm4, dword ptr [rdx+24H]
-       vaddss   xmm4, xmm4, dword ptr [rdx+18H]
-       vmulss   xmm4, xmm4, xmm1
-       vinsertps xmm3, xmm3, xmm4, 16
-       vmulss   xmm0, xmm0, xmm2
-       vinsertps xmm0, xmm3, xmm0, 32
-       vmovss   xmm2, dword ptr [rdx+04H]
-       vsubss   xmm2, xmm2, dword ptr [rdx+10H]
-       vmulss   xmm1, xmm2, xmm1
+       vmovss   xmm9, dword ptr [reloc @RWD04]
+       vdivss   xmm4, xmm9, xmm0
+       vmulss   xmm0, xmm0, xmm9
+       vinsertps xmm0, xmm0, xmm0, 14
+       vaddss   xmm1, xmm1, xmm3
+       vmulss   xmm3, xmm1, xmm4
+       vinsertps xmm0, xmm0, xmm3, 16
+       vaddss   xmm2, xmm2, xmm6
+       vmulss   xmm1, xmm2, xmm4
+       vinsertps xmm0, xmm0, xmm1, 32
+       vsubss   xmm5, xmm5, xmm7
+       vmulss   xmm1, xmm5, xmm4
        vinsertps xmm0, xmm0, xmm1, 48
-       vmovaps  xmm1, xmm0
-						;; size=106 bbWeight=0.50 PerfScore 40.62
+       vmovaps  xmm4, xmm0
+       jmp      G_M15800_IG07
+						;; size=108 bbWeight=0.50 PerfScore 34.12
+G_M15800_IG05:
+       vucomiss xmm4, xmm8
+       jbe      SHORT G_M15800_IG06
+       vaddss   xmm4, xmm4, dword ptr [reloc @RWD00]
+       vsubss   xmm0, xmm4, xmm0
+       vsubss   xmm0, xmm0, xmm8
+       vsqrtss  xmm0, xmm0
+       vmovss   xmm9, dword ptr [reloc @RWD04]
+       vdivss   xmm4, xmm9, xmm0
+       vaddss   xmm1, xmm3, xmm1
+       vmulss   xmm3, xmm1, xmm4
+       vinsertps xmm1, xmm1, xmm3, 14
+       vmulss   xmm0, xmm0, xmm9
+       vinsertps xmm0, xmm1, xmm0, 16
+       vaddss   xmm5, xmm7, xmm5
+       vmulss   xmm1, xmm5, xmm4
+       vinsertps xmm0, xmm0, xmm1, 32
+       vsubss   xmm2, xmm6, xmm2
+       vmulss   xmm1, xmm2, xmm4
+       vinsertps xmm0, xmm0, xmm1, 48
+       vmovaps  xmm4, xmm0
+       jmp      SHORT G_M15800_IG07
+						;; size=99 bbWeight=0.50 PerfScore 32.62
+G_M15800_IG06:
+       vaddss   xmm9, xmm8, dword ptr [reloc @RWD00]
+       vsubss   xmm0, xmm9, xmm0
+       vsubss   xmm0, xmm0, xmm4
+       vsqrtss  xmm0, xmm0
+       vmovss   xmm9, dword ptr [reloc @RWD04]
+       vdivss   xmm4, xmm9, xmm0
+       vaddss   xmm2, xmm6, xmm2
+       vmulss   xmm2, xmm2, xmm4
+       vinsertps xmm2, xmm2, xmm2, 14
+       vaddss   xmm5, xmm7, xmm5
+       vmulss   xmm5, xmm5, xmm4
+       vinsertps xmm2, xmm2, xmm5, 16
+       vmulss   xmm0, xmm0, xmm9
+       vinsertps xmm0, xmm2, xmm0, 32
+       vsubss   xmm1, xmm1, xmm3
+       vmulss   xmm1, xmm1, xmm4
+       vinsertps xmm0, xmm0, xmm1, 48
+       vmovaps  xmm4, xmm0
+						;; size=89 bbWeight=0.50 PerfScore 30.12
 G_M15800_IG07:
-       vmovups  xmmword ptr [rcx], xmm1
+       vmovups  xmmword ptr [rcx], xmm4
        mov      rax, rcx
 						;; size=7 bbWeight=1 PerfScore 2.25
 G_M15800_IG08:
+       vmovaps  xmm6, xmmword ptr [rsp+40H]
+       vmovaps  xmm7, xmmword ptr [rsp+30H]
+       vmovaps  xmm8, xmmword ptr [rsp+20H]
+       vmovaps  xmm9, xmmword ptr [rsp+10H]
+       vmovaps  xmm10, xmmword ptr [rsp]
+       add      rsp, 88
        ret      
-						;; size=1 bbWeight=1 PerfScore 1.00
+						;; size=34 bbWeight=1 PerfScore 21.25
 RWD00  	dd	3F800000h		;         1
 RWD04  	dd	3F000000h		;       0.5
 
 
-; Total bytes of code 477, prolog size 3, PerfScore 239.91, instruction count 100, allocated bytes for code 477 (MethodHash=096ec247) for method System.Numerics.Quaternion:CreateFromRotationMatrix(System.Numerics.Matrix4x4):System.Numerics.Quaternion
+; Total bytes of code 520, prolog size 36, PerfScore 256.96, instruction count 108, allocated bytes for code 520 (MethodHash=096ec247) for method System.Numerics.Quaternion:CreateFromRotationMatrix(System.Numerics.Matrix4x4):System.Numerics.Quaternion

We decide to promote all the fields of the matrix because we see each of them as having multiple accesses, and we only need to load them once (since they come from a parameter):

Accesses for V01
  float @ 000
    #:                             6
    # assigned from:               0
    # assigned to:                 0
    # as call arg:                 0
    # as implicit by-ref call arg: 0
    # as on-stack call arg:        0
    # as retbuf:                   0
    # as returned value:           0

  float @ 004
    #:                             4
    # assigned from:               0
    # assigned to:                 0
    # as call arg:                 0
    # as implicit by-ref call arg: 0
    # as on-stack call arg:        0
    # as retbuf:                   0
    # as returned value:           0

  float @ 008
    #:                             4
    # assigned from:               0
    # assigned to:                 0
    # as call arg:                 0
    # as implicit by-ref call arg: 0
    # as on-stack call arg:        0
    # as retbuf:                   0
    # as returned value:           0

  float @ 016
    #:                             4
    # assigned from:               0
    # assigned to:                 0
    # as call arg:                 0
    # as implicit by-ref call arg: 0
    # as on-stack call arg:        0
    # as retbuf:                   0
    # as returned value:           0

  float @ 020
    #:                             6
    # assigned from:               0
    # assigned to:                 0
    # as call arg:                 0
    # as implicit by-ref call arg: 0
    # as on-stack call arg:        0
    # as retbuf:                   0
    # as returned value:           0

  float @ 024
    #:                             4
    # assigned from:               0
    # assigned to:                 0
    # as call arg:                 0
    # as implicit by-ref call arg: 0
    # as on-stack call arg:        0
    # as retbuf:                   0
    # as returned value:           0

  float @ 032
    #:                             4
    # assigned from:               0
    # assigned to:                 0
    # as call arg:                 0
    # as implicit by-ref call arg: 0
    # as on-stack call arg:        0
    # as retbuf:                   0
    # as returned value:           0

  float @ 036
    #:                             4
    # assigned from:               0
    # assigned to:                 0
    # as call arg:                 0
    # as implicit by-ref call arg: 0
    # as on-stack call arg:        0
    # as retbuf:                   0
    # as returned value:           0

  float @ 040
    #:                             6
    # assigned from:               0
    # assigned to:                 0
    # as call arg:                 0
    # as implicit by-ref call arg: 0
    # as on-stack call arg:        0
    # as retbuf:                   0
    # as returned value:           0

Evaluating access float @ 000
  Write-back cost: 50
  # write backs: 0
  # read backs: 1
  Cost with: 200
  Cost without: 450
  Promoting replacement

lvaGrabTemp returning 12 (V12 tmp1) (a long lifetime temp) called for V01.[000..004).
Evaluating access float @ 004
  Write-back cost: 50
  # write backs: 0
  # read backs: 1
  Cost with: 150
  Cost without: 300
  Promoting replacement

lvaGrabTemp returning 13 (V13 tmp2) (a long lifetime temp) called for V01.[004..008).
Evaluating access float @ 008
  Write-back cost: 50
  # write backs: 0
  # read backs: 1
  Cost with: 150
  Cost without: 300
  Promoting replacement

lvaGrabTemp returning 14 (V14 tmp3) (a long lifetime temp) called for V01.[008..012).
Evaluating access float @ 016
  Write-back cost: 50
  # write backs: 0
  # read backs: 1
  Cost with: 150
  Cost without: 300
  Promoting replacement

lvaGrabTemp returning 15 (V15 tmp4) (a long lifetime temp) called for V01.[016..020).
Evaluating access float @ 020
  Write-back cost: 50
  # write backs: 0
  # read backs: 1
  Cost with: 200
  Cost without: 450
  Promoting replacement

lvaGrabTemp returning 16 (V16 tmp5) (a long lifetime temp) called for V01.[020..024).
Evaluating access float @ 024
  Write-back cost: 50
  # write backs: 0
  # read backs: 1
  Cost with: 150
  Cost without: 300
  Promoting replacement

lvaGrabTemp returning 17 (V17 tmp6) (a long lifetime temp) called for V01.[024..028).
Evaluating access float @ 032
  Write-back cost: 50
  # write backs: 0
  # read backs: 1
  Cost with: 150
  Cost without: 300
  Promoting replacement

lvaGrabTemp returning 18 (V18 tmp7) (a long lifetime temp) called for V01.[032..036).
Evaluating access float @ 036
  Write-back cost: 50
  # write backs: 0
  # read backs: 1
  Cost with: 150
  Cost without: 300
  Promoting replacement

lvaGrabTemp returning 19 (V19 tmp8) (a long lifetime temp) called for V01.[036..040).
Evaluating access float @ 040
  Write-back cost: 50
  # write backs: 0
  # read backs: 1
  Cost with: 200
  Cost without: 450
  Promoting replacement

lvaGrabTemp returning 20 (V20 tmp9) (a long lifetime temp) called for V01.[040..044).
V01 promoted with 9 replacements
  [000..004) promoted as float V12
  [004..008) promoted as float V13
  [008..012) promoted as float V14
  [016..020) promoted as float V15
  [020..024) promoted as float V16
  [024..028) promoted as float V17
  [032..036) promoted as float V18
  [036..040) promoted as float V19
  [040..044) promoted as float V20

It still ends up being a size regression because we need to save a bunch of callee saves. In addition there is a large up front cost in loading all the registers, but after that we do not need any more memory accesses.
However, the function is the following:

/// <summary>Creates a quaternion from the specified rotation matrix.</summary>
/// <param name="matrix">The rotation matrix.</param>
/// <returns>The newly created quaternion.</returns>
public static Quaternion CreateFromRotationMatrix(Matrix4x4 matrix)
{
float trace = matrix.M11 + matrix.M22 + matrix.M33;
Quaternion q = default;
if (trace > 0.0f)
{
float s = MathF.Sqrt(trace + 1.0f);
q.W = s * 0.5f;
s = 0.5f / s;
q.X = (matrix.M23 - matrix.M32) * s;
q.Y = (matrix.M31 - matrix.M13) * s;
q.Z = (matrix.M12 - matrix.M21) * s;
}
else
{
if (matrix.M11 >= matrix.M22 && matrix.M11 >= matrix.M33)
{
float s = MathF.Sqrt(1.0f + matrix.M11 - matrix.M22 - matrix.M33);
float invS = 0.5f / s;
q.X = 0.5f * s;
q.Y = (matrix.M12 + matrix.M21) * invS;
q.Z = (matrix.M13 + matrix.M31) * invS;
q.W = (matrix.M23 - matrix.M32) * invS;
}
else if (matrix.M22 > matrix.M33)
{
float s = MathF.Sqrt(1.0f + matrix.M22 - matrix.M11 - matrix.M33);
float invS = 0.5f / s;
q.X = (matrix.M21 + matrix.M12) * invS;
q.Y = 0.5f * s;
q.Z = (matrix.M32 + matrix.M23) * invS;
q.W = (matrix.M31 - matrix.M13) * invS;
}
else
{
float s = MathF.Sqrt(1.0f + matrix.M33 - matrix.M11 - matrix.M22);
float invS = 0.5f / s;
q.X = (matrix.M31 + matrix.M13) * invS;
q.Y = (matrix.M32 + matrix.M23) * invS;
q.Z = 0.5f * s;
q.W = (matrix.M12 - matrix.M21) * invS;
}
}
return q;
}

So in reality, in each execution each field is only accesses once anyway and it's best to leave it in memory to allow it to be loaded/contained right at its use:

Method Job EnvironmentVariables Mean Error StdDev Ratio
CreateFromRotationMatrix Job-RMJBCX Empty 4.305 ns 0.0170 ns 0.0150 ns 1.00
CreateFromRotationMatrix Job-BZLCPW DOTNET_JitNewStructPromotion=0 3.510 ns 0.0158 ns 0.0140 ns 0.82

My current thinking is to try to base things more on PGO, and piggy back on @AndyAyersMS's synthetic PGO in my experiment. That is, make the logic much more conservative about introducing replacement, so that we only actually end up introducing them when we're quite sure that an execution of the function ends up accessing the fields many times.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Mar 16, 2023

Reworked things to use BB weights and am now testing with DOTNET_JitSynthesizeCounts=1 (not sure how much of a workable state this is in yet, though).
Looking at benchmarks for win-x64, the impact seems to be very small and still overall a size regression:

[15:35:30] 110 contexts with diffs (29 improvements, 70 regressions, 11 same size)
[15:35:30]   -689/+6,564 bytes

PerfScore is a bit better:

[15:35:48] 104 total methods with Perf Score differences (66 improved, 38 regressed).

For aspnet we have:

[15:41:55] 524 contexts with diffs (195 improvements, 298 regressions, 31 same size)
[15:41:55]   -4,605/+9,335 bytes

[15:42:29] 500 total methods with Perf Score differences (377 improved, 123 regressed).

Accurately predicting things like new spills and register pressure when creating the new locals this early in the JIT is bound to be tough, but we should be able to do better than this.

For the example in #71510 we get the following diff in Test with the pass turned on (and a small fix in the importer):

 ; 0 inlinees with PGO data; 3 single block inlinees; 1 inlinees without PGO data
 ; Final local variable assignments
 ;
-;  V00 arg0         [V00,T08] (  3,  6   )   byref  ->  rcx         single-def
-;  V01 arg1         [V01,T10] (  3,  3   )     ref  ->  rdx         class-hnd single-def
-;  V02 loc0         [V02    ] (  7,437.00)  struct (32) [rsp+48H]   do-not-enreg[XSF] addr-exposed ld-addr-op
+;  V00 arg0         [V00,T11] (  3,  6   )   byref  ->  rcx         single-def
+;  V01 arg1         [V01,T12] (  3,  3   )     ref  ->  rdx         class-hnd single-def
+;* V02 loc0         [V02    ] (  0,  0   )  struct (32) zero-ref    do-not-enreg[SF] ld-addr-op
 ;* V03 loc1         [V03    ] (  0,  0   )  struct (32) zero-ref    do-not-enreg[S] ld-addr-op
 ;  V04 OutArgs      [V04    ] (  1,  1   )  struct (32) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
-;  V05 tmp1         [V05,T09] (  3,  6   )  struct (32) [rsp+28H]   do-not-enreg[SF] must-init ld-addr-op "NewObj constructor temp"
-;  V06 tmp2         [V06,T06] (  3,216.00)     ref  ->  rax         class-hnd exact "Single-def Box Helper"
-;* V07 tmp3         [V07,T07] (  0,  0   )    bool  ->  zero-ref    "Inline return value spill temp"
-;  V08 tmp4         [V08,T03] (  4,320.00)     int  ->  rcx         "Inline stloc first use temp"
-;  V09 tmp5         [V09,T00] (  3,480.00)     ref  ->  rax         class-hnd "impAppendStmt"
-;  V10 tmp6         [V10,T02] (  3,480.00)     int  ->  rcx         "Span.get_Item index"
-;  V11 tmp7         [V11,T01] (  3,480.00)   byref  ->  rax         "Span.get_Item ptrToSpan"
-;* V12 tmp8         [V12    ] (  0,  0   )   byref  ->  zero-ref    V14._reference(offs=0x00) P-INDEP "field V00._reference (fldOffset=0x0)"
-;* V13 tmp9         [V13    ] (  0,  0   )     int  ->  zero-ref    V14._length(offs=0x08) P-INDEP "field V00._length (fldOffset=0x8)"
-;* V14 tmp10        [V14    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref"
-;  V15 cse0         [V15,T05] (  3,240.00)     int  ->  rcx         "CSE - aggressive"
-;  V16 cse1         [V16,T04] (  3,240.00)     ref  ->  rax         "CSE - aggressive"
+;  V05 tmp1         [V05,T10] (  5, 10   )  struct (32) [rsp+28H]   do-not-enreg[SF] must-init ld-addr-op "NewObj constructor temp"
+;  V06 tmp2         [V06,T03] (  3,216.00)     ref  ->  rax         class-hnd exact "Single-def Box Helper"
+;* V07 tmp3         [V07,T04] (  0,  0   )    bool  ->  zero-ref    "Inline return value spill temp"
+;  V08 tmp4         [V08,T02] (  3,240.00)     int  ->  rdi         "Inline stloc first use temp"
+;* V09 tmp5         [V09    ] (  0,  0   )     ref  ->  zero-ref    class-hnd "impAppendStmt"
+;  V10 tmp6         [V10,T00] (  3,480.00)     int  ->  rcx         "Span.get_Item index"
+;* V11 tmp7         [V11    ] (  0,  0   )   byref  ->  zero-ref    V18._reference(offs=0x00) P-INDEP "field V00._reference (fldOffset=0x0)"
+;* V12 tmp8         [V12    ] (  0,  0   )     int  ->  zero-ref    V18._length(offs=0x08) P-INDEP "field V00._length (fldOffset=0x8)"
+;  V13 tmp9         [V13,T06] (  3, 82.00)     ref  ->  rsi         single-def "V02.[000..008)"
+;  V14 tmp10        [V14,T05] (  2,116.00)     int  ->  rdi         "V02.[008..012)"
+;  V15 tmp11        [V15,T01] (  4,241.00)     int  ->  registers   "V02.[012..016)"
+;  V16 tmp12        [V16,T07] (  2, 81.00)   byref  ->  rbx         single-def "V02.[016..024)"
+;  V17 tmp13        [V17,T08] (  2, 81.00)     int  ->  rbp         single-def "V02.[024..028)"
+;* V18 tmp14        [V18    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref"
+;  V19 cse0         [V19,T09] (  2, 81.00)     int  ->  r14         "CSE - aggressive"
+;  V20 cse1         [V20,T13] (  2,  2   )     int  ->  rdi         "CSE - aggressive"
 ;
-; Lcl frame size = 104
+; Lcl frame size = 72
 
 G_M27694_IG01:
-       sub      rsp, 104
+       push     r15
+       push     r14
+       push     rdi
+       push     rsi
+       push     rbp
+       push     rbx
+       sub      rsp, 72
        vzeroupper 
        xor      eax, eax
        mov      qword ptr [rsp+28H], rax
        vxorps   xmm4, xmm4
        vmovdqa  xmmword ptr [rsp+30H], xmm4
        mov      qword ptr [rsp+40H], rax
-						;; size=29 bbWeight=1 PerfScore 5.83
+						;; size=37 bbWeight=1 PerfScore 11.83
 G_M27694_IG02:
        vmovdqu  xmm0, xmmword ptr [rcx]
        vmovdqu  xmmword ptr [rsp+38H], xmm0
 						;; size=10 bbWeight=1 PerfScore 5.00
 G_M27694_IG03:
        mov      gword ptr [rsp+28H], rdx
-						;; size=5 bbWeight=1 PerfScore 1.00
+       mov      rsi, gword ptr [rsp+28H]
+       xor      edi, edi
+       mov      rbx, bword ptr [rsp+38H]
+       mov      ebp, dword ptr [rsp+40H]
+       mov      r14d, dword ptr [rsi+08H]
+       jmp      SHORT G_M27694_IG05
+						;; size=27 bbWeight=1 PerfScore 8.25
 G_M27694_IG04:
-       vmovdqu  ymm0, ymmword ptr[rsp+28H]
-       vmovdqu  ymmword ptr[rsp+48H], ymm0
-						;; size=12 bbWeight=1 PerfScore 5.00
-G_M27694_IG05:
-       jmp      SHORT G_M27694_IG07
-						;; size=2 bbWeight=1 PerfScore 2.00
-G_M27694_IG06:
        mov      rcx, 0xD1FFAB1E      ; System.Int32
        call     CORINFO_HELP_NEWSFAST
-       mov      ecx, dword ptr [rsp+50H]
-       mov      dword ptr [rax+08H], ecx
+       mov      dword ptr [rax+08H], edi
        mov      rcx, rax
        call     [System.Console:WriteLine(System.Object)]
-						;; size=31 bbWeight=36.00 PerfScore 234.00
+       mov      edi, r15d
+						;; size=30 bbWeight=36.00 PerfScore 207.00
+G_M27694_IG05:
+       cmp      r14d, edi
+       jle      SHORT G_M27694_IG07
+						;; size=5 bbWeight=80.00 PerfScore 100.00
+G_M27694_IG06:
+       lea      ecx, [rdi+01H]
+       mov      r15d, ecx
+       mov      ecx, edi
+       mov      ecx, dword ptr [rsi+4*rcx+10H]
+       cmp      ecx, ebp
+       jae      SHORT G_M27694_IG08
+       mov      edi, dword ptr [rbx+4*rcx]
+       jmp      SHORT G_M27694_IG04
+						;; size=21 bbWeight=80.00 PerfScore 660.00
 G_M27694_IG07:
-       mov      ecx, dword ptr [rsp+54H]
-       mov      rax, gword ptr [rsp+48H]
-       cmp      ecx, dword ptr [rax+08H]
-       jge      SHORT G_M27694_IG09
-						;; size=14 bbWeight=80.00 PerfScore 480.00
-G_M27694_IG08:
-       lea      edx, [rcx+01H]
-       mov      dword ptr [rsp+54H], edx
-       cmp      ecx, dword ptr [rax+08H]
-       jae      SHORT G_M27694_IG10
-       mov      ecx, ecx
-       mov      ecx, dword ptr [rax+4*rcx+10H]
-       lea      rax, bword ptr [rsp+58H]
-       cmp      ecx, dword ptr [rax+08H]
-       jae      SHORT G_M27694_IG10
-       mov      rax, bword ptr [rax]
-       mov      ecx, dword ptr [rax+4*rcx]
-       mov      dword ptr [rsp+50H], ecx
-       jmp      SHORT G_M27694_IG06
-						;; size=40 bbWeight=80.00 PerfScore 1540.00
-G_M27694_IG09:
-       vzeroupper 
-       add      rsp, 104
+       add      rsp, 72
+       pop      rbx
+       pop      rbp
+       pop      rsi
+       pop      rdi
+       pop      r14
+       pop      r15
        ret      
-						;; size=8 bbWeight=10.00 PerfScore 22.50
-G_M27694_IG10:
+						;; size=13 bbWeight=10.00 PerfScore 42.50
+G_M27694_IG08:
        call     CORINFO_HELP_RNGCHKFAIL
        int3     
 						;; size=6 bbWeight=0 PerfScore 0.00
 
-; Total bytes of code 157, prolog size 29, PerfScore 2311.03, instruction count 41, allocated bytes for code 157 (MethodHash=b4d393d1) for method FilteredSpanEnumerator`1[int]:Test(System.ReadOnlySpan`1[int],int[])
+; Total bytes of code 149, prolog size 37, PerfScore 1049.48, instruction count 48, allocated bytes for code 149 (MethodHash=b4d393d1) for method FilteredSpanEnumerator`1[int]:Test(System.ReadOnlySpan`1[int],int[])

So this looks encouraging at least.

@tannergooding
Copy link
Member

I wouldn't worry too much about regressions in APIs like CreateFromRotationMatrix

That's an API that I plan on (and need to) rewrite to use explicit SIMD code. Namely it needs to use the Matrix4x4.AsImpl() so we get 4x Vector4 fields instead rather than 16x float fields

It is somewhat representative of what "might" happen for some user code. But its still a bit different since Quaternion is a TYP_SIMD16, so we're getting very weird mixing of SIMD and non-SIMD code which isn't likely ot be typical

Eliminate commas early in block morphing, before the rest of the phase
needs to look at it. Do this by extracting their side effects and adding
them on top of the returned result instead. This allows us to handle
arbitrary COMMAs on the source and destination operands in a general
manner.
Otherwise we hit asserts when trying to allocate 0 bytes of memory.
@jakobbotsch jakobbotsch changed the title JIT: Prototype a generalized promotion pass JIT: Add a (disabled) prototype for a generalized promotion pass Apr 4, 2023
Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass ... overall this looks very nice.

#ifndef _PROMOTION_H
#define _PROMOTION_H

class Compiler;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like we have policy on this but it is often nice for headers to compile on their own instead of relying on any includers to provide the right prerequisites.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added includes instead of forward declarations.

// It may also be contained. Overall we are going to cost each use of
// an unpromoted local at 6.5 bytes.
// TODO-CQ: We can make much better guesses on what will and won't be contained.
costWithout += access.CountWtd * 6.5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be mixing units? Weighted costs reflect (relative) frequency so should be used to scale times, but bytes are size related.

You may want to form both a time score and a size score and then explicitly try and model the tradeoffs between the two.

Eg for inlining I would generally propose: if size is negative then always transform; else if size increases transform only if ratio of time improvement to size increase is above some threshold.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially this was purely a size estimate but then I switched it to make some use of PGO weights. But I didn't really spend any time on the costing after that, it's pretty hard to make good choices when a lot of the basic stuff is still missing.
I will add a todo here -- your suggestion to eventually check both size and perf makes sense. But I found it very hard to predict the size cost (in particular due to prolog/epilog cost and potential spilling). And accurately modelling execution cost is also going to be pretty hard.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I can see there is already a TODO above for it:

// TODO-CQ: Tune the following heuristics. Currently they are based on
// x64 code size although using BB weights when available.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added clarification to the todo that this mixing of units does not make sense.

argSize = m_compiler->typGetObjLayout(arg.GetSignatureClassHandle())->GetSize();
}

#ifdef WINDOWS_AMD64_ABI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leverage getArgTypeForStruct here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove these for now, they are currently unused. The plan was to move ABI information determination back so that it happened before this pass and could be used for it (I guess the implicit byref case might be simple enough here, while the "passed on stack" hint would be harder to determine).

if ((ssize_t)index < 0)
{
index = ~index;
if ((index > 0) && replacements[index - 1].Overlaps(offs, size))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know that there aren't any lower indexed entries that overlap?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replacements has only disjoint ranges in it (ensured by EvaluateReplacement)

// lcl - The struct local
// offs - The starting offset of the range in the struct local that needs to be read back from.
// size - The size of the range
// conservative - Whether this is a potentially conservative read back
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for the case where you may be able to prove that a reassembled struct won't be modified?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to make it clear in the jitdump that this is an assignment. Eventually we want to decompose these directly (like block morphing) and avoid this read back in many cases, but the logic is not implemented yet.

jakobbotsch added a commit to jakobbotsch/runtime that referenced this pull request Apr 5, 2023
* Avoid setting GTF_GLOB_REF on GT_FIELD_ADDR nodes
* Avoid setting GTF_GLOB_REF on GT_FIELD nodes off of implicit byrefs.
  This is ok now since implicit byref morphing indiscriminately sets
  GTF_GLOB_REF for these.

These changes are necessary to avoid address exposure in the two user
benchmarks in dotnet#83388.

Fix dotnet#74563
jakobbotsch added a commit to jakobbotsch/runtime that referenced this pull request Apr 5, 2023
Fix issue seen in dotnet#83388 where tail merging ended up adding new
predecessors to the scratch block, making adding more "initialization
IR" impossible for downstream phases.
jakobbotsch added a commit that referenced this pull request Apr 5, 2023
Fix issue seen in #83388 where tail merging ended up adding new
predecessors to the scratch block, making adding more "initialization
IR" impossible for downstream phases.
jakobbotsch added a commit that referenced this pull request Apr 7, 2023
* Avoid setting GTF_GLOB_REF on GT_FIELD_ADDR nodes
* Avoid setting GTF_GLOB_REF on GT_FIELD nodes off of implicit byrefs.
  This is ok now since implicit byref morphing indiscriminately sets
  GTF_GLOB_REF for these.
* Manually clone a "pointer to span" in span intrinsic expansion when it points to a local. Unfortunately this does not fall out from the above since gtClone does not handle FIELD_ADDR, and making it handle this needs some more work.

These changes are necessary to avoid address exposure in the two user
benchmarks in #83388.

Fix #74563
Fix #856
@jakobbotsch
Copy link
Member Author

No diffs now. TP regressions are from the new phase (even with the early exit) and the change in fgNewStmtFromTree.

@AndyAyersMS Anything else you think I need to do here?

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jakobbotsch jakobbotsch merged commit 388edb6 into dotnet:main Apr 11, 2023
186 checks passed
@jakobbotsch jakobbotsch deleted the generalized-promotion branch April 11, 2023 09:08
@jakobbotsch
Copy link
Member Author

jakobbotsch commented Apr 25, 2023

@En3Tho, are you ok with me adding your GSeq example to the dotnet/performance micro benchmarks under the MIT license so we can track the perf of it?

@En3Tho
Copy link
Contributor

En3Tho commented Apr 25, 2023

@jakobbotsch Will be glad if you do so! I think in that particular benchmark size of collection should be increased.

@dotnet dotnet locked as resolved and limited conversation to collaborators May 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants