Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues with Range type #11848

Closed
AndyAyersMS opened this issue Jan 18, 2019 · 27 comments
Closed

Performance issues with Range type #11848

AndyAyersMS opened this issue Jan 18, 2019 · 27 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization
Milestone

Comments

@AndyAyersMS
Copy link
Member

Ideally the following two methods should generate similar code.

[MethodImpl(MethodImplOptions.NoInlining)]
static Span<int> TrimFirstLast_OpenCoded(Span<int> s) => s.Slice(1, s.Length - 1);
 
[MethodImpl(MethodImplOptions.NoInlining)]
static Span<int> TrimFirstLast_Range(Span<int> s) => s[Range.Create(1, new Index(1, true))];

Currently the Range-based version runs into a number of issues:

cc @stephentoub

Will fill in more details later.

Marking as 3.0 for now.

category:cq
theme:optimization
skill-level:expert
cost:large

@AndyAyersMS
Copy link
Member Author

Range promotion is blocked here as its Index fields are only 4 bytes.

https://github.com/dotnet/coreclr/blob/b2c6dd4eb427caf7cb852998076b0b1543447a28/src/jit/lclvars.cpp#L2015-L2028

Seems relatively simple to add the right alignment check and allow this case through.

Prospective fix for that and the inlining and IL issues: master...AndyAyersMS:Range

@AndyAyersMS
Copy link
Member Author

Jit-diffs from the extra promotion:

ound 7 files with textual diffs.
PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary:
(Lower is better)
Total bytes of diff: -2421 (-0.01% of base)
    diff is an improvement.
Top file regressions by size (bytes):
          35 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.00% of base)
Top file improvements by size (bytes):
       -1283 : Microsoft.CodeAnalysis.dasm (-0.08% of base)
       -1054 : System.Reflection.Metadata.dasm (-0.25% of base)
         -83 : System.Private.CoreLib.dasm (0.00% of base)
         -25 : Microsoft.CodeAnalysis.CSharp.dasm (0.00% of base)
         -11 : System.Linq.Expressions.dasm (0.00% of base)
6 total files with size differences (5 improved, 1 regressed), 123 unchanged.
Top method regressions by size (bytes):
          37 ( 7.64% of base) : Microsoft.CodeAnalysis.dasm - MetadataWriter:PopulateParamTableRows():this
          36 (10.14% of base) : Microsoft.CodeAnalysis.dasm - PEModule:HasDeprecatedOrObsoleteAttribute(struct,byref):bool:this
          30 ( 3.17% of base) : System.Reflection.Metadata.dasm - ControlFlowBuilder:CopyCodeAndFixupBranches(ref,ref):this
          25 ( 5.97% of base) : Microsoft.CodeAnalysis.dasm - MetadataWriter:GetOrAddDocument(ref,ref):int:this
          22 (11.89% of base) : System.Reflection.Metadata.dasm - MetadataBuilder:AddManifestResource(int,struct,struct,int):struct:this
Top method improvements by size (bytes):
        -104 (-13.42% of base) : System.Reflection.Metadata.dasm - NamespaceCache:MergeDuplicateNamespaces(ref,byref):this
        -101 (-4.78% of base) : Microsoft.CodeAnalysis.dasm - MetadataWriter:SerializeMethodDebugInfo(ref,int,int):this
         -93 (-12.29% of base) : Microsoft.CodeAnalysis.dasm - MetadataWriter:SerializeStateMachineLocalScopes(ref,int):this
         -88 (-11.81% of base) : Microsoft.CodeAnalysis.dasm - MetadataWriter:SerializeEncMethodDebugInformation(ref,int):this
         -66 (-4.21% of base) : Microsoft.CodeAnalysis.dasm - MetadataWriter:PopulateConstantTableRows():this
Top method regressions by size (percentage):
          22 (11.89% of base) : System.Reflection.Metadata.dasm - MetadataBuilder:AddManifestResource(int,struct,struct,int):struct:this
          36 (10.14% of base) : Microsoft.CodeAnalysis.dasm - PEModule:HasDeprecatedOrObsoleteAttribute(struct,byref):bool:this
          37 ( 7.64% of base) : Microsoft.CodeAnalysis.dasm - MetadataWriter:PopulateParamTableRows():this
          15 ( 7.35% of base) : Microsoft.CodeAnalysis.dasm - PEModule:HasInterfaceTypeAttribute(struct,byref):bool:this
          15 ( 7.35% of base) : Microsoft.CodeAnalysis.dasm - PEModule:HasTypeLibTypeAttribute(struct,byref):bool:this
Top method improvements by size (percentage):
         -63 (-65.63% of base) : System.Private.CoreLib.dasm - String:EnumerateRunes():struct:this
         -59 (-32.78% of base) : System.Reflection.Metadata.dasm - MetadataBuilder:AddMemberReference(struct,struct,struct):struct:this
         -59 (-32.78% of base) : System.Reflection.Metadata.dasm - MetadataBuilder:AddCustomDebugInformation(struct,struct,struct):struct:this
         -35 (-32.11% of base) : Microsoft.CodeAnalysis.dasm - MetadataWriter:CreateConstantRow(ref,int):struct:this
         -50 (-31.25% of base) : System.Reflection.Metadata.dasm - MetadataBuilder:AddFieldDefinition(int,struct,struct):struct:this
172 total methods with size differences (138 improved, 34 regressed), 193304 unchanged.

@AndyAyersMS
Copy link
Member Author

Promotion Blockers

As noted above Range is not promotable currently, as the jit will only do very limited forms of "recursive" struct promotion (struct with struct fields). The inner structs must themselves just have one field, that field must be the same size as the inner struct, the inner struct must be pointer sized or smaller, and be suitably aligned. The jit currently skips the alignment check and only recursively promotes for pointer sized cases. The fix (as noted above) is to introduce the alignment check and relax the size restriction.

Inline Blockers

The Index Ctor has enough control flow that it hits the BB complexity limit. The Span indexer that takes a Range generates a lot of IL because a Range contains two struct Index fields and access to those was not factored.

Inlines into 06000002 X:TrimFirstLast_Range(struct):struct
  [1 IL=0003 TR=000004 0600120B] [below ALWAYS_INLINE size] Index:op_Implicit(int):struct
    [0 IL=0002 TR=000061 06001203] [FAILED: too many basic blocks] Index:.ctor(int,bool):this
  [0 IL=0010 TR=000016 06001203] [FAILED: too many basic blocks] Index:.ctor(int,bool):this
  [2 IL=0015 TR=000027 06001498] [below ALWAYS_INLINE size] Range:Create(struct,struct):struct
    [3 IL=0002 TR=000078 06001493] [below ALWAYS_INLINE size] Range:.ctor(struct,struct):this
  [0 IL=0020 TR=000036 0600153F] [FAILED: too many il bytes] Span`1:get_Item(struct):struct:this

We can force inlining here by applying AggressiveInlining but we might as well slim down the indexer by making local copies of the two Range fields.

Enregistration Blockers: Index

With the above applied we get struct promotion but not enregistration. Both are necessary to get good optimization, as the jit does only very limited modelling of locals that can't be enregistered.

For the case of Index the blocker is that the IL produced for the Index constructor places the address of the struct on the evaluation stack across branches. This encounters a limitation in the jit where if an struct address is taken and then not immediately consumed, the jit assumes the worst case and requires the struct to live on the stack.

This can be fixed (for now) by revising the C# sources to discourage CSC from producing this kind of IL -- eg using explicit if/then/else instead of : ?.

Enregistration Blockers: Range

With the above addressed (as in the provisional changes above) we now see the following assembly code. First the hand-coded version:

; Assembly listing for method X:TrimFirstLast_OpenCoded(struct):struct
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 RetBuf       [V00,T01] (  5,  5   )   byref  ->  rcx        
;  V01 arg0         [V01,T00] (  4,  8   )   byref  ->  rdx         ld-addr-op
;  V02 OutArgs      [V02    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V03 tmp1         [V03,T02] (  3,  6   )     int  ->   r8         "Inlining Arg"
;* V04 tmp2         [V04    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;* V05 tmp3         [V05    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;* V06 tmp4         [V06    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;* V07 tmp5         [V07    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;  V08 tmp6         [V08,T04] (  2,  2   )   byref  ->  rax         V13._pointer(offs=0x00) P-INDEP "field V01._pointer (fldOffset=0x0)"
;  V09 tmp7         [V09,T03] (  3,  3   )     int  ->  rdx         V13._length(offs=0x08) P-INDEP "field V01._length (fldOffset=0x8)"
;  V10 tmp8         [V10,T05] (  2,  2   )   byref  ->  rax         V04._pointer(offs=0x00) P-INDEP "field V04._pointer (fldOffset=0x0)"
;  V11 tmp9         [V11,T07] (  2,  2   )     int  ->   r8         V04._length(offs=0x08) P-INDEP "field V04._length (fldOffset=0x8)"
;  V12 tmp10        [V12,T06] (  2,  2   )   byref  ->  rax         V07._value(offs=0x00) P-INDEP "field V07._value (fldOffset=0x0)"
;* V13 tmp11        [V13    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref"
;
; Lcl frame size = 40

G_M13778_IG01:
       sub      rsp, 40
       nop      

G_M13778_IG02:
       mov      rax, bword ptr [rdx]
       mov      edx, dword ptr [rdx+8]
       lea      r8d, [rdx-1]
       mov      r9d, r8d
       inc      r9
       mov      edx, edx
       cmp      r9, rdx
       ja       SHORT G_M13778_IG05

G_M13778_IG03:
       add      rax, 4
       mov      bword ptr [rcx], rax
       mov      dword ptr [rcx+8], r8d
       mov      rax, rcx

G_M13778_IG04:
       add      rsp, 40
       ret      

G_M13778_IG05:
       call     ThrowHelper:ThrowArgumentOutOfRangeException()
       int3     

and next the Range based version:

; Assembly listing for method X:TrimFirstLast_Range(struct):struct
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 RetBuf       [V00,T01] (  5,  5   )   byref  ->  rcx        
;  V01 arg0         [V01,T00] (  4,  8   )   byref  ->  rdx         ld-addr-op
;  V02 OutArgs      [V02    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;* V03 tmp1         [V03    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;* V04 tmp2         [V04    ] (  0,  0   )  struct ( 8) zero-ref    "impAppendStmt"
;  V05 tmp3         [V05    ] (  3,  6   )  struct ( 8) [rsp+0x30]   do-not-enreg[SF] "struct address for call/obj"
;* V06 tmp4         [V06    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;* V07 tmp5         [V07    ] (  0,  0   )  struct ( 8) zero-ref    "Inlining Arg"
;* V08 tmp6         [V08    ] (  0,  0   )  struct ( 8) zero-ref    "Inlining Arg"
;  V09 tmp7         [V09    ] (  5, 10   )  struct ( 8) [rsp+0x28]   do-not-enreg[SF] "NewObj constructor temp"
;* V10 tmp8         [V10    ] (  0,  0   )  struct ( 8) zero-ref    "Inlining Arg"
;* V11 tmp9         [V11    ] (  0,  0   )  struct ( 8) zero-ref    "Inlining Arg"
;* V12 tmp10        [V12    ] (  0,  0   )  struct ( 8) zero-ref    ld-addr-op "Inlining Arg"
;* V13 tmp11        [V13    ] (  0,  0   )  struct ( 8) zero-ref    ld-addr-op "Inline stloc first use temp"
;* V14 tmp12        [V14,T26] (  0,  0   )     int  ->  zero-ref    "impAppendStmt"
;  V15 tmp13        [V15,T20] (  3,  1   )     int  ->   r8        
;  V16 tmp14        [V16,T03] (  4,  3.50)     int  ->   r8         "Inline stloc first use temp"
;* V17 tmp15        [V17    ] (  0,  0   )  struct ( 8) zero-ref    ld-addr-op "Inline stloc first use temp"
;* V18 tmp16        [V18,T27] (  0,  0   )     int  ->  zero-ref    "impAppendStmt"
;  V19 tmp17        [V19,T18] (  3,  1.50)     int  ->   r9        
;* V20 tmp18        [V20    ] (  0,  0   )     int  ->  zero-ref    "Inline stloc first use temp"
;  V21 tmp19        [V21,T22] (  2,  0.50)     int  ->   r8         "Inline return value spill temp"
;  V22 tmp20        [V22,T23] (  2,  0.50)     int  ->   r8         "Inline return value spill temp"
;  V23 tmp21        [V23,T24] (  2,  0.50)     int  ->   r9         "Inline return value spill temp"
;  V24 tmp22        [V24,T25] (  2,  0.50)     int  ->   r9         "Inline return value spill temp"
;  V25 tmp23        [V25,T02] (  3,  6   )     int  ->   r9         "Inlining Arg"
;* V26 tmp24        [V26    ] (  0,  0   )  struct (16) zero-ref    "NewObj constructor temp"
;* V27 tmp25        [V27    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;* V28 tmp26        [V28    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;* V29 tmp27        [V29    ] (  0,  0   )  struct ( 8) zero-ref    "NewObj constructor temp"
;  V30 tmp28        [V30,T06] (  2,  2   )   byref  ->  rax         V50._pointer(offs=0x00) P-INDEP "field V01._pointer (fldOffset=0x0)"
;  V31 tmp29        [V31,T04] (  4,  2.50)     int  ->  rdx         V50._length(offs=0x08) P-INDEP "field V01._length (fldOffset=0x8)"
;  V32 tmp30        [V32,T09] (  2,  2   )     int  ->   r9         V03._value(offs=0x00) P-INDEP "field V03._value (fldOffset=0x0)"
;  V33 tmp31        [V33,T10] (  2,  2   )     int  ->   r8         V04._value(offs=0x00) P-INDEP "field V04._value (fldOffset=0x0)"
;  V34 tmp32        [V34    ] (  2,  3   )     int  ->  [rsp+0x30]   do-not-enreg[] V05.<Start>k__BackingField(offs=0x00) P-DEP "field V05.<Start>k__BackingField (fldOffset=0x0)"
;  V35 tmp33        [V35    ] (  2,  3   )     int  ->  [rsp+0x34]   do-not-enreg[] V05.<End>k__BackingField(offs=0x04) P-DEP "field V05.<End>k__BackingField (fldOffset=0x4)"
;* V36 tmp34        [V36,T21] (  0,  0   )     int  ->  zero-ref    V06._value(offs=0x00) P-INDEP "field V06._value (fldOffset=0x0)"
;  V37 tmp35        [V37,T11] (  2,  2   )     int  ->   r8         V07._value(offs=0x00) P-INDEP "field V07._value (fldOffset=0x0)"
;  V38 tmp36        [V38,T12] (  2,  2   )     int  ->   r9         V08._value(offs=0x00) P-INDEP "field V08._value (fldOffset=0x0)"
;  V39 tmp37        [V39    ] (  3,  4   )     int  ->  [rsp+0x28]   do-not-enreg[] V09.<Start>k__BackingField(offs=0x00) P-DEP "field V09.<Start>k__BackingField (fldOffset=0x0)"
;  V40 tmp38        [V40    ] (  3,  4   )     int  ->  [rsp+0x2C]   do-not-enreg[] V09.<End>k__BackingField(offs=0x04) P-DEP "field V09.<End>k__BackingField (fldOffset=0x4)"
;  V41 tmp39        [V41,T13] (  2,  2   )     int  ->   r8         V10._value(offs=0x00) P-INDEP "field V10._value (fldOffset=0x0)"
;  V42 tmp40        [V42,T14] (  2,  2   )     int  ->   r9         V11._value(offs=0x00) P-INDEP "field V11._value (fldOffset=0x0)"
;  V43 tmp41        [V43,T15] (  2,  2   )     int  ->   r8         V12.<Start>k__BackingField(offs=0x00) P-INDEP "field V12.<Start>k__BackingField (fldOffset=0x0)"
;  V44 tmp42        [V44,T19] (  2,  1.50)     int  ->   r9         V12.<End>k__BackingField(offs=0x04) P-INDEP "field V12.<End>k__BackingField (fldOffset=0x4)"
;  V45 tmp43        [V45,T05] (  4,  2.50)     int  ->   r8         V13._value(offs=0x00) P-INDEP "field V13._value (fldOffset=0x0)"
;  V46 tmp44        [V46,T17] (  4,  1.50)     int  ->   r9         V17._value(offs=0x00) P-INDEP "field V17._value (fldOffset=0x0)"
;  V47 tmp45        [V47,T07] (  2,  2   )   byref  ->  rax         V26._pointer(offs=0x00) P-INDEP "field V26._pointer (fldOffset=0x0)"
;  V48 tmp46        [V48,T16] (  2,  2   )     int  ->   r9         V26._length(offs=0x08) P-INDEP "field V26._length (fldOffset=0x8)"
;  V49 tmp47        [V49,T08] (  2,  2   )   byref  ->  rax         V29._value(offs=0x00) P-INDEP "field V29._value (fldOffset=0x0)"
;* V50 tmp48        [V50    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref"
;
; Lcl frame size = 56

G_M36545_IG01:
       sub      rsp, 56
       nop      

G_M36545_IG02:
       mov      rax, bword ptr [rdx]
       mov      edx, dword ptr [rdx+8]
       mov      r8d, 1
       mov      r9d, -2
       xor      r10d, r10d
       mov      dword ptr [rsp+28H], r10d
       mov      dword ptr [rsp+2CH], r10d
       mov      dword ptr [rsp+28H], r8d
       mov      dword ptr [rsp+2CH], r9d
       mov      r8, qword ptr [rsp+28H]
       mov      qword ptr [rsp+30H], r8
       mov      r8d, dword ptr [rsp+30H]
       mov      r9d, dword ptr [rsp+34H]
       test     r8d, r8d
       jl       SHORT G_M36545_IG03
       jmp      SHORT G_M36545_IG04

G_M36545_IG03:
       not      r8d
       mov      r10d, edx
       sub      r10d, r8d
       mov      r8d, r10d

G_M36545_IG04:
       test     r9d, r9d
       jl       SHORT G_M36545_IG05
       jmp      SHORT G_M36545_IG06

G_M36545_IG05:
       not      r9d
       mov      r10d, edx
       sub      r10d, r9d
       mov      r9d, r10d

G_M36545_IG06:
       sub      r9d, r8d
       mov      r10d, r8d
       mov      r11d, r9d
       add      r10, r11
       mov      edx, edx
       cmp      r10, rdx
       ja       SHORT G_M36545_IG09

G_M36545_IG07:
       movsxd   rdx, r8d
       lea      rax, bword ptr [rax+4*rdx]
       mov      bword ptr [rcx], rax
       mov      dword ptr [rcx+8], r9d
       mov      rax, rcx

G_M36545_IG08:
       add      rsp, 56
       ret      

G_M36545_IG09:
       call     ThrowHelper:ThrowArgumentOutOfRangeException()
       int3     

; Total bytes of code 151, prolog size 5 for method X:TrimFirstLast_Range(struct):struct

The key here is that instances of the Range type are not enregisterable, because Ranges are returned in a single register, and we model this very early in the jit, and this interferes with the enregistration analysis.

;  V05 tmp3         [V05    ] (  3,  6   )  struct ( 8) [rsp+0x30]   do-not-enreg[SF] "struct address for call/obj"
;  V09 tmp7         [V09    ] (  5, 10   )  struct ( 8) [rsp+0x28]   do-not-enreg[SF] "NewObj constructor temp"
LocalAddressVisitor visiting statement:
               [000041] ------------              *  STMT      void  (IL   ???...  ???)
               [000121] ------------              |  /--*  LCL_FLD   long   V09 tmp7         [+0]
                                                  |  /--*    int    V09.<Start>k__BackingField (offs=0x00) -> V43 tmp41        
                                                  |  /--*    int    V09.<End>k__BackingField (offs=0x04) -> V44 tmp42        
               [000040] -AC---------              \--*  ASG       long  
               [000039] *-----------                 \--*  IND       long  
               [000038] L-----------                    \--*  ADDR      byref 
               [000037] ------------                       \--*  LCL_VAR   struct(P) V05 tmp3         
                                                           \--*    int    V05.<Start>k__BackingField (offs=0x00) -> V38 tmp36        
                                                           \--*    int    V05.<End>k__BackingField (offs=0x04) -> V39 tmp37        

Local V09 should not be enregistered because: was accessed as a local field

Still thinking about the best way to cope with this...

@mikedn
Copy link
Contributor

mikedn commented Jan 22, 2019

Local V09 should not be enregistered because: was accessed as a local field

This seems easy to fix, that assignment could be transformed into two individual field assignments. I experimented with this kind of stuff in LocalAddressVisitor in the past.

@AndyAyersMS
Copy link
Member Author

FWIW That same construct knocks out V05 later on in morph... so fixing it early would unblock both DNERs.

fgMorphTree BB20, stmt 14 (before)
               [000121] ------------              /--*  LCL_FLD   long   V09 tmp7         [+0]
                                                  /--*    int    V09.<Start>k__BackingField (offs=0x00) -> V43 tmp41        
                                                  /--*    int    V09.<End>k__BackingField (offs=0x04) -> V44 tmp42        
               [000040] -AC---------              *  ASG       long  
               [000039] *-----------              \--*  IND       long  
               [000038] L-----------                 \--*  ADDR      byref 
               [000037] ------------                    \--*  LCL_VAR   struct(P) V05 tmp3         
                                                        \--*    int    V05.<Start>k__BackingField (offs=0x00) -> V38 tmp36        
                                                        \--*    int    V05.<End>k__BackingField (offs=0x04) -> V39 tmp37        

Local V05 should not be enregistered because: was accessed as a local field

@mikedn
Copy link
Contributor

mikedn commented Jan 22, 2019

That same construct knocks out V05 later on in morph...

Yeah, the promotion story is pretty weird. Some happens in LocalAddressVisitor, some in morph and neither is 100% right. I'm 99.99% convinced that it shouldn't happen in LocalAddressVisitor at all (see #11455) and I'm relatively sure that morph isn't a good place for promotion either, because it seems to interfere with local assertion propagation.

Fixing it properly needs quite a bit of work but for now we could probably do something in LocalAddressVisitor, if it turns out that it helps.

@mikedn
Copy link
Contributor

mikedn commented Jan 22, 2019

Tried something in a hurry: mikedn/coreclr@aa2ab99#diff-bd2d23afdb1936d1982f699f760429d2

It generates this:

G_M39430_IG01:
       4883EC28             sub      rsp, 40
       90                   nop
G_M39430_IG02:
       488B02               mov      rax, bword ptr [rdx]
       8B5208               mov      edx, dword ptr [rdx+8]
       41B801000000         mov      r8d, 1
       41B9FEFFFFFF         mov      r9d, -2
       4533D2               xor      r10d, r10d
       4489542420           mov      dword ptr [rsp+20H], r10d
       4489542424           mov      dword ptr [rsp+24H], r10d
       4489442420           mov      dword ptr [rsp+20H], r8d
       44894C2424           mov      dword ptr [rsp+24H], r9d
       41B801000000         mov      r8d, 1
       41B901000000         mov      r9d, 1
       448BD2               mov      r10d, edx
       452BD1               sub      r10d, r9d
       458BCA               mov      r9d, r10d
       452BC8               sub      r9d, r8d
       458BD0               mov      r10d, r8d
       458BD9               mov      r11d, r9d
       4D03D3               add      r10, r11
       8BD2                 mov      edx, edx
       4C3BD2               cmp      r10, rdx
       7716                 ja       SHORT G_M39430_IG05

G_M39430_IG03:
       4963D0               movsxd   rdx, r8d
       488D0490             lea      rax, bword ptr [rax+4*rdx]
       488901               mov      bword ptr [rcx], rax
       44894908             mov      dword ptr [rcx+8], r9d
       488BC1               mov      rax, rcx
G_M39430_IG04:
       4883C428             add      rsp, 40
       C3                   ret
G_M39430_IG05:
       E8D797FEFF           call     ThrowHelper:ThrowArgumentOutOfRangeException()
       CC                   int3

There's still bad stuff to investigate but I'm not sure if I'll have time for this earlier than Friday.

@mikedn
Copy link
Contributor

mikedn commented Jan 23, 2019

Those stack stores are probably caused by variables still being not enregistered. I'll have to check but I'm pretty sure that fgMorphLocalField will mark variables as "do not enregister" when encountering such GT_LCL_FLD nodes. That happens in preorder so the change I'm doing in postorder is only partially useful. The way this works now is a leftover from the old local address taken code, it should be changed in the future because it doesn't make a lot of sense. Perhaps the future is now...

@mikedn
Copy link
Contributor

mikedn commented Jan 25, 2019

Yeah, fgMorphLocalField marks the variable DNER before I transform the LCL_FLD node. With that out of the way I get this:

G_M55904_IG01:
       4883EC28             sub      rsp, 40
       90                   nop
G_M55904_IG02:
       488B02               mov      rax, bword ptr [rdx]
       8B5208               mov      edx, dword ptr [rdx+8]
       41B801000000         mov      r8d, 1
       41B901000000         mov      r9d, 1
       448BD2               mov      r10d, edx
       452BD1               sub      r10d, r9d
       458BCA               mov      r9d, r10d
       452BC8               sub      r9d, r8d
       458BD0               mov      r10d, r8d
       458BD9               mov      r11d, r9d
       4D03D3               add      r10, r11
       8BD2                 mov      edx, edx
       4C3BD2               cmp      r10, rdx
       7716                 ja       SHORT G_M55904_IG05

G_M55904_IG03:
       4963D0               movsxd   rdx, r8d
       488D0490             lea      rax, bword ptr [rax+4*rdx]
       488901               mov      bword ptr [rcx], rax
       44894908             mov      dword ptr [rcx+8], r9d
       488BC1               mov      rax, rcx
G_M55904_IG04:
       4883C428             add      rsp, 40
       C3                   ret
G_M55904_IG05:
       E89A84FDFF           call     ThrowHelper:ThrowArgumentOutOfRangeException()
       CC                   int3

Hrm, something's still not right, those 1 constants are fishy...

@mikedn
Copy link
Contributor

mikedn commented Jan 26, 2019

Eh, those constants aren't propagated because there are PHIs in the way. This is going to be fun.

@mikedn
Copy link
Contributor

mikedn commented Jan 26, 2019

After hacking VN to do some kind of SCCP I get this:

G_M55904_IG01:
       4883EC28             sub      rsp, 40
       90                   nop
G_M55904_IG02:
       488B02               mov      rax, bword ptr [rdx]
       8B5208               mov      edx, dword ptr [rdx+8]
       448BC2               mov      r8d, edx
       4183E801             sub      r8d, 1
       4183E801             sub      r8d, 1
       458BC8               mov      r9d, r8d
       49FFC1               inc      r9
       8BD2                 mov      edx, edx
       4C3BCA               cmp      r9, rdx
       7713                 ja       SHORT G_M55904_IG05
G_M55904_IG03:
       4883C004             add      rax, 4
       488901               mov      bword ptr [rcx], rax
       44894108             mov      dword ptr [rcx+8], r8d
       488BC1               mov      rax, rcx
G_M55904_IG04:
       4883C428             add      rsp, 40
       C3                   ret
G_M55904_IG05:
       E86DADFEFF           call     ThrowHelper:ThrowArgumentOutOfRangeException()
       CC                   int3

Close. Except for the 2 consecutive x -= 1;. It looks to me that the 2 C# versions aren't actually equivalent. The range version does not include the last element so the non-range version should be

[MethodImpl(MethodImplOptions.NoInlining)]
static Span<int> TrimFirstLast_OpenCoded(Span<int> s) => s.Slice(1, s.Length - 2);

Well, I suppose we could also do it the other way around - change the range version to match the non-range version. But since we have this interesting x -= 1; case let's continue with it.

I don't think we have anything in the JIT today that can deal with this. I suppose morph might do it, if the 2 subtract nodes would be in the same tree. They are not. So... now we need forward substitution too.

@AndyAyersMS
Copy link
Member Author

Thanks for foraging ahead on this. Always happy to see SCCP make an appearance.

@mikedn
Copy link
Contributor

mikedn commented Jan 28, 2019

Always happy to see SCCP make an appearance.

Eh, to be clear about what I did - it should be similar in spirit to SCCP but it's a rather different implementation:

  • Added a BBF_DEAD block flag
  • When value numbering a block, check predecessors to see if the block is actually executable. Since predecessors have already been numbered (except in the case of loops) we may now discover that some conditional branches have constant arguments. If it turns out that the block is not executable then mark it with BBF_DEAD.
  • When value numbering PHIs, ignore PHI args from dead predecessors. So if we have 2 predecessors and one says that x is 1 and the other says that x is -1 but it's dead then we conclude that PHI's VN is 1.

It works for this example. It doesn't work in the general case and I've yet to figure out why. One likely cause is missing PHI args, the issue that came up in the PHI use list PR.

@mikedn
Copy link
Contributor

mikedn commented Jan 28, 2019

I don't think we have anything in the JIT today that can deal with this. I suppose morph might do it, if the 2 subtract nodes would be in the same tree. They are not. So... now we need forward substitution too.

Turns out that this is not so simple. The forward substitution stuff I'm experimenting with happens right after SSA, where the "is single use" property is up to date. At that point we have PHIs that are in the way and block the simplification of that chain of subtracts. The SUB(SUB(x, 1, 1) appears late, during assertion propagation.

Ideally we'd maintain the "is single use" property through transforms but there's a lot of work to do before this becomes practical.

So we may need to approach this differently for now. It would be better if we could avoid creating those PHIs in the first place - local assertion propagation could help with that by getting rid of constant branches. Apparently this doesn't work now because we end up with a sequence of blocks that could be compacted into one but that only happens after morph. Needs more investigation.

But for a start we should probably deal with struct promotion. We won't get very far without that LCL_FLD out of the way. While LocalAddressVisitor could deal with it I wonder why it appears so early in the first place, it looks like something that should perhaps appear during call morphing.

For completeness, #4323 shows a similar case where early introduced ABI details cause issues. In that case a struct containing a single double ends up being reinterpreted as long using LCL_FLD and makes a mess.

@AndyAyersMS
Copy link
Member Author

Thanks again -- hope to have time to get back to some of this soon.

I'd really like to see modelling of most or all the ABI details deferred. Things are currently very messy during inlining. The struct->int retyping for small struct returns sure seems like it could wait, and could be entirely avoided if we inline.

cc @CarolEidt

@tarekgh
Copy link
Member

tarekgh commented Feb 9, 2019

@tarekgh

@AndyAyersMS
Copy link
Member Author

Probably a stretch to promise any further improvements for 3.0, but am going to take another look.

@AndyAyersMS AndyAyersMS self-assigned this Feb 26, 2019
@AndyAyersMS
Copy link
Member Author

May decide to push the promotion bit forward (that is, allowing Ranges to be promoted, though as noted above, not enregistered) as this at least hits modestly often.

Updated diffs (based at fae2a56)

PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary:
(Lower is better)
Total bytes of diff: -2776 (-0.01% of base)
    diff is an improvement.
Top file regressions by size (bytes):
          39 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.00% of base)
Top file improvements by size (bytes):
       -1350 : Microsoft.CodeAnalysis.dasm (-0.09% of base)
       -1112 : System.Reflection.Metadata.dasm (-0.27% of base)
        -317 : System.Private.CoreLib.dasm (-0.01% of base)
         -19 : Microsoft.CodeAnalysis.CSharp.dasm (-0.00% of base)
         -17 : System.Linq.Expressions.dasm (-0.00% of base)
6 total files with size differences (5 improved, 1 regressed), 123 unchanged.
Top method regressions by size (bytes):
          37 ( 7.66% of base) : Microsoft.CodeAnalysis.dasm - MetadataWriter:PopulateParamTableRows():this
          36 (10.14% of base) : Microsoft.CodeAnalysis.dasm - PEModule:HasDeprecatedOrObsoleteAttribute(struct,byref):bool:this
          29 ( 3.06% of base) : System.Reflection.Metadata.dasm - ControlFlowBuilder:CopyCodeAndFixupBranches(ref,ref):this
          25 ( 5.97% of base) : Microsoft.CodeAnalysis.dasm - MetadataWriter:GetOrAddDocument(ref,ref):int:this
          22 (11.76% of base) : System.Reflection.Metadata.dasm - MetadataBuilder:AddManifestResource(int,struct,struct,int):struct:this
Top method improvements by size (bytes):
        -112 (-5.28% of base) : Microsoft.CodeAnalysis.dasm - MetadataWriter:SerializeMethodDebugInfo(ref,int,int):this
        -110 (-14.19% of base) : System.Reflection.Metadata.dasm - NamespaceCache:MergeDuplicateNamespaces(ref,byref):this
         -93 (-12.33% of base) : Microsoft.CodeAnalysis.dasm - MetadataWriter:SerializeStateMachineLocalScopes(ref,int):this
         -88 (-11.84% of base) : Microsoft.CodeAnalysis.dasm - MetadataWriter:SerializeEncMethodDebugInformation(ref,int):this
         -68 (-3.41% of base) : System.Private.CoreLib.dasm - RuntimeHelpers:GetSubArray(ref,struct):ref (5 methods)
Top method regressions by size (percentage):
          22 (11.76% of base) : System.Reflection.Metadata.dasm - MetadataBuilder:AddManifestResource(int,struct,struct,int):struct:this
          36 (10.14% of base) : Microsoft.CodeAnalysis.dasm - PEModule:HasDeprecatedOrObsoleteAttribute(struct,byref):bool:this
          37 ( 7.66% of base) : Microsoft.CodeAnalysis.dasm - MetadataWriter:PopulateParamTableRows():this
          15 ( 7.35% of base) : Microsoft.CodeAnalysis.dasm - PEModule:HasInterfaceTypeAttribute(struct,byref):bool:this
          15 ( 7.35% of base) : Microsoft.CodeAnalysis.dasm - PEModule:HasTypeLibTypeAttribute(struct,byref):bool:this
Top method improvements by size (percentage):
         -63 (-65.62% of base) : System.Private.CoreLib.dasm - String:EnumerateRunes():struct:this
         -56 (-35.00% of base) : System.Reflection.Metadata.dasm - MetadataBuilder:AddFieldDefinition(int,struct,struct):struct:this
         -56 (-33.73% of base) : System.Reflection.Metadata.dasm - MetadataBuilder:AddProperty(int,struct,struct):struct:this
         -59 (-32.42% of base) : System.Reflection.Metadata.dasm - MetadataBuilder:AddMemberReference(struct,struct,struct):struct:this
         -59 (-32.42% of base) : System.Reflection.Metadata.dasm - MetadataBuilder:AddCustomDebugInformation(struct,struct,struct):struct:this
177 total methods with size differences (137 improved, 40 regressed), 193175 unchanged.

Beyond that the conditional assignments to the underlying index fields inhibit our ability to reason about the values of those fields, even if we ultimately trim away the conditions and "know" that we have a from start or from end index.

A sparse CCP like approach might help but it almost feels like we'd need to integrate VN and assertion prop together to get the real benefit, as assertion prop can allow us to improve VNs, and improved VNs may allow us more use of assertions. But since updates are stylized perhaps running them together is not really needed. No new VNs are created -- all edge pruning does is setup or break equivalences among VNs.

If we go for some sparse eval scheme in assertion prop, we have to decide if we're going to approach this from an optimistic (no blocks reachable until proven otherwise) or pessimistic (all blocks reachable until proven otherwise) viewpoint.

Pessimistic is possibly more appropriate for the jit so we can bail at any point and not worry about being incorrect. So one off-the-cuff idea here is to VN as we do now, and track PHI input and output VNs as candidates for a union-find style algorithm... initially assume all these are distinct VNs but as we are able to prune SSA edges away we will create degenerate PHIs that let us merge some inputs and outputs. Would need some level of indirection in VN access, so would keep some kind of VN rename table that maps the full set of VNs into canonical leader VNs. We'd also need to know which preds supply which PHI inputs (and keep a full roster of all pred inputs, or a moral equivalent).

AndyAyersMS referenced this issue in AndyAyersMS/coreclr Feb 27, 2019
For a while now the jit has been able to promote an outer struct A with an
inner struct field B that itself has a single non-struct field C, provided
that C occupies all of B and that C and B are pointer-sized.

For example, this comes up when supporting promotion of `Span<T>`, as a span
contains a `ByReference<T>` field that itself contains a pointer-sized field.

This change relaxes the constraints slightly, allowing B and C to be less than
pointer sized, provided C still occupies all of B, and B is suitably aligned
within A.

Doing so allows promotion of the new `Range` type, which contains two `Index`
fields that each wrap an `int`. This improves performance for uses of `Range`
for simple examples like those in #22079.
AndyAyersMS referenced this issue in dotnet/coreclr Feb 28, 2019
…lds (#22867)

For a while now the jit has been able to promote an outer struct A with an
inner struct field B that itself has a single non-struct field C, provided
that C occupies all of B and that C and B are pointer-sized.

For example, this comes up when supporting promotion of `Span<T>`, as a span
contains a `ByReference<T>` field that itself contains a pointer-sized field.

This change relaxes the constraints slightly, allowing B and C to be less than
pointer sized, provided C still occupies all of B, and B is suitably aligned
within A.

Doing so allows promotion of the new `Range` type, which contains two `Index`
fields that each wrap an `int`. This improves performance for uses of `Range`
for simple examples like those in #22079.
@AndyAyersMS
Copy link
Member Author

Made some preliminary improvements in dotnet/coreclr#22867. Remainder will have to wait.

@AndyAyersMS
Copy link
Member Author

@sandreenko wondering if your recent work on small struct enregistration has fixed the last issue here. Can you take a look?

@sandreenko
Copy link
Contributor

@sandreenko wondering if your recent work on small struct enregistration has fixed the last issue here. Can you take a look?

Will do.

@sandreenko
Copy link
Contributor

I think there was a type in the range implementation, should not it be

static Span<int> TrimFirstLast_Range(Span<int> s) => s[new Range(1, new Index(0, true))];

instead of

static Span<int> TrimFirstLast_Range(Span<int> s) => s[Range.Create(1, new Index(1, true))];

?

If yes then both methods now have ; Lcl frame size = 32 and the first has

Method code size: 133

when the second

Method code size: 135

The only difference is in this instruction:

IN0003: 000013 lea      ebx, [rcx-1]
vs
IN0003: 000013 mov      ebx, ecx
IN0004: 000015 sub      ebx, 1

because in the second method when we come to global morph and try to do SUB(x, const)->ADD(x, -const) we don't yet know that op2 is const because we have not propagated a local var there yet (it would happen only in PHASE Assertion prop).

All range structs are promoted as expected, I think the issue could be closed.

@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 25, 2020
@BruceForstall
Copy link
Member

@sandreenko says:

I think the issue could be closed.

@AndyAyersMS do you agree? If not, what is left to do? (And should this issue be used to track such things?)

@AndyAyersMS
Copy link
Member Author

Seems like the most serious problems are now fixed, but there are some fit and finish issues.

dotnet/roslyn#47629 to shows we still don't handle ranges as well as we could and includes a few tests we should explore.

@kunalspathak
Copy link
Member

The only difference is in this instruction:

IN0003: 000013 lea      ebx, [rcx-1]
vs
IN0003: 000013 mov      ebx, ecx
IN0004: 000015 sub      ebx, 1

Just double checked and yes, this is the diff between TrimFirstLast_OpenCoded and TrimFirstLast_Range. Do we want to keep the issue open or can we close it? @AndyAyersMS

@AndyAyersMS
Copy link
Member Author

Interesting, I haven't looked at this for a while, so am pleasantly surprised.

Do we want to keep the issue open or can we close it?

I think we can close it. I suspect some of the linked issues may still be relevant.

@ghost ghost locked as resolved and limited conversation to collaborators Feb 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization
Projects
None yet
Development

No branches or pull requests

8 participants