Add basic support for folding and normalizing hwintrinsic trees in morph #103143

tannergooding · 2024-06-06T21:03:31Z

This covers some, but not all, of the same optimizations that exist for regular GenTreeOp nodes.

dotnet-policy-service · 2024-06-06T21:04:01Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

src/coreclr/jit/morph.cpp

tannergooding · 2024-06-10T14:38:20Z

CC. @dotnet/jit-contrib, much as with the floating-point PR (#103206), this simply enables constant folding of hwintrinsic nodes in morph (making it more inline with how simple integer ops are handled) and expands the support to a few other common operations, such as GetLower, GetUpper, and ToVectorXXX. This allows many more scenarios to be handled and better codegen to be produced across inlining boundaries when such constants are encountered. -- This is relevant towards moving the rest of Vector2 and Vector3 into managed code, so that we don't see regressions across inlining boundaries for constants.

tannergooding · 2024-06-13T01:18:35Z

Updated to handle the same feedback as given in #103206

EgorBo

The changes generally make sense to me, it's just unfortunate that we effectively duplicate constant folding logic here and in VN for the same ops resulting in+2000 LOC of code. Btw any idea why arm64 regresses on SPMI?

tannergooding · 2024-06-13T21:45:46Z

The changes generally make sense to me, it's just unfortunate that we effectively duplicate constant folding logic here and in VN for the same ops resulting in+2000 LOC of code

Definitely agree. It's something I noticed happens for VN vs morph for all our folding here. I'll log a low priority issue to see if there's some kind of code/logic sharing we could make happen here.

Maybe we could really just setup some kind of templated API that works with either GenTree or ValueNum to get the underlying constant and do the queries based off that, for example.

Btw any idea why arm64 regresses on SPMI?

It's mostly related to Vector256<T> (which is 2x V128 fields) where we end up using more registers which causes additional spill/restore logic in the prologue/epilogue.

One example of this is +28 (+8.05%) : 203475.dasm - JIT.HardwareIntrinsics.General._Vector256_1.VectorImmBinaryOpTest__op_RightShiftUInt6464:RunLclVarScenario_UnsafeRead():this (FullOpts) (see below) and this same +28 byte pattern is then repeated across a ton of coreclr_test since we have several hundred of them.

There's a couple other places where code has "regressed" as well, but it's typically in edge cases like this that are handling paths we don't expect to execute on Arm64. The paths we actually care about (using Vector2/3/4, Vector64/128<T>, and Vector<T>) are seeing improvements in terms of overall code quality and size.

@@ -18,35 +18,35 @@
 ;  V07 tmp4         [V07,T07] (  2,  4   )    long  ->   x0         "Inlining Arg"
 ;  V08 tmp5         [V08,T03] (  3,  6   )    long  ->   x1         "Inlining Arg"
 ;* V09 tmp6         [V09    ] (  0,  0   )  struct (32) zero-ref    HFA(simd16)  ld-addr-op "Inline ldloca(s) first use temp" <System.Runtime.Intrinsics.Vector256`1[ulong]>
-;* V10 tmp7         [V10,T15] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "Inlining Arg" <System.Runtime.Intrinsics.Vector128`1[ulong]>
-;* V11 tmp8         [V11,T16] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "Inlining Arg" <System.Runtime.Intrinsics.Vector128`1[ulong]>
+;* V10 tmp7         [V10    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "Inlining Arg" <System.Runtime.Intrinsics.Vector128`1[ulong]>
+;* V11 tmp8         [V11    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "Inlining Arg" <System.Runtime.Intrinsics.Vector128`1[ulong]>
 ;  V12 tmp9         [V12,T09] (  3,  3   )     ref  ->  x21         class-hnd exact single-def "Inline stloc first use temp" <<unknown class>>
 ;  V13 tmp10        [V13,T10] (  3,  3   )     ref  ->   x2         class-hnd exact single-def "Inline stloc first use temp" <<unknown class>>
 ;  V14 tmp11        [V14,T08] (  2,  4   )    long  ->  x20         "Inlining Arg"
 ;* V15 tmp12        [V15    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
-;  V16 tmp13        [V16,T11] (  3,  3   )  simd16  ->  [fp+0x20]  HFA(simd16)  spill-single-def "field V01._lower (fldOffset=0x0)" P-INDEP
-;  V17 tmp14        [V17,T12] (  3,  3   )  simd16  ->  [fp+0x10]  HFA(simd16)  spill-single-def "field V01._upper (fldOffset=0x10)" P-INDEP
+;  V16 tmp13        [V16,T11] (  3,  3   )  simd16  ->   d8         HFA(simd16)  "field V01._lower (fldOffset=0x0)" P-INDEP
+;  V17 tmp14        [V17,T12] (  3,  3   )  simd16  ->   d9         HFA(simd16)  "field V01._upper (fldOffset=0x10)" P-INDEP
 ;* V18 tmp15        [V18    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "field V02._lower (fldOffset=0x0)" P-INDEP
 ;* V19 tmp16        [V19    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "field V02._upper (fldOffset=0x10)" P-INDEP
-;* V20 tmp17        [V20,T17] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "field V09._lower (fldOffset=0x0)" P-INDEP
-;* V21 tmp18        [V21,T18] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "field V09._upper (fldOffset=0x10)" P-INDEP
+;* V20 tmp17        [V20,T13] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "field V09._lower (fldOffset=0x0)" P-INDEP
+;* V21 tmp18        [V21,T14] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "field V09._upper (fldOffset=0x10)" P-INDEP
 ;  V22 tmp19        [V22,T04] (  3,  6   )    long  ->   x0         "BlockOp address local"
 ;  V23 tmp20        [V23,T05] (  3,  6   )    long  ->   x0         "BlockOp address local"
 ;  V24 tmp21        [V24,T02] (  3,  6   )   byref  ->   x0         single-def "BlockOp address local"
-;  V25 cse0         [V25,T13] (  2,  2   )  simd16  ->  d17         "CSE #02: aggressive"
-;  V26 cse1         [V26,T06] (  4,  4   )   byref  ->  x20         "CSE #03: aggressive"
-;  V27 cse2         [V27,T14] (  2,  2   )  simd16  ->  d16         "CSE #01: moderate"
+;  V25 cse0         [V25,T06] (  4,  4   )   byref  ->  x20         "CSE #01: aggressive"
 ;
-; Lcl frame size = 40
+; Lcl frame size = 8
 
 G_M64353_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
             stp     fp, lr, [sp, #-0x50]!
+            stp     d8, d9, [sp, #0x18]
+            stp     d10, d11, [sp, #0x28]
             stp     x19, x20, [sp, #0x38]
             str     x21, [sp, #0x48]
             mov     fp, sp
             mov     x19, x0
             ; gcrRegs +[x19]
-						;; size=20 bbWeight=1 PerfScore 4.00
+						;; size=28 bbWeight=1 PerfScore 6.00
 G_M64353_IG02:        ; bbWeight=1, gcrefRegs=80000 {x19}, byrefRegs=0000 {}, byref
             movz    x0, #0xD1FFAB1E
             movk    x0, #0xD1FFAB1E LSL #16
@@ -84,26 +84,29 @@ G_M64353_IG02:        ; bbWeight=1, gcrefRegs=80000 {x19}, byrefRegs=0000 {}, by
             sub     x0, x0, #1
             sub     x1, x1, #1
             bic     x0, x0, x1
-            ldp     q16, q17, [x0]
-            stp     q17, q16, [fp, #0x10]	// [V17 tmp14], [V16 tmp13]
+            ldp     q8, q9, [x0]
             mov     x0, x20
             ; byrRegs +[x0]
             movz    x1, #0xD1FFAB1E      // code for <unknown method>
             movk    x1, #0xD1FFAB1E LSL #16
             movk    x1, #0xD1FFAB1E LSL #32
             ldr     x1, [x1]
+            mov     v10.d[0], v8.d[1]
+            mov     v11.d[0], v9.d[1]
             blr     x1
             ; byrRegs -[x0 x21]
-            ldr     q16, [fp, #0x20]	// [V16 tmp13]
-            str     q16, [x0]
-            ldr     q17, [fp, #0x10]	// [V17 tmp14]
-            str     q17, [x0, #0x10]
+            mov     v8.d[1], v10.d[0]
+            str     q8, [x0]
+            mov     v9.d[1], v11.d[0]
+            str     q9, [x0, #0x10]
             mov     x0, x20
             ; byrRegs +[x0]
             movz    x1, #0xD1FFAB1E      // code for <unknown method>
             movk    x1, #0xD1FFAB1E LSL #16
             movk    x1, #0xD1FFAB1E LSL #32
             ldr     x1, [x1]
+            mov     v10.d[0], v8.d[1]
+            mov     v11.d[0], v9.d[1]
             blr     x1
             ; byrRegs -[x0 x20]
             mov     x20, x0
@@ -127,10 +130,10 @@ G_M64353_IG02:        ; bbWeight=1, gcrefRegs=80000 {x19}, byrefRegs=0000 {}, by
             add     x0, x21, #16
             ; gcrRegs -[x0]
             ; byrRegs +[x0]
-            ldr     q16, [fp, #0x20]	// [V16 tmp13]
-            str     q16, [x0]
-            ldr     q17, [fp, #0x10]	// [V17 tmp14]
-            str     q17, [x0, #0x10]
+            mov     v8.d[1], v10.d[0]
+            str     q8, [x0]
+            mov     v9.d[1], v11.d[0]
+            str     q9, [x0, #0x10]
             ldp     q16, q17, [x20]
             stp     q16, q17, [x2, #0x10]
             mov     x0, x19
@@ -147,26 +150,28 @@ G_M64353_IG02:        ; bbWeight=1, gcrefRegs=80000 {x19}, byrefRegs=0000 {}, by
             ldr     x4, [x4]
             blr     x4
             ; gcrRegs -[x0-x2 x19 x21]
-						;; size=312 bbWeight=1 PerfScore 77.00
+						;; size=324 bbWeight=1 PerfScore 76.00
 G_M64353_IG03:        ; bbWeight=1, epilog, nogc, extend
             ldr     x21, [sp, #0x48]
             ldp     x19, x20, [sp, #0x38]
+            ldp     d10, d11, [sp, #0x28]
+            ldp     d8, d9, [sp, #0x18]
             ldp     fp, lr, [sp], #0x50
             ret     lr
-						;; size=16 bbWeight=1 PerfScore 5.00
+						;; size=24 bbWeight=1 PerfScore 7.00
 
-; Total bytes of code 348, prolog size 16, PerfScore 86.00, instruction count 87, allocated bytes for code 348 (MethodHash=fd89049e) for method JIT.HardwareIntrinsics.General._Vector256_1.VectorImmBinaryOpTest__op_RightShiftUInt6464:RunLclVarScenario_UnsafeRead():this (FullOpts)
+; Total bytes of code 376, prolog size 24, PerfScore 89.00, instruction count 94, allocated bytes for code 376 (MethodHash=fd89049e) for method JIT.HardwareIntrinsics.General._Vector256_1.VectorImmBinaryOpTest__op_RightShiftUInt6464:RunLclVarScenario_UnsafeRead():this (FullOpts)
 ; ============================================================
 
 Unwind Info:
   >> Start offset   : 0x000000 (not in unwind data)
   >>   End offset   : 0xd1ffab1e (not in unwind data)
-  Code Words        : 2
+  Code Words        : 3
   Epilog Count      : 1
   E bit             : 0
   X bit             : 0
   Vers              : 0
-  Function Length   : 87 (0x00057) Actual length = 348 (0x00015c)
+  Function Length   : 94 (0x0005e) Actual length = 376 (0x000178)
   ---- Epilog scopes ----
   ---- Scope 0
   Epilog Start Offset        : 3523193630 (0xd1ffab1e) Actual offset = 3523193630 (0xd1ffab1e) Offset from main function begin = 3523193630 (0xd1ffab1e)
@@ -176,7 +181,10 @@ Unwind Info:
     ---- Epilog start at index 1 ----
     D0 89       save_reg X#2 Z#9 (0x09); str x21, [sp, #72]
     C8 07       save_regp X#0 Z#7 (0x07); stp x19, x20, [sp, #56]
+    E6          save_next
+    D8 03       save_fregp X#0 Z#3 (0x03); stp d8, d9, [sp, #24]
     89          save_fplr_x #9 (0x09); stp fp, lr, [sp, #-80]!
     E4          end
     E4          end
+    E4          end

LoopedBard3 · 2024-06-18T16:31:52Z

Potential regressions:
Windows x64: dotnet/perf-autofiling-issues#36447

tannergooding · 2024-06-18T17:04:03Z

The regression are more likely #103462 and caused by the use of SinCos, which is an improvement for the real world scenario but isn't constant folded

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 6, 2024

dotnet-policy-service bot assigned tannergooding Jun 6, 2024

tannergooding commented Jun 6, 2024

View reviewed changes

src/coreclr/jit/morph.cpp Outdated Show resolved Hide resolved

tannergooding commented Jun 6, 2024

View reviewed changes

src/coreclr/jit/morph.cpp Show resolved Hide resolved

tannergooding force-pushed the hwintrinsic-morph branch from 1de5887 to 49088f6 Compare June 6, 2024 22:02

tannergooding closed this Jun 7, 2024

tannergooding reopened this Jun 7, 2024

tannergooding force-pushed the hwintrinsic-morph branch from 49088f6 to 471aa9a Compare June 7, 2024 21:38

This was referenced Jun 8, 2024

System.IO.Net5Compat.Tests and System.IO.Tests suddenly exiting with error 137 #100558

Open

SIGKILL (OOM?) while running LibraryImportGenerator.Tests w/o actionable log messages or artifacts dotnet/dnceng#2496

Open

tannergooding force-pushed the hwintrinsic-morph branch from 471aa9a to 03d5fc1 Compare June 8, 2024 01:31

build-analysis bot mentioned this pull request Jun 8, 2024

Dead lettering tests #101524

Closed

tannergooding force-pushed the hwintrinsic-morph branch 4 times, most recently from 938d968 to ee0b8ff Compare June 8, 2024 23:33

build-analysis bot mentioned this pull request Jun 9, 2024

Test failure: GC\\Features\\HeapExpansion\\Finalizer\\Finalizer.cmd #102706

Open

tannergooding force-pushed the hwintrinsic-morph branch 8 times, most recently from 7d71366 to d2203a3 Compare June 10, 2024 02:17

tannergooding force-pushed the hwintrinsic-morph branch from d2203a3 to 4a80f83 Compare June 10, 2024 03:38

Add basic support for folding hwintrinsic trees in morph

1cca48e

tannergooding force-pushed the hwintrinsic-morph branch from 4a80f83 to 1cca48e Compare June 10, 2024 05:42

tannergooding marked this pull request as ready for review June 10, 2024 14:31

This was referenced Jun 10, 2024

Failed test: baseservices/exceptions/unhandled/unhandledTester/unhandledTester #100495

Open

CI timeout in System.Data.Tests.DataTableReadXmlSchemaTest.XsdSchemaDeserializationIgnoresLocale #103062

Open

Reduce the amount of copying required to evaluated vector constants

d057ab9

tannergooding force-pushed the hwintrinsic-morph branch from fbe82b9 to d057ab9 Compare June 10, 2024 19:34

tannergooding force-pushed the hwintrinsic-morph branch from 39961a4 to 16d6c64 Compare June 13, 2024 01:32

tannergooding added 2 commits June 12, 2024 19:19

Merge remote-tracking branch 'dotnet/main' into hwintrinsic-morph

bb108f7

Have gtFoldExprHWIntrinsic handle side effects

1f24151

tannergooding force-pushed the hwintrinsic-morph branch from 16d6c64 to 1f24151 Compare June 13, 2024 02:39

This was referenced Jun 13, 2024

System.Net.Security.Tests.SslStreamTlsResumeTests failing #103354

Open

x64 test failure: DeserializeStoredObjects strings differ. #103367

Closed

EgorBo approved these changes Jun 13, 2024

View reviewed changes

tannergooding merged commit d7ae8c6 into dotnet:main Jun 13, 2024
112 of 114 checks passed

tannergooding deleted the hwintrinsic-morph branch June 13, 2024 21:45

AndyAyersMS mentioned this pull request Jun 20, 2024

[Perf] Linux/arm64: 5 Improvements on 6/13/2024 9:45:50 PM dotnet/perf-autofiling-issues#36684

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add basic support for folding and normalizing hwintrinsic trees in morph #103143

Add basic support for folding and normalizing hwintrinsic trees in morph #103143

tannergooding commented Jun 6, 2024

dotnet-policy-service bot commented Jun 6, 2024

tannergooding commented Jun 10, 2024

tannergooding commented Jun 13, 2024

EgorBo left a comment

tannergooding commented Jun 13, 2024

LoopedBard3 commented Jun 18, 2024

tannergooding commented Jun 18, 2024

Add basic support for folding and normalizing hwintrinsic trees in morph #103143

Add basic support for folding and normalizing hwintrinsic trees in morph #103143

Conversation

tannergooding commented Jun 6, 2024

dotnet-policy-service bot commented Jun 6, 2024

tannergooding commented Jun 10, 2024

tannergooding commented Jun 13, 2024

EgorBo left a comment

Choose a reason for hiding this comment

tannergooding commented Jun 13, 2024

LoopedBard3 commented Jun 18, 2024

tannergooding commented Jun 18, 2024