Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ldp/stp with SIMD registers on Arm64 #84135

Merged
merged 2 commits into from
Mar 31, 2023

Conversation

SwapnilGaikwad
Copy link
Contributor

@SwapnilGaikwad SwapnilGaikwad commented Mar 30, 2023

Use pairwise load/stores for

  1. the instructions using SIMD registers
ldr     q1, [x0, #0x20]
ldr     q2, [x0, #0x30]     =>  ldp     q1, q2, [x0, #0x20]

(Fixes #83773)

  1. the instructions using base and base plus immediate offset format
ldr     w1, [x20]
ldr     w2, [x20, #0x04]    =>  ldp     w1, w2, [x20]

ldr     q1, [x0]
ldr     q2, [x0, #0x10]     =>  ldp     q1, q2, [x0]

(Fixes #35133) Contributes to #35133. We still need to fix #81278 to cover all the cases.

Use pairwise load/stores for

1. the instructions using SIMD registers
```
ldr     q1, [x0, #0x20]
ldr     q2, [x0, #0x30]     =>  ldp     q1, q2, [x0, #0x20]
```

2. the instructions using base and base plus immediate offset format
```
ldr     w1, [x20]
ldr     w2, [x20, #0x04]    =>  ldp     w1, w2, [x20]

ldr     q1, [x0]
ldr     q2, [x0, #0x10]     =>  ldp     q1, q2, [x0]
```
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 30, 2023
@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Mar 30, 2023
@ghost
Copy link

ghost commented Mar 30, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Use pairwise load/stores for

  1. the instructions using SIMD registers
ldr     q1, [x0, #0x20]
ldr     q2, [x0, #0x30]     =>  ldp     q1, q2, [x0, #0x20]

(Fixes #83773)

  1. the instructions using base and base plus immediate offset format
ldr     w1, [x20]
ldr     w2, [x20, #0x04]    =>  ldp     w1, w2, [x20]

ldr     q1, [x0]
ldr     q2, [x0, #0x10]     =>  ldp     q1, q2, [x0]

(Fixes #35133)

Author: SwapnilGaikwad
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@SwapnilGaikwad
Copy link
Contributor Author

Not sure about the potential GC holes or how to confirm that yet.

The following spmi asmdiffs summary shows multiple matches as expected.

Diffs are based on 1,469,735 contexts (402,470 MinOpts, 1,067,265 FullOpts).

MISSED contexts: 3 (0.00%)

Overall (-769,712 bytes)
Collection Base size (bytes) Diff size (bytes)
benchmarks.run.linux.arm64.checked.mch 19,307,636 -4,632
libraries_tests.pmi.linux.arm64.checked.mch 160,979,832 -46,408
libraries.crossgen2.linux.arm64.checked.mch 42,310,160 -6,536
libraries.pmi.linux.arm64.checked.mch 65,290,988 -24,572
coreclr_tests.run.linux.arm64.checked.mch 535,732,276 -687,564
MinOpts (+0 bytes)
Collection Base size (bytes) Diff size (bytes)
libraries_tests.pmi.linux.arm64.checked.mch 5,439,544 +0
coreclr_tests.run.linux.arm64.checked.mch 363,182,616 +0
FullOpts (-769,712 bytes)
Collection Base size (bytes) Diff size (bytes)
benchmarks.run.linux.arm64.checked.mch 18,185,072 -4,632
libraries_tests.pmi.linux.arm64.checked.mch 155,540,288 -46,408
libraries.crossgen2.linux.arm64.checked.mch 42,308,524 -6,536
libraries.pmi.linux.arm64.checked.mch 63,778,580 -24,572
coreclr_tests.run.linux.arm64.checked.mch 172,549,660 -687,564
Example diffs
benchmarks.run.linux.arm64.checked.mch
-4 (-16.67%) : 2709.dasm - System.ValueTuple`2[long,System.DateTime]:.ctor(long,System.DateTime):this
@@ -20,15 +20,14 @@ G_M30325_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 						;; size=8 bbWeight=1 PerfScore 1.50
 G_M30325_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, byref
             ; byrRegs +[x0]
-            str     x1, [x0]
-            str     x2, [x0, #0x08]
-						;; size=8 bbWeight=1 PerfScore 2.00
+            stp     x1, x2, [x0]
+						;; size=4 bbWeight=1 PerfScore 1.00
 G_M30325_IG03:        ; bbWeight=1, epilog, nogc, extend
             ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
 
-; Total bytes of code 24, prolog size 8, PerfScore 7.90, instruction count 6, allocated bytes for code 24 (MethodHash=b3c1898a) for method System.ValueTuple`2[long,System.DateTime]:.ctor(long,System.DateTime):this
+; Total bytes of code 20, prolog size 8, PerfScore 6.50, instruction count 5, allocated bytes for code 20 (MethodHash=b3c1898a) for method System.ValueTuple`2[long,System.DateTime]:.ctor(long,System.DateTime):this
 ; ============================================================
 
 Unwind Info:
@@ -39,7 +38,7 @@ Unwind Info:
   E bit             : 0
   X bit             : 0
   Vers              : 0
-  Function Length   : 6 (0x00006) Actual length = 24 (0x000018)
+  Function Length   : 5 (0x00005) Actual length = 20 (0x000014)
   ---- Epilog scopes ----
   ---- Scope 0
   Epilog Start Offset        : 3523193630 (0xd1ffab1e) Actual offset = 3523193630 (0xd1ffab1e) Offset from main function begin = 3523193630 (0xd1ffab1e)
-4 (-16.67%) : 925.dasm - System.Reflection.Emit.OpCode:.ctor(int,int):this
@@ -19,15 +19,14 @@ G_M55742_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 						;; size=8 bbWeight=1 PerfScore 1.50
 G_M55742_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, byref
             ; byrRegs +[x0]
-            str     w1, [x0]
-            str     w2, [x0, #0x04]
-						;; size=8 bbWeight=1 PerfScore 2.00
+            stp     w1, w2, [x0]
+						;; size=4 bbWeight=1 PerfScore 1.00
 G_M55742_IG03:        ; bbWeight=1, epilog, nogc, extend
             ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
 
-; Total bytes of code 24, prolog size 8, PerfScore 7.90, instruction count 6, allocated bytes for code 24 (MethodHash=9e0a2641) for method System.Reflection.Emit.OpCode:.ctor(int,int):this
+; Total bytes of code 20, prolog size 8, PerfScore 6.50, instruction count 5, allocated bytes for code 20 (MethodHash=9e0a2641) for method System.Reflection.Emit.OpCode:.ctor(int,int):this
 ; ============================================================
 
 Unwind Info:
@@ -38,7 +37,7 @@ Unwind Info:
   E bit             : 0
   X bit             : 0
   Vers              : 0
-  Function Length   : 6 (0x00006) Actual length = 24 (0x000018)
+  Function Length   : 5 (0x00005) Actual length = 20 (0x000014)
   ---- Epilog scopes ----
   ---- Scope 0
   Epilog Start Offset        : 3523193630 (0xd1ffab1e) Actual offset = 3523193630 (0xd1ffab1e) Offset from main function begin = 3523193630 (0xd1ffab1e)
-8 (-16.67%) : 25440.dasm - System.Numerics.Tests.Perf_Matrix4x4:CreateRotationXWithCenterBenchmark():System.Numerics.Matrix4x4:this
@@ -38,11 +38,9 @@ G_M63428_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0100 {x8}, byre
             ldr     q17, [@RWD16]
             ldr     q18, [@RWD32]
             ldr     q19, [@RWD48]
-            str     q19, [x8]
-            str     q16, [x8, #0x10]
-            str     q17, [x8, #0x20]
-            str     q18, [x8, #0x30]
-						;; size=32 bbWeight=1 PerfScore 12.00
+            stp     q19, q16, [x8]
+            stp     q17, q18, [x8, #0x20]
+						;; size=24 bbWeight=1 PerfScore 10.00
 G_M63428_IG03:        ; bbWeight=1, epilog, nogc, extend
             ldp     fp, lr, [sp], #0x10
             ret     lr
@@ -53,7 +51,7 @@ RWD32  	dq	0000000000000000h, 3F80000000000000h
 RWD48  	dq	000000003F800000h, 0000000000000000h
 
 
-; Total bytes of code 48, prolog size 8, PerfScore 20.30, instruction count 12, allocated bytes for code 48 (MethodHash=0a46083b) for method System.Numerics.Tests.Perf_Matrix4x4:CreateRotationXWithCenterBenchmark():System.Numerics.Matrix4x4:this
+; Total bytes of code 40, prolog size 8, PerfScore 17.50, instruction count 10, allocated bytes for code 40 (MethodHash=0a46083b) for method System.Numerics.Tests.Perf_Matrix4x4:CreateRotationXWithCenterBenchmark():System.Numerics.Matrix4x4:this
 ; ============================================================
 
 Unwind Info:
@@ -64,7 +62,7 @@ Unwind Info:
   E bit             : 0
   X bit             : 0
   Vers              : 0
-  Function Length   : 12 (0x0000c) Actual length = 48 (0x000030)
+  Function Length   : 10 (0x0000a) Actual length = 40 (0x000028)
   ---- Epilog scopes ----
   ---- Scope 0
   Epilog Start Offset        : 3523193630 (0xd1ffab1e) Actual offset = 3523193630 (0xd1ffab1e) Offset from main function begin = 3523193630 (0xd1ffab1e)
+0 (0.00%) : 21886.dasm - System.Xml.XmlBinaryWriter:SetOutput(System.IO.Stream,System.Xml.IXmlDictionary,System.Xml.XmlBinaryWriterSession,bool):this
@@ -86,8 +86,8 @@ G_M34423_IG04:        ; bbWeight=1, gcrefRegs=780000 {x19 x20 x21 x22}, byrefReg
             strb    wzr, [x0, #0x26]
             add     x14, x0, #64
             ; byrRegs +[x14]
-            str     xzr, [x14]
-            stp     xzr, xzr, [x14, #0x08]
+            stp     xzr, xzr, [x14]
+            str     xzr, [x14, #0x10]
             movn    w14, #0
             ; byrRegs -[x14]
             str     w14, [x0, #0x38]
+0 (0.00%) : 20223.dasm - System.Text.Json.JsonDocument:TryGetValue(int,byref):bool:this
@@ -150,8 +150,8 @@ G_M19143_IG05:        ; bbWeight=1, gcrefRegs=80000 {x19}, byrefRegs=300000 {x20
             blr     x1
             cmp     w22, #12
             blt     G_M19143_IG15
-            ldr     w22, [x21]
-            ldp     w23, w1, [x21, #0x04]
+            ldp     w22, w23, [x21]
+            ldr     w1, [x21, #0x08]
             lsr     w1, w1, #28
             uxtb    w1, w1
             cmp     w1, #8
+0 (0.00%) : 32319.dasm - Microsoft.CodeAnalysis.CSharp.MethodCompiler:CompileSynthesizedMethods(Microsoft.CodeAnalysis.CSharp.TypeCompilationState):this
@@ -188,10 +188,10 @@ G_M26982_IG06:        ; bbWeight=4, gcVars=00000000000000400000000401000010 {V00
             add     x14, x14, #16
             add     x14, x15, x14
             ; byrRegs +[x14]
-            ldr     x20, [x14]
-            ; gcrRegs +[x20]
-            ldp     x21, x22, [x14, #0x08]
-            ; gcrRegs +[x21-x22]
+            ldp     x20, x21, [x14]
+            ; gcrRegs +[x20-x21]
+            ldr     x22, [x14, #0x10]
+            ; gcrRegs +[x22]
             add     x14, x4, #40
             mov     x15, x22
             bl      CORINFO_HELP_ASSIGN_REF
libraries_tests.pmi.linux.arm64.checked.mch
-8 (-20.00%) : 128351.dasm - Microsoft.CodeAnalysis.Checksum+HashData:FromPointer(ulong):Microsoft.CodeAnalysis.Checksum+HashData
@@ -26,19 +26,17 @@ G_M44009_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 						;; size=8 bbWeight=1 PerfScore 1.50
 G_M44009_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0100 {x8}, byref
             ; byrRegs +[x8]
-            ldr     x1, [x0]
-            ldr     x2, [x0, #0x08]
+            ldp     x1, x2, [x0]
             ldr     w0, [x0, #0x10]
-            str     x1, [x8]
-            str     x2, [x8, #0x08]
+            stp     x1, x2, [x8]
             str     w0, [x8, #0x10]
-						;; size=24 bbWeight=1 PerfScore 12.00
+						;; size=16 bbWeight=1 PerfScore 9.00
 G_M44009_IG03:        ; bbWeight=1, epilog, nogc, extend
             ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
 
-; Total bytes of code 40, prolog size 8, PerfScore 19.50, instruction count 10, allocated bytes for code 40 (MethodHash=a99b5416) for method Microsoft.CodeAnalysis.Checksum+HashData:FromPointer(ulong):Microsoft.CodeAnalysis.Checksum+HashData
+; Total bytes of code 32, prolog size 8, PerfScore 15.70, instruction count 8, allocated bytes for code 32 (MethodHash=a99b5416) for method Microsoft.CodeAnalysis.Checksum+HashData:FromPointer(ulong):Microsoft.CodeAnalysis.Checksum+HashData
 ; ============================================================
 
 Unwind Info:
@@ -49,7 +47,7 @@ Unwind Info:
   E bit             : 0
   X bit             : 0
   Vers              : 0
-  Function Length   : 10 (0x0000a) Actual length = 40 (0x000028)
+  Function Length   : 8 (0x00008) Actual length = 32 (0x000020)
   ---- Epilog scopes ----
   ---- Scope 0
   Epilog Start Offset        : 3523193630 (0xd1ffab1e) Actual offset = 3523193630 (0xd1ffab1e) Offset from main function begin = 3523193630 (0xd1ffab1e)
-4 (-16.67%) : 3265.dasm - System.Text.Json.Serialization.Tests.Point_2D_Struct_WithMultipleAttributes_OneNonPublic:.ctor(int):this
@@ -19,15 +19,14 @@ G_M61621_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 						;; size=8 bbWeight=1 PerfScore 1.50
 G_M61621_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, byref
             ; byrRegs +[x0]
-            str     w1, [x0]
-            str     wzr, [x0, #0x04]
-						;; size=8 bbWeight=1 PerfScore 2.00
+            stp     w1, wzr, [x0]
+						;; size=4 bbWeight=1 PerfScore 1.00
 G_M61621_IG03:        ; bbWeight=1, epilog, nogc, extend
             ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
 
-; Total bytes of code 24, prolog size 8, PerfScore 7.90, instruction count 6, allocated bytes for code 24 (MethodHash=8aeb0f4a) for method System.Text.Json.Serialization.Tests.Point_2D_Struct_WithMultipleAttributes_OneNonPublic:.ctor(int):this
+; Total bytes of code 20, prolog size 8, PerfScore 6.50, instruction count 5, allocated bytes for code 20 (MethodHash=8aeb0f4a) for method System.Text.Json.Serialization.Tests.Point_2D_Struct_WithMultipleAttributes_OneNonPublic:.ctor(int):this
 ; ============================================================
 
 Unwind Info:
@@ -38,7 +37,7 @@ Unwind Info:
   E bit             : 0
   X bit             : 0
   Vers              : 0
-  Function Length   : 6 (0x00006) Actual length = 24 (0x000018)
+  Function Length   : 5 (0x00005) Actual length = 20 (0x000014)
   ---- Epilog scopes ----
   ---- Scope 0
   Epilog Start Offset        : 3523193630 (0xd1ffab1e) Actual offset = 3523193630 (0xd1ffab1e) Offset from main function begin = 3523193630 (0xd1ffab1e)
-4 (-16.67%) : 156801.dasm - SerializationTestTypes.KeyValue`2[long,System.Nullable`1[int]]:.ctor(long,System.Nullable`1[int]):this
@@ -19,15 +19,14 @@ G_M24332_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 						;; size=8 bbWeight=1 PerfScore 1.50
 G_M24332_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, byref
             ; byrRegs +[x0]
-            str     x1, [x0]
-            str     x2, [x0, #0x08]
-						;; size=8 bbWeight=1 PerfScore 2.00
+            stp     x1, x2, [x0]
+						;; size=4 bbWeight=1 PerfScore 1.00
 G_M24332_IG03:        ; bbWeight=1, epilog, nogc, extend
             ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
 
-; Total bytes of code 24, prolog size 8, PerfScore 7.90, instruction count 6, allocated bytes for code 24 (MethodHash=e0c2a0f3) for method SerializationTestTypes.KeyValue`2[long,System.Nullable`1[int]]:.ctor(long,System.Nullable`1[int]):this
+; Total bytes of code 20, prolog size 8, PerfScore 6.50, instruction count 5, allocated bytes for code 20 (MethodHash=e0c2a0f3) for method SerializationTestTypes.KeyValue`2[long,System.Nullable`1[int]]:.ctor(long,System.Nullable`1[int]):this
 ; ============================================================
 
 Unwind Info:
@@ -38,7 +37,7 @@ Unwind Info:
   E bit             : 0
   X bit             : 0
   Vers              : 0
-  Function Length   : 6 (0x00006) Actual length = 24 (0x000018)
+  Function Length   : 5 (0x00005) Actual length = 20 (0x000014)
   ---- Epilog scopes ----
   ---- Scope 0
   Epilog Start Offset        : 3523193630 (0xd1ffab1e) Actual offset = 3523193630 (0xd1ffab1e) Offset from main function begin = 3523193630 (0xd1ffab1e)
+0 (0.00%) : 166976.dasm - Microsoft.CodeQuality.Analyzers.ApiDesignGuidelines.IdentifiersShouldHaveCorrectSuffixAnalyzer:.ctor():this
@@ -40,10 +40,10 @@ G_M60409_IG02:        ; bbWeight=1, gcrefRegs=80000 {x19}, byrefRegs=0000 {}, by
             movz    x0, #0xD1FFAB1E      // data for <unknown class>:<unknown field>
             movk    x0, #0xD1FFAB1E LSL #16
             movk    x0, #0xD1FFAB1E LSL #32
-            ldr     x20, [x0]
-            ; gcrRegs +[x20]
-            ldp     x21, x22, [x0, #0x08]
-            ; gcrRegs +[x21-x22]
+            ldp     x20, x21, [x0]
+            ; gcrRegs +[x20-x21]
+            ldr     x22, [x0, #0x10]
+            ; gcrRegs +[x22]
             movz    x0, #0xD1FFAB1E
             movk    x0, #0xD1FFAB1E LSL #16
             movk    x0, #0xD1FFAB1E LSL #32
+0 (0.00%) : 210048.dasm - System.Net.Http.Tests.StreamToStreamCopyTest+d__5:MoveNext():this
@@ -605,8 +605,8 @@ G_M59861_IG08:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=80000 {x19}, by
 G_M59861_IG09:        ; bbWeight=1.00, gcrefRegs=0000 {}, byrefRegs=80000 {x19}, byref, isz
             movn    w14, #1
             str     w14, [x19, #0x18]
-            str     xzr, [x19]
-            stp     xzr, xzr, [x19, #0x08]
+            stp     xzr, xzr, [x19]
+            str     xzr, [x19, #0x10]
             add     x14, x19, #32
             ; byrRegs +[x14]
             ldr     x15, [x14]
@@ -802,8 +802,8 @@ G_M59861_IG21:        ; bbWeight=0, gcVars=0000000000000001 {V00}, gcrefRegs=000
             ldr     x19, [fp, #0x10]	// [V00 this]
             ; byrRegs +[x19]
             str     w0, [x19, #0x18]
-            str     xzr, [x19]
-            stp     xzr, xzr, [x19, #0x08]
+            stp     xzr, xzr, [x19]
+            str     xzr, [x19, #0x10]
             add     x0, x19, #32
             ; byrRegs +[x0]
             movz    x2, #0xD1FFAB1E      // code for <unknown method>
+0 (0.00%) : 239744.dasm - System.Security.Cryptography.Pkcs.Tests.CryptographicAttributeObjectCollectionTests:CopyExceptions()
@@ -142,10 +142,10 @@ G_M45722_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
             movz    x0, #0xD1FFAB1E      // data for <unknown class>:<unknown field>
             movk    x0, #0xD1FFAB1E LSL #16
             movk    x0, #0xD1FFAB1E LSL #32
-            ldr     x20, [x0]
-            ; gcrRegs +[x20]
-            ldp     x21, x22, [x0, #0x08]
-            ; gcrRegs +[x21-x22]
+            ldp     x20, x21, [x0]
+            ; gcrRegs +[x20-x21]
+            ldr     x22, [x0, #0x10]
+            ; gcrRegs +[x22]
             movz    x0, #0xD1FFAB1E
             movk    x0, #0xD1FFAB1E LSL #16
             movk    x0, #0xD1FFAB1E LSL #32
libraries.crossgen2.linux.arm64.checked.mch
-8 (-25.00%) : 34883.dasm - System.Numerics.Quaternion:.ctor(float,float,float,float):this
@@ -22,17 +22,15 @@ G_M64168_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 						;; size=8 bbWeight=1 PerfScore 1.50
 G_M64168_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, byref
             ; byrRegs +[x0]
-            str     s0, [x0]
-            str     s1, [x0, #0x04]
-            str     s2, [x0, #0x08]
-            str     s3, [x0, #0x0C]
-						;; size=16 bbWeight=1 PerfScore 4.00
+            stp     s0, s1, [x0]
+            stp     s2, s3, [x0, #0x08]
+						;; size=8 bbWeight=1 PerfScore 2.00
 G_M64168_IG03:        ; bbWeight=1, epilog, nogc, extend
             ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
 
-; Total bytes of code 32, prolog size 8, PerfScore 10.70, instruction count 8, allocated bytes for code 32 (MethodHash=c0090557) for method System.Numerics.Quaternion:.ctor(float,float,float,float):this
+; Total bytes of code 24, prolog size 8, PerfScore 7.90, instruction count 6, allocated bytes for code 24 (MethodHash=c0090557) for method System.Numerics.Quaternion:.ctor(float,float,float,float):this
 ; ============================================================
 
 Unwind Info:
@@ -43,7 +41,7 @@ Unwind Info:
   E bit             : 0
   X bit             : 0
   Vers              : 0
-  Function Length   : 8 (0x00008) Actual length = 32 (0x000020)
+  Function Length   : 6 (0x00006) Actual length = 24 (0x000018)
   ---- Epilog scopes ----
   ---- Scope 0
   Epilog Start Offset        : 3523193630 (0xd1ffab1e) Actual offset = 3523193630 (0xd1ffab1e) Offset from main function begin = 3523193630 (0xd1ffab1e)
-8 (-25.00%) : 169952.dasm - System.Drawing.RectangleF:.ctor(float,float,float,float):this
@@ -22,17 +22,15 @@ G_M45207_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 						;; size=8 bbWeight=1 PerfScore 1.50
 G_M45207_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, byref
             ; byrRegs +[x0]
-            str     s0, [x0]
-            str     s1, [x0, #0x04]
-            str     s2, [x0, #0x08]
-            str     s3, [x0, #0x0C]
-						;; size=16 bbWeight=1 PerfScore 4.00
+            stp     s0, s1, [x0]
+            stp     s2, s3, [x0, #0x08]
+						;; size=8 bbWeight=1 PerfScore 2.00
 G_M45207_IG03:        ; bbWeight=1, epilog, nogc, extend
             ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
 
-; Total bytes of code 32, prolog size 8, PerfScore 10.70, instruction count 8, allocated bytes for code 32 (MethodHash=1f014f68) for method System.Drawing.RectangleF:.ctor(float,float,float,float):this
+; Total bytes of code 24, prolog size 8, PerfScore 7.90, instruction count 6, allocated bytes for code 24 (MethodHash=1f014f68) for method System.Drawing.RectangleF:.ctor(float,float,float,float):this
 ; ============================================================
 
 Unwind Info:
@@ -43,7 +41,7 @@ Unwind Info:
   E bit             : 0
   X bit             : 0
   Vers              : 0
-  Function Length   : 8 (0x00008) Actual length = 32 (0x000020)
+  Function Length   : 6 (0x00006) Actual length = 24 (0x000018)
   ---- Epilog scopes ----
   ---- Scope 0
   Epilog Start Offset        : 3523193630 (0xd1ffab1e) Actual offset = 3523193630 (0xd1ffab1e) Offset from main function begin = 3523193630 (0xd1ffab1e)
-8 (-25.00%) : 169953.dasm - System.Drawing.RectangleF:.ctor(System.Drawing.PointF,System.Drawing.SizeF):this
@@ -25,17 +25,15 @@ G_M36094_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 						;; size=8 bbWeight=1 PerfScore 1.50
 G_M36094_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, byref
             ; byrRegs +[x0]
-            str     s0, [x0]
-            str     s1, [x0, #0x04]
-            str     s2, [x0, #0x08]
-            str     s3, [x0, #0x0C]
-						;; size=16 bbWeight=1 PerfScore 4.00
+            stp     s0, s1, [x0]
+            stp     s2, s3, [x0, #0x08]
+						;; size=8 bbWeight=1 PerfScore 2.00
 G_M36094_IG03:        ; bbWeight=1, epilog, nogc, extend
             ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
 
-; Total bytes of code 32, prolog size 8, PerfScore 10.70, instruction count 8, allocated bytes for code 32 (MethodHash=550a7301) for method System.Drawing.RectangleF:.ctor(System.Drawing.PointF,System.Drawing.SizeF):this
+; Total bytes of code 24, prolog size 8, PerfScore 7.90, instruction count 6, allocated bytes for code 24 (MethodHash=550a7301) for method System.Drawing.RectangleF:.ctor(System.Drawing.PointF,System.Drawing.SizeF):this
 ; ============================================================
 
 Unwind Info:
@@ -46,7 +44,7 @@ Unwind Info:
   E bit             : 0
   X bit             : 0
   Vers              : 0
-  Function Length   : 8 (0x00008) Actual length = 32 (0x000020)
+  Function Length   : 6 (0x00006) Actual length = 24 (0x000018)
   ---- Epilog scopes ----
   ---- Scope 0
   Epilog Start Offset        : 3523193630 (0xd1ffab1e) Actual offset = 3523193630 (0xd1ffab1e) Offset from main function begin = 3523193630 (0xd1ffab1e)
+0 (0.00%) : 65343.dasm - Microsoft.CodeAnalysis.CSharp.ForEachStatementInfo:GetHashCode():int:this
@@ -58,12 +58,12 @@ G_M41916_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
             ; byrRegs +[x19]
 						;; size=28 bbWeight=1 PerfScore 6.00
 G_M41916_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=80000 {x19}, byref
-            ldr     x20, [x19]
-            ; gcrRegs +[x20]
-            ldp     x21, x22, [x19, #0x08]
-            ; gcrRegs +[x21-x22]
-            ldp     x23, x24, [x19, #0x18]
-            ; gcrRegs +[x23-x24]
+            ldp     x20, x21, [x19]
+            ; gcrRegs +[x20-x21]
+            ldp     x22, x23, [x19, #0x10]
+            ; gcrRegs +[x22-x23]
+            ldr     x24, [x19, #0x20]
+            ; gcrRegs +[x24]
 						;; size=12 bbWeight=1 PerfScore 11.00
 G_M41916_IG03:        ; bbWeight=1, nogc, extend
             add     x0, x19, #24
+0 (0.00%) : 136256.dasm - System.Text.RegularExpressions.RegexParser:.ctor(System.String,int,System.Globalization.CultureInfo,System.Collections.Hashtable,int,System.Collections.Hashtable,System.Span`1[int]):this
@@ -96,9 +96,9 @@ G_M19169_IG04:        ; bbWeight=1, extend
             blr     x12
             ldr     x12, [x13], #0x08
             str     x12, [x14], #0x08
-            str     xzr, [x0]
-            stp     xzr, xzr, [x0, #0x08]
-            stp     xzr, xzr, [x0, #0x18]
+            stp     xzr, xzr, [x0]
+            stp     xzr, xzr, [x0, #0x10]
+            str     xzr, [x0, #0x20]
             str     wzr, [x0, #0x58]
             stp     wzr, wzr, [x0, #0x60]
             str     wzr, [x0, #0x68]
+0 (0.00%) : 173248.dasm - ILCompiler.Diagnostics.PerfMapWriter+PerfmapTokensForTarget:Equals(System.Object):bool:this
@@ -65,9 +65,9 @@ G_M34908_IG04:        ; bbWeight=0.25, gcrefRegs=80000 {x19}, byrefRegs=100000 {
 G_M34908_IG05:        ; bbWeight=0.50, gcrefRegs=80000 {x19}, byrefRegs=100000 {x20}, byref, isz
             add     x11, x19, #8
             ; byrRegs +[x11]
-            ldr     w19, [x11]
+            ldp     w19, w21, [x11]
             ; gcrRegs -[x19]
-            ldp     w21, w22, [x11, #0x04]
+            ldr     w22, [x11, #0x08]
             adrp    x11, [HIGH RELOC #0xD1FFAB1E]      // function address
             ; byrRegs -[x11]
             add     x11, x11, [LOW RELOC #0xD1FFAB1E]
libraries.pmi.linux.arm64.checked.mch
-8 (-25.00%) : 250607.dasm - System.Drawing.RectangleF:.ctor(float,float,float,float):this
@@ -21,17 +21,15 @@ G_M45207_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 						;; size=8 bbWeight=1 PerfScore 1.50
 G_M45207_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, byref
             ; byrRegs +[x0]
-            str     s0, [x0]
-            str     s1, [x0, #0x04]
-            str     s2, [x0, #0x08]
-            str     s3, [x0, #0x0C]
-						;; size=16 bbWeight=1 PerfScore 4.00
+            stp     s0, s1, [x0]
+            stp     s2, s3, [x0, #0x08]
+						;; size=8 bbWeight=1 PerfScore 2.00
 G_M45207_IG03:        ; bbWeight=1, epilog, nogc, extend
             ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
 
-; Total bytes of code 32, prolog size 8, PerfScore 10.70, instruction count 8, allocated bytes for code 32 (MethodHash=1f014f68) for method System.Drawing.RectangleF:.ctor(float,float,float,float):this
+; Total bytes of code 24, prolog size 8, PerfScore 7.90, instruction count 6, allocated bytes for code 24 (MethodHash=1f014f68) for method System.Drawing.RectangleF:.ctor(float,float,float,float):this
 ; ============================================================
 
 Unwind Info:
@@ -42,7 +40,7 @@ Unwind Info:
   E bit             : 0
   X bit             : 0
   Vers              : 0
-  Function Length   : 8 (0x00008) Actual length = 32 (0x000020)
+  Function Length   : 6 (0x00006) Actual length = 24 (0x000018)
   ---- Epilog scopes ----
   ---- Scope 0
   Epilog Start Offset        : 3523193630 (0xd1ffab1e) Actual offset = 3523193630 (0xd1ffab1e) Offset from main function begin = 3523193630 (0xd1ffab1e)
-8 (-25.00%) : 250608.dasm - System.Drawing.RectangleF:.ctor(System.Drawing.PointF,System.Drawing.SizeF):this
@@ -24,17 +24,15 @@ G_M36094_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 						;; size=8 bbWeight=1 PerfScore 1.50
 G_M36094_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, byref
             ; byrRegs +[x0]
-            str     s0, [x0]
-            str     s1, [x0, #0x04]
-            str     s2, [x0, #0x08]
-            str     s3, [x0, #0x0C]
-						;; size=16 bbWeight=1 PerfScore 4.00
+            stp     s0, s1, [x0]
+            stp     s2, s3, [x0, #0x08]
+						;; size=8 bbWeight=1 PerfScore 2.00
 G_M36094_IG03:        ; bbWeight=1, epilog, nogc, extend
             ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
 
-; Total bytes of code 32, prolog size 8, PerfScore 10.70, instruction count 8, allocated bytes for code 32 (MethodHash=550a7301) for method System.Drawing.RectangleF:.ctor(System.Drawing.PointF,System.Drawing.SizeF):this
+; Total bytes of code 24, prolog size 8, PerfScore 7.90, instruction count 6, allocated bytes for code 24 (MethodHash=550a7301) for method System.Drawing.RectangleF:.ctor(System.Drawing.PointF,System.Drawing.SizeF):this
 ; ============================================================
 
 Unwind Info:
@@ -45,7 +43,7 @@ Unwind Info:
   E bit             : 0
   X bit             : 0
   Vers              : 0
-  Function Length   : 8 (0x00008) Actual length = 32 (0x000020)
+  Function Length   : 6 (0x00006) Actual length = 24 (0x000018)
   ---- Epilog scopes ----
   ---- Scope 0
   Epilog Start Offset        : 3523193630 (0xd1ffab1e) Actual offset = 3523193630 (0xd1ffab1e) Offset from main function begin = 3523193630 (0xd1ffab1e)
-8 (-20.00%) : 246149.dasm - System.IO.Hashing.XxHash128:WriteBigEndian128(byref,System.Span`1[ubyte])
@@ -29,20 +29,18 @@ G_M11325_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 						;; size=8 bbWeight=1 PerfScore 1.50
 G_M11325_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0003 {x0 x1}, byref
             ; byrRegs +[x0-x1]
-            ldr     x2, [x0]
-            ldr     x0, [x0, #0x08]
+            ldp     x2, x0, [x0]
             ; byrRegs -[x0]
             rev     x2, x2
             rev     x0, x0
-            str     x0, [x1]
-            str     x2, [x1, #0x08]
-						;; size=24 bbWeight=1 PerfScore 9.00
+            stp     x0, x2, [x1]
+						;; size=16 bbWeight=1 PerfScore 6.00
 G_M11325_IG03:        ; bbWeight=1, epilog, nogc, extend
             ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
 
-; Total bytes of code 40, prolog size 8, PerfScore 16.50, instruction count 10, allocated bytes for code 40 (MethodHash=ae9bd3c2) for method System.IO.Hashing.XxHash128:WriteBigEndian128(byref,System.Span`1[ubyte])
+; Total bytes of code 32, prolog size 8, PerfScore 12.70, instruction count 8, allocated bytes for code 32 (MethodHash=ae9bd3c2) for method System.IO.Hashing.XxHash128:WriteBigEndian128(byref,System.Span`1[ubyte])
 ; ============================================================
 
 Unwind Info:
@@ -53,7 +51,7 @@ Unwind Info:
   E bit             : 0
   X bit             : 0
   Vers              : 0
-  Function Length   : 10 (0x0000a) Actual length = 40 (0x000028)
+  Function Length   : 8 (0x00008) Actual length = 32 (0x000020)
   ---- Epilog scopes ----
   ---- Scope 0
   Epilog Start Offset        : 3523193630 (0xd1ffab1e) Actual offset = 3523193630 (0xd1ffab1e) Offset from main function begin = 3523193630 (0xd1ffab1e)
+0 (0.00%) : 219072.dasm - Microsoft.Cci.FullMetadataWriter:CreateReferenceVisitor():Microsoft.Cci.ReferenceIndexer:this
@@ -89,10 +89,10 @@ G_M64343_IG02:        ; bbWeight=1, gcrefRegs=80000 {x19}, byrefRegs=0000 {}, by
             ; byrRegs -[x14]
             add     x0, x19, #0xD1FFAB1E
             ; byrRegs +[x0]
-            ldr     x21, [x0]
-            ; gcrRegs +[x21]
-            ldp     x23, x24, [x0, #0x08]
-            ; gcrRegs +[x23-x24]
+            ldp     x21, x23, [x0]
+            ; gcrRegs +[x21 x23]
+            ldr     x24, [x0, #0x10]
+            ; gcrRegs +[x24]
             movz    x25, #0xD1FFAB1E
             movk    x25, #0xD1FFAB1E LSL #16
             movk    x25, #0xD1FFAB1E LSL #32
+0 (0.00%) : 241792.dasm - System.Formats.Cbor.CborWriter+KeyValuePairEncodingRange:.ctor(int,int,int):this
@@ -20,8 +20,8 @@ G_M54047_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 						;; size=8 bbWeight=1 PerfScore 1.50
 G_M54047_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, byref
             ; byrRegs +[x0]
-            str     w1, [x0]
-            stp     w2, w3, [x0, #0x04]
+            stp     w1, w2, [x0]
+            str     w3, [x0, #0x08]
 						;; size=8 bbWeight=1 PerfScore 2.00
 G_M54047_IG03:        ; bbWeight=1, epilog, nogc, extend
             ldp     fp, lr, [sp], #0x10
+0 (0.00%) : 256640.dasm - System.IO.Pipelines.PipeAwaitable:ExtractCompletion(byref):this
@@ -39,10 +39,10 @@ G_M12398_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 						;; size=24 bbWeight=1 PerfScore 5.50
 G_M12398_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0003 {x0 x1}, byref, isz
             ; byrRegs +[x0-x1]
-            ldr     x14, [x0]
-            ; gcrRegs +[x14]
-            ldp     x13, x12, [x0, #0x08]
-            ; gcrRegs +[x12-x13]
+            ldp     x14, x13, [x0]
+            ; gcrRegs +[x13-x14]
+            ldr     x12, [x0, #0x10]
+            ; gcrRegs +[x12]
             cbnz    x12, G_M12398_IG04
 						;; size=12 bbWeight=1 PerfScore 8.00
 G_M12398_IG03:        ; bbWeight=0.50, gcrefRegs=7000 {x12 x13 x14}, byrefRegs=0003 {x0 x1}, byref
@@ -71,8 +71,8 @@ G_M12398_IG07:        ; bbWeight=0.50, gcrefRegs=F000 {x12 x13 x14 x15}, byrefRe
 						;; size=4 bbWeight=0.50 PerfScore 1.50
 G_M12398_IG08:        ; bbWeight=1, gcrefRegs=1E000 {x13 x14 x15 xip0}, byrefRegs=0003 {x0 x1}, byref, isz
             ; gcrRegs -[x12]
-            str     xzr, [x0]
-            stp     xzr, xzr, [x0, #0x08]
+            stp     xzr, xzr, [x0]
+            str     xzr, [x0, #0x10]
             cbnz    x14, G_M12398_IG10
 						;; size=12 bbWeight=1 PerfScore 3.00
 G_M12398_IG09:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0002 {x1}, byref
coreclr_tests.run.linux.arm64.checked.mch
-20 (-33.33%) : 243626.dasm - testout1+VT_0_4_4:.ctor(int):this
@@ -19,23 +19,18 @@ G_M41861_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 G_M41861_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, byref
             ; byrRegs +[x0]
             fmov    d16, #1.0000
-            str     d16, [x0]
-            str     d16, [x0, #0x08]
-            str     d16, [x0, #0x10]
-            str     d16, [x0, #0x18]
-            str     d16, [x0, #0x20]
-            str     d16, [x0, #0x28]
-            str     d16, [x0, #0x30]
-            str     d16, [x0, #0x38]
-            str     d16, [x0, #0x40]
-            str     d16, [x0, #0x48]
-						;; size=44 bbWeight=1 PerfScore 10.50
+            stp     d16, d16, [x0]
+            stp     d16, d16, [x0, #0x10]
+            stp     d16, d16, [x0, #0x20]
+            stp     d16, d16, [x0, #0x30]
+            stp     d16, d16, [x0, #0x40]
+						;; size=24 bbWeight=1 PerfScore 5.50
 G_M41861_IG03:        ; bbWeight=1, epilog, nogc, extend
             ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
 
-; Total bytes of code 60, prolog size 8, PerfScore 20.00, instruction count 15, allocated bytes for code 60 (MethodHash=e0945c7a) for method testout1+VT_0_4_4:.ctor(int):this
+; Total bytes of code 40, prolog size 8, PerfScore 13.00, instruction count 10, allocated bytes for code 40 (MethodHash=e0945c7a) for method testout1+VT_0_4_4:.ctor(int):this
 ; ============================================================
 
 Unwind Info:
@@ -46,7 +41,7 @@ Unwind Info:
   E bit             : 0
   X bit             : 0
   Vers              : 0
-  Function Length   : 15 (0x0000f) Actual length = 60 (0x00003c)
+  Function Length   : 10 (0x0000a) Actual length = 40 (0x000028)
   ---- Epilog scopes ----
   ---- Scope 0
   Epilog Start Offset        : 3523193630 (0xd1ffab1e) Actual offset = 3523193630 (0xd1ffab1e) Offset from main function begin = 3523193630 (0xd1ffab1e)
-16 (-30.77%) : 243598.dasm - testout1+VT_0_7_8:.ctor(int):this
@@ -19,21 +19,17 @@ G_M55818_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 G_M55818_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, byref
             ; byrRegs +[x0]
             fmov    d16, #1.0000
-            str     d16, [x0]
-            str     d16, [x0, #0x08]
-            str     d16, [x0, #0x10]
-            str     d16, [x0, #0x18]
-            str     d16, [x0, #0x20]
-            str     d16, [x0, #0x28]
-            str     d16, [x0, #0x30]
-            str     d16, [x0, #0x38]
-						;; size=36 bbWeight=1 PerfScore 8.50
+            stp     d16, d16, [x0]
+            stp     d16, d16, [x0, #0x10]
+            stp     d16, d16, [x0, #0x20]
+            stp     d16, d16, [x0, #0x30]
+						;; size=20 bbWeight=1 PerfScore 4.50
 G_M55818_IG03:        ; bbWeight=1, epilog, nogc, extend
             ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
 
-; Total bytes of code 52, prolog size 8, PerfScore 17.20, instruction count 13, allocated bytes for code 52 (MethodHash=a1af25f5) for method testout1+VT_0_7_8:.ctor(int):this
+; Total bytes of code 36, prolog size 8, PerfScore 11.60, instruction count 9, allocated bytes for code 36 (MethodHash=a1af25f5) for method testout1+VT_0_7_8:.ctor(int):this
 ; ============================================================
 
 Unwind Info:
@@ -44,7 +40,7 @@ Unwind Info:
   E bit             : 0
   X bit             : 0
   Vers              : 0
-  Function Length   : 13 (0x0000d) Actual length = 52 (0x000034)
+  Function Length   : 9 (0x00009) Actual length = 36 (0x000024)
   ---- Epilog scopes ----
   ---- Scope 0
   Epilog Start Offset        : 3523193630 (0xd1ffab1e) Actual offset = 3523193630 (0xd1ffab1e) Offset from main function begin = 3523193630 (0xd1ffab1e)
-16 (-28.57%) : 243629.dasm - testout1+VT_0_4_1:.ctor(int):this
@@ -19,22 +19,18 @@ G_M56448_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 G_M56448_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, byref
             ; byrRegs +[x0]
             fmov    d16, #1.0000
-            str     d16, [x0]
-            str     d16, [x0, #0x08]
-            str     d16, [x0, #0x10]
-            str     d16, [x0, #0x18]
-            str     d16, [x0, #0x20]
-            str     d16, [x0, #0x28]
-            str     d16, [x0, #0x30]
-            str     d16, [x0, #0x38]
+            stp     d16, d16, [x0]
+            stp     d16, d16, [x0, #0x10]
+            stp     d16, d16, [x0, #0x20]
+            stp     d16, d16, [x0, #0x30]
             str     d16, [x0, #0x40]
-						;; size=40 bbWeight=1 PerfScore 9.50
+						;; size=24 bbWeight=1 PerfScore 5.50
 G_M56448_IG03:        ; bbWeight=1, epilog, nogc, extend
             ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
 
-; Total bytes of code 56, prolog size 8, PerfScore 18.60, instruction count 14, allocated bytes for code 56 (MethodHash=c6a1237f) for method testout1+VT_0_4_1:.ctor(int):this
+; Total bytes of code 40, prolog size 8, PerfScore 13.00, instruction count 10, allocated bytes for code 40 (MethodHash=c6a1237f) for method testout1+VT_0_4_1:.ctor(int):this
 ; ============================================================
 
 Unwind Info:
@@ -45,7 +41,7 @@ Unwind Info:
   E bit             : 0
   X bit             : 0
   Vers              : 0
-  Function Length   : 14 (0x0000e) Actual length = 56 (0x000038)
+  Function Length   : 10 (0x0000a) Actual length = 40 (0x000028)
   ---- Epilog scopes ----
   ---- Scope 0
   Epilog Start Offset        : 3523193630 (0xd1ffab1e) Actual offset = 3523193630 (0xd1ffab1e) Offset from main function begin = 3523193630 (0xd1ffab1e)
+0 (0.00%) : 388608.dasm - JIT.HardwareIntrinsics.Arm._AdvSimd.Arm64.SimpleTernaryOpTest__FusedMultiplyAddBySelectedScalar_Vector128_Single_Vector128_Single_3:.ctor():this
@@ -350,9 +350,10 @@ G_M34739_IG03:        ; bbWeight=4, isz, extend
             cmp     w0, #3
             bls     G_M34739_IG05
             str     s0, [x21, #0x1C]
-            ldr     x21, [x20]
-            ldp     x22, x20, [x20, #0x08]
-            ; gcrRegs +[x20 x22]
+            ldp     x21, x22, [x20]
+            ; gcrRegs +[x22]
+            ldr     x20, [x20, #0x10]
+            ; gcrRegs +[x20]
             movz    x0, #0xD1FFAB1E
             movk    x0, #0xD1FFAB1E LSL #16
             movk    x0, #0xD1FFAB1E LSL #32
+0 (0.00%) : 454528.dasm - JIT.HardwareIntrinsics.Arm._AdvSimd.SimpleTernaryOpTest__MultiplyBySelectedScalarWideningUpperAndSubtract_Vector128_UInt32_Vector64_UInt32_1:.ctor():this
@@ -268,9 +268,10 @@ G_M34358_IG02:        ; bbWeight=4, gcrefRegs=80000 {x19}, byrefRegs=0000 {}, by
             cmp     w1, #1
             bls     G_M34358_IG05
             str     w0, [x21, #0x14]
-            ldr     x21, [x20]
-            ldp     x22, x20, [x20, #0x08]
-            ; gcrRegs +[x20 x22]
+            ldp     x21, x22, [x20]
+            ; gcrRegs +[x22]
+            ldr     x20, [x20, #0x10]
+            ; gcrRegs +[x20]
 						;; size=764 bbWeight=4 PerfScore 1086.00
 G_M34358_IG03:        ; bbWeight=4, extend
             movz    x0, #0xD1FFAB1E
+0 (0.00%) : 458240.dasm - JIT.HardwareIntrinsics.Arm._AdvSimd.SimpleTernaryOpTest__MultiplySubtractByScalar_Vector64_Int16:.ctor():this
@@ -350,9 +350,10 @@ G_M3199_IG03:        ; bbWeight=4, isz, extend
             cmp     w1, #3
             bls     G_M3199_IG05
             strh    w0, [x21, #0x16]
-            ldr     x21, [x20]
-            ldp     x22, x20, [x20, #0x08]
-            ; gcrRegs +[x20 x22]
+            ldp     x21, x22, [x20]
+            ; gcrRegs +[x22]
+            ldr     x20, [x20, #0x10]
+            ; gcrRegs +[x20]
             movz    x0, #0xD1FFAB1E
             movk    x0, #0xD1FFAB1E LSL #16
             movk    x0, #0xD1FFAB1E LSL #32
Details

Improvements/regressions per collection

Collection Contexts with diffs Improvements Regressions Same size Improvements (bytes) Regressions (bytes)
benchmarks.run.linux.arm64.checked.mch 593 555 0 38 -4,632 +0
libraries_tests.pmi.linux.arm64.checked.mch 5,064 4,805 0 259 -46,408 +0
libraries.crossgen2.linux.arm64.checked.mch 1,237 1,152 0 85 -6,536 +0
libraries.pmi.linux.arm64.checked.mch 3,963 3,779 0 184 -24,572 +0
coreclr_tests.run.linux.arm64.checked.mch 45,744 45,060 0 684 -687,564 +0
56,601 55,351 0 1,250 -769,712 +0

Context information

Collection Diffed contexts MinOpts FullOpts Missed, base Missed, diff
benchmarks.run.linux.arm64.checked.mch 42,108 6,912 35,196 0 (0.00%) 0 (0.00%)
libraries_tests.pmi.linux.arm64.checked.mch 367,550 7,902 359,648 0 (0.00%) 0 (0.00%)
libraries.crossgen2.linux.arm64.checked.mch 174,775 15 174,760 0 (0.00%) 0 (0.00%)
libraries.pmi.linux.arm64.checked.mch 257,063 4,760 252,303 0 (0.00%) 0 (0.00%)
coreclr_tests.run.linux.arm64.checked.mch 628,239 382,881 245,358 3 (0.00%) 3 (0.00%)
1,469,735 402,470 1,067,265 3 (0.00%) 3 (0.00%)

jit-analyze output

@EgorBo
Copy link
Member

EgorBo commented Mar 30, 2023

@SwapnilGaikwad do you want us to kick various jitstress/gcstress jobs?

@kunalspathak kunalspathak self-requested a review March 30, 2023 19:13
@kunalspathak
Copy link
Member

/azp run runtime-coreclr gcstress0x3-gcstress0xc

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@kunalspathak
Copy link
Member

/azp run runtime-coreclr jitstress

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice diffs.

src/coreclr/jit/emitarm64.cpp Outdated Show resolved Hide resolved
return eRO_none;
}

if (lastInsFmt != fmt)
if (lastInsFmt != fmt && !(lastInsFmt == IF_LS_2B && fmt == IF_LS_2A) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering why we have to add additional checks for IF_LS_2B and IF_LS_2A? Are they specifically because we are adding vector register support? Why were they not needed previously?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue is here that we use IF_LS_2A for base (no offset) and IF_LS_2B for base + offset? Presumably imm/prevImm are correctly zero for 2A?

Would it be easier to read inverted? i.e.,

const bool compatibleFmt = (lastInsFmt == fmt) || (lastInsFmt == IF_LS_2B && fmt == IF_LS_2A) || (lastInsFmt == IF_LS_2A && fmt == IF_LS_2B);
if (!compatibleFmt) {... return eRO_none; }

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any general register (non-Vector) diffs from just this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue is here that we use IF_LS_2A for base (no offset) and IF_LS_2B for base + offset?

Sure, but we don't do it for GPR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that the title of the PR doesn't specifies the full functionality. Explicit format check is allowing us to catch the consecutive ldr/str where one instruction uses the offset and one without. This is applicable for both general purpose and SIMD/Vector registers.

Are there any general register (non-Vector) diffs from just this change?

Yup, there are multiple such changes. e.g.,

-4 (-16.67%) : 2709.dasm - System.ValueTuple`2[long,System.DateTime]:.ctor(long,System.DateTime):this
@@ -20,15 +20,14 @@ G_M30325_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
 						;; size=8 bbWeight=1 PerfScore 1.50
 G_M30325_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0001 {x0}, byref
             ; byrRegs +[x0]
-            str     x1, [x0]
-            str     x2, [x0, #0x08]
-						;; size=8 bbWeight=1 PerfScore 2.00
+            stp     x1, x2, [x0]
+						;; size=4 bbWeight=1 PerfScore 1.00
 G_M30325_IG03:        ; bbWeight=1, epilog, nogc, extend
             ldp     fp, lr, [sp], #0x10
             ret     lr
 						;; size=8 bbWeight=1 PerfScore 2.00
 
-; Total bytes of code 24, prolog size 8, PerfScore 7.90, instruction count 6, allocated bytes for code 24 (MethodHash=b3c1898a) for method System.ValueTuple`2[long,System.DateTime]:.ctor(long,System.DateTime):this
+; Total bytes of code 20, prolog size 8, PerfScore 6.50, instruction count 5, allocated bytes for code 20 (MethodHash=b3c1898a) for method System.ValueTuple`2[long,System.DateTime]:.ctor(long,System.DateTime):this
 ; ============================================================
 

Sure, but we don't do it for GPR?

Yup, I think it was missed previously.

Matching the consecutive ldr/str with mixed formatting is letting us further optimise what the previous optimisation would have allowed us. e.g.,

Previously, the following sequence

str     s0, [x0]
str     s1, [x0, #0x04]
str     s2, [x0, #0x08]
str     s3, [x0, #0x0C]

may have been optimised to

str     s0, [x0]
stp     s1, s2, [x0, #0x04]
str     s3, [x0, #0x0C]

but now would be optimised to

stp     s0, s1, [x0]
stp     s2, s3, [x0, #0x08]

Copy link
Contributor Author

@SwapnilGaikwad SwapnilGaikwad Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be easier to read inverted? i.e.,

Sure, this is more readable. Done 👍

@ghost ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label Mar 30, 2023
@kunalspathak
Copy link
Member

all the gcstress failures are existing ones.

@ghost ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Mar 31, 2023
Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for your contributions!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
4 participants