Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Narrow and Widen using SIMDAsHWIntrinsic #60094

Merged
merged 5 commits into from
Nov 12, 2021

Conversation

tannergooding
Copy link
Member

This continues the work on #49397 which started with #53450

In particular, this moves Narrow and Widen to be implemented using the general SIMDAsHWIntrinsic logic and adding then having the new APIs in Vector64/128/256 use the same shared entry points.

There will be a few more PRs after this one covering:

  • Floating-point to Integer conversions
  • CompareAll and CompareAny for each comparison type (GT, GE, LT, LE, etc)
  • Native Integer support

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 6, 2021
@ghost
Copy link

ghost commented Oct 6, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

This continues the work on #49397 which started with #53450

In particular, this moves Narrow and Widen to be implemented using the general SIMDAsHWIntrinsic logic and adding then having the new APIs in Vector64/128/256 use the same shared entry points.

There will be a few more PRs after this one covering:

  • Floating-point to Integer conversions
  • CompareAll and CompareAny for each comparison type (GT, GE, LT, LE, etc)
  • Native Integer support
Author: tannergooding
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@tannergooding
Copy link
Member Author

Will run outerloop jobs (azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop) after the initial CI results come back as successful.

Also plan on collecting PMI diffs before marking this ready-for-review.

@tannergooding
Copy link
Member Author

/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@tannergooding
Copy link
Member Author

/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@tannergooding
Copy link
Member Author

/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop

@tannergooding tannergooding marked this pull request as ready for review October 8, 2021 23:07
@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@tannergooding
Copy link
Member Author

No diff for benchmarks.

==============================

Frameworks has the following improvements:

Top method improvements (bytes):
        -605 (-68.52% of base) : System.Private.CoreLib.dasm - Vector:Widen(Vector`1,byref,byref) (7 methods)
        -605 (-66.92% of base) : System.Private.CoreLib.dasm - Vector256:Widen(Vector256`1):ValueTuple`2 (7 methods)
        -545 (-69.25% of base) : System.Private.CoreLib.dasm - Vector128:Widen(Vector128`1):ValueTuple`2 (7 methods)
        -431 (-55.19% of base) : System.Private.CoreLib.dasm - Vector64:Widen(Vector64`1):ValueTuple`2 (7 methods)
         -24 (-20.34% of base) : System.Private.CoreLib.dasm - Latin1Utility:WidenLatin1ToUtf16_Fallback(long,long,long)

Noting that the Vector:Widen improvements are due to the code being "vectorized" now since it forwards to WidenLower and WidenUpperhelper intrinsics. The actual interesting change is toWidenLatin1ToUtf16` which is small improvement to the codegen vs the previous:

; Latin1Utility:WidenLatin1ToUtf16_Fallback(long,long,long)
        vmovupd  ymm0, ymmword ptr[rcx+rax]
-       vpermq   ymm2, ymm0, -44
-       vxorps   ymm1, ymm1
-       vpunpcklbw ymm2, ymm1
-       vpermq   ymm0, ymm0, -24
-       vxorps   ymm1, ymm1
-       vpunpckhbw ymm0, ymm1
+       vmovaps  ymm1, ymm0
+       vpmovzxbw ymm1, ymm1
+       vextractf128 xmm0, xmm0, 1
+       vpmovzxbw ymm0, ymm0

and (no diff shown against the old software impl, just showing a vectorized example)

; Vector:Widen(Vector`1,byref,byref)
       vmovdqu  ymm0, ymmword ptr[rcx]
       vpmovzxbw ymm0, ymm0
       vmovupd  ymmword ptr[rdx], ymm0
       vmovupd  ymm0, ymmword ptr[rcx]
       vextractf128 xmm0, xmm0, 1
       vpmovzxbw ymm0, ymm0
       vmovupd  ymmword ptr[r8], ymm0

; Vector256:Widen(Vector256`1):ValueTuple`2
       vmovupd  ymm0, ymmword ptr[rdx]
       vmovaps  ymm1, ymm0
       vpmovzxbw ymm1, ymm1
       vextractf128 xmm0, xmm0, 1
       vpmovzxbw ymm0, ymm0
       vmovupd  ymmword ptr[rcx], ymm1
       vmovupd  ymmword ptr[rcx+32], ymm0

; Vector128:Widen(Vector128`1):ValueTuple`2
       vmovupd  xmm0, xmmword ptr [rdx]
       vpmovzxbw xmm1, xmm0
       vpsrldq  xmm0, xmm0, 8
       vpmovzxbw xmm0, xmm0
       vmovupd  xmmword ptr [rcx], xmm1
       vmovupd  xmmword ptr [rcx+16], xmm0

==============================

Tests has the following improvements:

Top file improvements (bytes):
       -3265 : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm (-0.13% of base)
       -2284 : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm (-0.09% of base)
        -922 : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm (-0.03% of base)
        -714 : JIT\HardwareIntrinsics\General\Vector128\Vector128_r\Vector128_r.dasm (-0.02% of base)
        -343 : JIT\HardwareIntrinsics\General\Vector64\Vector64_ro\Vector64_ro.dasm (-0.02% of base)
        -208 : JIT\SIMD\VectorConvert_r_Target_64Bit\VectorConvert_r_Target_64Bit.dasm (-0.17% of base)
        -200 : JIT\SIMD\VectorConvert_ro_Target_64Bit\VectorConvert_ro_Target_64Bit.dasm (-0.84% of base)
         -20 : JIT\Regression\JitBlue\GitHub_13568\GitHub_13568\GitHub_13568.dasm (-2.00% of base)
          -8 : JIT\Regression\JitBlue\GitHub_23159\GitHub_23159\GitHub_23159.dasm (-1.17% of base)

Top method improvements (bytes):
         -58 (-13.91% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunStructFldScenario():this
         -58 (-13.91% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunStructLclFldScenario():this
         -56 (-15.01% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - TestStruct:RunStructFldScenario(VectorBinaryOpTest__NarrowDouble):this
         -55 (-11.88% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - TestStruct:RunStructFldScenario(VectorBinaryOpTest__NarrowDouble):this
         -52 (-13.72% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunClassFldScenario():this
         -52 (-12.75% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunClassLclFldScenario():this
         -52 (-12.01% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunClsVarScenario():this
         -51 (-10.87% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunClassFldScenario():this
         -51 (-10.24% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunClassLclFldScenario():this
         -51 (-9.75% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunClsVarScenario():this
         -50 (-11.99% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowInt16:RunStructFldScenario():this
         -50 (-11.99% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowInt16:RunStructLclFldScenario():this
         -50 (-11.99% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowInt64:RunStructFldScenario():this
         -50 (-11.99% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowInt64:RunStructLclFldScenario():this
         -50 (-11.99% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowUInt16:RunStructFldScenario():this
         -50 (-11.99% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowUInt16:RunStructLclFldScenario():this

Noting that most of the improvements come from the Vector128/256:Narrow/Widen calls being intrinsified now, where they weren't before. The interesting diffs are for JIT\SIMD where Vector<T> codegen is improved from the past logic:

-       vmovupd  ymm6, ymmword ptr[rsi]
-       vmovupd  ymm0, ymmword ptr[rdi]
-       vcvtpd2ps ymm6, ymm6
-       vcvtpd2ps ymm1, ymm0
-       vinsertf128 xmm6, xmm1, 1
-       vcvtps2pd ymm7, ymm6
-       vextractf128 xmm8, xmm6, 1
-       vcvtps2pd ymm8, ymm8
+       vcvtpd2ps ymm0, ymmword ptr[rsi]
+       vcvtpd2ps ymm1, ymmword ptr[rdi]
+       vinsertf128 xmm6, xmm0, xmm1, 1
+       vmovaps  ymm0, ymm6
+       vcvtps2pd ymm7, ymm0
+       vextractf128 xmm0, xmm6, 1
+       vcvtps2pd ymm8, ymm0

@tannergooding
Copy link
Member Author

This should be ready for review now.

@tannergooding
Copy link
Member Author

Rebased onto dotnet/main. This is still ready for review.

@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone Nov 11, 2021
@JulieLeeMSFT
Copy link
Member

CC @echesakovMSFT PTAL.

public static unsafe (Vector64<ushort> Lower, Vector64<ushort> Upper) Widen(Vector64<byte> source)
=> (WidenLower(source), WidenUpper(source));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding Have you looked at the code produced for such function call? Do we pass the return values properly in two registers from a callee to the caller without copying to/from the memory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The codegen looked good when I had checked, let me rebuild and get a disasm example to share however.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I remember seeing some strange artifacts with multi reg returns while I was working on #52424.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting this isn't a multi-reg return; rather its an a managed wrapper method that calls two helper intrinsics instead; since no platform implements this as a single instruction with multi-reg result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it's not a multi-reg intrinsic, but it's still a multi-reg call returning a value in D0 and D1 registers, so I was wondering whether the codegen for such calls (incl. the method) is optimal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@echesakovMSFT, looks like yes there is still some oddities there and it doesn't preference q0/q1 as the target register nor does it directly do a mov and instead does a store/load.

        4EA01C10          mov     v16.16b, v0.16b
        0E617A10          fcvtl   v16.2d, v16.2s
        4E617811          fcvtl2  v17.2d, v0.4s
        3D8007B0          str     q16, [fp,#16]
        3D800BB1          str     q17, [fp,#32]
        3DC007A0          ldr     q0, [fp,#16]
        3DC00BA1          ldr     q1, [fp,#32]

It looks like this gets imported initially as:

               [000014] I-CXG-------              *  CALL      void   ValueTuple`2..ctor (exactContextHnd=0x00007FFA49A115C9)
               [000013] ------------ this in x0   +--*  ADDR      byref 
               [000012] -------N----              |  \--*  LCL_VAR   struct<ValueTuple`2, 32> V02 tmp1         
               [000018] ---XG------- arg1         +--*  OBJ       simd16<Vector128`1>
               [000017] ------------              |  \--*  ADDR      byref 
               [000004] -------N----              |     \--*  HWINTRINSIC simd16 float ConvertToDouble
               [000003] ------------              |        \--*  HWINTRINSIC simd8  float GetLower
               [000002] n-----------              |           \--*  OBJ       simd16<Vector128`1>
               [000001] ------------              |              \--*  ADDR      byref 
               [000000] -------N----              |                 \--*  LCL_VAR   simd16<Vector128`1> V00 arg0         
               [000016] ---XG------- arg2         \--*  OBJ       simd16<Vector128`1>
               [000015] ------------                 \--*  ADDR      byref 
               [000008] -------N----                    \--*  HWINTRINSIC simd16 float ConvertToDoubleUpper
               [000007] n-----------                       \--*  OBJ       simd16<Vector128`1>
               [000006] ------------                          \--*  ADDR      byref 
               [000005] -------N----                             \--*  LCL_VAR   simd16<Vector128`1> V00 arg0         

However, there are locals introduced and indirections kept that never truly go away; even in the face of HVA/HFA. So before rationalize we still have:

***** BB01
STMT00000 (IL 0x000...0x016)
N003 (  5,  5) [000011] IA------R---              *  ASG       struct (init) $VN.Void
N002 (  3,  2) [000009] D------N----              +--*  LCL_VAR   struct<ValueTuple`2, 32> V02 tmp1         d:1
N001 (  1,  2) [000010] ------------              \--*  CNS_INT   int    0 $40

***** BB01
STMT00005 (IL   ???...  ???)
N007 (  9,  8) [000037] -A--G---R---              *  ASG       simd16 (copy) $200
N006 (  1,  1) [000035] D------N----              +--*  LCL_VAR   simd16<Vector128`1> V03 tmp2         d:1 $200
N005 (  9,  8) [000004] -------N----              \--*  HWINTRINSIC simd16 float ConvertToDouble $200
N004 (  8,  7) [000003] ------------                 \--*  HWINTRINSIC simd8  float GetLower $1c0
N003 (  7,  6) [000002] n-----------                    \--*  OBJ       simd16<Vector128`1> $81
N002 (  1,  2) [000001] ------------                       \--*  ADDR      byref  $140
N001 (  1,  1) [000000] -------N----                          \--*  LCL_VAR   simd16<Vector128`1> V00 arg0         u:1 $80

***** BB01
STMT00006 (IL   ???...  ???)
N006 (  8,  7) [000040] -A--G---R---              *  ASG       simd16 (copy) $84
N005 (  1,  1) [000038] D------N----              +--*  LCL_VAR   simd16<Vector128`1> V04 tmp3         d:1 $84
N004 (  8,  7) [000008] -------N----              \--*  HWINTRINSIC simd16 float ConvertToDoubleUpper $84
N003 (  7,  6) [000007] n-----------                 \--*  OBJ       simd16<Vector128`1> $83
N002 (  1,  2) [000006] ------------                    \--*  ADDR      byref  $140
N001 (  1,  1) [000005] -------N----                       \--*  LCL_VAR   simd16<Vector128`1> V00 arg0         u:1 (last use) $80

***** BB01
STMT00003 (IL   ???...  ???)
N003 (  5,  6) [000027] -A------R---              *  ASG       simd16 (copy) $200
N002 (  3,  4) [000022] U------N----              +--*  LCL_FLD   simd16 V02 tmp1         ud:1->2[+0] Fseq[Item1] $300
N001 (  1,  1) [000023] ------------              \--*  LCL_VAR   simd16<Vector128`1> V03 tmp2         u:1 (last use) $200

***** BB01
STMT00004 (IL   ???...  ???)
N003 (  5,  6) [000034] -A------R---              *  ASG       simd16 (copy) $84
N002 (  3,  4) [000029] U------N----              +--*  LCL_FLD   simd16 V02 tmp1         ud:2->3[+16] Fseq[Item2] $301
N001 (  1,  1) [000030] ------------              \--*  LCL_VAR   simd16<Vector128`1> V04 tmp3         u:1 (last use) $84

***** BB01
STMT00002 (IL 0x016...  ???)
N002 (  4,  3) [000020] ------------              *  RETURN    struct $101
N001 (  3,  2) [000019] -------N----              \--*  LCL_VAR   struct<ValueTuple`2, 32> V02 tmp1         u:3 (last use) $301

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- Noting this is without inlining. With inlining, the JIT sometimes does the right thing and other times does not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks for checking - this is what I suspected would happen - we should work on the issue in .NET 7.

Copy link
Contributor

@echesakov echesakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM.

@tannergooding tannergooding merged commit 5fa6dd3 into dotnet:main Nov 12, 2021
@EgorBo
Copy link
Member

EgorBo commented Nov 16, 2021

Ubuntu-x64 improvements: dotnet/perf-autofiling-issues#2339
Alpine-x64 improvements: dotnet/perf-autofiling-issues#2362
windows-x64 improvements: dotnet/perf-autofiling-issues#2346
windows-amd-x64 improvements: dotnet/perf-autofiling-issues#2374

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Nov 23, 2021

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants