Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port SIMDIntrinsicGetItem and SIMDIntrinsicSetItem to be implemented via HWIntrinsics #52288

Merged
merged 6 commits into from
May 18, 2021

Conversation

tannergooding
Copy link
Member

@tannergooding tannergooding commented May 5, 2021

This is a continuation of past PRs (#35421 and #37882) which ported the legacy System.Numerics code to be implemented in terms of the newer System.Runtime.Intrinsics code.

In particular, this updates SIMDIntrinsicGetItem and SIMDIntrinsicSetItem to be implemented via the corresponding HWIntrinsics functionality bringing along benefits to both codepaths. Namely, the System.Numerics codepath becomes VEX aware and has better instruction selection, while the System.Runtime.Intrinsics codepath picked up an optimization for non-constant operands.

Some example diffs are:

  vextractf128 xmm0, ymm0, 1
- vpsrldq  xmm0, 8
- vmovd    edx, xmm0
+ vpextrq  edx, xmm0, 1

and:

- vmovaps  xmm4, xmm3
- vpsrldq  xmm4, 4
- vmovaps  xmm5, xmm2
- vpsrldq  xmm5, 8
+ vmovshdup xmm4, xmm3
+ vunpckhps xmm5, xmm2, xmm2
  vmulss   xmm6, xmm4, xmm5
- vmovaps  xmm7, xmm3
- vpsrldq  xmm7, 8
- vmovaps  xmm8, xmm2
- vpsrldq  xmm8, 4
+ vunpckhps xmm7, xmm3, xmm3
+ vmovshdup xmm8, xmm2
  vmulss   xmm9, xmm7, xmm8

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 5, 2021
@tannergooding
Copy link
Member Author

tannergooding commented May 5, 2021

Framework diff is:

Found 317 files with textual diffs.

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 53448994
Total bytes of diff: 53442523
Total bytes of delta: -6471 (-0.01% of base)
    diff is an improvement.


Top file improvements (bytes):
       -6440 : System.Private.CoreLib.dasm (-0.14% of base)
         -19 : System.Drawing.Common.dasm (-0.01% of base)
         -12 : System.Text.Json.dasm (-0.00% of base)

3 total files with Code Size differences (3 improved, 0 regressed), 268 unchanged.

Top method regressions (bytes):
           4 (12.50% of base) : System.Private.CoreLib.dasm - Vector128:AsVector128(Vector2):Vector128`1

Top method improvements (bytes):
       -3830 (-99.61% of base) : System.Private.CoreLib.dasm - Sse41:Extract(Vector128`1,ubyte):float
        -409 (-12.10% of base) : System.Private.CoreLib.dasm - Vector256`1:ToString():String:this (6 methods)
        -327 (-46.12% of base) : System.Private.CoreLib.dasm - Vector128`1:<Equals>g__SoftwareFallback|12_0(byref,Vector128`1):bool (6 methods)
        -273 (-40.15% of base) : System.Private.CoreLib.dasm - Vector256`1:<Equals>g__SoftwareFallback|14_0(byref,Vector256`1):bool (6 methods)
        -253 (-7.89% of base) : System.Private.CoreLib.dasm - Vector128`1:ToString():String:this (6 methods)
        -206 (-27.43% of base) : System.Private.CoreLib.dasm - Vector128`1:GetHashCode():int:this (6 methods)
        -198 (-28.01% of base) : System.Private.CoreLib.dasm - Vector256`1:GetHashCode():int:this (6 methods)
        -104 (-9.52% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateConstrainedBillboard(Vector3,Vector3,Vector3,Vector3,Vector3):Matrix4x4
         -98 (-75.38% of base) : System.Private.CoreLib.dasm - Vector`1:get_Item(int):ubyte:this
         -98 (-74.81% of base) : System.Private.CoreLib.dasm - Vector`1:get_Item(int):short:this
         -98 (-75.97% of base) : System.Private.CoreLib.dasm - Vector`1:get_Item(int):int:this
         -98 (-73.13% of base) : System.Private.CoreLib.dasm - Vector`1:get_Item(int):double:this
         -98 (-75.38% of base) : System.Private.CoreLib.dasm - Vector`1:get_Item(int):long:this
         -72 (-4.25% of base) : System.Private.CoreLib.dasm - Matrix4x4:Decompose(Matrix4x4,byref,byref,byref):bool
         -61 (-8.32% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateLookAt(Vector3,Vector3,Vector3):Matrix4x4
         -31 (-5.24% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateBillboard(Vector3,Vector3,Vector3,Vector3):Matrix4x4
         -30 (-5.80% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateWorld(Vector3,Vector3,Vector3):Matrix4x4
         -20 (-7.30% of base) : System.Private.CoreLib.dasm - Plane:CreateFromVertices(Vector3,Vector3,Vector3):Plane
         -17 (-1.95% of base) : System.Drawing.Common.dasm - Graphics:GetContextInfo(byref,bool,byref):this
         -16 (-1.63% of base) : System.Private.CoreLib.dasm - SpanHelpers:LastIndexOf(byref,ubyte,int):int (2 methods)

Top method regressions (percentages):
           4 (12.50% of base) : System.Private.CoreLib.dasm - Vector128:AsVector128(Vector2):Vector128`1

Top method improvements (percentages):
       -3830 (-99.61% of base) : System.Private.CoreLib.dasm - Sse41:Extract(Vector128`1,ubyte):float
         -98 (-75.97% of base) : System.Private.CoreLib.dasm - Vector`1:get_Item(int):int:this
         -98 (-75.38% of base) : System.Private.CoreLib.dasm - Vector`1:get_Item(int):ubyte:this
         -98 (-75.38% of base) : System.Private.CoreLib.dasm - Vector`1:get_Item(int):long:this
         -98 (-74.81% of base) : System.Private.CoreLib.dasm - Vector`1:get_Item(int):short:this
         -98 (-73.13% of base) : System.Private.CoreLib.dasm - Vector`1:get_Item(int):double:this
        -327 (-46.12% of base) : System.Private.CoreLib.dasm - Vector128`1:<Equals>g__SoftwareFallback|12_0(byref,Vector128`1):bool (6 methods)
        -273 (-40.15% of base) : System.Private.CoreLib.dasm - Vector256`1:<Equals>g__SoftwareFallback|14_0(byref,Vector256`1):bool (6 methods)
        -198 (-28.01% of base) : System.Private.CoreLib.dasm - Vector256`1:GetHashCode():int:this (6 methods)
        -206 (-27.43% of base) : System.Private.CoreLib.dasm - Vector128`1:GetHashCode():int:this (6 methods)
        -409 (-12.10% of base) : System.Private.CoreLib.dasm - Vector256`1:ToString():String:this (6 methods)
         -12 (-10.00% of base) : System.Private.CoreLib.dasm - Plane:op_Equality(Plane,Plane):bool
        -104 (-9.52% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateConstrainedBillboard(Vector3,Vector3,Vector3,Vector3,Vector3):Matrix4x4
         -12 (-9.30% of base) : System.Private.CoreLib.dasm - Plane:op_Inequality(Plane,Plane):bool
          -6 (-8.33% of base) : System.Private.CoreLib.dasm - Plane:Dot(Plane,Vector4):float
         -61 (-8.32% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateLookAt(Vector3,Vector3,Vector3):Matrix4x4
        -253 (-7.89% of base) : System.Private.CoreLib.dasm - Vector128`1:ToString():String:this (6 methods)
         -20 (-7.30% of base) : System.Private.CoreLib.dasm - Plane:CreateFromVertices(Vector3,Vector3,Vector3):Plane
          -8 (-6.84% of base) : System.Private.CoreLib.dasm - SpanHelpers:LocateLastFoundByte(Vector`1):int
          -8 (-6.84% of base) : System.Private.CoreLib.dasm - SpanHelpers:LocateLastFoundChar(Vector`1):int

39 total methods with Code Size differences (38 improved, 1 regressed), 269971 unchanged.

The regression in Vector128:AsVector128(Vector2) is due to an additional xorps that isn't elided. HWIntrinsics need this optimization enabled for them like we do for GPR zero and limitedly for System.Numerics intrinsics

I'm working on getting the PMI diffs for tests as well, but they take a while to run.

@tannergooding
Copy link
Member Author

CC. @echesakovMSFT

@tannergooding tannergooding force-pushed the improved-numerics-vector branch 5 times, most recently from 5e62003 to 7c9380d Compare May 6, 2021 20:13
@echesakov echesakov self-requested a review May 6, 2021 22:42
@tannergooding
Copy link
Member Author

tannergooding commented May 7, 2021

Diff for benchmarks:

Found 87 files with textual diffs.

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 507624
Total bytes of diff: 507381
Total bytes of delta: -243 (-0.05% of base)
    diff is an improvement.


Top file improvements (bytes):
        -139 : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm (-0.41% of base)
         -82 : SIMD\RayTracer\RayTracer\RayTracer.dasm (-0.33% of base)
         -20 : SIMD\SeekUnroll\SeekUnroll\SeekUnroll.dasm (-0.65% of base)
          -2 : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm (-0.11% of base)

4 total files with Code Size differences (4 improved, 0 regressed), 78 unchanged.

Top method regressions (bytes):
           2 ( 0.15% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
           2 ( 0.15% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this

Top method improvements (bytes):
         -63 (-13.91% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorHelper:ForEach(Vector`1,Action`2) (6 methods)
         -52 (-6.18% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Camera:Create(Vector,Vector):Camera
         -20 (-9.13% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Vector:Cross(Vector,Vector):Vector
         -14 (-0.92% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
         -14 (-0.93% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
         -13 (-1.51% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass4_0:<RenderMultiThreadedNoADT>b__1(int):this
         -13 (-1.50% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedWithADT>b__1(int):this
          -9 (-0.41% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedNoADT>b__1(int):this (3 methods)
          -6 (-0.40% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass4_0:<RenderMultiThreadedWithADT>b__1(int):this (2 methods)
          -5 (-2.18% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - <>c:<.cctor>b__3_0(Vector):Color:this
          -5 (-4.76% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - <>c:<.cctor>b__3_2(Vector):double:this
          -4 (-3.15% of base) : SIMD\SeekUnroll\SeekUnroll\SeekUnroll.dasm - SeekUnroll:FindByte(byref):int
          -4 (-2.00% of base) : SIMD\SeekUnroll\SeekUnroll\SeekUnroll.dasm - SeekUnroll:InnerLoop(byref,byref)
          -4 (-0.65% of base) : SIMD\SeekUnroll\SeekUnroll\SeekUnroll.dasm - SeekUnroll:XunitBenchmarkLoop(byref,byref)
          -4 (-0.70% of base) : SIMD\SeekUnroll\SeekUnroll\SeekUnroll.dasm - SeekUnroll:ManualTimerLoop(byref,byref)
          -4 (-0.56% of base) : SIMD\SeekUnroll\SeekUnroll\SeekUnroll.dasm - SeekUnroll:Test(int,bool):bool
          -4 (-0.30% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
          -4 (-0.29% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
          -3 (-0.44% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass6_0:<RenderMultiThreadedWithADT>b__1(int):this
          -1 (-0.50% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - MandelBrot_7:GetByte(long,double,int,int):ubyte

Top method regressions (percentages):
           2 ( 0.15% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
           2 ( 0.15% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this

Top method improvements (percentages):
         -63 (-13.91% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorHelper:ForEach(Vector`1,Action`2) (6 methods)
         -20 (-9.13% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Vector:Cross(Vector,Vector):Vector
         -52 (-6.18% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Camera:Create(Vector,Vector):Camera
          -5 (-4.76% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - <>c:<.cctor>b__3_2(Vector):double:this
          -4 (-3.15% of base) : SIMD\SeekUnroll\SeekUnroll\SeekUnroll.dasm - SeekUnroll:FindByte(byref):int
          -5 (-2.18% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - <>c:<.cctor>b__3_0(Vector):Color:this
          -4 (-2.00% of base) : SIMD\SeekUnroll\SeekUnroll\SeekUnroll.dasm - SeekUnroll:InnerLoop(byref,byref)
         -13 (-1.51% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass4_0:<RenderMultiThreadedNoADT>b__1(int):this
         -13 (-1.50% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedWithADT>b__1(int):this
         -14 (-0.93% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
         -14 (-0.92% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
          -4 (-0.70% of base) : SIMD\SeekUnroll\SeekUnroll\SeekUnroll.dasm - SeekUnroll:ManualTimerLoop(byref,byref)
          -4 (-0.65% of base) : SIMD\SeekUnroll\SeekUnroll\SeekUnroll.dasm - SeekUnroll:XunitBenchmarkLoop(byref,byref)
          -4 (-0.56% of base) : SIMD\SeekUnroll\SeekUnroll\SeekUnroll.dasm - SeekUnroll:Test(int,bool):bool
          -1 (-0.50% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - MandelBrot_7:GetByte(long,double,int,int):ubyte
          -3 (-0.44% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass6_0:<RenderMultiThreadedWithADT>b__1(int):this
          -9 (-0.41% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedNoADT>b__1(int):this (3 methods)
          -6 (-0.40% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass4_0:<RenderMultiThreadedWithADT>b__1(int):this (2 methods)
          -1 (-0.34% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - <>c__DisplayClass4_0:<DoBench>b__0(int):this
          -4 (-0.30% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this

23 total methods with Code Size differences (21 improved, 2 regressed), 1878 unchanged.

@tannergooding tannergooding force-pushed the improved-numerics-vector branch 2 times, most recently from f8788e5 to 57576b9 Compare May 8, 2021 18:26
@tannergooding
Copy link
Member Author

CC. @echesakovMSFT, @dotnet/jit-contrib

Copy link
Contributor

@echesakov echesakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have some questions that I would like to get answered before approving?

src/coreclr/jit/decomposelongs.cpp Show resolved Hide resolved
src/coreclr/jit/decomposelongs.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/decomposelongs.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/hwintrinsicarm64.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/gentree.cpp Show resolved Hide resolved
src/coreclr/jit/hwintrinsicxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emitxarch.cpp Show resolved Hide resolved
src/coreclr/jit/hwintrinsicxarch.cpp Show resolved Hide resolved
src/coreclr/jit/lowerarmarch.cpp Show resolved Hide resolved
src/coreclr/jit/lowerxarch.cpp Show resolved Hide resolved
Copy link
Contributor

@echesakov echesakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thanks for the contribution @tannergooding !

Co-authored-by: Egor Chesakov <Egor.Chesakov@microsoft.com>
@ghost
Copy link

ghost commented May 18, 2021

Hello @tannergooding!

Because this pull request has the auto-merge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

clamp03 added a commit to clamp03/runtime that referenced this pull request May 20, 2021
PR dotnet#52288 throws ArgumentOutOfRangeException instead of IndexOutOfRangeException.
The test is not updated. It still use IndexOutOfRangeException.
@clamp03 clamp03 mentioned this pull request May 20, 2021
@karelz karelz added this to the 6.0.0 milestone May 20, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jun 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants