Add ValueNumbering support for GT_SIMD and GT_HWINTRINSIC tree nodes #31834

briansull · 2020-02-06T01:42:47Z

No description provided.

tannergooding · 2020-02-07T16:58:30Z

CC. @CarolEidt, @echesakovMSFT

tannergooding · 2020-02-07T17:02:54Z

src/coreclr/src/jit/optcse.cpp

+
+#ifdef FEATURE_HW_INTRINSICS
+        case GT_HWINTRINSIC:
+            return true; // allow Hardware Intrinsics to be CSE-ed


Are there any HWIntrinsics we don't want CSE'd or any that need special handling (such as the non-temporal load/stores)?

Excellent thought - I would say this should include:

any of the "special" loads (LoadFence, any NonTemporal

any store (i.e. HW_Category_MemoryStore and StoreFence)

any prefetch

MemoryFence

Perhaps we should consider adding a flag for this, but a reasonable proxy in the meantime would be to exclude any HW_Category_MemoryStore, HW_Category_MemoryLoad and HW_Category_Special:

There would be few, if any, loads that the JIT could identify as CSE's (local stack loads would generally not show up as actual load intrinsics).

The only HW_Category_Special that I found that this would unnecessarily exclude is CompareLessThan

briansull · 2020-02-07T22:23:22Z

One issue that I ran into while working on this feature was that the numArgs for the SIMD and HW_INTRINSICS is not always set to a useful value. Value numbering needs to know the exact number of arguments to expect and some instructions have a "-1" for numArgs or specify 2 when then can accept either 1 or 2.

For the purposes of Value Numbering it would be better to have multiple entries in the table when the intrinsic can take different numbers of arguments.
Currently I just bail out when the table specifies a numArgs of -1.

briansull · 2020-02-07T22:25:19Z

For true binary operations, it would be useful to have a "commutative" column that indicates that the operands can be safely swapped.

tannergooding · 2020-02-07T22:31:57Z

For true binary operations, it would be useful to have a "communitive" column that indicates that the operands can be safely swapped.

You can use static bool IsCommutative(NamedIntrinsic id): https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsic.h#L202. You would get the id via AsHWIntrinsic()->gtHWIntrinsicId

This is likewise hooked up through the standard OperIsCommutative API.

Its also worth noting that there are some intrinsics, like FusedMultiplyAdd, which are commutative but have 3 operands and require special handling to account for this (the operation is a fused a * b + c and so the a and b operands are commutative).

For the purposes of Value Numbering it would be better to have multiple entries in the table when the intrinsic can take different numbers of arguments.

This would make lookup during imporation quite a bit more expensive. We already have to check the names, I don't think we want to also have to start checking other parameters.

There are existing helper methods such as static int lookupNumArgs(const GenTreeHWIntrinsic* node): https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsic.h#L135 which will return the correct number of arguments for a given node and will attempt to do it as efficiently as possible.

src/coreclr/src/jit/gentree.cpp

briansull · 2020-02-07T23:53:55Z

There are existing helper methods such as static int lookupNumArgs(const GenTreeHWIntrinsic* node): https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsic.h#L135 which will return the correct number of arguments for a given node and will attempt to do it as efficiently as possible.

The Value number operation defined by this enum (below), requires that each enum value have a fixed value for arity and commutativity, as we just pass around the value number and then expect to be able to unpack it based upon the VNFunc.

enum VNFunc
{
    // Implicitly, elements of genTreeOps here.
    VNF_Boundary = GT_COUNT,
#define ValueNumFuncDef(nm, arity, commute, knownNonNull, sharedStatic) VNF_##nm,
#include "valuenumfuncs.h"
    VNF_COUNT
};

tannergooding · 2020-02-08T01:31:17Z

The Value number operation defined by this enum (below), requires that each enum value have a fixed value for arity and commutativity, as we just pass around the value number and then expect to be able to unpack it based upon the VNFunc

@CarolEidt, do you have any concerns about updating the hwintrinsic tables to only contain constant data (so entries that currently have a variable number of args or supporting multiple SIMD sizes would be split into multiple entries, for example)?

It would make the table larger but I think we could probably cut some of the current lookup cost and still make this manageable (like having a table per column as we've done elsewhere; or tracking the first/last entry for a given ISA).

CarolEidt · 2020-02-08T01:58:49Z

do you have any concerns about updating the hwintrinsic tables to only contain constant data (so entries that currently have a variable number of args or supporting multiple SIMD sizes would be split into multiple entries

It would be unfortunate, but I don't think it would be too prohibitive to split it out by number of args, but I don't think it's necessary to split it out by different SIMD sizes, as nodes of different sizes should never have the same value type - isn't that right @briansull?

briansull · 2020-02-08T02:28:24Z

I don't think it's necessary to split it out by different SIMD sizes, as nodes of different sizes should never have the same value type - isn't that right @briansull?

Actually the type size isn't part of the the value number, so there are a couple of nodes where Value Numbering adds an additional arg that holds a constant represent the size of the operation. We do that for GT_CAST and I also had to add this for the SIMD Init operation., As one test was CSE-ing two different sized Vector Inits: 3-way and 4-way. So it awkward but it is something that we can already handle.

            case SIMDIntrinsicInit:
            {
                // Also encode the resulting type as op2vnp
                ValueNumPair op2vnp;
                op2vnp.SetBoth(vnStore->VNForIntCon(INT32(tree->TypeGet())));

                excSetPair     = op1Xvnp;
                normalPair     = vnStore->VNPairForFunc(tree->TypeGet(), GetVNFuncForNode(tree), op1vnp, op2vnp);
                tree->gtVNPair = vnStore->VNPWithExc(normalPair, excSetPair);
                return;
            }

briansull · 2020-02-18T18:11:44Z

src/coreclr/src/jit/valuenum.cpp

+            case SIMDIntrinsicWidenLo:
+            {
+                // Also encodes the resulting type
+                encodeResultType = true;


I also encode the result type for all of the above SIMD operations.

- Zero out the bbAssertionIn values, as these can be referenced in RangeCheck::MergeAssertion and this is shared state with the CSE phase: bbCseIn

Mutate the gloabl heap when performing a HW_INTRINSIC memory store operation Printing of SIMD constants only support 0

…a SIMD LclVar in PerformCSE

Added bool OperIsMemoryLoad() to GenTreeSIMD, returns true for SIMDIntrinsicInitArray Added valuenumfuncs.h to src/coreclr/src/jit/CMakeLists.txt

…have a GT_IND node

CarolEidt

LGTM - thanks!

AndyAyersMS · 2020-02-21T19:07:35Z

I would like to see new tests added here, especially given the rather complex IR surface.

Can you also show the impact on FX codegen and/or tests? I am curious how much coverage we're likely to get organically.

How robust will this be as people add new intrinsics?

briansull · 2020-02-22T00:08:12Z

Beginning PMI PerfScore Diffs for benchstones and benchmarks game
D:\jit-diffs\runtime\x64\dasmset_98
Total PerfScoreUnits of diff: -3149.45 (-0.46% of base)
    diff is an improvement.
Top file regressions (PerfScoreUnits):
       43.10 : Span\SpanBench\SpanBench.dasm (0.22% of base)
Top file improvements (PerfScoreUnits):
    -2805.60 : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm (-3.38% of base)
     -289.60 : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm (-6.44% of base)
      -97.35 : SIMD\RayTracer\RayTracer\RayTracer.dasm (-0.78% of base)
4 total files with Perf Score differences (3 improved, 1 regressed), 78 unchanged.
Top method regressions (percentages):
       43.10 (14.43% of base) : Span\SpanBench\SpanBench.dasm - SpanBench:TestSpanFill(ref,int) (7 methods)
        5.00 ( 3.73% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Sphere:Intersect(Ray):ISect:this
Top method improvements (percentages):
      -98.60 (-9.31% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetNaturalColor(SceneObject,Vector,Vector,Vector,Scene):Color:this
     -371.33 (-8.90% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
     -361.70 (-8.57% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
       -3.50 (-8.44% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - ComplexVecFloat:square():ComplexVecFloat:this
       -3.50 (-8.44% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - ComplexVecDouble:square():ComplexVecDouble:this
      -80.63 (-7.49% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedWithADT>b__1(int):this
      -96.50 (-7.41% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - MandelBrot_7:GetByte(long,double,int,int):ubyte
     -192.50 (-7.19% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - <>c__DisplayClass4_0:<DoBench>b__0(int):this
      -78.50 (-7.14% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass6_0:<RenderMultiThreadedWithADT>b__1(int):this
     -320.74 (-7.12% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
     -318.24 (-6.98% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatStrictRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
     -264.93 (-6.38% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
     -139.06 (-6.21% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass4_0:<RenderMultiThreadedWithADT>b__1(int):this (2 methods)
     -254.90 (-5.75% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
     -168.96 (-5.16% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedNoADT>b__1(int):this (3 methods)
     -211.84 (-4.67% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatStrictRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
     -211.84 (-4.39% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
      -15.93 (-1.58% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass4_0:<RenderMultiThreadedNoADT>b__1(int):this
       -0.75 (-0.96% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Sphere:Normal(Vector):Vector:this
       -2.25 (-0.49% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Camera:Create(Vector,Vector):Camera
       -0.75 (-0.48% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetPoint(double,double,Camera):Vector:this
       -0.60 (-0.18% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - MandelBrot_7:DoBench(int,int):ref
24 total methods with Perf Score differences (22 improved, 2 regressed), 1869 unchanged.

briansull · 2020-02-22T00:45:16Z

Beginning PMI PerfScore Diffs for System.Private.CoreLib.dll
Completed PMI PerfScore Diffs for System.Private.CoreLib.dll in 148.60s
in D:\jit-diffs\runtime\x64\dasmset_96

Top method regressions (percentages):
       11.35 ( 1.83% of base) : System.Private.CoreLib.dasm - SpanHelpers:IndexOfAny(byref,ushort,ushort,ushort,int):int
       11.95 ( 1.66% of base) : System.Private.CoreLib.dasm - SpanHelpers:IndexOfAny(byref,ushort,ushort,ushort,ushort,int):int
        0.20 ( 0.14% of base) : System.Private.CoreLib.dasm - Vector:LessThanAny(Vector`1,Vector`1):bool (6 methods)
        0.20 ( 0.14% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanAny(Vector`1,Vector`1):bool (6 methods)
        1.00 ( 0.12% of base) : System.Private.CoreLib.dasm - SpanHelpers:Contains(byref,ushort,int):bool (2 methods)
Top method improvements (percentages):
       -6.70 (-22.98% of base) : System.Private.CoreLib.dasm - Vector3:LengthSquared():float:this
      -12.65 (-19.69% of base) : System.Private.CoreLib.dasm - Vector3:Reflect(Vector3,Vector3):Vector3
      -12.65 (-18.67% of base) : System.Private.CoreLib.dasm - Vector3:Normalize(Vector3):Vector3
      -42.00 (-18.30% of base) : System.Private.CoreLib.dasm - Vector`1:GreaterThanOrEqual(Vector`1,Vector`1):Vector`1 (6 methods)
      -42.00 (-18.30% of base) : System.Private.CoreLib.dasm - Vector`1:LessThanOrEqual(Vector`1,Vector`1):Vector`1 (6 methods)
       -6.70 (-16.48% of base) : System.Private.CoreLib.dasm - Vector3:Length():float:this
       -8.25 (-14.44% of base) : System.Private.CoreLib.dasm - Vector4:Normalize(Vector4):Vector4
      -45.83 (-14.27% of base) : System.Private.CoreLib.dasm - Vector:LessThanOrEqual(Vector`1,Vector`1):Vector`1 (10 methods)
      -45.83 (-14.27% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanOrEqual(Vector`1,Vector`1):Vector`1 (10 methods)
       -3.00 (-12.50% of base) : System.Private.CoreLib.dasm - Vector4:LengthSquared():float:this
      -29.13 (-11.50% of base) : System.Private.CoreLib.dasm - Vector:LessThanOrEqualAny(Vector`1,Vector`1):bool (6 methods)
      -29.13 (-11.50% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanOrEqualAny(Vector`1,Vector`1):bool (6 methods)
      -29.33 (-10.23% of base) : System.Private.CoreLib.dasm - Vector:LessThanOrEqualAll(Vector`1,Vector`1):bool (6 methods)
      -29.33 (-10.23% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanOrEqualAll(Vector`1,Vector`1):bool (6 methods)
      -31.50 (-9.40% of base) : System.Private.CoreLib.dasm - Vector:ConditionalSelect(Vector`1,Vector`1,Vector`1):Vector`1 (8 methods)
      -22.50 (-8.74% of base) : System.Private.CoreLib.dasm - Vector`1:ConditionalSelect(Vector`1,Vector`1,Vector`1):Vector`1 (6 methods)
       -3.00 (-8.45% of base) : System.Private.CoreLib.dasm - Vector4:Length():float:this
       -8.25 (-7.39% of base) : System.Private.CoreLib.dasm - Vector:Min(Vector`1,Vector`1):Vector`1 (6 methods)
       -8.25 (-7.39% of base) : System.Private.CoreLib.dasm - Vector:Max(Vector`1,Vector`1):Vector`1 (6 methods)
      -14.15 (-7.30% of base) : System.Private.CoreLib.dasm - Plane:CreateFromVertices(Vector3,Vector3,Vector3):Plane
      -29.90 (-6.41% of base) : System.Private.CoreLib.dasm - Vector`1:Min(Vector`1,Vector`1):Vector`1 (6 methods)
      -29.90 (-6.41% of base) : System.Private.CoreLib.dasm - Vector`1:Max(Vector`1,Vector`1):Vector`1 (6 methods)
      -21.60 (-5.67% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateLookAt(Vector3,Vector3,Vector3):Matrix4x4
       -1.00 (-5.00% of base) : System.Private.CoreLib.dasm - Vector2:LengthSquared():float:this
       -1.50 (-4.93% of base) : System.Private.CoreLib.dasm - VectorMath:Lerp(Vector128`1,Vector128`1,Vector128`1):Vector128`1
       -9.90 (-3.32% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateWorld(Vector3,Vector3,Vector3):Matrix4x4
       -1.00 (-3.17% of base) : System.Private.CoreLib.dasm - Vector2:Length():float:this
      -17.85 (-2.64% of base) : System.Private.CoreLib.dasm - Matrix4x4:Decompose(Matrix4x4,byref,byref,byref):bool
      -10.02 (-2.07% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateConstrainedBillboard(Vector3,Vector3,Vector3,Vector3,Vector3):Matrix4x4
       -2.17 (-1.56% of base) : System.Private.CoreLib.dasm - ASCIIUtility:GetIndexOfFirstNonAsciiChar_Default(long,long):long
       -1.64 (-1.42% of base) : System.Private.CoreLib.dasm - ASCIIUtility:GetIndexOfFirstNonAsciiByte_Default(long,long):long
       -1.50 (-0.87% of base) : System.Private.CoreLib.dasm - ASCIIUtility:NarrowUtf16ToAscii_Sse2(long,long,long):long (2 methods)
       -1.26 (-0.51% of base) : System.Private.CoreLib.dasm - ASCIIUtility:GetIndexOfFirstNonAsciiChar_Sse2(long,long):long (2 methods)
       -0.75 (-0.23% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateBillboard(Vector3,Vector3,Vector3,Vector3):Matrix4x4
       -0.85 (-0.19% of base) : System.Private.CoreLib.dasm - SpanHelpers:LastIndexOf(byref,ushort,int):int
       -0.45 (-0.08% of base) : System.Private.CoreLib.dasm - SpanHelpers:IndexOfAny(byref,ushort,ushort,int):int
42 total methods with Perf Score differences (36 improved, 6 regressed), 22012 unchanged.

briansull · 2020-02-22T00:51:11Z

The regression in Span\SpanBench\SpanBench.dasm - SpanBench:TestSpanFill(ref,int) (7 methods)
We make a simd32 into a CSE, which requires an additional two caller saved ymm registers to be spilled in the prolog. And this ymm register is live across a call, which since only half of the register is preserved we have to save and restore half of it at a callsite:

prolog

+      vmovaps  qword ptr [rsp+70H], xmm6
+      vmovaps  qword ptr [rsp+60H], xmm7

Loop:

+     vextractf128 ymm7, ymm6, 1
       call     Span`1:Fill(Vector`1):this
       inc      edi
       cmp      edi, esi
+     vinsertf128 ymm6, ymm6, ymm7, 1

CarolEidt · 2020-02-25T18:19:22Z

src/coreclr/src/jit/optcse.cpp

@@ -2461,6 +2461,28 @@ class CSE_Heuristic
                    extra_yes_cost *= 2; // full cost if we are being Conservative
                }
            }
+
+#ifdef FEATURE_SIMD
+            // SIMD types will cause a SIMD register to be spilled/restored


Should be "may cause".
I think you should have a single comment that talks about the heuristics for estimating register save/restore/spill costs. First, it is conservative to assume that each SIMD CSE will require an additional register to be saved/restored in prolog/epilog, as there may be disjoint CSE live ranges that can utilize the same register.
Second, this code presumes that the CSEs will be live arcoss a call, which may be true more often than not, but it also assumes that each use will require a restore, though it may be that multiple uses will not span a call.
While it may be fine to be conservative in these ways, I think we should accurately describe those conservative assumptions.

CarolEidt · 2020-02-25T18:20:37Z

src/coreclr/src/jit/optcse.cpp

+                {
+                    spillSimdRegInProlog++; // we likely need to spill an extra register to save the upper half
+
+                    // Increase the cse_use_cost since at use sites we have to generate code to spill/restore


again, "we have to" should be "we may have to"

This reverts commit 169c24e.

briansull · 2020-02-25T20:14:57Z

Here is an example where value number of SIMD types produced a very nice win:

runtime\src\coreclr\tests\src\JIT\Regression\JitBlue\GitHub_11816\GitHub_11816.cs:

    static void DoSomeWorkWithAStruct(ref Vector<float> source, out Vector<float> result)
    {
        StructType u;
        u.A = new Vector<float>(2) * source;
        u.B = new Vector<float>(3) * source;
        u.C = new Vector<float>(4) * source;
        u.D = new Vector<float>(5) * source;
        u.E = new Vector<float>(6) * source;
        u.F = new Vector<float>(7) * source;
        result = u.A + u.B + u.C + u.D + u.E + u.F;
    }

We create the following CSE's to hold source, u.A, u.B, u.C, u.D, U.E and u.F

;  V10 cse0         [V10,T10] (  2,  2   )  simd32  ->  mm0         "CSE - aggressive"
;  V11 cse1         [V11,T11] (  2,  2   )  simd32  ->  mm2         "CSE - aggressive"
;  V12 cse2         [V12,T12] (  2,  2   )  simd32  ->  mm3         "CSE - aggressive"
;  V13 cse3         [V13,T13] (  2,  2   )  simd32  ->  mm4         "CSE - aggressive"
;  V14 cse4         [V14,T14] (  2,  2   )  simd32  ->  mm5         "CSE - aggressive"
;  V15 cse5         [V15,T15] (  2,  2   )  simd32  ->  mm1         "CSE - aggressive"
;  V16 cse6         [V16,T03] (  7,  7   )  simd32  ->  mm1         "CSE - aggressive"

Reducing the code from:

       vmovupd  ymmword ptr[rsp+08H], ymm0
       vbroadcastss ymm0, ymmword ptr[reloc @RWD04]
       vmovupd  ymm1, ymmword ptr[rcx]
       vmulps   ymm0, ymm1
       vmovupd  ymmword ptr[rsp+28H], ymm0
       vbroadcastss ymm0, ymmword ptr[reloc @RWD08]
       vmovupd  ymm1, ymmword ptr[rcx]
       vmulps   ymm0, ymm1
       vmovupd  ymmword ptr[rsp+48H], ymm0
       vbroadcastss ymm0, ymmword ptr[reloc @RWD12]
       vmovupd  ymm1, ymmword ptr[rcx]
       vmulps   ymm0, ymm1
       vmovupd  ymmword ptr[rsp+68H], ymm0
       vbroadcastss ymm0, ymmword ptr[reloc @RWD16]
       vmovupd  ymm1, ymmword ptr[rcx]
       vmulps   ymm0, ymm1
       vmovupd  ymmword ptr[rsp+88H], ymm0
       vbroadcastss ymm0, ymmword ptr[reloc @RWD20]
       vmovupd  ymm1, ymmword ptr[rcx]
       vmulps   ymm0, ymm1
       vmovupd  ymmword ptr[rsp+A8H], ymm0
       vmovupd  ymm0, ymmword ptr[rsp+08H]
       vmovupd  ymm1, ymmword ptr[rsp+28H]
       vaddps   ymm0, ymm1
       vmovupd  ymm1, ymmword ptr[rsp+48H]
       vaddps   ymm0, ymm1
       vmovupd  ymm1, ymmword ptr[rsp+68H]
       vaddps   ymm0, ymm1
       vmovupd  ymm1, ymmword ptr[rsp+88H]
       vaddps   ymm0, ymm1
       vmovupd  ymm1, ymmword ptr[rsp+A8H]
       vaddps   ymm0, ymm1
       vmovupd  ymmword ptr[rdx], ymm0

To:

       vbroadcastss ymm2, ymmword ptr[reloc @RWD04]
       vmulps   ymm2, ymm1
       vbroadcastss ymm3, ymmword ptr[reloc @RWD08]
       vmulps   ymm3, ymm1
       vbroadcastss ymm4, ymmword ptr[reloc @RWD12]
       vmulps   ymm4, ymm1
       vbroadcastss ymm5, ymmword ptr[reloc @RWD16]
       vmulps   ymm5, ymm1
       vbroadcastss ymm6, ymmword ptr[reloc @RWD20]
       vmulps   ymm1, ymm6, ymm1
       vaddps   ymm0, ymm2
       vaddps   ymm0, ymm3
       vaddps   ymm0, ymm4
       vaddps   ymm0, ymm5
       vaddps   ymm0, ymm1
       vmovupd  ymmword ptr[rdx], ymm0

PerfScore:

OLD> ; Total bytes of code 255, prolog size 32, PerfScore 175.50, (MethodHash=2eeeff7d) for method GitHub_11816:DoSomeWorkWithAStruct(byref,byref)
NEW> ; Total bytes of code 131, prolog size 12, PerfScore 89.10, (MethodHash=2eeeff7d) for method GitHub_11816:DoSomeWorkWithAStruct(byref,byref)

briansull · 2020-02-26T00:22:53Z

We perform a nice CSE for the a[i] and b[i] operations in this test case:

src\coreclr\tests\src\JIT\SIMD>emacs BitwiseOperations.cs

        static int TestBool()
        {
            Random random = new Random(13);
            byte[] arr1 = GenerateByteArray(64, random);
            byte[] arr2 = GenerateByteArray(64, random);
            var a = new System.Numerics.Vector<byte>(arr1);
            var b = new System.Numerics.Vector<byte>(arr2);

            var xorR = a ^ b;
            var andR = a & b;
            var orR = a | b;
            int Count = System.Numerics.Vector<byte>.Count;
            for (int i = 0; i < Count; ++i)
            {
                int d = a[i] ^ b[i];
                if (xorR[i] != d)
                {
                    return 0;
                }
                d = a[i] & b[i];
                if (andR[i] != d)
                {
                    return 0;
                }
                d = a[i] | b[i];
                if (orR[i] != d)
                {
                    return 0;
                }
            }
            return 100;
        }

briansull · 2020-02-26T00:27:03Z

@AndyAyersMS I believe that we have a reasonable amount of test coverage here.

I will add a Config variable to allow us to disable this new functionality and revert to the old behavior.

briansull · 2020-02-26T01:45:14Z

We also get this CSE now in:
coreclr\tests\src\JIT\Performance\CodeQuality\SIMD\ConsoleMandel\Abstractions.cs

Real * Imaginary

        [MethodImplAttribute(MethodImplOptions.AggressiveInlining)]
        public ComplexVecFloat square()
        {
            return new ComplexVecFloat(Real * Real - Imaginary * Imaginary, Real * Imaginary + Real * Imaginary);
        }

CSE candidate #03, vn=$189 in BB01, [cost= 8, size= 7]: 
N014 (  8,  7) CSE #03 (use)[000020] ---XG-------              *  SIMD      simd32 float * <l:$24f, c:$252>
N009 (  3,  2) CSE #01 (use)[000017] *--XG-------              +--*  IND       simd32 <l:$241, c:$250>
N008 (  1,  1)              [000016] ------------              |  \--*  LCL_VAR   byref  V00 this         u:1 Zero Fseq[Real] $80
N013 (  4,  4) CSE #02 (use)[000019] *--XG-------              \--*  IND       simd32 <l:$246, c:$251>
N012 (  2,  2)              [000069] -------N----                 \--*  ADD       byref  $280
N010 (  1,  1)              [000018] ------------                    +--*  LCL_VAR   byref  V00 this         u:1 (last use) $80
N011 (  1,  1)              [000068] ------------                    \--*  CNS_INT   long   32 field offset Fseq[Imaginary] $c1

G_M53767_IG02:
       vmovupd  ymm0, ymmword ptr[rcx]
       vmulps   ymm1, ymm0, ymm0
       vmovupd  ymm2, ymmword ptr[rcx+32]
       vmulps   ymm3, ymm2, ymm2
       vsubps   ymm1, ymm3
 <     vmulps   ymm3, ymm0, ymm2
       vmulps   ymm0, ymm2
 <    vaddps   ymm0, ymm3, ymm0
 >    vaddps   ymm0, ymm0
       vmovupd  ymmword ptr[rdx], ymm1
       vmovupd  ymmword ptr[rdx+32], ymm0
       mov      rax, rdx
<					;; bbWeight=1    PerfScore 32.25
>					;; bbWeight=1    PerfScore 29.25
G_M53767_IG03:
       vzeroupper 
       ret      
						;; bbWeight=1    PerfScore 2.00

<; Total bytes of code 52, prolog size 3, PerfScore 41.45, (MethodHash=c9862df9) for method ComplexVecFloat:square():ComplexVecFloat:this
>; Total bytes of code 48, prolog size 3, PerfScore 37.95, (MethodHash=c9862df9) for method ComplexVecFloat:square():ComplexVecFloat:this

Default 0, ValueNumbering of SIMD nodes and HW Intrinsic nodes enabled If 1, then disable ValueNumbering of SIMD nodes If 2, then disable ValueNumbering of HW Intrinsic nodes If 3, disable both SIMD and HW Intrinsic nodes

briansull · 2020-02-26T21:51:29Z

Beginning PMI PerfScore Diffs for System.Private.CoreLib.dll, framework assemblies
Completed PMI PerfScore Diffs for System.Private.CoreLib.dll, framework assemblies in 685.82s
Diffs (if any) can be viewed by comparing: d:\jit-diffs\runtime\x64\dasmset_108\base d:\jit-diffs\runtime\x64\dasmset_108\diff
Analyzing PerfScore diffs...
PMI PerfScore Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary of Perf Score diffs:
(Lower is better)
Total PerfScoreUnits of diff: -1179.71 (-0.00% of base)
    diff is an improvement.
Top file improvements (PerfScoreUnits):
     -536.77 : System.Private.CoreLib.dasm (-0.02% of base)
     -196.30 : Microsoft.Diagnostics.FastSerialization.dasm (-0.07% of base)
     -113.13 : System.Data.Common.dasm (-0.00% of base)
      -75.97 : System.Linq.Parallel.dasm (-0.01% of base)
      -47.00 : System.Text.Encodings.Web.dasm (-0.34% of base)
      -44.70 : Newtonsoft.Json.dasm (-0.00% of base)
      -34.90 : System.Collections.dasm (-0.02% of base)
      -31.99 : System.Collections.Immutable.dasm (-0.01% of base)
      -20.18 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (-0.00% of base)
      -18.77 : System.Threading.Channels.dasm (-0.03% of base)
      -13.15 : System.Private.Xml.Linq.dasm (-0.02% of base)
Top method regressions (percentages):
       11.35 ( 1.83% of base) : System.Private.CoreLib.dasm - SpanHelpers:IndexOfAny(byref,ushort,ushort,ushort,int):int
       11.95 ( 1.66% of base) : System.Private.CoreLib.dasm - SpanHelpers:IndexOfAny(byref,ushort,ushort,ushort,ushort,int):int
        1.50 ( 0.79% of base) : System.Collections.Immutable.dasm - Node:Add(Vector`1):Node:this
        1.05 ( 0.48% of base) : System.Linq.dasm - Lookup`2:GetGrouping(Vector`1,bool):Grouping`2:this
        2.03 ( 0.44% of base) : System.Collections.Immutable.dasm - Builder:set_Count(int):this (7 methods)
        0.20 ( 0.14% of base) : System.Private.CoreLib.dasm - Vector:LessThanAny(Vector`1,Vector`1):bool (6 methods)
        0.20 ( 0.14% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanAny(Vector`1,Vector`1):bool (6 methods)
        1.00 ( 0.12% of base) : System.Private.CoreLib.dasm - SpanHelpers:Contains(byref,ushort,int):bool (2 methods)
        0.34 ( 0.08% of base) : Microsoft.Diagnostics.FastSerialization.dasm - GrowableArray`1:set_Count(int):this (7 methods)
        1.30 ( 0.07% of base) : System.Collections.Immutable.dasm - ImmutableSortedSet`1:LeafToRootRefill(IEnumerable`1):ImmutableSortedSet`1:this (7 methods)
Top method improvements (percentages):
       -6.70 (-22.98% of base) : System.Private.CoreLib.dasm - Vector3:LengthSquared():float:this
      -12.65 (-19.69% of base) : System.Private.CoreLib.dasm - Vector3:Reflect(Vector3,Vector3):Vector3
      -12.65 (-18.67% of base) : System.Private.CoreLib.dasm - Vector3:Normalize(Vector3):Vector3
      -42.00 (-18.30% of base) : System.Private.CoreLib.dasm - Vector`1:GreaterThanOrEqual(Vector`1,Vector`1):Vector`1 (6 methods)
      -42.00 (-18.30% of base) : System.Private.CoreLib.dasm - Vector`1:LessThanOrEqual(Vector`1,Vector`1):Vector`1 (6 methods)
      -23.00 (-17.73% of base) : System.Text.Encodings.Web.dasm - Sse2Helper:CreateEscapingMask_UnsafeRelaxedJavaScriptEncoder(Vector128`1):Vector128`1 (2 methods)
       -6.70 (-16.48% of base) : System.Private.CoreLib.dasm - Vector3:Length():float:this
      -20.10 (-14.71% of base) : System.Text.Encodings.Web.dasm - Sse2Helper:CreateEscapingMask_DefaultJavaScriptEncoderBasicLatin(Vector128`1):Vector128`1
       -8.25 (-14.44% of base) : System.Private.CoreLib.dasm - Vector4:Normalize(Vector4):Vector4
      -45.83 (-14.27% of base) : System.Private.CoreLib.dasm - Vector:LessThanOrEqual(Vector`1,Vector`1):Vector`1 (10 methods)
      -45.83 (-14.27% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanOrEqual(Vector`1,Vector`1):Vector`1 (10 methods)
       -3.00 (-12.50% of base) : System.Private.CoreLib.dasm - Vector4:LengthSquared():float:this
      -29.13 (-11.50% of base) : System.Private.CoreLib.dasm - Vector:LessThanOrEqualAny(Vector`1,Vector`1):bool (6 methods)
      -29.13 (-11.50% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanOrEqualAny(Vector`1,Vector`1):bool (6 methods)
      -35.20 (-10.49% of base) : System.Collections.Immutable.dasm - ImmutableInterlocked:GetOrAdd(byref,Vector`1,Func`2):long
      -29.33 (-10.23% of base) : System.Private.CoreLib.dasm - Vector:LessThanOrEqualAll(Vector`1,Vector`1):bool (6 methods)
      -29.33 (-10.23% of base) : System.Private.CoreLib.dasm - Vector:GreaterThanOrEqualAll(Vector`1,Vector`1):bool (6 methods)
       -7.50 (-9.77% of base) : System.Collections.dasm - HashSet`1:AddValue(int,int,Vector`1):this
       -5.90 (-9.56% of base) : System.DirectoryServices.AccountManagement.dasm - PrincipalValueCollection`1:Insert(int,Vector`1):this
      -31.50 (-9.40% of base) : System.Private.CoreLib.dasm - Vector:ConditionalSelect(Vector`1,Vector`1,Vector`1):Vector`1 (8 methods)
      -22.50 (-8.74% of base) : System.Private.CoreLib.dasm - Vector`1:ConditionalSelect(Vector`1,Vector`1,Vector`1):Vector`1 (6 methods)
       -3.00 (-8.45% of base) : System.Private.CoreLib.dasm - Vector4:Length():float:this
       -8.25 (-7.39% of base) : System.Private.CoreLib.dasm - Vector:Min(Vector`1,Vector`1):Vector`1 (6 methods)
       -8.25 (-7.39% of base) : System.Private.CoreLib.dasm - Vector:Max(Vector`1,Vector`1):Vector`1 (6 methods)
      -14.15 (-7.30% of base) : System.Private.CoreLib.dasm - Plane:CreateFromVertices(Vector3,Vector3,Vector3):Plane
       -3.40 (-7.11% of base) : System.Threading.Channels.dasm - Deque`1:DequeueHead():Vector`1:this
      -29.90 (-6.41% of base) : System.Private.CoreLib.dasm - Vector`1:Min(Vector`1,Vector`1):Vector`1 (6 methods)
      -29.90 (-6.41% of base) : System.Private.CoreLib.dasm - Vector`1:Max(Vector`1,Vector`1):Vector`1 (6 methods)
       -3.50 (-6.31% of base) : System.Threading.Channels.dasm - Deque`1:EnqueueTail(Vector`1):this
       -2.50 (-6.27% of base) : System.Text.Encodings.Web.dasm - Sse2Helper:CreateAsciiMask(Vector128`1):Vector128`1
       -3.10 (-5.96% of base) : System.Collections.dasm - Stack`1:PushWithResize(Vector`1):this
      -21.60 (-5.67% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateLookAt(Vector3,Vector3,Vector3):Matrix4x4
       -3.10 (-5.50% of base) : Microsoft.Diagnostics.FastSerialization.dasm - SegmentedList`1:Add(Vector`1):this

briansull · 2020-02-26T22:10:40Z

Beginning PMI PerfScore Diffs for benchstones and benchmarks game in D:\fxkit\runtime\artifacts\tests\coreclr\Windows_NT.x64.Release
Completed PMI PerfScore Diffs for benchstones and benchmarks game in D:\fxkit\runtime\artifacts\tests\coreclr\Windows_NT.x64.Release in 113.61s
Diffs (if any) can be viewed by comparing: d:\jit-diffs\runtime\x64\dasmset_109\base d:\jit-diffs\runtime\x64\dasmset_109\diff
Total PerfScoreUnits of diff: -3192.55 (-0.47% of base)
    diff is an improvement.
Top file improvements (PerfScoreUnits):
    -2805.60 : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm (-3.38% of base)
     -289.60 : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm (-6.44% of base)
      -97.35 : SIMD\RayTracer\RayTracer\RayTracer.dasm (-0.78% of base)
3 total files with Perf Score differences (3 improved, 0 regressed), 79 unchanged.
Top method regressions (percentages):
        5.00 ( 3.73% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Sphere:Intersect(Ray):ISect:this
Top method improvements (percentages):
      -98.60 (-9.31% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetNaturalColor(SceneObject,Vector,Vector,Vector,Scene):Color:this
     -371.33 (-8.90% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
     -361.70 (-8.57% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
       -3.50 (-8.44% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - ComplexVecFloat:square():ComplexVecFloat:this
       -3.50 (-8.44% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - ComplexVecDouble:square():ComplexVecDouble:this
      -80.63 (-7.49% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedWithADT>b__1(int):this
      -96.50 (-7.41% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - MandelBrot_7:GetByte(long,double,int,int):ubyte
     -192.50 (-7.19% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - <>c__DisplayClass4_0:<DoBench>b__0(int):this
      -78.50 (-7.14% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass6_0:<RenderMultiThreadedWithADT>b__1(int):this
     -320.74 (-7.12% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
     -318.24 (-6.98% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatStrictRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
     -264.93 (-6.38% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
     -139.06 (-6.21% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass4_0:<RenderMultiThreadedWithADT>b__1(int):this (2 methods)
     -254.90 (-5.75% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
     -168.96 (-5.16% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedNoADT>b__1(int):this (3 methods)
     -211.84 (-4.67% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatStrictRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
     -211.84 (-4.39% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this

briansull · 2020-02-26T22:29:49Z

Beginning PMI PerfScore Diffs for D:\fxkit\runtime\artifacts\tests\coreclr\Windows_NT.x64.Release\JIT\SIMD
Completed PMI PerfScore Diffs for D:\fxkit\runtime\artifacts\tests\coreclr\Windows_NT.x64.Release\JIT\SIMD in 141.91s
Diffs (if any) can be viewed by comparing: d:\jit-diffs\runtime\x64\dasmset_110\base d:\jit-diffs\runtime\x64\dasmset_110\diff
Analyzing PerfScore diffs...
Total PerfScoreUnits of diff: -1789.36 (-0.16% of base)
    diff is an improvement.
Top file regressions (PerfScoreUnits):
       79.83 : VectorRelOp_ro\VectorRelOp_ro.dasm (0.68% of base)
        9.90 : Matrix4x4_ro\Matrix4x4_ro.dasm (3.23% of base)
        0.10 : VectorIntEquals_ro\VectorIntEquals_ro.dasm (0.00% of base)
Top file improvements (PerfScoreUnits):
     -770.70 : VectorMatrix_ro\VectorMatrix_ro.dasm (-0.26% of base)
     -747.30 : VectorConvert_ro\VectorConvert_ro.dasm (-4.99% of base)
     -110.30 : Vector3Interop_ro\Vector3Interop_ro.dasm (-4.96% of base)
      -67.88 : VectorGet_ro\VectorGet_ro.dasm (-1.17% of base)
      -66.40 : VectorExp_ro\VectorExp_ro.dasm (-1.42% of base)
      -65.60 : BitwiseOperations_ro\BitwiseOperations_ro.dasm (-8.52% of base)
      -24.00 : Haar-likeFeaturesGeneric_ro\Haar-likeFeaturesGeneric_ro.dasm (-1.66% of base)
       -9.30 : VectorMul_ro\VectorMul_ro.dasm (-0.17% of base)
       -6.10 : LdfldGeneric_ro\LdfldGeneric_ro.dasm (-4.35% of base)
       -5.18 : VectorInitN_ro\VectorInitN_ro.dasm (-0.21% of base)
       -3.08 : VectorSqrt_ro\VectorSqrt_ro.dasm (-0.11% of base)
       -2.60 : StoreElement_ro\StoreElement_ro.dasm (-3.76% of base)
       -0.75 : Plane_ro\Plane_ro.dasm (-0.38% of base)
Top method regressions (percentages):
        9.90 ( 3.28% of base) : Matrix4x4_ro\Matrix4x4_ro.dasm - Matrix4x4Test:Matrix4x4CreateScaleCenterTest3():int
       31.91 ( 1.95% of base) : VectorRelOp_ro\VectorRelOp_ro.dasm - VectorRelopTest`1:VectorRelOp(int,int):int
       28.11 ( 1.68% of base) : VectorRelOp_ro\VectorRelOp_ro.dasm - VectorRelopTest`1:VectorRelOp(short,short):int
       25.11 ( 1.51% of base) : VectorRelOp_ro\VectorRelOp_ro.dasm - VectorRelopTest`1:VectorRelOp(long,long):int
        0.10 ( 0.15% of base) : VectorIntEquals_ro\VectorIntEquals_ro.dasm - VectorTest:VectorIntEquals():int
Top method improvements (percentages):
      -65.60 (-23.85% of base) : BitwiseOperations_ro\BitwiseOperations_ro.dasm - Program:TestBool():int
      -65.65 (-18.22% of base) : VectorExp_ro\VectorExp_ro.dasm - VectorExpTest`1:VectorExp(Vector`1,double,double,double):int
      -42.18 (-17.87% of base) : VectorGet_ro\VectorGet_ro.dasm - VectorGetTest`1:VectorGet(long,int):int
      -20.45 (-13.18% of base) : Vector3Interop_ro\Vector3Interop_ro.dasm - RPInvokeTest:callBack_RPInvoke_Vector3Array(ref,int)
      -20.70 (-12.09% of base) : Vector3Interop_ro\Vector3Interop_ro.dasm - RPInvokeTest:callBack_RPInvoke_Vector3Arg(int,Vector3,String,Vector3)
     -116.80 (-11.82% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertDoubleSingle(Vector`1,Vector`1):int
      -45.65 (-11.26% of base) : Vector3Interop_ro\Vector3Interop_ro.dasm - RPInvokeTest:callBack_RPInvoke_Vector3Arg_Unix2(Vector3,float,float,float,float,float,float,float,Vector3,float,float,Vector3,float)
      -25.70 (-11.21% of base) : VectorGet_ro\VectorGet_ro.dasm - VectorGetTest`1:VectorGet(double,int):int
      -92.40 (-11.13% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertUInt16And8(Vector`1,Vector`1):int
      -90.80 (-10.97% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertUInt32And16(Vector`1,Vector`1):int
      -89.40 (-10.34% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertUInt64And32(Vector`1,Vector`1):int
      -93.20 (-10.30% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertInt16And8(Vector`1,Vector`1):int
      -91.00 (-10.09% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertInt32And16(Vector`1,Vector`1):int
      -89.40 (-9.53% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertInt64And32(Vector`1,Vector`1):int
      -42.00 (-8.39% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertSingleInt(Vector`1):int
      -23.50 (-8.37% of base) : Vector3Interop_ro\Vector3Interop_ro.dasm - RPInvokeTest:callBack_RPInvoke_Vector3Arg_Unix(Vector3,float,float,float,float,float,float,float,Vector3,float,float)
      -42.30 (-7.26% of base) : VectorConvert_ro\VectorConvert_ro.dasm - VectorConvertTest:VectorConvertDoubleInt64(Vector`1):int
       -6.10 (-4.84% of base) : LdfldGeneric_ro\LdfldGeneric_ro.dasm - Program:Main(ref):int
      -12.00 (-4.25% of base) : Haar-likeFeaturesGeneric_ro\Haar-likeFeaturesGeneric_ro.dasm - Program:VectorAndFilter(ref,ref):ref
      -12.00 (-4.09% of base) : Haar-likeFeaturesGeneric_ro\Haar-likeFeaturesGeneric_ro.dasm - Program:VectorFilter(ref,ref):ref

briansull · 2020-02-26T22:44:22Z

Beginning PMI PerfScore Diffs for D:\fxkit\runtime\artifacts\tests\coreclr\Windows_NT.x64.Release\JIT\Performance\CodeQuality
Completed PMI PerfScore Diffs for D:\fxkit\runtime\artifacts\tests\coreclr\Windows_NT.x64.Release\JIT\Performance\CodeQuality in 112.49s
Diffs (if any) can be viewed by comparing: d:\jit-diffs\runtime\x64\dasmset_111\base d:\jit-diffs\runtime\x64\dasmset_111\diff
Analyzing PerfScore diffs...
Total PerfScoreUnits of diff: -3537.58 (-0.50% of base)
    diff is an improvement.
Top file improvements (PerfScoreUnits):
    -2805.60 : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm (-3.38% of base)
     -345.03 : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm (-1.54% of base)
     -289.60 : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm (-6.44% of base)
      -97.35 : SIMD\RayTracer\RayTracer\RayTracer.dasm (-0.78% of base)
4 total files with Perf Score differences (4 improved, 0 regressed), 80 unchanged.
Top method regressions (percentages):
        5.00 ( 3.73% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - Sphere:Intersect(Ray):ISect:this
        4.70 ( 1.55% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - VectorMath:Log(Vector256`1):Vector256`1
        2.50 ( 0.46% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - Camera:Create(VectorPacket256,VectorPacket256):Camera
Top method improvements (percentages):
      -10.25 (-32.59% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - <>c:<.cctor>b__5_2(VectorPacket256):VectorPacket256:this
      -10.25 (-32.59% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - <>c:<.cctor>b__5_4(VectorPacket256):VectorPacket256:this
       -9.00 (-18.69% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - VectorPacket256:op_Multiply(Vector256`1,VectorPacket256):VectorPacket256
      -98.60 (-9.31% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetNaturalColor(SceneObject,Vector,Vector,Vector,Scene):Color:this
     -371.33 (-8.90% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
     -361.70 (-8.57% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
       -3.50 (-8.44% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - ComplexVecFloat:square():ComplexVecFloat:this
       -3.50 (-8.44% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - ComplexVecDouble:square():ComplexVecDouble:this
      -80.63 (-7.49% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedWithADT>b__1(int):this
      -96.50 (-7.41% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - MandelBrot_7:GetByte(long,double,int,int):ubyte
     -192.50 (-7.19% of base) : BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.dasm - <>c__DisplayClass4_0:<DoBench>b__0(int):this
      -78.50 (-7.14% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass6_0:<RenderMultiThreadedWithADT>b__1(int):this
     -320.74 (-7.12% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
     -318.24 (-6.98% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatStrictRenderer:RenderSingleThreadedWithADT(float,float,float,float,float):this
      -35.00 (-6.45% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - Packet256Tracer:CreateDefaultScene():Scene
     -264.93 (-6.38% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
     -139.06 (-6.21% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass4_0:<RenderMultiThreadedWithADT>b__1(int):this (2 methods)
     -254.90 (-5.75% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
     -168.96 (-5.16% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - <>c__DisplayClass5_0:<RenderMultiThreadedNoADT>b__1(int):this (3 methods)
     -211.84 (-4.67% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorFloatStrictRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
     -211.84 (-4.39% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - VectorDoubleStrictRenderer:RenderSingleThreadedNoADT(float,float,float,float,float):this
      -33.93 (-4.32% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - Packet256Tracer:Shade(Intersections,RayPacket256,Scene,int):VectorPacket256:this
     -239.80 (-3.82% of base) : HWIntrinsic\X86\PacketTracer\PacketTracer\PacketTracer.dasm - Packet256Tracer:GetNaturalColor(Vector256`1,VectorPacket256,VectorPacket256,VectorPacket256,Scene):VectorPacket256:this

briansull · 2020-02-26T23:21:01Z

We detect and make a new CSE in:

coreclr\tests\src\JIT\Performance\CodeQuality\BenchmarksGame\mandelbrot\mandelbrot-7.cs

The expression:
Zi = Zr * Zi + Zr * Zi + vCiby;
has the sub expression (Zr *Zi) occurring twice, so we make it into a CSE:
Saving about 10% of the loop cost.

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        private static unsafe byte GetByte(double* pCrb, double Ciby, int x, int y)
        {
            // This currently does not do anything special for 'Count > 2'

            var res = 0;

            for (var i = 0; i < 8; i += 2)
            {
                var Crbx = Unsafe.Read<Vector<double>>(pCrb + x + i);
                var Zr = Crbx;
                var vCiby = new Vector<double>(Ciby);
                var Zi = vCiby;

                var b = 0;
                var j = 49;

                do
                {
                    var nZr = Zr * Zr - Zi * Zi + Crbx;
                    Zi = Zr * Zi + Zr * Zi + vCiby;
                    Zr = nZr;

We also make the CSE when we inline the GetByte method into the lamba function :

                Parallel.For(0, size, y => {
                    var offset = y * lineLength;

                    for (var x = 0; x < lineLength; x++)
                    {
                        data[offset + x] = GetByte(_Crb, Cib[y], x * 8, y);
                    }
                });

briansull · 2020-02-26T23:28:19Z

I verified that setting
set COMPLUS_JitDisableSimdVN=3
will disable my changes.

Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 6, 2020

tannergooding reviewed Feb 7, 2020

View reviewed changes

briansull force-pushed the vn-simd branch from 5e05ec9 to 51201c1 Compare February 7, 2020 22:51

echesakov reviewed Feb 7, 2020

View reviewed changes

src/coreclr/src/jit/gentree.cpp Outdated Show resolved Hide resolved

echesakov reviewed Feb 7, 2020

View reviewed changes

src/coreclr/src/jit/gentree.cpp Show resolved Hide resolved

src/coreclr/src/jit/gentree.cpp Outdated Show resolved Hide resolved

This was referenced Feb 12, 2020

System.Net.NetworkInformation.Functional.Tests failing CI and PR #32179

Closed

Test failure: System.Net.NetworkInformation.Tests.NetworkInterfaceBasicTest/BasicTest_AccessInstanceProperties_NoExceptions_Linux #18090

Closed

briansull commented Feb 18, 2020

View reviewed changes

briansull added 14 commits February 18, 2020 10:29

Added ValueNumbering support for GT_SIMD and GT_HWINTRINSIC tree nodes

0c4b857

Allow SIMD and HW Intrinsics to be CSE candidates

367da1f

Correctness fix for optAssertionPropMain

57eadfb

- Zero out the bbAssertionIn values, as these can be referenced in RangeCheck::MergeAssertion and this is shared state with the CSE phase: bbCseIn

Improve the VNFOA_ArityMask

6d2028c

Include node type when value numbering SIMDIntrinsicInit

d8b9b53

Mutate the gloabl heap when performing a HW_INTRINSIC memory store operation Printing of SIMD constants only support 0

Disable CSE's for some special HW_INTRINSIC categories

e188b9b

Code review feedback

506a544

Record csdStructHnd; // The class handle, currently needed to create …

9c62167

…a SIMD LclVar in PerformCSE

Fix the JITDUMP messages to print the CseIndex

8e3cea8

add check for (newElemStructHnd != NO_CLASS_HANDLE)

3769197

Fix the printing of BitSets on Linux, change the printf format specifier

305a017

Added check for simdNode->OperIsMemoryLoad()) to fgValueNumberSimd

0d0e4f8

Added bool OperIsMemoryLoad() to GenTreeSIMD, returns true for SIMDIntrinsicInitArray Added valuenumfuncs.h to src/coreclr/src/jit/CMakeLists.txt

Update to use the new TARGET macros

0232044

Avoid calling gtGetStructHandleIfPresent to set csdStructHnd when we …

d9dda0a

…have a GT_IND node

briansull added 2 commits February 21, 2020 10:31

Codereview feedback and some more comments

a0e6a8c

fix typo

cbe5538

CarolEidt approved these changes Feb 21, 2020

View reviewed changes

Moved the code that sets the arg count for the three SIMD intrinsics

3e927bf

briansull added 3 commits February 21, 2020 17:07

clang-format

b2b8986

Adjust CSE for SIMD types that are live across a call

7bd9303

Proposed fix for dotnet#32085

169c24e

CarolEidt reviewed Feb 25, 2020

View reviewed changes

briansull added 2 commits February 25, 2020 11:10

Revert "Proposed fix for dotnet#32085"

8d0ce82

This reverts commit 169c24e.

Added better comments for optcse SIMD caller saved register heuristics

6ae894a

briansull added 2 commits February 25, 2020 17:55

Added CONFIG_INTEGER: JitDisableSimdVN,

b4eb406

Default 0, ValueNumbering of SIMD nodes and HW Intrinsic nodes enabled If 1, then disable ValueNumbering of SIMD nodes If 2, then disable ValueNumbering of HW Intrinsic nodes If 3, disable both SIMD and HW Intrinsic nodes

Moved JitDisableSimdVN from DEBUG to RETAIL

90bf7b3

briansull merged commit 4ef3cc6 into dotnet:master Feb 26, 2020

tannergooding mentioned this pull request Mar 22, 2020

Implement Vector{Size}<T>.AllBitsSet #33924

Merged

EgorBo mentioned this pull request Nov 1, 2020

Revise how constant SIMD vectors are defined in BCL #44115

Closed

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ValueNumbering support for GT_SIMD and GT_HWINTRINSIC tree nodes #31834

Add ValueNumbering support for GT_SIMD and GT_HWINTRINSIC tree nodes #31834

briansull commented Feb 6, 2020

tannergooding commented Feb 7, 2020

tannergooding Feb 7, 2020

CarolEidt Feb 7, 2020 •

edited

Loading

briansull commented Feb 7, 2020

briansull commented Feb 7, 2020 •

edited

Loading

tannergooding commented Feb 7, 2020 •

edited

Loading

briansull commented Feb 7, 2020

tannergooding commented Feb 8, 2020 •

edited

Loading

CarolEidt commented Feb 8, 2020

briansull commented Feb 8, 2020

briansull Feb 18, 2020

CarolEidt left a comment

AndyAyersMS commented Feb 21, 2020

briansull commented Feb 22, 2020 •

edited

Loading

briansull commented Feb 22, 2020 •

edited

Loading

briansull commented Feb 22, 2020 •

edited

Loading

CarolEidt Feb 25, 2020

briansull Feb 25, 2020

CarolEidt Feb 25, 2020

briansull commented Feb 25, 2020 •

edited

Loading

briansull commented Feb 26, 2020

briansull commented Feb 26, 2020

briansull commented Feb 26, 2020 •

edited

Loading

briansull commented Feb 26, 2020

briansull commented Feb 26, 2020

briansull commented Feb 26, 2020

briansull commented Feb 26, 2020

briansull commented Feb 26, 2020

briansull commented Feb 26, 2020

Add ValueNumbering support for GT_SIMD and GT_HWINTRINSIC tree nodes #31834

Add ValueNumbering support for GT_SIMD and GT_HWINTRINSIC tree nodes #31834

Conversation

briansull commented Feb 6, 2020

tannergooding commented Feb 7, 2020

tannergooding Feb 7, 2020

Choose a reason for hiding this comment

CarolEidt Feb 7, 2020 • edited Loading

Choose a reason for hiding this comment

briansull commented Feb 7, 2020

briansull commented Feb 7, 2020 • edited Loading

tannergooding commented Feb 7, 2020 • edited Loading

briansull commented Feb 7, 2020

tannergooding commented Feb 8, 2020 • edited Loading

CarolEidt commented Feb 8, 2020

briansull commented Feb 8, 2020

briansull Feb 18, 2020

Choose a reason for hiding this comment

CarolEidt left a comment

Choose a reason for hiding this comment

AndyAyersMS commented Feb 21, 2020

briansull commented Feb 22, 2020 • edited Loading

briansull commented Feb 22, 2020 • edited Loading

briansull commented Feb 22, 2020 • edited Loading

CarolEidt Feb 25, 2020

Choose a reason for hiding this comment

briansull Feb 25, 2020

Choose a reason for hiding this comment

CarolEidt Feb 25, 2020

Choose a reason for hiding this comment

briansull commented Feb 25, 2020 • edited Loading

briansull commented Feb 26, 2020

briansull commented Feb 26, 2020

briansull commented Feb 26, 2020 • edited Loading

briansull commented Feb 26, 2020

briansull commented Feb 26, 2020

briansull commented Feb 26, 2020

briansull commented Feb 26, 2020

briansull commented Feb 26, 2020

briansull commented Feb 26, 2020

CarolEidt Feb 7, 2020 •

edited

Loading

briansull commented Feb 7, 2020 •

edited

Loading

tannergooding commented Feb 7, 2020 •

edited

Loading

tannergooding commented Feb 8, 2020 •

edited

Loading

briansull commented Feb 22, 2020 •

edited

Loading

briansull commented Feb 22, 2020 •

edited

Loading

briansull commented Feb 22, 2020 •

edited

Loading

briansull commented Feb 25, 2020 •

edited

Loading

briansull commented Feb 26, 2020 •

edited

Loading