JIT: Recognize FMA patterns (x*y+z) #25856

EgorBo · 2019-07-24T13:11:15Z

I know, such features certainly require design and discussions, so it's just a do-not-merge PR to show how it could be done (this is how I learn how the RuyJIT actually works 🙂 thanks to your feedback/comments).
So the PR teaches JIT to recognize a * b + c patterns (see https://github.com/dotnet/coreclr/issues/17541) and replace them with, basically, Fma.MultiplyAddScalar intrinsics (depending on signs and types):

Benchmark: (Coffee Lake i7 8700K)

Method	Mean	Ratio
Old	129.41 ns	1.00
New	64.95 ns	0.50

So it morphs:

fgMorphTree BB01, stmt 1 (before)
    [000005] ------------              *  RETURN    float 
    [000004] ------------              \--*  ADD       float 
    [000002] ------------                 +--*  MUL       float 
    [000000] ------------                 |  +--*  LCL_VAR   float  V01 arg1     
    [000001] ------------                 |  \--*  LCL_VAR   float  V02 arg2      
    [000003] ------------                 \--*  LCL_VAR   float  V03 arg3

into

fgMorphTree BB01, stmt 1 (after)
    [000005] -----+------              *  RETURN    float 
    [000014] -----+------              \--*  HWIntrinsic float  float ToScalar
    [000013] -----+------                 \--*  HWIntrinsic simd16 float MultiplyAddScalar
    [000012] -----+------                    \--*  LIST      void  
    [000007] -----+------                       +--*  HWIntrinsic simd16 float CreateScalarUnsafe
    [000000] -----+------                       |  \--*  LCL_VAR   float  V01 arg1         
    [000011] ------------                       \--*  LIST      void  
    [000008] -----+------                          +--*  HWIntrinsic simd16 float CreateScalarUnsafe
    [000001] -----+------                          |  \--*  LCL_VAR   float  V02 arg2         
    [000010] ------------                          \--*  LIST      void  
    [000009] -----+------                             \--*  HWIntrinsic simd16 float CreateScalarUnsafe
    [000003] -----+------                                \--*  LCL_VAR   float  V03 arg3

(Math.FusedMultiplyAdd() generates the same IR tree)

However, I suspect this transformation should be done in lower.cpp instead (I tried but it was too complicated to figure out how to do that)

Issues

static double NoMadd(double a, double b, double c, out double z)
{
    z = a * b;         // a * b
    return a * b + c;  // z + c
}

^ currently generates mul and fmadd here instead of mul and add because, I suspect, CSE happens after morphing (moving this transformation to lowering will help).

static float Madd(float a)
{
    return a * a + a;
}

^ generates redundant movs. (while could be just vfmadd231ss xmm0, xmm0, xmm0) - jit-diff shows some size regressions because of that.

Also, If an FMADD candidate is prejitted (R2R'd) then if we re-compile it with FMA it might return different values for the same input (however, it already happens in .NET Core: https://github.com/dotnet/coreclr/issues/25857)

PS: mono supports it thanks to LLVM (if -fp-contract=fast is set) see https://twitter.com/EgorBo/status/1063468884257316865/photo/1

benaadams · 2019-07-24T13:32:52Z

An issue for you 😉 https://github.com/dotnet/coreclr/issues/17541

RussKeldorph · 2019-07-24T22:05:29Z

@dotnet/jit-contrib

mikedn · 2019-07-25T04:02:03Z

^ currently generates mul and fmadd here instead of mul and add because, I suspect, CSE happens after morphing (moving this transformation to lowering will help).

Yes, CSE runs after morph. But it's not clear if getting mul and fmadd is a bad thing in this case. In the mul+add case the add has to wait for the mul to complete so the total latency would be 8 cycles (on a Skylake). In the mul+fmadd case both instructions might be able to execute at the same time so both results would be available after only 4 cycles. But it depends on the surrounding code if this is useful or not, the latency may be unimportant and/or the surrounding code may have other ILP opportunities.

Moving to lowering would help in other ways though: if you have two a * b + c and you morph to FMA then CSE won't be able to pick up the redundancy because intrinsics are not recognized by CSE (well by value numbering that CSE depends on).

However, I suspect this transformation should be done in lower.cpp instead (I tried but it was too complicated to figure out how to do that)

What exactly was complicated? Should be relatively similar, the main difference would be that you need to manually insert and remove the old and new nodes from the linear order.

mono supports it thanks to LLVM (if -fp-contract=fast is set) see

The fact that auto generating FMA is normally done only if certain compiler options are set is probably the biggest roadblock to actually doing this in the JIT now. There's currently no way to developers to provide such options to the JIT (well, except perhaps by using COMPLUS environment variables but that's probably impractical).

tannergooding · 2019-07-25T18:52:43Z

There's currently no way to developers to provide such options to the JIT

Definitely right. I believe https://github.com/dotnet/coreclr/issues/24784 is currently the closest thing we have to a tracking issue. @CarolEidt and I had discussed this a couple times in the past and it likely needs a good bit of design work and discussion to determine how it all works (especially with records to crossing various boundaries).

maryamariyan · 2019-11-06T21:04:30Z

Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:

In your coreclr repository clone, create patch by running git format-patch origin
In your runtime repository clone, apply the patch by running git apply --directory src/coreclr <path to the patch created in step 1>

maryamariyan · 2019-12-02T19:37:56Z

Thank you for your contribution. As announced in #27549 the dotnet/runtime repository will be used going forward for changes to this code base. Closing this PR as no more changes will be accepted into master for this repository. If you’d like to continue working on this change please move it to dotnet/runtime.

EgorBo added 6 commits July 24, 2019 01:20

Insert FMA

7bf763b

fix build errors

efb00c3

Introduce COMPlus_JitInsertFma

49720e9

drop comment

fbf006c

morph only in fgGlobalMorph phase

fd136c3

add comments

5ce02d8

sandreenko added the area-CodeGen label Sep 23, 2019

BruceForstall added optimization post-consolidation PRs which will be hand ported to dotnet/runtime labels Nov 7, 2019

maryamariyan closed this Dec 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Recognize FMA patterns (x*y+z) #25856

JIT: Recognize FMA patterns (x*y+z) #25856

EgorBo commented Jul 24, 2019 •

edited

Loading

benaadams commented Jul 24, 2019

RussKeldorph commented Jul 24, 2019

mikedn commented Jul 25, 2019

tannergooding commented Jul 25, 2019

maryamariyan commented Nov 6, 2019

maryamariyan commented Dec 2, 2019

JIT: Recognize FMA patterns (x*y+z) #25856

JIT: Recognize FMA patterns (x*y+z) #25856

Conversation

EgorBo commented Jul 24, 2019 • edited Loading

Issues

benaadams commented Jul 24, 2019

RussKeldorph commented Jul 24, 2019

mikedn commented Jul 25, 2019

tannergooding commented Jul 25, 2019

maryamariyan commented Nov 6, 2019

maryamariyan commented Dec 2, 2019

EgorBo commented Jul 24, 2019 •

edited

Loading