Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

JIT: Recognize FMA patterns (x*y+z) #25856

Closed
wants to merge 6 commits into from
Closed

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Jul 24, 2019

I know, such features certainly require design and discussions, so it's just a do-not-merge PR to show how it could be done (this is how I learn how the RuyJIT actually works 🙂 thanks to your feedback/comments).
So the PR teaches JIT to recognize a * b + c patterns (see https://github.com/dotnet/coreclr/issues/17541) and replace them with, basically, Fma.MultiplyAddScalar intrinsics (depending on signs and types):

image

Benchmark: (Coffee Lake i7 8700K)

Method Mean Ratio
Old 129.41 ns 1.00
New 64.95 ns 0.50

So it morphs:

fgMorphTree BB01, stmt 1 (before)
    [000005] ------------              *  RETURN    float 
    [000004] ------------              \--*  ADD       float 
    [000002] ------------                 +--*  MUL       float 
    [000000] ------------                 |  +--*  LCL_VAR   float  V01 arg1     
    [000001] ------------                 |  \--*  LCL_VAR   float  V02 arg2      
    [000003] ------------                 \--*  LCL_VAR   float  V03 arg3       

into

fgMorphTree BB01, stmt 1 (after)
    [000005] -----+------              *  RETURN    float 
    [000014] -----+------              \--*  HWIntrinsic float  float ToScalar
    [000013] -----+------                 \--*  HWIntrinsic simd16 float MultiplyAddScalar
    [000012] -----+------                    \--*  LIST      void  
    [000007] -----+------                       +--*  HWIntrinsic simd16 float CreateScalarUnsafe
    [000000] -----+------                       |  \--*  LCL_VAR   float  V01 arg1         
    [000011] ------------                       \--*  LIST      void  
    [000008] -----+------                          +--*  HWIntrinsic simd16 float CreateScalarUnsafe
    [000001] -----+------                          |  \--*  LCL_VAR   float  V02 arg2         
    [000010] ------------                          \--*  LIST      void  
    [000009] -----+------                             \--*  HWIntrinsic simd16 float CreateScalarUnsafe
    [000003] -----+------                                \--*  LCL_VAR   float  V03 arg3         

(Math.FusedMultiplyAdd() generates the same IR tree)

However, I suspect this transformation should be done in lower.cpp instead (I tried but it was too complicated to figure out how to do that)

Issues

static double NoMadd(double a, double b, double c, out double z)
{
    z = a * b;         // a * b
    return a * b + c;  // z + c
}

^ currently generates mul and fmadd here instead of mul and add because, I suspect, CSE happens after morphing (moving this transformation to lowering will help).


static float Madd(float a)
{
    return a * a + a;
}

^ generates redundant movs. (while could be just vfmadd231ss xmm0, xmm0, xmm0) - jit-diff shows some size regressions because of that.

Also, If an FMADD candidate is prejitted (R2R'd) then if we re-compile it with FMA it might return different values for the same input (however, it already happens in .NET Core: https://github.com/dotnet/coreclr/issues/25857)

PS: mono supports it thanks to LLVM (if -fp-contract=fast is set) see https://twitter.com/EgorBo/status/1063468884257316865/photo/1

@benaadams
Copy link
Member

An issue for you 😉 https://github.com/dotnet/coreclr/issues/17541

@RussKeldorph
Copy link

@dotnet/jit-contrib

@mikedn
Copy link

mikedn commented Jul 25, 2019

^ currently generates mul and fmadd here instead of mul and add because, I suspect, CSE happens after morphing (moving this transformation to lowering will help).

Yes, CSE runs after morph. But it's not clear if getting mul and fmadd is a bad thing in this case. In the mul+add case the add has to wait for the mul to complete so the total latency would be 8 cycles (on a Skylake). In the mul+fmadd case both instructions might be able to execute at the same time so both results would be available after only 4 cycles. But it depends on the surrounding code if this is useful or not, the latency may be unimportant and/or the surrounding code may have other ILP opportunities.

Moving to lowering would help in other ways though: if you have two a * b + c and you morph to FMA then CSE won't be able to pick up the redundancy because intrinsics are not recognized by CSE (well by value numbering that CSE depends on).

However, I suspect this transformation should be done in lower.cpp instead (I tried but it was too complicated to figure out how to do that)

What exactly was complicated? Should be relatively similar, the main difference would be that you need to manually insert and remove the old and new nodes from the linear order.

mono supports it thanks to LLVM (if -fp-contract=fast is set) see

The fact that auto generating FMA is normally done only if certain compiler options are set is probably the biggest roadblock to actually doing this in the JIT now. There's currently no way to developers to provide such options to the JIT (well, except perhaps by using COMPLUS environment variables but that's probably impractical).

@tannergooding
Copy link
Member

There's currently no way to developers to provide such options to the JIT

Definitely right. I believe https://github.com/dotnet/coreclr/issues/24784 is currently the closest thing we have to a tracking issue. @CarolEidt and I had discussed this a couple times in the past and it likely needs a good bit of design work and discussion to determine how it all works (especially with records to crossing various boundaries).

@maryamariyan
Copy link
Member

Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:

  1. In your coreclr repository clone, create patch by running git format-patch origin
  2. In your runtime repository clone, apply the patch by running git apply --directory src/coreclr <path to the patch created in step 1>

@BruceForstall BruceForstall added optimization post-consolidation PRs which will be hand ported to dotnet/runtime labels Nov 7, 2019
@maryamariyan
Copy link
Member

Thank you for your contribution. As announced in #27549 the dotnet/runtime repository will be used going forward for changes to this code base. Closing this PR as no more changes will be accepted into master for this repository. If you’d like to continue working on this change please move it to dotnet/runtime.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen optimization post-consolidation PRs which will be hand ported to dotnet/runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants