Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Recognize FMA patterns (x*y+z) #25856

Open
wants to merge 6 commits into
base: master
from

Conversation

@EgorBo
Copy link
Contributor

commented Jul 24, 2019

I know, such features certainly require design and discussions, so it's just a do-not-merge PR to show how it could be done (this is how I learn how the RuyJIT actually works 馃檪 thanks to your feedback/comments).
So the PR teaches JIT to recognize a * b + c patterns (see #17541) and replace them with, basically, Fma.MultiplyAddScalar intrinsics (depending on signs and types):

image

Benchmark: (Coffee Lake i7 8700K)

Method Mean Ratio
Old 129.41 ns 1.00
New 64.95 ns 0.50

So it morphs:

fgMorphTree BB01, stmt 1 (before)
    [000005] ------------              *  RETURN    float 
    [000004] ------------              \--*  ADD       float 
    [000002] ------------                 +--*  MUL       float 
    [000000] ------------                 |  +--*  LCL_VAR   float  V01 arg1     
    [000001] ------------                 |  \--*  LCL_VAR   float  V02 arg2      
    [000003] ------------                 \--*  LCL_VAR   float  V03 arg3       

into

fgMorphTree BB01, stmt 1 (after)
    [000005] -----+------              *  RETURN    float 
    [000014] -----+------              \--*  HWIntrinsic float  float ToScalar
    [000013] -----+------                 \--*  HWIntrinsic simd16 float MultiplyAddScalar
    [000012] -----+------                    \--*  LIST      void  
    [000007] -----+------                       +--*  HWIntrinsic simd16 float CreateScalarUnsafe
    [000000] -----+------                       |  \--*  LCL_VAR   float  V01 arg1         
    [000011] ------------                       \--*  LIST      void  
    [000008] -----+------                          +--*  HWIntrinsic simd16 float CreateScalarUnsafe
    [000001] -----+------                          |  \--*  LCL_VAR   float  V02 arg2         
    [000010] ------------                          \--*  LIST      void  
    [000009] -----+------                             \--*  HWIntrinsic simd16 float CreateScalarUnsafe
    [000003] -----+------                                \--*  LCL_VAR   float  V03 arg3         

(Math.FusedMultiplyAdd() generates the same IR tree)

However, I suspect this transformation should be done in lower.cpp instead (I tried but it was too complicated to figure out how to do that)

Issues

static double NoMadd(double a, double b, double c, out double z)
{
    z = a * b;         // a * b
    return a * b + c;  // z + c
}

^ currently generates mul and fmadd here instead of mul and add because, I suspect, CSE happens after morphing (moving this transformation to lowering will help).


static float Madd(float a)
{
    return a * a + a;
}

^ generates redundant movs. (while could be just vfmadd231ss xmm0, xmm0, xmm0) - jit-diff shows some size regressions because of that.

Also, If an FMADD candidate is prejitted (R2R'd) then if we re-compile it with FMA it might return different values for the same input (however, it already happens in .NET Core: #25857)

PS: mono supports it thanks to LLVM (if -fp-contract=fast is set) see https://twitter.com/EgorBo/status/1063468884257316865/photo/1

@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Jul 24, 2019

An issue for you 馃槈 #17541

@RussKeldorph

This comment has been minimized.

Copy link
Member

commented Jul 24, 2019

@mikedn

This comment has been minimized.

Copy link
Contributor

commented Jul 25, 2019

^ currently generates mul and fmadd here instead of mul and add because, I suspect, CSE happens after morphing (moving this transformation to lowering will help).

Yes, CSE runs after morph. But it's not clear if getting mul and fmadd is a bad thing in this case. In the mul+add case the add has to wait for the mul to complete so the total latency would be 8 cycles (on a Skylake). In the mul+fmadd case both instructions might be able to execute at the same time so both results would be available after only 4 cycles. But it depends on the surrounding code if this is useful or not, the latency may be unimportant and/or the surrounding code may have other ILP opportunities.

Moving to lowering would help in other ways though: if you have two a * b + c and you morph to FMA then CSE won't be able to pick up the redundancy because intrinsics are not recognized by CSE (well by value numbering that CSE depends on).

However, I suspect this transformation should be done in lower.cpp instead (I tried but it was too complicated to figure out how to do that)

What exactly was complicated? Should be relatively similar, the main difference would be that you need to manually insert and remove the old and new nodes from the linear order.

mono supports it thanks to LLVM (if -fp-contract=fast is set) see

The fact that auto generating FMA is normally done only if certain compiler options are set is probably the biggest roadblock to actually doing this in the JIT now. There's currently no way to developers to provide such options to the JIT (well, except perhaps by using COMPLUS environment variables but that's probably impractical).

@tannergooding

This comment has been minimized.

Copy link
Member

commented Jul 25, 2019

There's currently no way to developers to provide such options to the JIT

Definitely right. I believe #24784 is currently the closest thing we have to a tracking issue. @CarolEidt and I had discussed this a couple times in the past and it likely needs a good bit of design work and discussion to determine how it all works (especially with records to crossing various boundaries).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can鈥檛 perform that action at this time.