-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Conversation
An issue for you 😉 https://github.com/dotnet/coreclr/issues/17541 |
@dotnet/jit-contrib |
Yes, CSE runs after morph. But it's not clear if getting mul and fmadd is a bad thing in this case. In the mul+add case the add has to wait for the mul to complete so the total latency would be 8 cycles (on a Skylake). In the mul+fmadd case both instructions might be able to execute at the same time so both results would be available after only 4 cycles. But it depends on the surrounding code if this is useful or not, the latency may be unimportant and/or the surrounding code may have other ILP opportunities. Moving to lowering would help in other ways though: if you have two
What exactly was complicated? Should be relatively similar, the main difference would be that you need to manually insert and remove the old and new nodes from the linear order.
The fact that auto generating FMA is normally done only if certain compiler options are set is probably the biggest roadblock to actually doing this in the JIT now. There's currently no way to developers to provide such options to the JIT (well, except perhaps by using COMPLUS environment variables but that's probably impractical). |
Definitely right. I believe https://github.com/dotnet/coreclr/issues/24784 is currently the closest thing we have to a tracking issue. @CarolEidt and I had discussed this a couple times in the past and it likely needs a good bit of design work and discussion to determine how it all works (especially with records to crossing various boundaries). |
Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:
|
Thank you for your contribution. As announced in #27549 the dotnet/runtime repository will be used going forward for changes to this code base. Closing this PR as no more changes will be accepted into master for this repository. If you’d like to continue working on this change please move it to dotnet/runtime. |
I know, such features certainly require design and discussions, so it's just a do-not-merge PR to show how it could be done (this is how I learn how the RuyJIT actually works 🙂 thanks to your feedback/comments).
So the PR teaches JIT to recognize
a * b + c
patterns (see https://github.com/dotnet/coreclr/issues/17541) and replace them with, basically,Fma.MultiplyAddScalar
intrinsics (depending on signs and types):Benchmark: (Coffee Lake i7 8700K)
So it morphs:
into
(
Math.FusedMultiplyAdd()
generates the same IR tree)However, I suspect this transformation should be done in
lower.cpp
instead (I tried but it was too complicated to figure out how to do that)Issues
^ currently generates
mul
andfmadd
here instead ofmul
andadd
because, I suspect, CSE happens after morphing (moving this transformation to lowering will help).^ generates redundant movs. (while could be just
vfmadd231ss xmm0, xmm0, xmm0
) - jit-diff shows some size regressions because of that.Also, If an FMADD candidate is prejitted (R2R'd) then if we re-compile it with FMA it might return different values for the same input (however, it already happens in .NET Core: https://github.com/dotnet/coreclr/issues/25857)
PS:
mono
supports it thanks toLLVM
(if-fp-contract=fast
is set) see https://twitter.com/EgorBo/status/1063468884257316865/photo/1