HWIntrinsics: FMA suboptimal codegen #12212

saucecontrol · 2019-03-07T23:31:33Z

This code:

av1 = Fma.MultiplyAdd(iv1, Sse.LoadVector128(mp + 4), av1);

currently compiles to:

lea         rbx,[rdi+10h]  
vfmadd132ps xmm4,xmm1,xmmword ptr [rbx]  
vmovaps     xmm1,xmm4

Assuming dotnet/coreclr#22944 would eliminate the extra lea there, I believe this should be generating:

vfmadd231ps xmm1,xmm4,xmmword ptr [rdi+10h]

It looks like the logic in genFMAIntrinsic is missing the fact the two non-contained arguments could be swapped here.

cc @tannergooding

category:cq
theme:hardware-intrinsics
skill-level:expert
cost:medium

The text was updated successfully, but these errors were encountered:

tannergooding · 2019-03-08T00:32:20Z

It looks like the logic in genFMAIntrinsic is missing the fact the two non-contained arguments could be swapped here.

This shouldn't be the case. We have a check here: https://github.com/dotnet/coreclr/blob/master/src/jit/hwintrinsiccodegenxarch.cpp#L2354

Could you share a minimal repro I could look at?

saucecontrol · 2019-03-08T00:40:43Z

Yeah, this one matches the second condition in the if above (op2 is contained) and it doesn't set the isCommutative value. I couldn't tell why the first and last branches set it but the middle two don't.

tannergooding · 2019-03-08T01:10:51Z

I'm not sure why that is missing either, let me test out a fix real quick.

saucecontrol · 2019-03-08T02:06:21Z

Cool, I can make up a repro if that would help. I can confirm that swapping the first two arguments in my example causes vfmadd231ps to be emitted.

tannergooding · 2019-03-08T02:10:47Z

A repro that I can directly test would be helpful 😄. Otherwise, I am left just testing the existing tests and local code.

tannergooding · 2019-03-08T02:55:15Z

I can confirm that swapping the first two arguments in my example causes vfmadd231ps to be emitted.

Ah, right. I remember why this is hard now.

The code right now is choosing:

lea         rbx,[rdi+10h]  
vfmadd132ps xmm4,xmm1,xmmword ptr [rbx]  
vmovaps     xmm1,xmm4

After the lea is optimized this will be:

vfmadd132ps xmm4,xmm1,xmmword ptr [rdi+10h]  
vmovaps     xmm1,xmm4

We start off with (x * y) + z which will choose 213 ((op2 * op1) + op3 where op1 = x; op2 = y; op3 = z). This form allows op1 and op2 to be swapped and op3 to be contained. In this case, op3 didn't need to be (or couldn't be) contained so we checked op2 instead and determined it could be contained, thus selecting 132 ((op1 * op3) + op2 where op1 = x; op2 = y; op3 = z) instead.

This means we now have (x * z) + y, which is wrong. We fix this up by swapping op2 and op3 https://github.com/dotnet/coreclr/blob/master/src/jit/hwintrinsiccodegenxarch.cpp#L2331, thus giving us (x * y) + z again ((op1 * op3) + op2 where op1 = x; op2 = z; op3 = y).

Given that op3 is the thing that is contained and and the commutative bits are the things being multiplied, you can't swap op1 and op3.

saucecontrol · 2019-03-08T02:55:21Z

Here's a quick one. fmaTest1 shows the vfmadd132 variant with the movaps right after. fmaTest2 shows the desired vfmadd231. Couple of other codegen issues here as well... this is on preview3, so I'm not sure if they've been fixed since.

using System;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

struct vec
{
    public float f1;
    public float f2;
    public float f3;
    public float f4;
}

class Program
{

    static unsafe float fmaTest1()
    {
        vec b;
        var a = Vector128.Create(1f);
        var c = Vector128.Create(2f);
        var d = Vector128.Create(3f);

        c = Fma.MultiplyAdd(a, Sse.LoadVector128((float*)&b), c);

        return Sse.Add(c, d).ToScalar();
    }

    static unsafe float fmaTest2()
    {
        vec b;
        var a = Vector128.Create(1f);
        var c = Vector128.Create(2f);
        var d = Vector128.Create(3f);

        c = Fma.MultiplyAdd(Sse.LoadVector128((float*)&b), a, c);

        return Sse.Add(c, d).ToScalar();
    }

    static void Main(string[] args)
    {
        Console.WriteLine(fmaTest1());
        Console.WriteLine(fmaTest2());
    }
}

fmaTest1

vmovss      xmm0,dword ptr[7FFCBB577598h]
vbroadcastss xmm0,xmm0  
vmovss      xmm1,dword ptr[7FFCBB57759Ch]
vbroadcastss xmm1,xmm1  
vmovss      xmm2,dword ptr[7FFCBB5775A0h]
vbroadcastss xmm2,xmm2  
lea         rax,[rsp+18h]  
vfmadd132ps xmm0,xmm1,xmmword ptr[rax]
vmovaps     xmm1,xmm0  
vaddps      xmm0,xmm1,xmm2  
vmovapd     xmmword ptr[rsp], xmm0
vmovss      xmm0,dword ptr[rsp]

fmaTest2

vmovss      xmm0,dword ptr[7FFCBB5794B8h]
vbroadcastss xmm0,xmm0  
vmovss      xmm1,dword ptr[7FFCBB5794BCh]
vbroadcastss xmm1,xmm1  
vmovss      xmm2,dword ptr[7FFCBB5794C0h]
vbroadcastss xmm2,xmm2  
lea         rax,[rsp+18h]  
vfmadd231ps xmm1,xmm0,xmmword ptr[rax]
vaddps      xmm0,xmm1,xmm2  
vmovapd     xmmword ptr[rsp], xmm0
vmovss      xmm0,dword ptr[rsp]

tannergooding · 2019-03-08T03:01:07Z

(Continued from https://github.com/dotnet/coreclr/issues/23115#issuecomment-470785697)

Right now we are waiting on choosing the instruction until codegen and do that based on which operand is contained. I'm not sure we have an easier way to do that and I'm not sure that it would help even if we could. sorry, we choose this in lowering and it would likely benefit if we could delay it until codegen, is what I meant to say.

In the register allocator, we are already not setting a tgtPref when both operands are commutative, but we do set a tgtPref in the case the operands aren't (for example, we set it to op1 for the 132 case: https://github.com/dotnet/coreclr/blob/master/src/jit/lsraxarch.cpp#L2610).

I think the problem is that we aren't really able to tell the register allocator that any of op1, op2, or op3 can be contained; and we would ideally just let the allocator decide what is best and then go off that. Part of this (being able to say more than one node can be regOptional) is tracked here: https://github.com/dotnet/coreclr/issues/6361

But, I'm not sure we have something tracking the former (that is, being able to say any node can be contained; but we need to ultimately decide on just 1 of them based on the register allocators choices).

Maybe @CarolEidt has some ideas here (but I would guess this is unlikely for 3.0).

saucecontrol · 2019-03-08T03:11:58Z

Ah, so it looks like this is a dupe of https://github.com/dotnet/coreclr/issues/20480. Not sure why I didn't see that one when I searched.

AndyAyersMS · 2019-03-12T17:24:00Z

Marking this as future...

saucecontrol · 2022-05-21T00:46:58Z

This was resolved by #58196

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

saucecontrol mentioned this issue May 24, 2021

Jpeg encoding code optimization SixLabors/ImageSharp#1632

Merged

4 tasks

tannergooding mentioned this issue Jul 7, 2021

Improve preferencing and code generation for FMA #12984

Closed

weilinwa mentioned this issue Sep 3, 2021

Optimize FMA codegen base on the overwritten #58196

Merged

saucecontrol closed this as completed May 21, 2022

ghost locked as resolved and limited conversation to collaborators Jun 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HWIntrinsics: FMA suboptimal codegen #12212

HWIntrinsics: FMA suboptimal codegen #12212

saucecontrol commented Mar 7, 2019 •

edited by BruceForstall

Loading

tannergooding commented Mar 8, 2019

saucecontrol commented Mar 8, 2019 •

edited

Loading

tannergooding commented Mar 8, 2019

saucecontrol commented Mar 8, 2019

tannergooding commented Mar 8, 2019

tannergooding commented Mar 8, 2019

saucecontrol commented Mar 8, 2019

tannergooding commented Mar 8, 2019 •

edited

Loading

saucecontrol commented Mar 8, 2019

AndyAyersMS commented Mar 12, 2019

saucecontrol commented May 21, 2022

HWIntrinsics: FMA suboptimal codegen #12212

HWIntrinsics: FMA suboptimal codegen #12212

Comments

saucecontrol commented Mar 7, 2019 • edited by BruceForstall Loading

tannergooding commented Mar 8, 2019

saucecontrol commented Mar 8, 2019 • edited Loading

tannergooding commented Mar 8, 2019

saucecontrol commented Mar 8, 2019

tannergooding commented Mar 8, 2019

tannergooding commented Mar 8, 2019

saucecontrol commented Mar 8, 2019

tannergooding commented Mar 8, 2019 • edited Loading

saucecontrol commented Mar 8, 2019

AndyAyersMS commented Mar 12, 2019

saucecontrol commented May 21, 2022

saucecontrol commented Mar 7, 2019 •

edited by BruceForstall

Loading

saucecontrol commented Mar 8, 2019 •

edited

Loading

tannergooding commented Mar 8, 2019 •

edited

Loading