Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HWIntrinsics: FMA suboptimal codegen #12212

Closed
saucecontrol opened this issue Mar 7, 2019 · 11 comments
Closed

HWIntrinsics: FMA suboptimal codegen #12212

saucecontrol opened this issue Mar 7, 2019 · 11 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions
Milestone

Comments

@saucecontrol
Copy link
Member

saucecontrol commented Mar 7, 2019

This code:

av1 = Fma.MultiplyAdd(iv1, Sse.LoadVector128(mp + 4), av1);

currently compiles to:

lea         rbx,[rdi+10h]  
vfmadd132ps xmm4,xmm1,xmmword ptr [rbx]  
vmovaps     xmm1,xmm4  

Assuming dotnet/coreclr#22944 would eliminate the extra lea there, I believe this should be generating:

vfmadd231ps xmm1,xmm4,xmmword ptr [rdi+10h]  

It looks like the logic in genFMAIntrinsic is missing the fact the two non-contained arguments could be swapped here.

cc @tannergooding

category:cq
theme:hardware-intrinsics
skill-level:expert
cost:medium

@tannergooding
Copy link
Member

It looks like the logic in genFMAIntrinsic is missing the fact the two non-contained arguments could be swapped here.

This shouldn't be the case. We have a check here: https://github.com/dotnet/coreclr/blob/master/src/jit/hwintrinsiccodegenxarch.cpp#L2354

Could you share a minimal repro I could look at?

@saucecontrol
Copy link
Member Author

saucecontrol commented Mar 8, 2019

Yeah, this one matches the second condition in the if above (op2 is contained) and it doesn't set the isCommutative value. I couldn't tell why the first and last branches set it but the middle two don't.

@tannergooding
Copy link
Member

I'm not sure why that is missing either, let me test out a fix real quick.

@saucecontrol
Copy link
Member Author

Cool, I can make up a repro if that would help. I can confirm that swapping the first two arguments in my example causes vfmadd231ps to be emitted.

@tannergooding
Copy link
Member

A repro that I can directly test would be helpful 😄. Otherwise, I am left just testing the existing tests and local code.

@tannergooding
Copy link
Member

I can confirm that swapping the first two arguments in my example causes vfmadd231ps to be emitted.

Ah, right. I remember why this is hard now.

The code right now is choosing:

lea         rbx,[rdi+10h]  
vfmadd132ps xmm4,xmm1,xmmword ptr [rbx]  
vmovaps     xmm1,xmm4  

After the lea is optimized this will be:

vfmadd132ps xmm4,xmm1,xmmword ptr [rdi+10h]  
vmovaps     xmm1,xmm4  

We start off with (x * y) + z which will choose 213 ((op2 * op1) + op3 where op1 = x; op2 = y; op3 = z). This form allows op1 and op2 to be swapped and op3 to be contained. In this case, op3 didn't need to be (or couldn't be) contained so we checked op2 instead and determined it could be contained, thus selecting 132 ((op1 * op3) + op2 where op1 = x; op2 = y; op3 = z) instead.

This means we now have (x * z) + y, which is wrong. We fix this up by swapping op2 and op3 https://github.com/dotnet/coreclr/blob/master/src/jit/hwintrinsiccodegenxarch.cpp#L2331, thus giving us (x * y) + z again ((op1 * op3) + op2 where op1 = x; op2 = z; op3 = y).

Given that op3 is the thing that is contained and and the commutative bits are the things being multiplied, you can't swap op1 and op3.

@saucecontrol
Copy link
Member Author

Here's a quick one. fmaTest1 shows the vfmadd132 variant with the movaps right after. fmaTest2 shows the desired vfmadd231. Couple of other codegen issues here as well... this is on preview3, so I'm not sure if they've been fixed since.

using System;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

struct vec
{
    public float f1;
    public float f2;
    public float f3;
    public float f4;
}

class Program
{

    static unsafe float fmaTest1()
    {
        vec b;
        var a = Vector128.Create(1f);
        var c = Vector128.Create(2f);
        var d = Vector128.Create(3f);

        c = Fma.MultiplyAdd(a, Sse.LoadVector128((float*)&b), c);

        return Sse.Add(c, d).ToScalar();
    }

    static unsafe float fmaTest2()
    {
        vec b;
        var a = Vector128.Create(1f);
        var c = Vector128.Create(2f);
        var d = Vector128.Create(3f);

        c = Fma.MultiplyAdd(Sse.LoadVector128((float*)&b), a, c);

        return Sse.Add(c, d).ToScalar();
    }

    static void Main(string[] args)
    {
        Console.WriteLine(fmaTest1());
        Console.WriteLine(fmaTest2());
    }
}

fmaTest1

vmovss      xmm0,dword ptr[7FFCBB577598h]
vbroadcastss xmm0,xmm0  
vmovss      xmm1,dword ptr[7FFCBB57759Ch]
vbroadcastss xmm1,xmm1  
vmovss      xmm2,dword ptr[7FFCBB5775A0h]
vbroadcastss xmm2,xmm2  
lea         rax,[rsp+18h]  
vfmadd132ps xmm0,xmm1,xmmword ptr[rax]
vmovaps     xmm1,xmm0  
vaddps      xmm0,xmm1,xmm2  
vmovapd     xmmword ptr[rsp], xmm0
vmovss      xmm0,dword ptr[rsp]

fmaTest2

vmovss      xmm0,dword ptr[7FFCBB5794B8h]
vbroadcastss xmm0,xmm0  
vmovss      xmm1,dword ptr[7FFCBB5794BCh]
vbroadcastss xmm1,xmm1  
vmovss      xmm2,dword ptr[7FFCBB5794C0h]
vbroadcastss xmm2,xmm2  
lea         rax,[rsp+18h]  
vfmadd231ps xmm1,xmm0,xmmword ptr[rax]
vaddps      xmm0,xmm1,xmm2  
vmovapd     xmmword ptr[rsp], xmm0
vmovss      xmm0,dword ptr[rsp]

@tannergooding
Copy link
Member

tannergooding commented Mar 8, 2019

(Continued from https://github.com/dotnet/coreclr/issues/23115#issuecomment-470785697)

Right now we are waiting on choosing the instruction until codegen and do that based on which operand is contained. I'm not sure we have an easier way to do that and I'm not sure that it would help even if we could. sorry, we choose this in lowering and it would likely benefit if we could delay it until codegen, is what I meant to say.

In the register allocator, we are already not setting a tgtPref when both operands are commutative, but we do set a tgtPref in the case the operands aren't (for example, we set it to op1 for the 132 case: https://github.com/dotnet/coreclr/blob/master/src/jit/lsraxarch.cpp#L2610).

I think the problem is that we aren't really able to tell the register allocator that any of op1, op2, or op3 can be contained; and we would ideally just let the allocator decide what is best and then go off that. Part of this (being able to say more than one node can be regOptional) is tracked here: https://github.com/dotnet/coreclr/issues/6361

But, I'm not sure we have something tracking the former (that is, being able to say any node can be contained; but we need to ultimately decide on just 1 of them based on the register allocators choices).

Maybe @CarolEidt has some ideas here (but I would guess this is unlikely for 3.0).

@saucecontrol
Copy link
Member Author

Ah, so it looks like this is a dupe of https://github.com/dotnet/coreclr/issues/20480. Not sure why I didn't see that one when I searched.

@AndyAyersMS
Copy link
Member

Marking this as future...

@saucecontrol
Copy link
Member Author

This was resolved by #58196

@ghost ghost locked as resolved and limited conversation to collaborators Jun 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions
Projects
None yet
Development

No branches or pull requests

4 participants