Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codegen for Hardware Intrinsics arithmetic operations memory operands is poor #10923

Closed
4creators opened this issue Aug 18, 2018 · 9 comments
Closed
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization
Milestone

Comments

@4creators
Copy link
Contributor

Majority of hardware intrinsics arithmetic operations support using memory address as one of it's operands. It allows to write more efficient code which would bypass memory bottlenecks. Unfortunately jit does not fold memory loads into one of arithmetic operation operands and generates code for separate loads or stores.

The following example illustrates the problem (expression was specifically written to hint jit that second subtraction operand should not be loaded but folded into memory operand):

StoreScalar(rf + 2, Subtract(iVec, LoadAlignedVector128(((double*)(items + j)) + 2))); 
;StoreScalar(rf + 2, Subtract(iVec, LoadAlignedVector128(((double*)(items + j)) + 2)));                  
00007ffd`19ab4180 4983c310        add     r11, 10h  
00007ffd`19ab4184 c4c1792813      vmovapd xmm2, xmmword ptr [r11]   
00007ffd`19ab4189 c4e1715cd2      vsubpd  xmm2, xmm1, xmm2  
00007ffd`19ab418e 4983c210        add     r10,10h   
00007ffd`19ab4192 c4c17b1112      vmovsd  qword ptr [r10], xmm2

This code has two problems: (i) inefficient memory address calculation, (ii) memory operands not folded into one of vsubpd operands. There are some possible optimizations.

  1. Fold last vsubpd operand into memory address.
;StoreScalar(rf + 2, Subtract(iVec, LoadAlignedVector128(((double*)(items + j)) + 2)));
00007ffd`19ab4180 4983c310        add     r11, 10h    
00007ffd`19ab4189 c4e1715cd2      vsubpd  xmm2, xmm1, xmmword ptr [r11]  
00007ffd`19ab418e 4983c210        add     r10,10h   
00007ffd`19ab4192 c4c17b1112      vmovsd  qword ptr [r10], xmm2
  1. Improve addressing of operands - this particular problem is tracked by Codegen for LoadVector128 for a field of a struct is "poor" #10915
;StoreScalar(rf + 2, Subtract(iVec, LoadAlignedVector128(((double*)(items + j)) + 2)));                    
00007ffd`19ab4189 c4e1715cd2      vsubpd  xmm2, xmm1, xmmword ptr [r11 + 10h]  
00007ffd`19ab4192 c4c17b1112      vmovsd  qword ptr [r10 + 10h], xmm2

By applying these optimizations the above code should be roughly 2.5 x faster.

There are several solutions to the memory operand handling.

The simplest one is to give control to developers and provide overloads which would allow to pass memory pointers besides Vector128<T> or Vector256<T>.

Vector128<double> Sse2.Add(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Add(Vector128<double> left, double* right);

Vector128<double> Sse2.Divide(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Divide(Vector128<double> left, double* right);

Vector128<double> Sse2.Multiply(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Multiply(Vector128<double> left, double* right);

Vector128<double> Sse2.Subtract(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Subtract(Vector128<double> left, double* right);

For Vector128<T> marked as blittable type it should be possible to have even better self documenting overloads (providing C# would support pointers to generic blittable types).

Vector128<double> Sse2.Add(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Add(Vector128<double> left, Vector128<double>* right);

More complex and very inefficient form developer perspective is to provide jit support for folding loads and Unsafe.Read<T> reads into memory operands. Unfortunately the burden to write more code would make use of intrinsics even more harder and some developers would not even know how to use that support without digging into docs.

IMHO the best solution would be to expand API surface as this would be self documenting enhancement. Furthermore, from my experience managing data flow through memory avoiding memory wall while using HW intrinsics is one of the most difficult parts of the coding with them.

cc @AndyAyersMS @CarolEidt @eerhardt @fiigii @tannergooding

@tannergooding
Copy link
Member

@4creators, point 1 is not possible because

vsubpd  xmm2, xmm1, xmmword ptr [r11]

is not equivalent to

vmovapd xmm2, xmmword ptr [r11]
vsubpd  xmm2, xmm1, xmm2

For their VEX encoding, the majority of SIMD instructions start accepting unaligned memory operands, rather than requiring aligned memory operands. Folding the LoadAlignedVector128 when using the VEX encoding means the code will no longer throw if an unaligned address is passed in and that your code would now silently have different behavior.

For machines that support the VEX encoding (Avx.IsSupported), you must use LoadVector128 to get folding. Otherwise, (for machines using the "legacy encoding") you must use LoadAlignedVector128 to get folding. -- I believe there are a couple exceptions to this rule, but I don't recall which they are off the top of my head

@tannergooding
Copy link
Member

For Vector128 marked as blittable type it should be possible to have even better self documenting overloads (providing C# would support pointers to generic blittable types).

Taking the address of a generic struct (Vector128<double>*) is not possible today. The unmanaged constraint does exist, but that lets you take the address of T, not the container. dotnet/csharplang#1744 tracks adding language support for taking the address of the generic struct itself (provided it doesn't contain any GC tracked types).

@tannergooding
Copy link
Member

I don't think adding additional overloads, such as Vector128<double> Sse2.Divide(Vector128<double> left, double* right);, is a good idea due to the different semantic behavior between machines that support the VEX encoding and machines that do not.

Instead, having some documentation that covers things like best practices, easy to hit bugs, etc and/or having an analyzer catch and report these difference is likely desirable.

@fiigii
Copy link
Contributor

fiigii commented Aug 18, 2018

Agree with @tannergooding.

The codegen issue is caused by IR rather than API design. The correct solution should be adding more sophisticated optimization (like forward substitution).

@4creators
Copy link
Contributor Author

Taking the address of a generic struct (Vector128*) is not possible today.

@tannergooding Do you know what is the perspective of having dotnet/csharplang#1744 available when .NET Core 3.0 ships?

@CarolEidt
Copy link
Contributor

I'm going to tentatively label this as 3.0, but I'm not certain it will make it.

@CarolEidt CarolEidt self-assigned this Feb 4, 2019
CarolEidt referenced this issue in CarolEidt/coreclr Feb 6, 2019
CarolEidt referenced this issue in CarolEidt/coreclr Feb 28, 2019
Contribute to #19550
Fix #19521
CarolEidt referenced this issue in CarolEidt/coreclr Mar 1, 2019
Contribute to #19550
Fix #19521
CarolEidt referenced this issue in CarolEidt/coreclr Mar 1, 2019
Contribute to #19550
Fix #19521
CarolEidt referenced this issue in CarolEidt/coreclr Mar 12, 2019
Also, eliminate some places where the code size estimates were over-estimating.

Contribute to #19550
Fix #19521
CarolEidt referenced this issue in CarolEidt/coreclr Mar 20, 2019
Also, eliminate some places where the code size estimates were over-estimating.

Contribute to #19550
Fix #19521
CarolEidt referenced this issue in dotnet/coreclr Mar 26, 2019
* Handle addressing modes for HW intrinsics

Also, eliminate some places where the code size estimates were over-estimating.

Contribute to #19550
Fix #19521
@CarolEidt
Copy link
Contributor

I checked in a test case for this in dotnet/coreclr#22944. Here are the relevant diffs:
For this:

Sse2.StoreScalar(p + alignmentOffset + 2, Sse2.Subtract(v, Sse2.LoadAlignedVector128(p + offset + alignmentOffset + 4)));
...
v2 = Sse41.LoadAlignedVector128NonTemporal((byte*)(p + alignmentOffset)).AsSingle();

Before:

       shl      rbx, 2
       lea      r14, [rdi+rbx]
       lea      rcx, [r14+8]
       movsxd   r15, r8d
       shl      r15, 2
       add      rdi, r15
       add      rdi, rbx
       lea      rax, [rdi+16]
       vmovaps  xmm0, xmmword ptr [rax]
       vmovupd  xmm1, xmmword ptr [rsi]
       vsubps   xmm0, xmm1, xmm0
       vmovss   xmmword ptr [rcx], xmm0
       vmovntdqa xmm6, xmmword ptr [r14]

After:

       lea      r14, [rdi+4*rbx]
       movsxd   rcx, r8d
       lea      r15, [rdi+4*rcx]
       vmovaps  xmm0, xmmword ptr [r15+4*rbx+16]
       vmovupd  xmm1, xmmword ptr [rsi]
       vsubps   xmm0, xmm1, xmm0
       vmovss   xmmword ptr [r14+8], xmm0
       vmovntdqa xmm6, xmmword ptr [rdi+4*rbx]

And for this:

Sse2.Store(p + alignmentOffset + 1, Sse2.Subtract(v, Sse2.LoadVector128(p + offset + alignmentOffset + 1)));
v2 = Sse2.LoadVector128(p + alignmentOffset + 1);

Before:

       add      r14, 4
       add      rdi, 4
       vmovupd  xmm0, xmmword ptr [rsi]
       vsubps   xmm0, xmm0, xmmword ptr [rdi]
       vmovups  xmmword ptr [r14], xmm0
       vmovups  xmm6, xmmword ptr [r14]

After:

       vmovupd  xmm0, xmmword ptr [rsi]
       vsubps   xmm0, xmm0, xmmword ptr [r15+4*rbx+4]
       vmovups  xmmword ptr [r14+4], xmm0
       vmovups  xmm6, xmmword ptr [rdi+4*rbx+4]

Finally, for this:

Avx.Store(p + 1, Avx.Subtract(v, Avx.LoadVector256(p + offset + 1)));
Vector256<float> v2 = Avx.LoadVector256(p + 1);

Before:

       lea      rdx, [rcx+4*rdx+4]
       vmovupd  ymm0, ymmword ptr[rsi]
       vsubps   ymm0, ymm0, ymmword ptr[rdx]
       add      rcx, 4
       vmovups  ymmword ptr[rcx], ymm0
       vmovups  ymm6, ymmword ptr[rcx]

After:

       vmovupd  ymm0, ymmword ptr[rsi]
       vsubps   ymm0, ymm0, ymmword ptr[rcx+4*rdx+4]
       vmovups  ymmword ptr[rcx+4], ymm0
       vmovups  ymm6, ymmword ptr[rcx+4]

@CarolEidt
Copy link
Contributor

@4creators - there's still room for improvement, but I think this is in a reasonable place, and most of the remaining opportunity lies in the more general code to recognize addressing modes.
I'm going to close this as fixed with dotnet/coreclr#22944 but feel free to open another with specific issues you'd still like to see addressed.

@4creators
Copy link
Contributor Author

@CarolEidt Thanks for the work on improving on the issue. I am going to open more general issue tracking optimization of addressing mode with more than 2 args used to calculate address which should be folded to maximum 2 args.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization
Projects
None yet
Development

No branches or pull requests

5 participants