Codegen for Hardware Intrinsics arithmetic operations memory operands is poor #10923

4creators · 2018-08-18T14:00:04Z

Majority of hardware intrinsics arithmetic operations support using memory address as one of it's operands. It allows to write more efficient code which would bypass memory bottlenecks. Unfortunately jit does not fold memory loads into one of arithmetic operation operands and generates code for separate loads or stores.

The following example illustrates the problem (expression was specifically written to hint jit that second subtraction operand should not be loaded but folded into memory operand):

StoreScalar(rf + 2, Subtract(iVec, LoadAlignedVector128(((double*)(items + j)) + 2)));

;StoreScalar(rf + 2, Subtract(iVec, LoadAlignedVector128(((double*)(items + j)) + 2)));                  
00007ffd`19ab4180 4983c310        add     r11, 10h  
00007ffd`19ab4184 c4c1792813      vmovapd xmm2, xmmword ptr [r11]   
00007ffd`19ab4189 c4e1715cd2      vsubpd  xmm2, xmm1, xmm2  
00007ffd`19ab418e 4983c210        add     r10,10h   
00007ffd`19ab4192 c4c17b1112      vmovsd  qword ptr [r10], xmm2

This code has two problems: (i) inefficient memory address calculation, (ii) memory operands not folded into one of vsubpd operands. There are some possible optimizations.

Fold last vsubpd operand into memory address.

;StoreScalar(rf + 2, Subtract(iVec, LoadAlignedVector128(((double*)(items + j)) + 2)));
00007ffd`19ab4180 4983c310        add     r11, 10h    
00007ffd`19ab4189 c4e1715cd2      vsubpd  xmm2, xmm1, xmmword ptr [r11]  
00007ffd`19ab418e 4983c210        add     r10,10h   
00007ffd`19ab4192 c4c17b1112      vmovsd  qword ptr [r10], xmm2

Improve addressing of operands - this particular problem is tracked by Codegen for LoadVector128 for a field of a struct is "poor" #10915

;StoreScalar(rf + 2, Subtract(iVec, LoadAlignedVector128(((double*)(items + j)) + 2)));                    
00007ffd`19ab4189 c4e1715cd2      vsubpd  xmm2, xmm1, xmmword ptr [r11 + 10h]  
00007ffd`19ab4192 c4c17b1112      vmovsd  qword ptr [r10 + 10h], xmm2

By applying these optimizations the above code should be roughly 2.5 x faster.

There are several solutions to the memory operand handling.

The simplest one is to give control to developers and provide overloads which would allow to pass memory pointers besides Vector128<T> or Vector256<T>.

Vector128<double> Sse2.Add(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Add(Vector128<double> left, double* right);

Vector128<double> Sse2.Divide(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Divide(Vector128<double> left, double* right);

Vector128<double> Sse2.Multiply(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Multiply(Vector128<double> left, double* right);

Vector128<double> Sse2.Subtract(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Subtract(Vector128<double> left, double* right);

For Vector128<T> marked as blittable type it should be possible to have even better self documenting overloads (providing C# would support pointers to generic blittable types).

Vector128<double> Sse2.Add(Vector128<double> left, Vector128<double> right);
Vector128<double> Sse2.Add(Vector128<double> left, Vector128<double>* right);

More complex and very inefficient form developer perspective is to provide jit support for folding loads and Unsafe.Read<T> reads into memory operands. Unfortunately the burden to write more code would make use of intrinsics even more harder and some developers would not even know how to use that support without digging into docs.

IMHO the best solution would be to expand API surface as this would be self documenting enhancement. Furthermore, from my experience managing data flow through memory avoiding memory wall while using HW intrinsics is one of the most difficult parts of the coding with them.

cc @AndyAyersMS @CarolEidt @eerhardt @fiigii @tannergooding

The text was updated successfully, but these errors were encountered:

tannergooding · 2018-08-18T14:16:57Z

@4creators, point 1 is not possible because

vsubpd  xmm2, xmm1, xmmword ptr [r11]

is not equivalent to

vmovapd xmm2, xmmword ptr [r11]
vsubpd  xmm2, xmm1, xmm2

For their VEX encoding, the majority of SIMD instructions start accepting unaligned memory operands, rather than requiring aligned memory operands. Folding the LoadAlignedVector128 when using the VEX encoding means the code will no longer throw if an unaligned address is passed in and that your code would now silently have different behavior.

For machines that support the VEX encoding (Avx.IsSupported), you must use LoadVector128 to get folding. Otherwise, (for machines using the "legacy encoding") you must use LoadAlignedVector128 to get folding. -- I believe there are a couple exceptions to this rule, but I don't recall which they are off the top of my head

tannergooding · 2018-08-18T14:21:19Z

For Vector128 marked as blittable type it should be possible to have even better self documenting overloads (providing C# would support pointers to generic blittable types).

Taking the address of a generic struct (Vector128<double>*) is not possible today. The unmanaged constraint does exist, but that lets you take the address of T, not the container. dotnet/csharplang#1744 tracks adding language support for taking the address of the generic struct itself (provided it doesn't contain any GC tracked types).

tannergooding · 2018-08-18T14:24:49Z

I don't think adding additional overloads, such as Vector128<double> Sse2.Divide(Vector128<double> left, double* right);, is a good idea due to the different semantic behavior between machines that support the VEX encoding and machines that do not.

Instead, having some documentation that covers things like best practices, easy to hit bugs, etc and/or having an analyzer catch and report these difference is likely desirable.

fiigii · 2018-08-18T19:55:13Z

Agree with @tannergooding.

The codegen issue is caused by IR rather than API design. The correct solution should be adding more sophisticated optimization (like forward substitution).

4creators · 2018-08-18T21:14:53Z

Taking the address of a generic struct (Vector128*) is not possible today.

@tannergooding Do you know what is the perspective of having dotnet/csharplang#1744 available when .NET Core 3.0 ships?

CarolEidt · 2019-02-04T15:30:30Z

I'm going to tentatively label this as 3.0, but I'm not certain it will make it.

Contribute to #19550 Fix #19521

Also, eliminate some places where the code size estimates were over-estimating. Contribute to #19550 Fix #19521

* Handle addressing modes for HW intrinsics Also, eliminate some places where the code size estimates were over-estimating. Contribute to #19550 Fix #19521

CarolEidt · 2019-03-26T23:45:36Z

I checked in a test case for this in dotnet/coreclr#22944. Here are the relevant diffs:
For this:

Sse2.StoreScalar(p + alignmentOffset + 2, Sse2.Subtract(v, Sse2.LoadAlignedVector128(p + offset + alignmentOffset + 4)));
...
v2 = Sse41.LoadAlignedVector128NonTemporal((byte*)(p + alignmentOffset)).AsSingle();

Before:

       shl      rbx, 2
       lea      r14, [rdi+rbx]
       lea      rcx, [r14+8]
       movsxd   r15, r8d
       shl      r15, 2
       add      rdi, r15
       add      rdi, rbx
       lea      rax, [rdi+16]
       vmovaps  xmm0, xmmword ptr [rax]
       vmovupd  xmm1, xmmword ptr [rsi]
       vsubps   xmm0, xmm1, xmm0
       vmovss   xmmword ptr [rcx], xmm0
       vmovntdqa xmm6, xmmword ptr [r14]

After:

       lea      r14, [rdi+4*rbx]
       movsxd   rcx, r8d
       lea      r15, [rdi+4*rcx]
       vmovaps  xmm0, xmmword ptr [r15+4*rbx+16]
       vmovupd  xmm1, xmmword ptr [rsi]
       vsubps   xmm0, xmm1, xmm0
       vmovss   xmmword ptr [r14+8], xmm0
       vmovntdqa xmm6, xmmword ptr [rdi+4*rbx]

And for this:

Sse2.Store(p + alignmentOffset + 1, Sse2.Subtract(v, Sse2.LoadVector128(p + offset + alignmentOffset + 1)));
v2 = Sse2.LoadVector128(p + alignmentOffset + 1);

Before:

       add      r14, 4
       add      rdi, 4
       vmovupd  xmm0, xmmword ptr [rsi]
       vsubps   xmm0, xmm0, xmmword ptr [rdi]
       vmovups  xmmword ptr [r14], xmm0
       vmovups  xmm6, xmmword ptr [r14]

After:

       vmovupd  xmm0, xmmword ptr [rsi]
       vsubps   xmm0, xmm0, xmmword ptr [r15+4*rbx+4]
       vmovups  xmmword ptr [r14+4], xmm0
       vmovups  xmm6, xmmword ptr [rdi+4*rbx+4]

Finally, for this:

Avx.Store(p + 1, Avx.Subtract(v, Avx.LoadVector256(p + offset + 1)));
Vector256<float> v2 = Avx.LoadVector256(p + 1);

Before:

       lea      rdx, [rcx+4*rdx+4]
       vmovupd  ymm0, ymmword ptr[rsi]
       vsubps   ymm0, ymm0, ymmword ptr[rdx]
       add      rcx, 4
       vmovups  ymmword ptr[rcx], ymm0
       vmovups  ymm6, ymmword ptr[rcx]

After:

       vmovupd  ymm0, ymmword ptr[rsi]
       vsubps   ymm0, ymm0, ymmword ptr[rcx+4*rdx+4]
       vmovups  ymmword ptr[rcx+4], ymm0
       vmovups  ymm6, ymmword ptr[rcx+4]

CarolEidt · 2019-03-26T23:47:20Z

@4creators - there's still room for improvement, but I think this is in a reasonable place, and most of the remaining opportunity lies in the more general code to recognize addressing modes.
I'm going to close this as fixed with dotnet/coreclr#22944 but feel free to open another with specific issues you'd still like to see addressed.

4creators · 2019-03-27T01:17:11Z

@CarolEidt Thanks for the work on improving on the issue. I am going to open more general issue tracking optimization of addressing mode with more than 2 args used to calculate address which should be folded to maximum 2 args.

CarolEidt self-assigned this Feb 4, 2019

CarolEidt referenced this issue in CarolEidt/coreclr Feb 6, 2019

[WIP] Handle addressing modes for HW intrinsics

16efc10

Contribute to #19550 Fix #19521

CarolEidt referenced this issue in CarolEidt/coreclr Feb 28, 2019

[WIP] Handle addressing modes for HW intrinsics

306731e

Contribute to #19550 Fix #19521

CarolEidt referenced this issue in CarolEidt/coreclr Mar 1, 2019

[WIP] Handle addressing modes for HW intrinsics

b7bde96

Contribute to #19550 Fix #19521

CarolEidt referenced this issue in CarolEidt/coreclr Mar 1, 2019

[WIP] Handle addressing modes for HW intrinsics

407cc7d

Contribute to #19550 Fix #19521

CarolEidt referenced this issue in CarolEidt/coreclr Mar 12, 2019

Handle addressing modes for HW intrinsics

29ab492

Also, eliminate some places where the code size estimates were over-estimating. Contribute to #19550 Fix #19521

CarolEidt referenced this issue in CarolEidt/coreclr Mar 20, 2019

Handle addressing modes for HW intrinsics

0ae8f19

Also, eliminate some places where the code size estimates were over-estimating. Contribute to #19550 Fix #19521

CarolEidt referenced this issue in dotnet/coreclr Mar 26, 2019

Handle addressing modes for HW intrinsics (#22944)

da6ed11

* Handle addressing modes for HW intrinsics Also, eliminate some places where the code size estimates were over-estimating. Contribute to #19550 Fix #19521

CarolEidt closed this as completed Mar 26, 2019

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the 3.0 milestone Jan 31, 2020

GrabYourPitchforks mentioned this issue Feb 29, 2020

JIT: Some SIMD intrinsics operations don't reliably fold memory addressing logic #33002

Closed

ghost locked as resolved and limited conversation to collaborators Dec 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codegen for Hardware Intrinsics arithmetic operations memory operands is poor #10923

Codegen for Hardware Intrinsics arithmetic operations memory operands is poor #10923

4creators commented Aug 18, 2018

tannergooding commented Aug 18, 2018

tannergooding commented Aug 18, 2018

tannergooding commented Aug 18, 2018

fiigii commented Aug 18, 2018

4creators commented Aug 18, 2018

CarolEidt commented Feb 4, 2019

CarolEidt commented Mar 26, 2019

CarolEidt commented Mar 26, 2019

4creators commented Mar 27, 2019

Codegen for Hardware Intrinsics arithmetic operations memory operands is poor #10923

Codegen for Hardware Intrinsics arithmetic operations memory operands is poor #10923

Comments

4creators commented Aug 18, 2018

tannergooding commented Aug 18, 2018

tannergooding commented Aug 18, 2018

tannergooding commented Aug 18, 2018

fiigii commented Aug 18, 2018

4creators commented Aug 18, 2018

CarolEidt commented Feb 4, 2019

CarolEidt commented Mar 26, 2019

CarolEidt commented Mar 26, 2019

4creators commented Mar 27, 2019