-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Codegen for Hardware Intrinsics arithmetic operations memory operands is poor #10923
Comments
@4creators, point 1 is not possible because vsubpd xmm2, xmm1, xmmword ptr [r11] is not equivalent to vmovapd xmm2, xmmword ptr [r11]
vsubpd xmm2, xmm1, xmm2 For their VEX encoding, the majority of SIMD instructions start accepting unaligned memory operands, rather than requiring aligned memory operands. Folding the For machines that support the VEX encoding ( |
Taking the address of a generic struct ( |
I don't think adding additional overloads, such as Instead, having some documentation that covers things like best practices, easy to hit bugs, etc and/or having an analyzer catch and report these difference is likely desirable. |
Agree with @tannergooding. The codegen issue is caused by IR rather than API design. The correct solution should be adding more sophisticated optimization (like forward substitution). |
@tannergooding Do you know what is the perspective of having dotnet/csharplang#1744 available when .NET Core 3.0 ships? |
I'm going to tentatively label this as 3.0, but I'm not certain it will make it. |
Also, eliminate some places where the code size estimates were over-estimating. Contribute to #19550 Fix #19521
Also, eliminate some places where the code size estimates were over-estimating. Contribute to #19550 Fix #19521
* Handle addressing modes for HW intrinsics Also, eliminate some places where the code size estimates were over-estimating. Contribute to #19550 Fix #19521
I checked in a test case for this in dotnet/coreclr#22944. Here are the relevant diffs:
Before:
After:
And for this:
Before:
After:
Finally, for this:
Before:
After:
|
@4creators - there's still room for improvement, but I think this is in a reasonable place, and most of the remaining opportunity lies in the more general code to recognize addressing modes. |
@CarolEidt Thanks for the work on improving on the issue. I am going to open more general issue tracking optimization of addressing mode with more than 2 args used to calculate address which should be folded to maximum 2 args. |
Majority of hardware intrinsics arithmetic operations support using memory address as one of it's operands. It allows to write more efficient code which would bypass memory bottlenecks. Unfortunately jit does not fold memory loads into one of arithmetic operation operands and generates code for separate loads or stores.
The following example illustrates the problem (expression was specifically written to hint jit that second subtraction operand should not be loaded but folded into memory operand):
This code has two problems: (i) inefficient memory address calculation, (ii) memory operands not folded into one of
vsubpd
operands. There are some possible optimizations.vsubpd
operand into memory address.LoadVector128
for a field of a struct is "poor" #10915By applying these optimizations the above code should be roughly 2.5 x faster.
There are several solutions to the memory operand handling.
The simplest one is to give control to developers and provide overloads which would allow to pass memory pointers besides
Vector128<T>
orVector256<T>
.For
Vector128<T>
marked as blittable type it should be possible to have even better self documenting overloads (providing C# would support pointers to generic blittable types).More complex and very inefficient form developer perspective is to provide jit support for folding loads and
Unsafe.Read<T>
reads into memory operands. Unfortunately the burden to write more code would make use of intrinsics even more harder and some developers would not even know how to use that support without digging into docs.IMHO the best solution would be to expand API surface as this would be self documenting enhancement. Furthermore, from my experience managing data flow through memory avoiding memory wall while using HW intrinsics is one of the most difficult parts of the coding with them.
cc @AndyAyersMS @CarolEidt @eerhardt @fiigii @tannergooding
The text was updated successfully, but these errors were encountered: