-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HWIntrinsics: Load folding to immediate address? #12308
Comments
dotnet/coreclr#22944 may help. |
@CarolEidt @tannergooding @dotnet/jit-contrib |
@Zhentar llvm uses special |
@Zhentar - I believe this may have been addressed with dotnet/coreclr#22944. Can you verify? |
I'm not entirely clear on what effect was expected from that PR, but it doesn't appear to have affected this scenario at all. Testing without the vmovupd ymm3,[rsp+60]
vmovaps ymm2,ymm3
vmovupd ymm5,[r15+360]
vpaddd ymm2,ymm5,ymm2
vpshufd ymm3,ymm2,31
vpmuludq ymm2,ymm2,ymm3
vpaddq ymm1,ymm5,ymm1
vpaddq ymm1,ymm2,ymm1 With the lea rsi,[r10+68]
vmovupd ymm4,[r11+340]
vpaddd ymm5,ymm4,[rsi]
vpshufd ymm6,ymm5,31
vpmuludq ymm5,ymm5,ymm6
vpaddq ymm0,ymm4,ymm0
vpaddq ymm0,ymm5,ymm0 On the bright side, I do see it's doing a much better job of leveraging available registers to save off some of the reads or lea calculations outside of the loop. |
Just before forward sub we have a tree shape like the following:
This gets fixed up in global morph to be more like:
It persists this way all the way through LSRA, just getting a few pieces of additional metadata:
In the above, we have this case of a store to a local from an indirection, followed by an immediate use which is also notably the last use. In such a scenario, we should really support taking that indirection directly, but we miss out on it today. We miss out on the containment check because it's a I think this might be an opportunity that forward sub is missing out on and it namely looks to be because the Can we minimally have forward sub look past other |
I've been taking a go at porting the XXH3 hash algorithm including SSE & AVX versions. My current AVX2 code for the hot loop is here: https://github.com/Zhentar/xxHash3.NET/blob/ee6a626e87f2a829ec786690d4dfa560d876dda7/xxHash3/xxHash3_AVX2.cs#L103
So far I've gotten it up to 36GB/s, against the clang compiled native version's ~40GB/s.
One sub-piece by clang looks like this:
While my version looks like this:
Or, if I arrange the code such that folding kicks in (uncommenting the
in
for theProcessStripePiece_AVX2
key argument), this:However, the folded version performs worse, because the
lea
competes with the add/shuf/mul instructions for an integer ALU port instead of a load port.Is there any way to get an immediate address folded into the
vpaddd
instead of an execution time calculated displacement? I've tried a static readonly field, but that still resulted in an lea displacement calculation.category:cq
theme:hardware-intrinsics
skill-level:expert
cost:medium
The text was updated successfully, but these errors were encountered: