Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HWIntrinsics: Load folding to immediate address? #12308

Open
Zhentar opened this issue Mar 21, 2019 · 6 comments
Open

HWIntrinsics: Load folding to immediate address? #12308

Zhentar opened this issue Mar 21, 2019 · 6 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization tenet-performance Performance related issue
Milestone

Comments

@Zhentar
Copy link

Zhentar commented Mar 21, 2019

I've been taking a go at porting the XXH3 hash algorithm including SSE & AVX versions. My current AVX2 code for the hot loop is here: https://github.com/Zhentar/xxHash3.NET/blob/ee6a626e87f2a829ec786690d4dfa560d876dda7/xxHash3/xxHash3_AVX2.cs#L103

So far I've gotten it up to 36GB/s, against the clang compiled native version's ~40GB/s.

One sub-piece by clang looks like this:

vmovdqu ymm3, ymmword ptr [rax-360h]
vpaddd  ymm4, ymm3, cs:ymmword_40BDC0
vpshufd ymm6, ymm4, 31h
vpmuludq ymm4, ymm6, ymm4
vpaddq  ymm3, ymm5, ymm3
vpaddq  ymm0, ymm0, ymm3

While my version looks like this:

vmovupd ymm8,ymmword ptr [r10+88h]
vmovupd ymm9,ymmword ptr [r11+360h]
vpaddd  ymm8,ymm9,ymm8
vpshufd ymm10,ymm8,31h
vpmuludq ymm8,ymm8,ymm10
vpaddq  ymm1,ymm9,ymm1
vpaddq  ymm1,ymm8,ymm1

Or, if I arrange the code such that folding kicks in (uncommenting the in for the ProcessStripePiece_AVX2 key argument), this:

lea     r14,[rax+20h]
vmovupd ymm4,ymmword ptr [rbp+100h]
vpaddd  ymm5,ymm4,ymmword ptr [r14]
vpshufd ymm6,ymm5,31h
vpmuludq ymm5,ymm5,ymm6
vpaddq  ymm0,ymm4,ymm0
vpaddq  ymm0,ymm5,ymm0

However, the folded version performs worse, because the lea competes with the add/shuf/mul instructions for an integer ALU port instead of a load port.

Is there any way to get an immediate address folded into the vpaddd instead of an execution time calculated displacement? I've tried a static readonly field, but that still resulted in an lea displacement calculation.

category:cq
theme:hardware-intrinsics
skill-level:expert
cost:medium

@fiigii
Copy link
Contributor

fiigii commented Mar 21, 2019

dotnet/coreclr#22944 may help.

@RussKeldorph
Copy link
Contributor

@CarolEidt @tannergooding @dotnet/jit-contrib

@4creators
Copy link
Contributor

@Zhentar llvm uses special lea optimization stage which works on the whole function frame. This allows for much better whole frame lea addressing optimization by eliminating any unnecessary offset calculations. RyuJIT does not have such functionality at this time. Nevertheless, as @fiigii pointed dotnet/coreclr#22944 may help to some extent.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@CarolEidt
Copy link
Contributor

@Zhentar - I believe this may have been addressed with dotnet/coreclr#22944. Can you verify?

@Zhentar
Copy link
Author

Zhentar commented Nov 8, 2020

I'm not entirely clear on what effect was expected from that PR, but it doesn't appear to have affected this scenario at all. Testing without the in, I am unfortunately getting worse codegen than I was when I wrote this; I get the same code plus an added completely unnecessary mov:

       vmovupd   ymm3,[rsp+60]
       vmovaps   ymm2,ymm3
       vmovupd   ymm5,[r15+360]
       vpaddd    ymm2,ymm5,ymm2
       vpshufd   ymm3,ymm2,31
       vpmuludq  ymm2,ymm2,ymm3
       vpaddq    ymm1,ymm5,ymm1
       vpaddq    ymm1,ymm2,ymm1

With the in, the code is basically identical:

       lea       rsi,[r10+68]
       vmovupd   ymm4,[r11+340]
       vpaddd    ymm5,ymm4,[rsi]
       vpshufd   ymm6,ymm5,31
       vpmuludq  ymm5,ymm5,ymm6
       vpaddq    ymm0,ymm4,ymm0
       vpaddq    ymm0,ymm5,ymm0

On the bright side, I do see it's doing a much better job of leveraging available registers to save off some of the reads or lea calculations outside of the loop.

@tannergooding
Copy link
Member

Just before forward sub we have a tree shape like the following:

***** BB01 [0000]
STMT00006 ( 0x000[E-] ... ??? )
               [000033] DA-XG------                         *  STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V06 tmp2         
               [000003] n--XG------                         \--*  IND       simd32
               [000002] ---X-------                            \--*  FIELD_ADDR byref  xxHash3.xxHash3+Vec256Pair`1[uint]:A
               [000001] ---X-------                               \--*  FIELD_ADDR byref  xxHash3.xxHash3+Keys_AVX2:K00
               [000000] -----------                                  \--*  LCL_VAR   byref  V03 arg2          (last use)

***** BB01 [0000]
STMT00007 ( 0x000[E-] ... ??? )
               [000034] DA-XG------                         *  STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V05 tmp1         
               [000008] n--XG------                         \--*  IND       simd32
               [000007] ---X-------                            \--*  FIELD_ADDR byref  xxHash3.xxHash3+Vec256Pair`1[uint]:A
               [000006] ---X-------                               \--*  FIELD_ADDR byref  xxHash3.xxHash3+StripeBlock_AVX2:S00
               [000005] -----------                                  \--*  LCL_VAR   byref  V01 arg0          (last use)

***** BB01 [0000]
STMT00003 ( INL01 @ 0x000[E-] ... ??? ) <- INLRT @ 0x000[E-]
               [000018] DA---------                         *  STORE_LCL_VAR simd32 V07 tmp3         
               [000017] -----------                         \--*  HWINTRINSIC simd32 uint Add
               [000015] -----------                            +--*  LCL_VAR   simd32<System.Runtime.Intrinsics.Vector256`1> V05 tmp1         
               [000016] -----------                            \--*  LCL_VAR   simd32<System.Runtime.Intrinsics.Vector256`1> V06 tmp2          (last use)

This gets fixed up in global morph to be more like:

***** BB01 [0000]
STMT00006 ( 0x000[E-] ... ??? )
               [000033] DA-XG+-----                         *  STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V06 tmp2         
               [000003] ---XG+-----                         \--*  IND       simd32
               [000000] -----+-----                            \--*  LCL_VAR   byref  V03 arg2          (last use)

***** BB01 [0000]
STMT00007 ( 0x000[E-] ... ??? )
               [000034] DA-XG+-----                         *  STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V05 tmp1         
               [000008] ---XG+-----                         \--*  IND       simd32
               [000005] -----+-----                            \--*  LCL_VAR   byref  V01 arg0          (last use)

***** BB01 [0000]
STMT00003 ( INL01 @ 0x000[E-] ... ??? ) <- INLRT @ 0x000[E-]
               [000018] DA---+-----                         *  STORE_LCL_VAR simd32 V07 tmp3         
               [000017] -----+-----                         \--*  HWINTRINSIC simd32 uint Add
               [000015] -----+-----                            +--*  LCL_VAR   simd32<System.Runtime.Intrinsics.Vector256`1> V05 tmp1         
               [000016] -----+-----                            \--*  LCL_VAR   simd32<System.Runtime.Intrinsics.Vector256`1> V06 tmp2          (last use)

It persists this way all the way through LSRA, just getting a few pieces of additional metadata:

***** BB01 [0000]
STMT00006 ( 0x000[E-] ... ??? )
N003 (  3,  3) [000033] DA-XG------                         *  STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V06 tmp2         d:1 $1c1
N002 (  3,  2) [000003] ---XG------                         \--*  IND       simd32 <l:$200, c:$201>
N001 (  1,  1) [000000] -----------                            \--*  LCL_VAR   byref  V03 arg2         u:1 (last use) $83

***** BB01 [0000]
STMT00007 ( 0x000[E-] ... ??? )
N003 (  3,  3) [000034] DA-XG------                         *  STORE_LCL_VAR simd32<System.Runtime.Intrinsics.Vector256`1> V05 tmp1         d:1 $1c3
N002 (  3,  2) [000008] ---XG------                         \--*  IND       simd32 <l:$202, c:$203>
N001 (  1,  1) [000005] -----------                            \--*  LCL_VAR   byref  V01 arg0         u:1 (last use) $81

***** BB01 [0000]
STMT00003 ( INL01 @ 0x000[E-] ... ??? ) <- INLRT @ 0x000[E-]
N004 (  3,  3) [000018] DA---------                         *  STORE_LCL_VAR simd32 V07 tmp3         d:1 $VN.Void
N003 (  3,  3) [000017] -----------                         \--*  HWINTRINSIC simd32 uint Add <l:$102, c:$103>
N001 (  1,  1) [000015] -----------                            +--*  LCL_VAR   simd32<System.Runtime.Intrinsics.Vector256`1> V05 tmp1         u:1 <l:$101, c:$141>
N002 (  1,  1) [000016] -----------                            \--*  LCL_VAR   simd32<System.Runtime.Intrinsics.Vector256`1> V06 tmp2         u:1 (last use) <l:$100, c:$140>

In the above, we have this case of a store to a local from an indirection, followed by an immediate use which is also notably the last use.

In such a scenario, we should really support taking that indirection directly, but we miss out on it today. We miss out on the containment check because it's a LCL_VAR where lvDoNotEnregister is false. We then do mark it regOptional but it doesn't ultimately get picked.

I think this might be an opportunity that forward sub is missing out on and it namely looks to be because the STORE_LCL_VAR for tmp1 is in the way. If we explicitly change the parameter order to ensure that key is the final parameter, then it works as expected.

Can we minimally have forward sub look past other STORE_LCL_VAR that aren't side effecting? CC. @AndyAyersMS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

7 participants