JIT: Improve codegen for xarch vector byte multiply#126348
JIT: Improve codegen for xarch vector byte multiply#126348saucecontrol wants to merge 3 commits intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
There was a problem hiding this comment.
Pull request overview
Improves xarch JIT codegen for SIMD byte multiplication by using a more efficient two-multiply “odd/even byte” strategy when widening to the next vector size isn’t possible, reducing unnecessary widen/narrow work compared to the prior fallback.
Changes:
- Adds a fast-path that widens to the next vector size (AVX2 for SIMD16, AVX512 for SIMD32) to perform a single multiply and then narrow.
- Replaces the previous fallback (split/widen/mul/narrow twice) with an odd/even byte approach that uses two 16-bit multiplies and recombines bytes with masks/shifts.
the current codegen has an invalid operand for the instruction? should be and a similar issue for |
Ha, I didn't notice that. Looks like that's just a bug in the JIT disasm. Running the code bytes through another disassembler shows it correctly. |
|
cc @dotnet/jit-contrib |
Resolves #109775
In cases where we can widen to the next vector size up and multiply once, the current codegen is already optimal. When that's not possible, the current codegen falls back to a version that splits into two vectors and runs the same basic algorithm, which is not optimal.
This implements the suggestion made by @MineCake147E on #109775, which still requires two multiplications, but avoids double widening and narrowing. The result is a ~2x perf improvement.
Typical diff:
vmovups zmm0, zmmword ptr [r8] vpbroadcastb zmm1, byte ptr [rcx] vpsubb zmm0, zmm0, zmm1 - vmovaps zmm1, zmm0 - vpmovzxbw zmm1, zmm1 - vpmullw zmm1, zmm1, zmm1 - vpmovwb zmm1, zmm1 - vextracti32x8 ymm0, zmm0, 1 - vpmovzxbw zmm0, zmm0 + vpsrlw zmm1, zmm0, 8 + vpandd zmm2, zmm0, dword ptr [reloc @RWD00] {1to16} + vpmullw zmm1, zmm2, zmm1 vpmullw zmm0, zmm0, zmm0 - vpmovwb zmm0, zmm0 - vinserti32x8 zmm0, zmm1, ymm0, 1 + vpternlogd zmm0, zmm1, dword ptr [reloc @RWD04] {1to16}, -20 vmovups zmmword ptr [rdx], zmm0 mov rax, rdx vzeroupper ret +RWD00 dd FF00FF00h +RWD04 dd 00FF00FFh -; Total bytes of code 87 +; Total bytes of code 71Full diffs
NB: codegen could actually be better, but currently JIT imports (and morphs)
AND_NOT(x, y)asAND(x, NOT(y)), which in the case of constantythat is re-used, creates two constants where one would have sufficed.