-
Notifications
You must be signed in to change notification settings - Fork 42
Add Load-And-Splat instructions #82
Conversation
You might want to expand the proposal with the reasons that make it impossible (or sufficiently hard) for WASM machine code generators to generate a single "load-and-splat" CPU instruction on targets that support it from WASM instruction sequences that perform a scalar load followed by a splat. I kind of expect a machine code generator to pattern match those two instructions and treat them as one, lowering to a single CPU instruction when available. |
|
We've generally tried to avoid forcing engines to do any optimization work that could have been done in the toolchain, including any pattern matching on instructions. I'm not saying it would be hard to do such pattern matching, but the bar for including an instruction should not assume there will be pattern matching. |
|
I don't see a way in which a machine code generator could generate optimal code here without pattern matching, and this operation is super common, so these instructions make a lot of sense to me. For example, vector math libraries often support binary operations between scalars and vectors ( |
|
The issue is where matching would happen - in compiler backend (requiring a new instruction) or WASM runtime. So far runtimes are have been reluctant to adopt instruction selection optimizations that work across WASM instructions. |
|
Agree that we should avoid expecting optimizations from engines. These operations looks useful with good hardware support. I am in favor of adding these. |
|
@gnzlbg Good point about motivation for new instruction vs pattern-matching a pair of instructions. Updated PR with explanation. |
dtig
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed PR. Strongly agree that eliminating the expensive move here is useful. We have an issue open to prototype this in V8, but given that this has cross architecture support, this should be good to merge. Could you add the new instructions to ImplementationStatus.md as well so that it doesn't get out of sync?
|
@dtig Please take a look |
dtig
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
|
It seems like these instructions should be called |
|
Sorry, late to the game, here to point out that VPBROADCASTB/W requires AVX512VL/AVX512BW, not AVX2. |
|
@ngzhian You are looking at EVEX-encoded |
|
You're very right, why are those in two different tables??? Thanks :) |
|
For ARM |
|
Right, I missed the lack of |
|
VBROADCASTSD seems to only supported for 256-bit register, https://www.felixcloutier.com/x86/vbroadcast |
|
@ngzhian |
|
Updated 64-bit load-and-splat lowering in PR description |
|
Perfect, thank you! |
Introduction
Loading a single element into all lanes of a SIMD vector is a common operation in signal processing, and both ARM NEON and recent x86 SIMD extensions can do it in a single instruction. In currently WebAssembly SIMD proposal this operation can be emulated via a combination of a scalar load and a
splatinstruction to replace the loaded scalar value into all lanes of a SIMD register. Unlike LLVM, streaming compilers in WebAssembly engines generate code under tight latency constraints, and can not afford pattern matching to generate optimal machine instruction for the combination of load and splat WAsm instructions. This PR introduce combined Load-and-Splat instructions which offer two advantages improvements over the above two-instruction combination:Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX2 instruction set
v = i8x16.load_splat(mem)maps toVPBROADCASTB xmm_v, [mem]v = i16x8.load_splat(mem)maps toVPBROADCASTW xmm_v, [mem]v = i32x4.load_splat(mem)is lowered like in AVX instruction setv = i64x2.load_splat(mem)is lowered like in AVX instruction setx86/x86-64 processors with AVX instruction set
v = i8x16.load_splat(mem)maps toVPINSRB xmm_v, [mem], 0 + VPXOR xmm_t, xmm_t, xmm_t + VPSHUFB xmm_v, xmm_v, xmm_tv = i16x8.load_splat(mem)maps toVPINSRW xmm_v, [mem], 0 + VPSHUFLW xmm_v, xmm_v, 0 + VPUNPCKLQDQ xmm_v, xmm_v, xmm_vv = i32x4.load_splat(mem)maps toVBROADCASTSS xmm_v, [mem]v = i64x2.load_splat(mem)maps toVMOVDDUP xmm_v, [mem]x86/x86-64 processors with SSE4.1 instruction set
v = i8x16.load_splat(mem)maps toPINSRB xmm_v, [mem], 0 + PXOR xmm_t, xmm_t + PSHUFB xmm_v, xmm_tv = i16x8.load_splat(mem)is lowered like in SSE2 instruction set`v = i32x4.load_splat(mem)is lowered like in SSE2 instruction setv = i64x2.load_splat(mem)is lowered like in SSE3 instruction setx86/x86-64 processors with SSE3 instruction set
v = i8x16.load_splat(mem)is lowered like in SSE2 instruction setv = i16x8.load_splat(mem)is lowered like in SSE2 instruction set`v = i32x4.load_splat(mem)is lowered like in SSE2 instruction setv = i64x2.load_splat(mem)maps toMOVDDUP xmm_v, [mem]x86/x86-64 processors with SSE2 instruction set
v = i8x16.load_splat(mem)maps toMOVZX r32_t, byte [mem] + IMUL r32_t, r32_t, 0x01010101 + MOVD xmm_v, r32_t + PSHUFD xmm_v, xmm_v, 0v = i16x8.load_splat(mem)maps toPINSRW xmm_v, [mem], 0 + PSHUFLW xmm_v, xmm_v, 0 + PUNPCKLQDQ xmm_v, xmm_vv = i32x4.load_splat(mem)maps toMOVSS xmm_v, [mem] + SHUFPS xmm_v, xmm_v, 0v = i64x2.load_splat(mem)maps toMOVSD xmm_v, [mem] + UNPCKLPD xmm_v, xmm_vARM64 processors
v = i8x16.load_splat(mem)maps toLD1R {Vv.16B}, [Rmem]v = i16x8.load_splat(mem)maps toLD1R {Vv.8H}, [Rmem]v = i32x4.load_splat(mem)maps toLD1R {Vv.4S}, [Rmem]v = i64x2.load_splat(mem)maps toLD1R {Vv.2D}, [Rmem]ARMv7 processors with NEON instruction set
v = i8x16.load_splat(mem)maps toVLD1.8 {d_v[], d_v+1[]}, [Rmem]v = i16x8.load_splat(mem)maps toVLD1.16 {d_v[], d_v+1[]}, [Rmem]v = i32x4.load_splat(mem)maps toVLD1.32 {d_v[], d_v+1[]}, [Rmem]v = i64x2.load_splat(mem)maps toVLD1.64 {d_v[]}, [Rmem] + VLD1.64 {d_v+1[]}, [Rmem]