Add Load-And-Splat instructions #82

Maratyszcza · 2019-06-24T23:35:11Z

Introduction

Loading a single element into all lanes of a SIMD vector is a common operation in signal processing, and both ARM NEON and recent x86 SIMD extensions can do it in a single instruction. In currently WebAssembly SIMD proposal this operation can be emulated via a combination of a scalar load and a splat instruction to replace the loaded scalar value into all lanes of a SIMD register. Unlike LLVM, streaming compilers in WebAssembly engines generate code under tight latency constraints, and can not afford pattern matching to generate optimal machine instruction for the combination of load and splat WAsm instructions. This PR introduce combined Load-and-Splat instructions which offer two advantages improvements over the above two-instruction combination:

The value from memory is loaded directly into SIMD register rather than a general-purpose register (which store scalar integer values) in the current two-instruction scheme. Loading value directly into SIMD register eliminates expensive transfer from general-purpose register to a SIMD register requires in the current two-instruction scheme.
These instructions enable WebAssembly implementations to leverage specialized instructions to load-and-splat element which exist in both ARM and x86 SIMD extensions.

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX2 instruction set

v = i8x16.load_splat(mem) maps to VPBROADCASTB xmm_v, [mem]
v = i16x8.load_splat(mem) maps to VPBROADCASTW xmm_v, [mem]
v = i32x4.load_splat(mem) is lowered like in AVX instruction set
v = i64x2.load_splat(mem) is lowered like in AVX instruction set

x86/x86-64 processors with AVX instruction set

v = i8x16.load_splat(mem) maps to VPINSRB xmm_v, [mem], 0 + VPXOR xmm_t, xmm_t, xmm_t + VPSHUFB xmm_v, xmm_v, xmm_t
v = i16x8.load_splat(mem) maps to VPINSRW xmm_v, [mem], 0 + VPSHUFLW xmm_v, xmm_v, 0 + VPUNPCKLQDQ xmm_v, xmm_v, xmm_v
v = i32x4.load_splat(mem) maps to VBROADCASTSS xmm_v, [mem]
v = i64x2.load_splat(mem) maps to VMOVDDUP xmm_v, [mem]

x86/x86-64 processors with SSE4.1 instruction set

v = i8x16.load_splat(mem) maps to PINSRB xmm_v, [mem], 0 + PXOR xmm_t, xmm_t + PSHUFB xmm_v, xmm_t
v = i16x8.load_splat(mem) is lowered like in SSE2 instruction set`
v = i32x4.load_splat(mem) is lowered like in SSE2 instruction set
v = i64x2.load_splat(mem) is lowered like in SSE3 instruction set

x86/x86-64 processors with SSE3 instruction set

v = i8x16.load_splat(mem) is lowered like in SSE2 instruction set
v = i16x8.load_splat(mem) is lowered like in SSE2 instruction set`
v = i32x4.load_splat(mem) is lowered like in SSE2 instruction set
v = i64x2.load_splat(mem) maps to MOVDDUP xmm_v, [mem]

x86/x86-64 processors with SSE2 instruction set

v = i8x16.load_splat(mem) maps to MOVZX r32_t, byte [mem] + IMUL r32_t, r32_t, 0x01010101 + MOVD xmm_v, r32_t + PSHUFD xmm_v, xmm_v, 0
v = i16x8.load_splat(mem) maps to PINSRW xmm_v, [mem], 0 + PSHUFLW xmm_v, xmm_v, 0 + PUNPCKLQDQ xmm_v, xmm_v
v = i32x4.load_splat(mem) maps to MOVSS xmm_v, [mem] + SHUFPS xmm_v, xmm_v, 0
v = i64x2.load_splat(mem) maps to MOVSD xmm_v, [mem] + UNPCKLPD xmm_v, xmm_v

ARM64 processors

v = i8x16.load_splat(mem) maps to LD1R {Vv.16B}, [Rmem]
v = i16x8.load_splat(mem) maps to LD1R {Vv.8H}, [Rmem]
v = i32x4.load_splat(mem) maps to LD1R {Vv.4S}, [Rmem]
v = i64x2.load_splat(mem) maps to LD1R {Vv.2D}, [Rmem]

ARMv7 processors with NEON instruction set

v = i8x16.load_splat(mem) maps to VLD1.8 {d_v[], d_v+1[]}, [Rmem]
v = i16x8.load_splat(mem) maps to VLD1.16 {d_v[], d_v+1[]}, [Rmem]
v = i32x4.load_splat(mem) maps to VLD1.32 {d_v[], d_v+1[]}, [Rmem]
v = i64x2.load_splat(mem) maps to VLD1.64 {d_v[]}, [Rmem] + VLD1.64 {d_v+1[]}, [Rmem]

gnzlbg · 2019-06-25T11:49:24Z

The value from memory is loaded directly into SIMD register rather than a general-purpose register (which store scalar integer values) in the current two-instruction scheme. Loading value directly into SIMD register eliminates expensive transfer from general-purpose register to a SIMD register requires in the current two-instruction scheme.

You might want to expand the proposal with the reasons that make it impossible (or sufficiently hard) for WASM machine code generators to generate a single "load-and-splat" CPU instruction on targets that support it from WASM instruction sequences that perform a scalar load followed by a splat.

I kind of expect a machine code generator to pattern match those two instructions and treat them as one, lowering to a single CPU instruction when available.

tlively · 2019-06-25T16:07:41Z

We've generally tried to avoid forcing engines to do any optimization work that could have been done in the toolchain, including any pattern matching on instructions. I'm not saying it would be hard to do such pattern matching, but the bar for including an instruction should not assume there will be pattern matching.

gnzlbg · 2019-06-25T17:16:04Z

I don't see a way in which a machine code generator could generate optimal code here without pattern matching, and this operation is super common, so these instructions make a lot of sense to me.

For example, vector math libraries often support binary operations between scalars and vectors (bOp(scalar, vec), e.g., scalar * vec), and these are very often implemented as a bOp(splat(scalar), vec) because the CPU instruction for these only operate on two vectors. That splat is the splat from memory being proposed here.

penzn · 2019-06-25T17:49:35Z

The issue is where matching would happen - in compiler backend (requiring a new instruction) or WASM runtime. So far runtimes are have been reluctant to adopt instruction selection optimizations that work across WASM instructions.

arunetm · 2019-06-25T18:28:56Z

Agree that we should avoid expecting optimizations from engines. These operations looks useful with good hardware support. I am in favor of adding these.

Maratyszcza · 2019-06-26T05:21:05Z

@gnzlbg Good point about motivation for new instruction vs pattern-matching a pair of instructions. Updated PR with explanation.

dtig

Thanks for the detailed PR. Strongly agree that eliminating the expensive move here is useful. We have an issue open to prototype this in V8, but given that this has cross architecture support, this should be good to merge. Could you add the new instructions to ImplementationStatus.md as well so that it doesn't get out of sync?

Maratyszcza · 2019-07-18T18:45:31Z

@dtig Please take a look

dtig

Looks good, thanks!

AndrewScheidecker · 2019-08-09T12:05:50Z

It seems like these instructions should be called vNxM.load_splat instead of iNxM.load_splat, since they don't need to interpret the bits of the scalars they work with, and can be used equally well for FP scalars.

ngzhian · 2019-10-17T23:28:28Z

Sorry, late to the game, here to point out that VPBROADCASTB/W requires AVX512VL/AVX512BW, not AVX2.

Maratyszcza · 2019-10-18T00:09:09Z

@ngzhian You are looking at EVEX-encoded VPBROADCASTB/W. VEX-encoded VPBROADCASTB/W exists since AVX2.

ngzhian · 2019-10-18T00:12:04Z

You're very right, why are those in two different tables??? Thanks :)

ngzhian · 2019-11-20T23:29:47Z

For ARM v = i64x2.load_splat(mem) maps to VLD1.64 {d_v[], d_v+1[]}, [Rmem], I don't see vld1.64 in the architecture manual.
Specially i'm looking at A8.8.323 VLD1 (single element to all lanes), size == '11' seems to be an undefined operation.

Maratyszcza · 2019-11-20T23:45:30Z

Right, I missed the lack of VLD1.64 with broadcast. This instruction would need to be lowered through either a combination of two VLD1.64 instructions without broadcast, or a combination of VLD1.64 + VMOV from one d register with a q to its neighbor.

ngzhian · 2019-11-21T23:00:09Z

VBROADCASTSD seems to only supported for 256-bit register, https://www.felixcloutier.com/x86/vbroadcast
"VBROADCASTSD and VBROADCASTF128,F32x4 and F64x2 are only supported as 256-bit and 512-bit wide versions and up"
we don't have 256-bit register support in v8, so I don't think we can do this for now.

Maratyszcza · 2019-11-21T23:02:24Z

@ngzhian VMOVDDUP xmm, m64 would do the same

Maratyszcza · 2019-11-21T23:04:52Z

Updated 64-bit load-and-splat lowering in PR description

ngzhian · 2019-11-21T23:27:21Z

Perfect, thank you!

dtig suggested changes Jul 18, 2019

View reviewed changes

Add Load-And-Splat instructions

7c5e4e2

Maratyszcza force-pushed the load_dup branch from 0aab971 to 7c5e4e2 Compare July 18, 2019 18:44

dtig approved these changes Jul 18, 2019

View reviewed changes

dtig merged commit fa8af5f into WebAssembly:master Jul 18, 2019

Honry mentioned this pull request Jul 19, 2019

Add support for SIMD Load-And-Splat instructions WAVM/WAVM#176

Closed

Honry pushed a commit to Honry/simd that referenced this pull request Oct 19, 2019

Add Load-And-Splat instructions (WebAssembly#82)

e3726fa

tlively mentioned this pull request Aug 10, 2020

Add SIMD instructions to syntax #271

Merged

Add Load-And-Splat instructions #82

Add Load-And-Splat instructions #82

Uh oh!

Conversation

Maratyszcza commented Jun 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introduction

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX2 instruction set

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE4.1 instruction set

x86/x86-64 processors with SSE3 instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set

Uh oh!

gnzlbg commented Jun 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tlively commented Jun 25, 2019

Uh oh!

gnzlbg commented Jun 25, 2019

Uh oh!

penzn commented Jun 25, 2019

Uh oh!

arunetm commented Jun 25, 2019

Uh oh!

Maratyszcza commented Jun 26, 2019

Uh oh!

dtig left a comment

Choose a reason for hiding this comment

Uh oh!

Maratyszcza commented Jul 18, 2019

Uh oh!

dtig left a comment

Choose a reason for hiding this comment

Uh oh!

AndrewScheidecker commented Aug 9, 2019

Uh oh!

ngzhian commented Oct 17, 2019

Uh oh!

Maratyszcza commented Oct 18, 2019

Uh oh!

ngzhian commented Oct 18, 2019

Uh oh!

ngzhian commented Nov 20, 2019

Uh oh!

Maratyszcza commented Nov 20, 2019

Uh oh!

ngzhian commented Nov 21, 2019

Uh oh!

Maratyszcza commented Nov 21, 2019

Uh oh!

Maratyszcza commented Nov 21, 2019

Uh oh!

ngzhian commented Nov 21, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Maratyszcza commented Jun 24, 2019 •

edited

Loading

gnzlbg commented Jun 25, 2019 •

edited

Loading