Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Conversation

@Maratyszcza
Copy link
Contributor

@Maratyszcza Maratyszcza commented Jun 24, 2019

Introduction

Loading a single element into all lanes of a SIMD vector is a common operation in signal processing, and both ARM NEON and recent x86 SIMD extensions can do it in a single instruction. In currently WebAssembly SIMD proposal this operation can be emulated via a combination of a scalar load and a splat instruction to replace the loaded scalar value into all lanes of a SIMD register. Unlike LLVM, streaming compilers in WebAssembly engines generate code under tight latency constraints, and can not afford pattern matching to generate optimal machine instruction for the combination of load and splat WAsm instructions. This PR introduce combined Load-and-Splat instructions which offer two advantages improvements over the above two-instruction combination:

  1. The value from memory is loaded directly into SIMD register rather than a general-purpose register (which store scalar integer values) in the current two-instruction scheme. Loading value directly into SIMD register eliminates expensive transfer from general-purpose register to a SIMD register requires in the current two-instruction scheme.
  2. These instructions enable WebAssembly implementations to leverage specialized instructions to load-and-splat element which exist in both ARM and x86 SIMD extensions.

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX2 instruction set

  • v = i8x16.load_splat(mem) maps to VPBROADCASTB xmm_v, [mem]
  • v = i16x8.load_splat(mem) maps to VPBROADCASTW xmm_v, [mem]
  • v = i32x4.load_splat(mem) is lowered like in AVX instruction set
  • v = i64x2.load_splat(mem) is lowered like in AVX instruction set

x86/x86-64 processors with AVX instruction set

  • v = i8x16.load_splat(mem) maps to VPINSRB xmm_v, [mem], 0 + VPXOR xmm_t, xmm_t, xmm_t + VPSHUFB xmm_v, xmm_v, xmm_t
  • v = i16x8.load_splat(mem) maps to VPINSRW xmm_v, [mem], 0 + VPSHUFLW xmm_v, xmm_v, 0 + VPUNPCKLQDQ xmm_v, xmm_v, xmm_v
  • v = i32x4.load_splat(mem) maps to VBROADCASTSS xmm_v, [mem]
  • v = i64x2.load_splat(mem) maps to VMOVDDUP xmm_v, [mem]

x86/x86-64 processors with SSE4.1 instruction set

  • v = i8x16.load_splat(mem) maps to PINSRB xmm_v, [mem], 0 + PXOR xmm_t, xmm_t + PSHUFB xmm_v, xmm_t
  • v = i16x8.load_splat(mem) is lowered like in SSE2 instruction set`
  • v = i32x4.load_splat(mem) is lowered like in SSE2 instruction set
  • v = i64x2.load_splat(mem) is lowered like in SSE3 instruction set

x86/x86-64 processors with SSE3 instruction set

  • v = i8x16.load_splat(mem) is lowered like in SSE2 instruction set
  • v = i16x8.load_splat(mem) is lowered like in SSE2 instruction set`
  • v = i32x4.load_splat(mem) is lowered like in SSE2 instruction set
  • v = i64x2.load_splat(mem) maps to MOVDDUP xmm_v, [mem]

x86/x86-64 processors with SSE2 instruction set

  • v = i8x16.load_splat(mem) maps to MOVZX r32_t, byte [mem] + IMUL r32_t, r32_t, 0x01010101 + MOVD xmm_v, r32_t + PSHUFD xmm_v, xmm_v, 0
  • v = i16x8.load_splat(mem) maps to PINSRW xmm_v, [mem], 0 + PSHUFLW xmm_v, xmm_v, 0 + PUNPCKLQDQ xmm_v, xmm_v
  • v = i32x4.load_splat(mem) maps to MOVSS xmm_v, [mem] + SHUFPS xmm_v, xmm_v, 0
  • v = i64x2.load_splat(mem) maps to MOVSD xmm_v, [mem] + UNPCKLPD xmm_v, xmm_v

ARM64 processors

  • v = i8x16.load_splat(mem) maps to LD1R {Vv.16B}, [Rmem]
  • v = i16x8.load_splat(mem) maps to LD1R {Vv.8H}, [Rmem]
  • v = i32x4.load_splat(mem) maps to LD1R {Vv.4S}, [Rmem]
  • v = i64x2.load_splat(mem) maps to LD1R {Vv.2D}, [Rmem]

ARMv7 processors with NEON instruction set

  • v = i8x16.load_splat(mem) maps to VLD1.8 {d_v[], d_v+1[]}, [Rmem]
  • v = i16x8.load_splat(mem) maps to VLD1.16 {d_v[], d_v+1[]}, [Rmem]
  • v = i32x4.load_splat(mem) maps to VLD1.32 {d_v[], d_v+1[]}, [Rmem]
  • v = i64x2.load_splat(mem) maps to VLD1.64 {d_v[]}, [Rmem] + VLD1.64 {d_v+1[]}, [Rmem]

@gnzlbg
Copy link
Contributor

gnzlbg commented Jun 25, 2019

The value from memory is loaded directly into SIMD register rather than a general-purpose register (which store scalar integer values) in the current two-instruction scheme. Loading value directly into SIMD register eliminates expensive transfer from general-purpose register to a SIMD register requires in the current two-instruction scheme.

You might want to expand the proposal with the reasons that make it impossible (or sufficiently hard) for WASM machine code generators to generate a single "load-and-splat" CPU instruction on targets that support it from WASM instruction sequences that perform a scalar load followed by a splat.

I kind of expect a machine code generator to pattern match those two instructions and treat them as one, lowering to a single CPU instruction when available.

@tlively
Copy link
Member

tlively commented Jun 25, 2019

We've generally tried to avoid forcing engines to do any optimization work that could have been done in the toolchain, including any pattern matching on instructions. I'm not saying it would be hard to do such pattern matching, but the bar for including an instruction should not assume there will be pattern matching.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jun 25, 2019

I don't see a way in which a machine code generator could generate optimal code here without pattern matching, and this operation is super common, so these instructions make a lot of sense to me.

For example, vector math libraries often support binary operations between scalars and vectors (bOp(scalar, vec), e.g., scalar * vec), and these are very often implemented as a bOp(splat(scalar), vec) because the CPU instruction for these only operate on two vectors. That splat is the splat from memory being proposed here.

@penzn
Copy link
Contributor

penzn commented Jun 25, 2019

The issue is where matching would happen - in compiler backend (requiring a new instruction) or WASM runtime. So far runtimes are have been reluctant to adopt instruction selection optimizations that work across WASM instructions.

@arunetm
Copy link
Collaborator

arunetm commented Jun 25, 2019

Agree that we should avoid expecting optimizations from engines. These operations looks useful with good hardware support. I am in favor of adding these.

@Maratyszcza
Copy link
Contributor Author

@gnzlbg Good point about motivation for new instruction vs pattern-matching a pair of instructions. Updated PR with explanation.

Copy link
Member

@dtig dtig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed PR. Strongly agree that eliminating the expensive move here is useful. We have an issue open to prototype this in V8, but given that this has cross architecture support, this should be good to merge. Could you add the new instructions to ImplementationStatus.md as well so that it doesn't get out of sync?

@Maratyszcza
Copy link
Contributor Author

@dtig Please take a look

Copy link
Member

@dtig dtig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@dtig dtig merged commit fa8af5f into WebAssembly:master Jul 18, 2019
@AndrewScheidecker
Copy link
Contributor

It seems like these instructions should be called vNxM.load_splat instead of iNxM.load_splat, since they don't need to interpret the bits of the scalars they work with, and can be used equally well for FP scalars.

@ngzhian
Copy link
Member

ngzhian commented Oct 17, 2019

Sorry, late to the game, here to point out that VPBROADCASTB/W requires AVX512VL/AVX512BW, not AVX2.

@Maratyszcza
Copy link
Contributor Author

@ngzhian You are looking at EVEX-encoded VPBROADCASTB/W. VEX-encoded VPBROADCASTB/W exists since AVX2.

@ngzhian
Copy link
Member

ngzhian commented Oct 18, 2019

You're very right, why are those in two different tables??? Thanks :)

Honry pushed a commit to Honry/simd that referenced this pull request Oct 19, 2019
@ngzhian
Copy link
Member

ngzhian commented Nov 20, 2019

For ARM v = i64x2.load_splat(mem) maps to VLD1.64 {d_v[], d_v+1[]}, [Rmem], I don't see vld1.64 in the architecture manual.
Specially i'm looking at A8.8.323 VLD1 (single element to all lanes), size == '11' seems to be an undefined operation.

@Maratyszcza
Copy link
Contributor Author

Right, I missed the lack of VLD1.64 with broadcast. This instruction would need to be lowered through either a combination of two VLD1.64 instructions without broadcast, or a combination of VLD1.64 + VMOV from one d register with a q to its neighbor.

@ngzhian
Copy link
Member

ngzhian commented Nov 21, 2019

VBROADCASTSD seems to only supported for 256-bit register, https://www.felixcloutier.com/x86/vbroadcast
"VBROADCASTSD and VBROADCASTF128,F32x4 and F64x2 are only supported as 256-bit and 512-bit wide versions and up"
we don't have 256-bit register support in v8, so I don't think we can do this for now.

@Maratyszcza
Copy link
Contributor Author

@ngzhian VMOVDDUP xmm, m64 would do the same

@Maratyszcza
Copy link
Contributor Author

Updated 64-bit load-and-splat lowering in PR description

@ngzhian
Copy link
Member

ngzhian commented Nov 21, 2019

Perfect, thank you!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants