i64x2.min_u and i64x2.max_u instructions #418

Maratyszcza · 2020-12-30T04:40:31Z

Introduction

This is proposal to add 64-bit variant of the existing min_u and max_u instructions. Only x86 processors with AVX512 natively support these instructions.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX512F and AVX512VL instruction sets

i64x2.min_u
- y = i64x2.min_u(a, b) is lowered to VPMINUQ xmm_y, xmm_a, xmm_b
i64x2.max_u
- y = i64x2.max_u(a, b) is lowered to VPMAXUQ xmm_y, xmm_a, xmm_b

x86/x86-64 processors with XOP instruction set

i64x2.min_u
- y = i64x2.min_u(a, b) (y is not a and y is not b) is lowered to:
  - VPCOMGTUQ xmm_y, xmm_a, xmm_b
  - VPBLENDVB xmm_y, xmm_b, xmm_a, xmm_y
i64x2.max_u
- y = i64x2.max_u(a, b) (y is not a and y is not b) is lowered to:
  - VPCOMGTUQ xmm_y, xmm_a, xmm_b
  - VPBLENDVB xmm_y, xmm_a, xmm_b, xmm_y

x86/x86-64 processors with AVX instruction set

i64x2.min_u
- y = i64x2.min_u(a, b) (y is not a and y is not b) is lowered to:
  - VMOVDQA xmm_tmp, [wasm_i64x2_splat(0x8000000000000000)]
  - VPXOR xmm_y, xmm_tmp, xmm_a
  - VPXOR xmm_tmp, xmm_tmp, xmm_b
  - VPCMPGTQ xmm_y, xmm_y, xmm_tmp
  - VPBLENDVB xmm_y, xmm_a, xmm_b, xmm_y
i64x2.max_u
- y = i64x2.max_u(a, b) (y is not a and y is not b) is lowered to:
  - VMOVDQA xmm_tmp, [wasm_i64x2_splat(0x8000000000000000)]
  - VPXOR xmm_y, xmm_tmp, xmm_a
  - VPXOR xmm_tmp, xmm_tmp, xmm_b
  - VPCMPGTQ xmm_y, xmm_y, xmm_tmp
  - VPBLENDVB xmm_y, xmm_b, xmm_a, xmm_y

x86/x86-64 processors with SSE4.2 instruction set

i64x2.min_u
- y = i64x2.min_u(a, b) (y is not a and y is not b and a/b/y is not in xmm0) is lowered to:
  - MOVDQA xmm_y, [wasm_i64x2_splat(0x8000000000000000)]
  - MOVDQA xmm0, xmm_a
  - PXOR xmm0, xmm_y
  - PXOR xmm_y, xmm_b
  - PCMPGTQ xmm0, xmm_y
  - MOVDQA xmm_y, xmm_a
  - PBLENDVB xmm_y, xmm_b
i64x2.max_u
- y = i64x2.max_u(a, b) (y is not a and y is not b and a/b/y is not in xmm0) is lowered to:
  - MOVDQA xmm_y, [wasm_i64x2_splat(0x8000000000000000)]
  - MOVDQA xmm0, xmm_a
  - PXOR xmm0, xmm_y
  - PXOR xmm_y, xmm_b
  - PCMPGTQ xmm0, xmm_y
  - MOVDQA xmm_y, xmm_b
  - PBLENDVB xmm_y, xmm_a

x86/x86-64 processors with SSE4.1 instruction set

Based on this answer by user aqrit on Stack Overflow

i64x2.min_u
- y = i64x2.min_u(a, b) (y is not a and y is not b and a/b/y is not in xmm0) is lowered to:
  - MOVDQA xmm_y, xmm_b
  - MOVDQA xmm0, xmm_b
  - PSUBQ xmm_y, xmm_a
  - PXOR xmm0, xmm_a
  - PANDN xmm0, xmm_y
  - MOVDQA xmm_y, xmm_b
  - PANDN xmm_y, xmm_a
  - POR xmm0, xmm_y
  - PSRAD xmm0, 31
  - MOVDQA xmm_y, xmm_a
  - PSHUFD xmm0, xmm0, 0xF5
  - PBLENDVB xmm_y, xmm_b
i64x2.max_u
- y = i64x2.max_u(a, b) (y is not a and y is not b and a/b/y is not in xmm0) is lowered to:
  - MOVDQA xmm_y, xmm_b
  - MOVDQA xmm0, xmm_b
  - PSUBQ xmm_y, xmm_a
  - PXOR xmm0, xmm_a
  - PANDN xmm0, xmm_y
  - MOVDQA xmm_y, xmm_b
  - PANDN xmm_y, xmm_a
  - POR xmm0, xmm_y
  - PSRAD xmm0, 31
  - MOVDQA xmm_y, xmm_b
  - PSHUFD xmm0, xmm0, 0xF5
  - PBLENDVB xmm_y, xmm_a

x86/x86-64 processors with SSE2 instruction set

Based on this answer by user aqrit on Stack Overflow

i64x2.min_u
- y = i64x2.min_u(a, b) (y is not a and y is not b) is lowered to:
  - MOVDQA xmm_tmp, xmm_b
  - MOVDQA xmm_y, xmm_b
  - PSUBQ xmm_tmp, xmm_a
  - PXOR xmm_y, xmm_a
  - PANDN xmm_y, xmm_tmp
  - MOVDQA xmm_tmp, xmm_b
  - PANDN xmm_tmp, xmm_a
  - POR xmm_y, xmm_tmp
  - PSRAD xmm_y, 31
  - MOVDQA xmm_tmp, xmm_b
  - PSHUFD xmm_y, xmm_y, 0xF5
  - PAND xmm_tmp, xmm_y
  - PANDN xmm_y, xmm_a
  - POR xmm_y, xmm_tmp
i64x2.max_u
- y = i64x2.max_u(a, b) (y is not a and y is not b) is lowered to:
  - MOVDQA xmm_tmp, xmm_b
  - MOVDQA xmm_y, xmm_b
  - PSUBQ xmm_tmp, xmm_a
  - PXOR xmm_y, xmm_a
  - PANDN xmm_y, xmm_tmp
  - MOVDQA xmm_tmp, xmm_b
  - PANDN xmm_tmp, xmm_a
  - POR xmm_y, xmm_tmp
  - PSRAD xmm_y, 31
  - MOVDQA xmm_tmp, xmm_a
  - PSHUFD xmm_y, xmm_y, 0xF5
  - PAND xmm_tmp, xmm_y
  - PANDN xmm_y, xmm_b
  - POR xmm_y, xmm_tmp

ARM64 processors

i64x2.min_u
- y = i64x2.min_u(a, b) (y is not a and y is not b) is lowered to:
  - CMHI Vy.2D, Va.2D, Vb.2D
  - BSL Vy.16B, Vb.16B, Va.16B
i64x2.max_u
- y = i64x2.max_u(a, b) (y is not a and y is not b) is lowered to:
  - CMHI Vy.2D, Va.2D, Vb.2D
  - BSL Vy.16B, Va.16B, Vb.16B

ARMv7 processors with NEON instruction set

i64x2.min_u
- y = i64x2.min_u(a, b) (y is not a and y is not b) is lowered to:
  - VQSUB.U64 Qy, Qa, Qb
  - VSUB.I64 Qy, Qa, Qy
i64x2.max_u
- y = i64x2.max_u(a, b) (y is not a and y is not b) is lowered to:
  - VQSUB.U64 Qy, Qa, Qb
  - VADD.I64 Qy, Qb, Qy

ngzhian · 2021-01-06T02:00:44Z

We will probably discuss this in the next meeting, I will like to put down my thoughts ahead of time:

horrible codegen (we can argue that since there is precedence, this is okay, but do we really want more of these cliffs?)
XOP is not a thing in V8 yet, and I don't have data to tell me what % of users will benefit
AVX512 is also not a thing in V8 yet, and reach is limited, and IIRC new generations of processors won't support it
limited use cases, 3 out of 5 are simd libraries, so we really have 2

abrown · 2021-01-11T21:44:24Z

Code-sequence-wise, these are rather unfortunate on a large number of x86 machines. As with #417, it might make sense to wait on these until more machines have something like VP[MIN|MAX]UQ.

penzn · 2021-01-13T02:07:21Z

Same comment as @abrown. This part seems to be a bit worse than #417.

dtig · 2021-01-25T19:30:52Z

Adding a preliminary vote for the inclusion of i64x2 unsigned min/max operations to the SIMD proposal below. Please vote with -

👍 For including i64x2 unsigned min/max operations
👎 Against including i64x2 unsigned min/max operations

Maratyszcza · 2021-02-04T19:30:12Z

The community group unanimously decided against including these instructions in the 1/29/21 meeting (#429).

Maratyszcza mentioned this pull request Jan 5, 2021

Agenda for sync meeting 1/8/2021 #410

Closed

penzn mentioned this pull request Jan 8, 2021

Agenda for sync meeting 1/22/21 #419

Closed

i64x2.min_u and i64x2.max_u instructions

39a2cba

Maratyszcza force-pushed the minmaxu-64bit branch from 1216e30 to 39a2cba Compare January 19, 2021 20:41

tlively mentioned this pull request Jan 23, 2021

Agenda for sync meeting 1/29/21 #429

Closed

ngzhian added the 2021-01-29 Agenda for sync meeting 1/29/21 label Jan 26, 2021

tlively added the post SIMD MVP label Feb 2, 2021

dtig removed the 2021-01-29 Agenda for sync meeting 1/29/21 label Feb 2, 2021

Maratyszcza closed this Feb 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

i64x2.min_u and i64x2.max_u instructions #418

i64x2.min_u and i64x2.max_u instructions #418

Maratyszcza commented Dec 30, 2020 •

edited

Loading

ngzhian commented Jan 6, 2021

abrown commented Jan 11, 2021

penzn commented Jan 13, 2021

dtig commented Jan 25, 2021

Maratyszcza commented Feb 4, 2021

i64x2.min_u and i64x2.max_u instructions #418

i64x2.min_u and i64x2.max_u instructions #418

Conversation

Maratyszcza commented Dec 30, 2020 • edited Loading

Introduction

Applications

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX512F and AVX512VL instruction sets

x86/x86-64 processors with XOP instruction set

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE4.2 instruction set

x86/x86-64 processors with SSE4.1 instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set

ngzhian commented Jan 6, 2021

abrown commented Jan 11, 2021

penzn commented Jan 13, 2021

dtig commented Jan 25, 2021

Maratyszcza commented Feb 4, 2021

Maratyszcza commented Dec 30, 2020 •

edited

Loading