Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

i64x2.min_u and i64x2.max_u instructions #418

Closed
wants to merge 1 commit into from

Conversation

Maratyszcza
Copy link
Contributor

@Maratyszcza Maratyszcza commented Dec 30, 2020

Introduction

This is proposal to add 64-bit variant of the existing min_u and max_u instructions. Only x86 processors with AVX512 natively support these instructions.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX512F and AVX512VL instruction sets

  • i64x2.min_u
    • y = i64x2.min_u(a, b) is lowered to VPMINUQ xmm_y, xmm_a, xmm_b
  • i64x2.max_u
    • y = i64x2.max_u(a, b) is lowered to VPMAXUQ xmm_y, xmm_a, xmm_b

x86/x86-64 processors with XOP instruction set

  • i64x2.min_u
    • y = i64x2.min_u(a, b) (y is not a and y is not b) is lowered to:
      • VPCOMGTUQ xmm_y, xmm_a, xmm_b
      • VPBLENDVB xmm_y, xmm_b, xmm_a, xmm_y
  • i64x2.max_u
    • y = i64x2.max_u(a, b) (y is not a and y is not b) is lowered to:
      • VPCOMGTUQ xmm_y, xmm_a, xmm_b
      • VPBLENDVB xmm_y, xmm_a, xmm_b, xmm_y

x86/x86-64 processors with AVX instruction set

  • i64x2.min_u
    • y = i64x2.min_u(a, b) (y is not a and y is not b) is lowered to:
      • VMOVDQA xmm_tmp, [wasm_i64x2_splat(0x8000000000000000)]
      • VPXOR xmm_y, xmm_tmp, xmm_a
      • VPXOR xmm_tmp, xmm_tmp, xmm_b
      • VPCMPGTQ xmm_y, xmm_y, xmm_tmp
      • VPBLENDVB xmm_y, xmm_a, xmm_b, xmm_y
  • i64x2.max_u
    • y = i64x2.max_u(a, b) (y is not a and y is not b) is lowered to:
      • VMOVDQA xmm_tmp, [wasm_i64x2_splat(0x8000000000000000)]
      • VPXOR xmm_y, xmm_tmp, xmm_a
      • VPXOR xmm_tmp, xmm_tmp, xmm_b
      • VPCMPGTQ xmm_y, xmm_y, xmm_tmp
      • VPBLENDVB xmm_y, xmm_b, xmm_a, xmm_y

x86/x86-64 processors with SSE4.2 instruction set

  • i64x2.min_u
    • y = i64x2.min_u(a, b) (y is not a and y is not b and a/b/y is not in xmm0) is lowered to:
      • MOVDQA xmm_y, [wasm_i64x2_splat(0x8000000000000000)]
      • MOVDQA xmm0, xmm_a
      • PXOR xmm0, xmm_y
      • PXOR xmm_y, xmm_b
      • PCMPGTQ xmm0, xmm_y
      • MOVDQA xmm_y, xmm_a
      • PBLENDVB xmm_y, xmm_b
  • i64x2.max_u
    • y = i64x2.max_u(a, b) (y is not a and y is not b and a/b/y is not in xmm0) is lowered to:
      • MOVDQA xmm_y, [wasm_i64x2_splat(0x8000000000000000)]
      • MOVDQA xmm0, xmm_a
      • PXOR xmm0, xmm_y
      • PXOR xmm_y, xmm_b
      • PCMPGTQ xmm0, xmm_y
      • MOVDQA xmm_y, xmm_b
      • PBLENDVB xmm_y, xmm_a

x86/x86-64 processors with SSE4.1 instruction set

Based on this answer by user aqrit on Stack Overflow

  • i64x2.min_u
    • y = i64x2.min_u(a, b) (y is not a and y is not b and a/b/y is not in xmm0) is lowered to:
      • MOVDQA xmm_y, xmm_b
      • MOVDQA xmm0, xmm_b
      • PSUBQ xmm_y, xmm_a
      • PXOR xmm0, xmm_a
      • PANDN xmm0, xmm_y
      • MOVDQA xmm_y, xmm_b
      • PANDN xmm_y, xmm_a
      • POR xmm0, xmm_y
      • PSRAD xmm0, 31
      • MOVDQA xmm_y, xmm_a
      • PSHUFD xmm0, xmm0, 0xF5
      • PBLENDVB xmm_y, xmm_b
  • i64x2.max_u
    • y = i64x2.max_u(a, b) (y is not a and y is not b and a/b/y is not in xmm0) is lowered to:
      • MOVDQA xmm_y, xmm_b
      • MOVDQA xmm0, xmm_b
      • PSUBQ xmm_y, xmm_a
      • PXOR xmm0, xmm_a
      • PANDN xmm0, xmm_y
      • MOVDQA xmm_y, xmm_b
      • PANDN xmm_y, xmm_a
      • POR xmm0, xmm_y
      • PSRAD xmm0, 31
      • MOVDQA xmm_y, xmm_b
      • PSHUFD xmm0, xmm0, 0xF5
      • PBLENDVB xmm_y, xmm_a

x86/x86-64 processors with SSE2 instruction set

Based on this answer by user aqrit on Stack Overflow

  • i64x2.min_u
    • y = i64x2.min_u(a, b) (y is not a and y is not b) is lowered to:
      • MOVDQA xmm_tmp, xmm_b
      • MOVDQA xmm_y, xmm_b
      • PSUBQ xmm_tmp, xmm_a
      • PXOR xmm_y, xmm_a
      • PANDN xmm_y, xmm_tmp
      • MOVDQA xmm_tmp, xmm_b
      • PANDN xmm_tmp, xmm_a
      • POR xmm_y, xmm_tmp
      • PSRAD xmm_y, 31
      • MOVDQA xmm_tmp, xmm_b
      • PSHUFD xmm_y, xmm_y, 0xF5
      • PAND xmm_tmp, xmm_y
      • PANDN xmm_y, xmm_a
      • POR xmm_y, xmm_tmp
  • i64x2.max_u
    • y = i64x2.max_u(a, b) (y is not a and y is not b) is lowered to:
      • MOVDQA xmm_tmp, xmm_b
      • MOVDQA xmm_y, xmm_b
      • PSUBQ xmm_tmp, xmm_a
      • PXOR xmm_y, xmm_a
      • PANDN xmm_y, xmm_tmp
      • MOVDQA xmm_tmp, xmm_b
      • PANDN xmm_tmp, xmm_a
      • POR xmm_y, xmm_tmp
      • PSRAD xmm_y, 31
      • MOVDQA xmm_tmp, xmm_a
      • PSHUFD xmm_y, xmm_y, 0xF5
      • PAND xmm_tmp, xmm_y
      • PANDN xmm_y, xmm_b
      • POR xmm_y, xmm_tmp

ARM64 processors

  • i64x2.min_u
    • y = i64x2.min_u(a, b) (y is not a and y is not b) is lowered to:
      • CMHI Vy.2D, Va.2D, Vb.2D
      • BSL Vy.16B, Vb.16B, Va.16B
  • i64x2.max_u
    • y = i64x2.max_u(a, b) (y is not a and y is not b) is lowered to:
      • CMHI Vy.2D, Va.2D, Vb.2D
      • BSL Vy.16B, Va.16B, Vb.16B

ARMv7 processors with NEON instruction set

  • i64x2.min_u
    • y = i64x2.min_u(a, b) (y is not a and y is not b) is lowered to:
      • VQSUB.U64 Qy, Qa, Qb
      • VSUB.I64 Qy, Qa, Qy
  • i64x2.max_u
    • y = i64x2.max_u(a, b) (y is not a and y is not b) is lowered to:
      • VQSUB.U64 Qy, Qa, Qb
      • VADD.I64 Qy, Qb, Qy

@ngzhian
Copy link
Member

ngzhian commented Jan 6, 2021

We will probably discuss this in the next meeting, I will like to put down my thoughts ahead of time:

  • horrible codegen (we can argue that since there is precedence, this is okay, but do we really want more of these cliffs?)
  • XOP is not a thing in V8 yet, and I don't have data to tell me what % of users will benefit
  • AVX512 is also not a thing in V8 yet, and reach is limited, and IIRC new generations of processors won't support it
  • limited use cases, 3 out of 5 are simd libraries, so we really have 2

@abrown
Copy link
Contributor

abrown commented Jan 11, 2021

Code-sequence-wise, these are rather unfortunate on a large number of x86 machines. As with #417, it might make sense to wait on these until more machines have something like VP[MIN|MAX]UQ.

@penzn
Copy link
Contributor

penzn commented Jan 13, 2021

Same comment as @abrown. This part seems to be a bit worse than #417.

@dtig
Copy link
Member

dtig commented Jan 25, 2021

Adding a preliminary vote for the inclusion of i64x2 unsigned min/max operations to the SIMD proposal below. Please vote with -

👍 For including i64x2 unsigned min/max operations
👎 Against including i64x2 unsigned min/max operations

@ngzhian ngzhian added the 2021-01-29 Agenda for sync meeting 1/29/21 label Jan 26, 2021
@dtig dtig removed the 2021-01-29 Agenda for sync meeting 1/29/21 label Feb 2, 2021
@Maratyszcza
Copy link
Contributor Author

The community group unanimously decided against including these instructions in the 1/29/21 meeting (#429).

@Maratyszcza Maratyszcza closed this Feb 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants