Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Conversation

@Maratyszcza
Copy link
Contributor

Introduction

ANDNOT is a widely supported SIMD operation which computes a & ~b. ANDNOT is involved in a common idiom of zeroing out elements which don't satisfy a condition, i.e.

float x = ...;
const bool cond = ...;
if (!cond) x = 0;

In the present SIMD instruction set, a vectorized version of this snippet would require two WAsm SIMD instructions, v128.and and v128.not:

v128 x = ...;
const v128 cond = ...;
x = v128.and(x, v128.not(cond));

Representing ANDNOT as two instructions is inefficient on all architectures:

  • On ARM, ARM64, and PowerPC it requires two machine instructions even though ANDNOT has an exact equivalent in their SIMD instruction sets.
  • On x86 and x86-64 it requires three machine instructions because x86 SIMD extensions (until AVX512) do not include SIMD NOT operation. Thus, WAsm engine would typically generate three instructions: two (PXOR tmp_zero, tmp_zero to zero a temporary register and PANDNOT b, tmp_zero (!) to emulate v128.not), and an extra PAND.

This PR introduce combined ANDNOT instruction to enable WebAssembly engines to directly leverage architecture-specific ANDNOT instructions without doing complicated and expensive analysis of the instruction stream

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instruction can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

  • c = v128.andnot(a, b) maps to VANDNPS xmm_c, xmm_b, xmm_a (note the inverted order of operands)

x86/x86-64 processors with SSE instruction set

  • c = v128.andnot(a, b) maps to MOVAPS xmm_c, xmm_b + ANDNPS xmm_c, xmm_a (note the inverted order of operands)
  • b = v128.andnot(a, b) maps to ANDNPS xmm_b, xmm_a (note the inverted order of operands)

ARMv7+ processors with NEON instruction set

  • c = v128.andnot(a, b) maps to VBIC Qc, Qa, Qb

ARM64 processors

  • c = v128.andnot(a, b) maps to BIC Vc.16B, Va.16B, Vb.16B

POWER processors with Vector facility (VMX)

  • c = v128.andnot(a, b) maps to VANDC Vc, Va, Vb

@dtig
Copy link
Member

dtig commented Sep 13, 2019

The NOT operation being sub optimal on x86 and x86-64 stood out during implementation, and having a more optimal ANDNOT operation when supported across architectures would make sense to me. After preliminary research, it doesn't look like any of the hardware instructions these operations map to have performance issues. That said, I'd like to make sure that the operations this is widely used. Could you add examples where this is already in use? Anything that makes use of the equivalent XMM/Neon intrinsics would be helpful.

@Maratyszcza
Copy link
Contributor Author

Maratyszcza commented Sep 14, 2019

GitHub search shows 127K+ files using _mm_andnot_ps (x86 instrinsic for ANDNOT) and 40K+ C++ files using vbicq_u32 (ARM NEON intrinsic for ANDNOT). From my experience, ANDNOT is useful when implementing vector mathematical functions, e.g. expf must return +0.0f when input is below -0x1.9FE368p+6f:

y = _mm_andnot_ps(_mm_cmpgt_ps(x, _mm_set1_ps(-0x1.9FE368p+6f)), y);

Note: inverting comparison and using _mm_and_ps wouldn't work, because it change result for NaN values.

@dtig
Copy link
Member

dtig commented Sep 19, 2019

Thanks @Maratyszcza! I'm in favor of merging this because this is not introducing a class of new operations, but a more efficient combination of operations that are already supported by this proposal, have an exact match on relevant architectures, is widely used as pointed out above. Are there any objections to including this in the proposal?

Copy link
Member

@dtig dtig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like there are any objections to the inclusion of these operations, so change lgtm with minor nits. Could you add an entry to ImplementationStatus.md as well?

| `v128.load` | `0x00`| m:memarg |
| `v128.store` | `0x01`| m:memarg |
| `v128.const` | `0x02`| i:ImmByte[16] |
| `v128.andnot` | `0x03`| - |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this down to the end of the opcode space instead of here? This belongs with the other logical operations, but there isn't a good spot for it right now. It's what we've tried to do with other new opcodes when it doesn't fit into the current opcode space.

@Maratyszcza
Copy link
Contributor Author

Done: added an entry in ImplementationStatus.md and moved to the end of the opcode list.

@Maratyszcza Maratyszcza requested a review from dtig September 24, 2019 11:19
@Maratyszcza Maratyszcza changed the title Add ANDNOT operation ANDNOT operation Sep 24, 2019
tlively added a commit to tlively/binaryen that referenced this pull request Sep 24, 2019
As specified at WebAssembly/simd#102.

Also fixes bugs in the JS API for other SIMD bitwise operators.
@arunetm arunetm merged commit f31b325 into WebAssembly:master Sep 24, 2019
tlively added a commit to WebAssembly/binaryen that referenced this pull request Sep 24, 2019
As specified at WebAssembly/simd#102.

Also fixes bugs in the JS API for other SIMD bitwise operators.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants