Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Saturating Rounding Q-format Multiplication #365

Merged
merged 1 commit into from
Jan 11, 2021

Conversation

Maratyszcza
Copy link
Contributor

@Maratyszcza Maratyszcza commented Sep 25, 2020

Introduction

Fixed-point algorithms often represent fractional numbers in Q format, where a fixed number of bits is allocated to fractional part. Addition and subtraction in Q format are no different than addition and subtraction on integers, but multiplication is substantially different. First, to produce multiplication result in the same Q format as the factors, we need to compute the full (i.e. with twice the number of bits of the factors) product of the factors, and then do arithmetic shift right by Q bits. Secondly, as the right shift loses some of the bits of the product, it is typically done with rounding to nearest Q-format number to minimize the rounding error. Thus, for Q15 format (i.e. with 15 fractional bits) the multiplication result is computed as (a * b + 0x4000) >> 15 where a and b are integer representation of the input factors.

Despite sophisticated low-level definition, Q-format multiplication is commonly used in fixed-point algorithms, and widely supported in hardware. x86 since SSSE3 supports multiplication of 16-bit numbers in Q15 format, and ARM NEON supports multiplication of 16-bit numbers in Q15 format and 32-bit numbers in Q31 format. This proposal suggests to add 16-bit Q15-format multiplication to WebAssembly SIMD instruction set. The native x86 and ARM variants of this instruction differ in how they handle overflow: x86 version wraps around while ARM version saturates. The overflow happens only when both inputs are INT16_MIN, and can be corrected with a couple of instructions. For the purpose of bitwise compatibility this proposal standardize on the saturating variant as the more mathematically meaningful.

Q-format multiplication instruction is particularly important for fixed-point neural network inference, and was previously requested by @bjacob in #221.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

  • y = i16x8.q15mulr_sat_s(a, b) is lowered to
    • VPMULHRSW xmm_y, xmm_a, xmm_b
    • VPCMPEQW xmm_tmp, xmm_y, wasm_i16x8_splat(0x8000)
    • VPXOR xmm_y, xmm_y, xmm_tmp

x86/x86-64 processors with SSSE3 instruction set

  • y = i16x8.q15mulr_sat_s(a, b) is lowered to
    • MOVDQA xmm_y, xmm_a
    • MOVDQA xmm_tmp, wasm_i16x8_splat(0x8000)
    • PMULHRSW xmm_y, xmm_b
    • PCMPEQW xmm_tmp, xmm_y
    • PXOR xmm_y, xmm_tmp

x86/x86-64 processors with SSE2 instruction set

  • y = i16x8.q15mulr_sat_s(a, b) (y is NOT a and y is NOT b) is lowered to
    • MOVDQA xmm_y, xmm_a
    • MOVDQA xmm_tmp, xmm_a
    • PMULLW xmm_y, xmm_b
    • PMULHW xmm_tmp, xmm_b
    • PSRLW xmm_y, 14
    • PADDW xmm_tmp, xmm_tmp
    • PAVGW xmm_y, wasm_i16x8_splat(0)
    • PADDW xmm_y, xmm_tmp
    • MOVDQA xmm_tmp, wasm_i16x8_splat(0x8000)
    • PCMPEQW xmm_tmp, xmm_y
    • PXOR xmm_y, xmm_tmp

ARM64 processors

  • y = i16x8.q15mulr_sat_s(a, b) is lowered to
    • SQRDMULH Vy.8H, Va.8H, Vb.8H

ARMv7 processors with NEON instruction set

  • y = i16x8.q15mulr_sat_s(a, b) is lowered to
    • VQRDMULH.S16 Qy, Qa, Qb

@ngzhian
Copy link
Member

ngzhian commented Sep 29, 2020

What is the alternative without this instruction? The asymmetry doesn't look great here (5 v.s. 1 instruction).

@Maratyszcza
Copy link
Contributor Author

Maratyszcza commented Sep 29, 2020

@ngzhian With AVX it is just 3 instructions. Emulation in the current WebAssembly SIMD specification is way more expensive:

v128_t wasm_i16x8_q15mulr_sat(v128_t a, v128_t b) {
  v128_t lo = wasm_i32x4_mul(wasm_i32x4_widen_low_i16x8(a), wasm_i32x4_widen_low_i16x8(b));
  v128_t hi = wasm_i32x4_mul(wasm_i32x4_widen_high_i16x8(a), wasm_i32x4_widen_high_i16x8(b));
  const v128_t inc = wasm_i32x4_splat(0x4000);
  lo = wasm_i32x4_add(lo, inc);
  hi = wasm_i32x4_add(hi, inc);
  lo = wasm_i32x4_shr(lo, 15);
  hi = wasm_i32x4_shr(hi, 15);
  return wasm_i16x8_narrow_i32x4(lo, hi);
}

@ngzhian
Copy link
Member

ngzhian commented Sep 29, 2020

Sure, I'm looking at V8's SIMD baseline though, which is SSE4.1, so that's 5.
Yea agree the emulation looks bad.

I'm not familiar with this Q format, do you expect other instructions from this "class" to be useful? This suggestion is different from the rest in that it is just a single instruction.

Think this is worth prototyping just to see the improvements, it also variety of use cases (across ML, codecs). I'll get started on this.

@Maratyszcza
Copy link
Contributor Author

Other instructions are just regular integer saturated addition/subtraction. Q31 multiplication is useful, but it is natively supported only on ARM, and emulation on x86 would be very expensive.

@penzn
Copy link
Contributor

penzn commented Sep 30, 2020

The asymmetry is likely to be worse than 5 vs 1, as the wasm_i16x8_splat(0x8000) values would need to be emitted as well.

@Maratyszcza
Copy link
Contributor Author

@penzn wasm_i16x8_splat(0x8000) is just an in-memory constant literal. V8 currently doesn't use in-memory constants for v128 literal values, but it is a temporary limitation, and thus shouldn't be a factor when considering instructions.

@penzn
Copy link
Contributor

penzn commented Oct 1, 2020

I am not sure if this would be worth it even if the functionality did exist. In this case mov would be from memory, which is not exactly cheap, and would lead to pref drops when the value is not in the cache.

As for the current state of things, I don't think any engines maintain lists of v128 constants (@lars-t-hansen correct me if I am wrong), which realistically means that for the time being the instruction count on x86 would be much worse than listed here.

I think we need to keep implementation reality in mind, at the very least because we depend on implementations in order to move the proposal forward.

@lars-t-hansen
Copy link
Contributor

As for the current state of things, I don't think any engines maintain lists of v128 constants (@lars-t-hansen correct me if I am wrong), which realistically means that for the time being the instruction count on x86 would be much worse than listed here.

Not 100% sure what you're thinking about here but our optimizing compiler will constant-fold wasm_i16x8_splat(0x8000) into a loadable constant, and may merge multiple occurences of it to produce a reusable constant in a register under suitable conditions. I confess I've not put a lot of effort into such low-level optimizations of SIMD code yet, and it's unlikely that we'll get to the point where we're trying to determine whether it's best to put the constant in a register or to load it from memory for each use; the placement decision will likely be driven by register pressure and ad-hoc heuristics in the instruction selector (given current architecture).

@ngzhian
Copy link
Member

ngzhian commented Oct 6, 2020

Prototyped on arm64 as of https://chromium-review.googlesource.com/c/v8/v8/+/2438990, should see it in canary tomorrow.

tlively added a commit to llvm/llvm-project that referenced this pull request Oct 9, 2020
This saturating, rounding, Q-format multiplication instruction is proposed in
WebAssembly/simd#365.

Differential Revision: https://reviews.llvm.org/D88968
@penzn
Copy link
Contributor

penzn commented Oct 10, 2020

Not to pick on this PR in particular, as this is a common trend lately, but how many of these projects are porting or have been ported to WebAssembly:

(this a copy of the list in PR description)

@ngzhian
Copy link
Member

ngzhian commented Oct 10, 2020

Skia has a Wasm port (https://skia.org/user/modules/canvaskit) that seems to be actively maintained (the most serious one.)

The rest of the projects turn up some experimental ports (such as https://github.com/GoogleChromeLabs/webm-wasm), not sure if anything serious (happy to be corrected.)

Good question though Petr, we should focus on use cases that are being worked on.

@tlively
Copy link
Member

tlively commented Oct 10, 2020

Various folks are building libvpx to Wasm and it looks like it is part of some official WebRTC code

The dav1d AVI codec builds to WebAssembly.

Here's a blog post about someone using gemmlowp with Emscripten to make a shower timer, of all things 🚿

Didn't find anything in particular for OpenVINO, but this is still more than I expected to find.

@Maratyszcza
Copy link
Contributor Author

IMO the list of applications is not to show which apps will use the new instruction as soon as it is available, but rather to show that instruction is useful for different applications. Practically speaking, Emscripten includes emulation of SSE and NEON intrinsics, so most codebases written with these SIMD intrinsics can be ported with little effort.

@penzn
Copy link
Contributor

penzn commented Oct 14, 2020

@ngzhian, @tlively, good list, thank you! Especially the shower timer 🙄 I am (predictably) curious if any of those use cases can be used to test Wasm SIMD vs Wasm in context of this PR. Seems like Skia can be a good example.

The dav1d AVI codec builds to WebAssembly.

That seems to be a different AV1 codec though than the one in the description.

IMO the list of applications is not to show which apps will use the new instruction as soon as it is available, but rather to show that instruction is useful for different applications.

I don't think this is completely fair - if we are to delay the proposal longer, we need to understand how code in the wild would benefit from proposed operations. In a sense, consider this a re-phrasing of @binji's question in #343 (comment)

Practically speaking, Emscripten includes emulation of SSE and NEON intrinsics, so most codebases written with these SIMD intrinsics can be ported with little effort.

Only if we consider compiling those codes as the end goal. Hitting a particular operation does not guarantee performance, due to various architectural quirks (see discussion on hadd for example). I really think further changes to the proposal should be tested, because (a) it has been available to users for some time and (b) we are venturing into either controversial territory (hadd, popcount), or starting to re-work things in stable state for some time (ne, eq).

Again, this is not to pick on this particular operation, but to illustrate the point of having real-world tests.

tlively added a commit to tlively/binaryen that referenced this pull request Oct 27, 2020
Including saturating, rounding Q15 multiplication as proposed in
WebAssembly/simd#365 and extending multiplications as
proposed in WebAssembly/simd#376. Since these are just
prototypes, skips adding them to the C or JS APIs and the fuzzer, as well as
implementing them in the interpreter.
tlively added a commit to WebAssembly/binaryen that referenced this pull request Oct 28, 2020
Including saturating, rounding Q15 multiplication as proposed in
WebAssembly/simd#365 and extending multiplications as
proposed in WebAssembly/simd#376. Since these are just
prototypes, skips adding them to the C or JS APIs and the fuzzer, as well as
implementing them in the interpreter.
@tlively
Copy link
Member

tlively commented Oct 28, 2020

This instruction has landed in both LLVM and Binaryen and should be ready to benchmark in tip-of-tree Emscripten in a few hours. The builtin function to use is __builtin_wasm_q15mulr_saturate_s_i8x16.

@bjacob
Copy link

bjacob commented Oct 28, 2020

"i8x16" ? or "i16x8" ?

@tlively
Copy link
Member

tlively commented Oct 28, 2020

Oh haha I committed it with the wrong name. I'll push a quick fix right now. Thanks for the catch!

@tlively
Copy link
Member

tlively commented Oct 28, 2020

The name has now been fixed to __builtin_wasm_q15mulr_saturate_s_i16x8.

@sbc100
Copy link
Member

sbc100 commented Oct 28, 2020

The name has now been fixed to __builtin_wasm_q15mulr_saturate_s_i16x8.

We are starting to get almost java-like in the length of our identifiers here!

@bjacob
Copy link

bjacob commented Oct 29, 2020

nah, java would be AbstractBuiltinWasmQ15MulrSaturateSI16x8ProducerBuilderFactory

@Maratyszcza
Copy link
Contributor Author

I evaluated performance impact of the proposed instruction by porting fixed-point headers in gemmlowp to WebAssembly SIMD and benchmarking fixed-point (16-bit) sigmoid implementation from gemmlowp (this implementation is being used in TensorFlow Lite). Benchmark results are presented below:

Processor (Device)  Performance with WAsm SIMD + i16x8.q15mulr_sat_s Performance with WAsm SIMD (baseline) Speedup
Snapdragon 855 (LG G8 ThinQ) 893 MB/s 244 MB/s 3.7X
Snapdragon 670 (Pixel 3a) 482 MB/s 140 MB/s 3.4X
Exynos 8895 (Galaxy S8) 681 MB/s 171 MB/s 4.0X

@ngzhian
Copy link
Member

ngzhian commented Dec 15, 2020

How large is the benchmark? Is it a small benchmark exercising just the sigmoid implementation (single function), or is it some end-to-end inference benchmark in TF Lite?

@Maratyszcza
Copy link
Contributor Author

The benchmark is for a single operator (fixed-point Sigmoid). Here's the function being benchmarked:

void Sigmoid(const int16_t* input_ptr, int16_t* output_ptr, size_t elements) {
  assert(elements % 16 == 0);

  // F0 uses 0 integer bits, range [-1, 1].
  // This is the return type of math functions such as tanh, logistic,
  // whose range is in [-1, 1].
  using F0 = gemmlowp::FixedPoint<gemmlowp::int16x8_v128_t, 0>;
  // F3 uses 3 integer bits, range [-8, 8], the input range expected here.
  using F3 = gemmlowp::FixedPoint<gemmlowp::int16x8_v128_t, 3>;

  do {
    F3 input0 =
        F3::FromRaw(gemmlowp::to_int16x8_v128_t(wasm_v128_load(input_ptr)));
    F3 input1 =
        F3::FromRaw(gemmlowp::to_int16x8_v128_t(wasm_v128_load(input_ptr + 8)));
    F0 output0 = gemmlowp::logistic(input0);
    F0 output1 = gemmlowp::logistic(input1);
    wasm_v128_store(output_ptr, output0.raw().v);
    wasm_v128_store(output_ptr + 8, output1.raw().v);

    elements -= 16;
    input_ptr += 16;
    output_ptr += 16;
  } while (elements != 0);
}

@Maratyszcza
Copy link
Contributor Author

Maratyszcza commented Dec 18, 2020

The gemmlowp code changes are in the google/gemmlowp#202 PR.

@penzn
Copy link
Contributor

penzn commented Dec 19, 2020

Would this change affect xnnpack as well? What would be the effect there?

The gemmlowp code changes are in the google/gemmlowp#202 PR.

How does the comparison work then? Looks like there is just one implementation.

Is this the implementation on logistic:

// Returns logistic(x) = 1 / (1 + exp(-x)) for any x.
template <typename tRawType, int tIntegerBits>
FixedPoint<tRawType, 0> logistic(FixedPoint<tRawType, tIntegerBits> a) {
  typedef FixedPoint<tRawType, tIntegerBits> InputF;
  typedef FixedPoint<tRawType, 0> ResultF;
  tRawType mask_if_positive = MaskIfGreaterThan(a, InputF::Zero());
  tRawType mask_if_zero = MaskIfZero(a);
  InputF abs_input = SelectUsingMask(mask_if_positive, a, -a);
  ResultF result_if_positive = logistic_on_positive_values(abs_input);
  ResultF result_if_negative = ResultF::One() - result_if_positive;
  const ResultF one_half =
      GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(ResultF, 1 << 30, 0.5);
  return SelectUsingMask(mask_if_zero, one_half,
                         SelectUsingMask(mask_if_positive, result_if_positive,
                                         result_if_negative));
}

@Maratyszcza
Copy link
Contributor Author

Would this change affect xnnpack as well? What would be the effect there?

No, XNNPACK is unaffected. It might be in the future when it implements the same operator.

How does the comparison work then? Looks like there is just one implementation.

Replace this line with #if 1 to enable the i16x8.q15mulr_sat_s instruction.

Is this the implementation on logistic:

This is the entry point. Note that it calls other inline functions, e.g. logistic_on_positive_values.

@omnisip
Copy link

omnisip commented Dec 22, 2020

Discussed in #402 (12/22/2020 Sync Meeting) -- https://docs.google.com/document/d/1Tnf-fvRcCVj_vv8CjAN_qUvj4Wa57JnPa8ghGi4_C-s/edit# -- Currently awaiting x64 prototypes (@ngzhian) to do x64 benchmarks.

@Maratyszcza
Copy link
Contributor Author

Here are the additional results on x86-64 systems:

Processor Performance with WAsm SIMD + i16x8.q15mulr_sat_s Performance with WAsm SIMD (baseline) Speedup
Intel Xeon W-2135 1218 MB/s 430 MB/s 2.8X
Intel Celeron N3060 276 MB/s 108 MB/s 2.6X
AMD PRO A10-8700B 679 MB/s 268 MB/s 2.5X

@@ -108,6 +108,7 @@
| i8x16.max_u | 0x79 | i16x8.max_u | 0x99 | i32x4.max_u | 0xb9 | ---- | 0xd9 |
| ---------------- | 0x7a | ---------------- | 0x9a | i32x4.dot_i16x8_s | 0xba | ---- | 0xda |
| i8x16.avgr_u | 0x7b | i16x8.avgr_u | 0x9b | ---- avgr_u ---- | 0xbb | ---- | 0xdb |
| ---- | 0x7c | i16x8.q15mulr_sat_s | 0x9c | ---- | 0xbc | ---- | 0xdc |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add TBD here, too, so that the two opcode files are consistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the modification of this file

@dtig dtig merged commit df999c8 into WebAssembly:master Jan 11, 2021
def S.q15mulr_sat_s(a, b):
def subq15mulr(x, y):
return S.SignedSaturate((x * y + 0x4000) >> 15)
return S.lanewise_binary(subsat, S.AsSigned(a), S.AsSigned(b))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, is this subsat should be subq15mulr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, didn't catch that during review - fixed in #424.

ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 11, 2021
ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 11, 2021
ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 11, 2021
ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 11, 2021
ngzhian added a commit that referenced this pull request Feb 17, 2021
ngzhian added a commit that referenced this pull request Feb 17, 2021
* [interpreter] Implement i16x8.qmulr_sat_s

This was merged in #365.

* Update interpreter/exec/int.ml

Co-authored-by: Andreas Rossberg <rossberg@mpi-sws.org>

Co-authored-by: Andreas Rossberg <rossberg@mpi-sws.org>
arichardson pushed a commit to arichardson/llvm-project that referenced this pull request Mar 24, 2021
This saturating, rounding, Q-format multiplication instruction is proposed in
WebAssembly/simd#365.

Differential Revision: https://reviews.llvm.org/D88968
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants