Saturating Rounding Q-format Multiplication #365

Maratyszcza · 2020-09-25T18:57:14Z

Introduction

Fixed-point algorithms often represent fractional numbers in Q format, where a fixed number of bits is allocated to fractional part. Addition and subtraction in Q format are no different than addition and subtraction on integers, but multiplication is substantially different. First, to produce multiplication result in the same Q format as the factors, we need to compute the full (i.e. with twice the number of bits of the factors) product of the factors, and then do arithmetic shift right by Q bits. Secondly, as the right shift loses some of the bits of the product, it is typically done with rounding to nearest Q-format number to minimize the rounding error. Thus, for Q15 format (i.e. with 15 fractional bits) the multiplication result is computed as (a * b + 0x4000) >> 15 where a and b are integer representation of the input factors.

Despite sophisticated low-level definition, Q-format multiplication is commonly used in fixed-point algorithms, and widely supported in hardware. x86 since SSSE3 supports multiplication of 16-bit numbers in Q15 format, and ARM NEON supports multiplication of 16-bit numbers in Q15 format and 32-bit numbers in Q31 format. This proposal suggests to add 16-bit Q15-format multiplication to WebAssembly SIMD instruction set. The native x86 and ARM variants of this instruction differ in how they handle overflow: x86 version wraps around while ARM version saturates. The overflow happens only when both inputs are INT16_MIN, and can be corrected with a couple of instructions. For the purpose of bitwise compatibility this proposal standardize on the saturating variant as the more mathematically meaningful.

Q-format multiplication instruction is particularly important for fixed-point neural network inference, and was previously requested by @bjacob in #221.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

y = i16x8.q15mulr_sat_s(a, b) is lowered to
- VPMULHRSW xmm_y, xmm_a, xmm_b
- VPCMPEQW xmm_tmp, xmm_y, wasm_i16x8_splat(0x8000)
- VPXOR xmm_y, xmm_y, xmm_tmp

x86/x86-64 processors with SSSE3 instruction set

y = i16x8.q15mulr_sat_s(a, b) is lowered to
- MOVDQA xmm_y, xmm_a
- MOVDQA xmm_tmp, wasm_i16x8_splat(0x8000)
- PMULHRSW xmm_y, xmm_b
- PCMPEQW xmm_tmp, xmm_y
- PXOR xmm_y, xmm_tmp

x86/x86-64 processors with SSE2 instruction set

y = i16x8.q15mulr_sat_s(a, b) (y is NOT a and y is NOT b) is lowered to
- MOVDQA xmm_y, xmm_a
- MOVDQA xmm_tmp, xmm_a
- PMULLW xmm_y, xmm_b
- PMULHW xmm_tmp, xmm_b
- PSRLW xmm_y, 14
- PADDW xmm_tmp, xmm_tmp
- PAVGW xmm_y, wasm_i16x8_splat(0)
- PADDW xmm_y, xmm_tmp
- MOVDQA xmm_tmp, wasm_i16x8_splat(0x8000)
- PCMPEQW xmm_tmp, xmm_y
- PXOR xmm_y, xmm_tmp

ARM64 processors

y = i16x8.q15mulr_sat_s(a, b) is lowered to
- SQRDMULH Vy.8H, Va.8H, Vb.8H

ARMv7 processors with NEON instruction set

y = i16x8.q15mulr_sat_s(a, b) is lowered to
- VQRDMULH.S16 Qy, Qa, Qb

ngzhian · 2020-09-29T20:15:46Z

What is the alternative without this instruction? The asymmetry doesn't look great here (5 v.s. 1 instruction).

Maratyszcza · 2020-09-29T20:20:57Z

@ngzhian With AVX it is just 3 instructions. Emulation in the current WebAssembly SIMD specification is way more expensive:

v128_t wasm_i16x8_q15mulr_sat(v128_t a, v128_t b) {
  v128_t lo = wasm_i32x4_mul(wasm_i32x4_widen_low_i16x8(a), wasm_i32x4_widen_low_i16x8(b));
  v128_t hi = wasm_i32x4_mul(wasm_i32x4_widen_high_i16x8(a), wasm_i32x4_widen_high_i16x8(b));
  const v128_t inc = wasm_i32x4_splat(0x4000);
  lo = wasm_i32x4_add(lo, inc);
  hi = wasm_i32x4_add(hi, inc);
  lo = wasm_i32x4_shr(lo, 15);
  hi = wasm_i32x4_shr(hi, 15);
  return wasm_i16x8_narrow_i32x4(lo, hi);
}

ngzhian · 2020-09-29T20:23:48Z

Sure, I'm looking at V8's SIMD baseline though, which is SSE4.1, so that's 5.
Yea agree the emulation looks bad.

I'm not familiar with this Q format, do you expect other instructions from this "class" to be useful? This suggestion is different from the rest in that it is just a single instruction.

Think this is worth prototyping just to see the improvements, it also variety of use cases (across ML, codecs). I'll get started on this.

Maratyszcza · 2020-09-29T20:29:15Z

Other instructions are just regular integer saturated addition/subtraction. Q31 multiplication is useful, but it is natively supported only on ARM, and emulation on x86 would be very expensive.

penzn · 2020-09-30T23:11:12Z

The asymmetry is likely to be worse than 5 vs 1, as the wasm_i16x8_splat(0x8000) values would need to be emitted as well.

Maratyszcza · 2020-10-01T21:52:00Z

@penzn wasm_i16x8_splat(0x8000) is just an in-memory constant literal. V8 currently doesn't use in-memory constants for v128 literal values, but it is a temporary limitation, and thus shouldn't be a factor when considering instructions.

penzn · 2020-10-01T22:37:12Z

I am not sure if this would be worth it even if the functionality did exist. In this case mov would be from memory, which is not exactly cheap, and would lead to pref drops when the value is not in the cache.

As for the current state of things, I don't think any engines maintain lists of v128 constants (@lars-t-hansen correct me if I am wrong), which realistically means that for the time being the instruction count on x86 would be much worse than listed here.

I think we need to keep implementation reality in mind, at the very least because we depend on implementations in order to move the proposal forward.

lars-t-hansen · 2020-10-02T13:27:43Z

As for the current state of things, I don't think any engines maintain lists of v128 constants (@lars-t-hansen correct me if I am wrong), which realistically means that for the time being the instruction count on x86 would be much worse than listed here.

Not 100% sure what you're thinking about here but our optimizing compiler will constant-fold wasm_i16x8_splat(0x8000) into a loadable constant, and may merge multiple occurences of it to produce a reusable constant in a register under suitable conditions. I confess I've not put a lot of effort into such low-level optimizations of SIMD code yet, and it's unlikely that we'll get to the point where we're trying to determine whether it's best to put the constant in a register or to load it from memory for each use; the placement decision will likely be driven by register pressure and ad-hoc heuristics in the instruction selector (given current architecture).

ngzhian · 2020-10-06T20:34:27Z

Prototyped on arm64 as of https://chromium-review.googlesource.com/c/v8/v8/+/2438990, should see it in canary tomorrow.

This saturating, rounding, Q-format multiplication instruction is proposed in WebAssembly/simd#365. Differential Revision: https://reviews.llvm.org/D88968

penzn · 2020-10-10T00:03:03Z

Not to pick on this PR in particular, as this is a common trend lately, but how many of these projects are porting or have been ported to WebAssembly:

(this a copy of the list in PR description)

ngzhian · 2020-10-10T00:21:50Z

Skia has a Wasm port (https://skia.org/user/modules/canvaskit) that seems to be actively maintained (the most serious one.)

The rest of the projects turn up some experimental ports (such as https://github.com/GoogleChromeLabs/webm-wasm), not sure if anything serious (happy to be corrected.)

Good question though Petr, we should focus on use cases that are being worked on.

tlively · 2020-10-10T00:30:53Z

Various folks are building libvpx to Wasm and it looks like it is part of some official WebRTC code

The dav1d AVI codec builds to WebAssembly.

Here's a blog post about someone using gemmlowp with Emscripten to make a shower timer, of all things 🚿

Didn't find anything in particular for OpenVINO, but this is still more than I expected to find.

Maratyszcza · 2020-10-10T00:41:37Z

IMO the list of applications is not to show which apps will use the new instruction as soon as it is available, but rather to show that instruction is useful for different applications. Practically speaking, Emscripten includes emulation of SSE and NEON intrinsics, so most codebases written with these SIMD intrinsics can be ported with little effort.

penzn · 2020-10-14T00:49:31Z

@ngzhian, @tlively, good list, thank you! Especially the shower timer 🙄 I am (predictably) curious if any of those use cases can be used to test Wasm SIMD vs Wasm in context of this PR. Seems like Skia can be a good example.

The dav1d AVI codec builds to WebAssembly.

That seems to be a different AV1 codec though than the one in the description.

IMO the list of applications is not to show which apps will use the new instruction as soon as it is available, but rather to show that instruction is useful for different applications.

I don't think this is completely fair - if we are to delay the proposal longer, we need to understand how code in the wild would benefit from proposed operations. In a sense, consider this a re-phrasing of @binji's question in #343 (comment)

Practically speaking, Emscripten includes emulation of SSE and NEON intrinsics, so most codebases written with these SIMD intrinsics can be ported with little effort.

Only if we consider compiling those codes as the end goal. Hitting a particular operation does not guarantee performance, due to various architectural quirks (see discussion on hadd for example). I really think further changes to the proposal should be tested, because (a) it has been available to users for some time and (b) we are venturing into either controversial territory (hadd, popcount), or starting to re-work things in stable state for some time (ne, eq).

Again, this is not to pick on this particular operation, but to illustrate the point of having real-world tests.

Including saturating, rounding Q15 multiplication as proposed in WebAssembly/simd#365 and extending multiplications as proposed in WebAssembly/simd#376. Since these are just prototypes, skips adding them to the C or JS APIs and the fuzzer, as well as implementing them in the interpreter.

tlively · 2020-10-28T16:14:41Z

This instruction has landed in both LLVM and Binaryen and should be ready to benchmark in tip-of-tree Emscripten in a few hours. The builtin function to use is __builtin_wasm_q15mulr_saturate_s_i8x16.

bjacob · 2020-10-28T16:21:11Z

"i8x16" ? or "i16x8" ?

tlively · 2020-10-28T16:49:08Z

Oh haha I committed it with the wrong name. I'll push a quick fix right now. Thanks for the catch!

tlively · 2020-10-28T18:07:28Z

The name has now been fixed to __builtin_wasm_q15mulr_saturate_s_i16x8.

sbc100 · 2020-10-28T20:24:23Z

The name has now been fixed to __builtin_wasm_q15mulr_saturate_s_i16x8.

We are starting to get almost java-like in the length of our identifiers here!

bjacob · 2020-10-29T01:03:44Z

nah, java would be AbstractBuiltinWasmQ15MulrSaturateSI16x8ProducerBuilderFactory

Maratyszcza · 2020-12-12T01:55:03Z

I evaluated performance impact of the proposed instruction by porting fixed-point headers in gemmlowp to WebAssembly SIMD and benchmarking fixed-point (16-bit) sigmoid implementation from gemmlowp (this implementation is being used in TensorFlow Lite). Benchmark results are presented below:

Processor (Device)	Performance with WAsm SIMD + `i16x8.q15mulr_sat_s`	Performance with WAsm SIMD (baseline)	Speedup
Snapdragon 855 (LG G8 ThinQ)	893 MB/s	244 MB/s	3.7X
Snapdragon 670 (Pixel 3a)	482 MB/s	140 MB/s	3.4X
Exynos 8895 (Galaxy S8)	681 MB/s	171 MB/s	4.0X

ngzhian · 2020-12-15T01:12:18Z

How large is the benchmark? Is it a small benchmark exercising just the sigmoid implementation (single function), or is it some end-to-end inference benchmark in TF Lite?

Maratyszcza · 2020-12-18T19:44:13Z

The benchmark is for a single operator (fixed-point Sigmoid). Here's the function being benchmarked:

void Sigmoid(const int16_t* input_ptr, int16_t* output_ptr, size_t elements) {
  assert(elements % 16 == 0);

  // F0 uses 0 integer bits, range [-1, 1].
  // This is the return type of math functions such as tanh, logistic,
  // whose range is in [-1, 1].
  using F0 = gemmlowp::FixedPoint<gemmlowp::int16x8_v128_t, 0>;
  // F3 uses 3 integer bits, range [-8, 8], the input range expected here.
  using F3 = gemmlowp::FixedPoint<gemmlowp::int16x8_v128_t, 3>;

  do {
    F3 input0 =
        F3::FromRaw(gemmlowp::to_int16x8_v128_t(wasm_v128_load(input_ptr)));
    F3 input1 =
        F3::FromRaw(gemmlowp::to_int16x8_v128_t(wasm_v128_load(input_ptr + 8)));
    F0 output0 = gemmlowp::logistic(input0);
    F0 output1 = gemmlowp::logistic(input1);
    wasm_v128_store(output_ptr, output0.raw().v);
    wasm_v128_store(output_ptr + 8, output1.raw().v);

    elements -= 16;
    input_ptr += 16;
    output_ptr += 16;
  } while (elements != 0);
}

Maratyszcza · 2020-12-18T19:57:59Z

The gemmlowp code changes are in the google/gemmlowp#202 PR.

penzn · 2020-12-19T02:01:42Z

Would this change affect xnnpack as well? What would be the effect there?

The gemmlowp code changes are in the google/gemmlowp#202 PR.

How does the comparison work then? Looks like there is just one implementation.

Is this the implementation on logistic:

// Returns logistic(x) = 1 / (1 + exp(-x)) for any x.
template <typename tRawType, int tIntegerBits>
FixedPoint<tRawType, 0> logistic(FixedPoint<tRawType, tIntegerBits> a) {
  typedef FixedPoint<tRawType, tIntegerBits> InputF;
  typedef FixedPoint<tRawType, 0> ResultF;
  tRawType mask_if_positive = MaskIfGreaterThan(a, InputF::Zero());
  tRawType mask_if_zero = MaskIfZero(a);
  InputF abs_input = SelectUsingMask(mask_if_positive, a, -a);
  ResultF result_if_positive = logistic_on_positive_values(abs_input);
  ResultF result_if_negative = ResultF::One() - result_if_positive;
  const ResultF one_half =
      GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(ResultF, 1 << 30, 0.5);
  return SelectUsingMask(mask_if_zero, one_half,
                         SelectUsingMask(mask_if_positive, result_if_positive,
                                         result_if_negative));
}

Maratyszcza · 2020-12-20T07:46:34Z

Would this change affect xnnpack as well? What would be the effect there?

No, XNNPACK is unaffected. It might be in the future when it implements the same operator.

How does the comparison work then? Looks like there is just one implementation.

Replace this line with #if 1 to enable the i16x8.q15mulr_sat_s instruction.

Is this the implementation on logistic:

This is the entry point. Note that it calls other inline functions, e.g. logistic_on_positive_values.

omnisip · 2020-12-22T21:11:20Z

Discussed in #402 (12/22/2020 Sync Meeting) -- https://docs.google.com/document/d/1Tnf-fvRcCVj_vv8CjAN_qUvj4Wa57JnPa8ghGi4_C-s/edit# -- Currently awaiting x64 prototypes (@ngzhian) to do x64 benchmarks.

Maratyszcza · 2021-01-01T22:10:23Z

Here are the additional results on x86-64 systems:

Processor	Performance with WAsm SIMD + `i16x8.q15mulr_sat_s`	Performance with WAsm SIMD (baseline)	Speedup
Intel Xeon W-2135	1218 MB/s	430 MB/s	2.8X
Intel Celeron N3060	276 MB/s	108 MB/s	2.6X
AMD PRO A10-8700B	679 MB/s	268 MB/s	2.5X

tlively · 2021-01-11T18:19:05Z

proposals/simd/NewOpcodes.md

@@ -108,6 +108,7 @@
 | i8x16.max_u          | 0x79   | i16x8.max_u              | 0x99   | i32x4.max_u              | 0xb9   | ----        | 0xd9   |
 | ----------------     | 0x7a   | ----------------         | 0x9a   | i32x4.dot_i16x8_s        | 0xba   | ----        | 0xda   |
 | i8x16.avgr_u         | 0x7b   | i16x8.avgr_u             | 0x9b   | ---- avgr_u ----         | 0xbb   | ----        | 0xdb   |
+| ----                 | 0x7c   | i16x8.q15mulr_sat_s      | 0x9c   | ----                     | 0xbc   | ----        | 0xdc   |


Can you add TBD here, too, so that the two opcode files are consistent?

Reverted the modification of this file

zjiaz · 2021-01-12T11:21:53Z

proposals/simd/SIMD.md

+def S.q15mulr_sat_s(a, b):
+    def subq15mulr(x, y):
+        return S.SignedSaturate((x * y + 0x4000) >> 15)
+    return S.lanewise_binary(subsat, S.AsSigned(a), S.AsSigned(b))


Hi, is this subsat should be subq15mulr?

Oops, didn't catch that during review - fixed in #424.

This was merged in WebAssembly#365.

This was merged in #365.

* [interpreter] Implement i16x8.qmulr_sat_s This was merged in #365. * Update interpreter/exec/int.ml Co-authored-by: Andreas Rossberg <rossberg@mpi-sws.org> Co-authored-by: Andreas Rossberg <rossberg@mpi-sws.org>

This saturating, rounding, Q-format multiplication instruction is proposed in WebAssembly/simd#365. Differential Revision: https://reviews.llvm.org/D88968

tlively mentioned this pull request Oct 14, 2020

We need persistent floating-point rounding mode control WebAssembly/design#1384

Open

tlively mentioned this pull request Oct 27, 2020

Prototype new SIMD multiplications WebAssembly/binaryen#3291

Merged

Maratyszcza force-pushed the rqmul branch from 77d4b53 to c165542 Compare December 18, 2020 09:26

tlively mentioned this pull request Dec 22, 2020

Agenda for sync meeting 1/8/2021 #410

Closed

tlively mentioned this pull request Jan 8, 2021

Tracking instructions with unassigned opcodes #421

Closed

Maratyszcza force-pushed the rqmul branch from c165542 to fe17145 Compare January 11, 2021 18:11

tlively reviewed Jan 11, 2021

View reviewed changes

i16x8.q15rmul_sat_s instruction

04798ef

Maratyszcza force-pushed the rqmul branch from fe17145 to 04798ef Compare January 11, 2021 18:26

tlively approved these changes Jan 11, 2021

View reviewed changes

dtig approved these changes Jan 11, 2021

View reviewed changes

dtig merged commit df999c8 into WebAssembly:master Jan 11, 2021

zjiaz reviewed Jan 12, 2021

View reviewed changes

This was referenced Jan 14, 2021

Proposal to add mul 32x32=64 #175

Closed

Proposal to add fixed-point multiplication instructions #221

Closed

ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 11, 2021

[interpreter] Implement i16x8.qmulr_sat_s

6db9199

This was merged in WebAssembly#365.

ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 11, 2021

[interpreter] Implement i16x8.qmulr_sat_s

e4c911d

This was merged in WebAssembly#365.

ngzhian mentioned this pull request Feb 11, 2021

[interpreter] Implement i16x8.qmulr_sat_s #463

Merged

ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 11, 2021

[spectext] Add i16x8.qmulr_sat_s

c261c77

This was merged in WebAssembly#365.

ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 11, 2021

[spectext] Add i16x8.qmulr_sat_s

df3f8e0

This was merged in WebAssembly#365.

ngzhian mentioned this pull request Feb 11, 2021

[spectext] Add i16x8.qmulr_sat_s #464

Merged

ngzhian added a commit that referenced this pull request Feb 17, 2021

[spectext] Add i16x8.qmulr_sat_s

2d191fe

This was merged in #365.

akirilov-arm mentioned this pull request Jun 28, 2021

Enable the simd_i16x8_q15mulr_sat_s test on AArch64 bytecodealliance/wasmtime#3035

Merged

Maratyszcza mentioned this pull request Oct 1, 2021

Relaxed Rounding Q-format Multiplication WebAssembly/relaxed-simd#40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saturating Rounding Q-format Multiplication #365

Saturating Rounding Q-format Multiplication #365

Maratyszcza commented Sep 25, 2020 •

edited

Loading

ngzhian commented Sep 29, 2020

Maratyszcza commented Sep 29, 2020 •

edited

Loading

ngzhian commented Sep 29, 2020

Maratyszcza commented Sep 29, 2020

penzn commented Sep 30, 2020

Maratyszcza commented Oct 1, 2020

penzn commented Oct 1, 2020

lars-t-hansen commented Oct 2, 2020

ngzhian commented Oct 6, 2020

penzn commented Oct 10, 2020

ngzhian commented Oct 10, 2020

tlively commented Oct 10, 2020

Maratyszcza commented Oct 10, 2020

penzn commented Oct 14, 2020

tlively commented Oct 28, 2020

bjacob commented Oct 28, 2020

tlively commented Oct 28, 2020

tlively commented Oct 28, 2020

sbc100 commented Oct 28, 2020

bjacob commented Oct 29, 2020

Maratyszcza commented Dec 12, 2020

ngzhian commented Dec 15, 2020

Maratyszcza commented Dec 18, 2020

Maratyszcza commented Dec 18, 2020 •

edited

Loading

penzn commented Dec 19, 2020

Maratyszcza commented Dec 20, 2020

omnisip commented Dec 22, 2020

Maratyszcza commented Jan 1, 2021

tlively Jan 11, 2021

Maratyszcza Jan 11, 2021

zjiaz Jan 12, 2021

Maratyszcza Jan 12, 2021

dtig Jan 12, 2021

Saturating Rounding Q-format Multiplication #365

Saturating Rounding Q-format Multiplication #365

Conversation

Maratyszcza commented Sep 25, 2020 • edited Loading

Introduction

Applications

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSSE3 instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set

ngzhian commented Sep 29, 2020

Maratyszcza commented Sep 29, 2020 • edited Loading

ngzhian commented Sep 29, 2020

Maratyszcza commented Sep 29, 2020

penzn commented Sep 30, 2020

Maratyszcza commented Oct 1, 2020

penzn commented Oct 1, 2020

lars-t-hansen commented Oct 2, 2020

ngzhian commented Oct 6, 2020

penzn commented Oct 10, 2020

ngzhian commented Oct 10, 2020

tlively commented Oct 10, 2020

Maratyszcza commented Oct 10, 2020

penzn commented Oct 14, 2020

tlively commented Oct 28, 2020

bjacob commented Oct 28, 2020

tlively commented Oct 28, 2020

tlively commented Oct 28, 2020

sbc100 commented Oct 28, 2020

bjacob commented Oct 29, 2020

Maratyszcza commented Dec 12, 2020

ngzhian commented Dec 15, 2020

Maratyszcza commented Dec 18, 2020

Maratyszcza commented Dec 18, 2020 • edited Loading

penzn commented Dec 19, 2020

Maratyszcza commented Dec 20, 2020

omnisip commented Dec 22, 2020

Maratyszcza commented Jan 1, 2021

tlively Jan 11, 2021

Choose a reason for hiding this comment

Maratyszcza Jan 11, 2021

Choose a reason for hiding this comment

zjiaz Jan 12, 2021

Choose a reason for hiding this comment

Maratyszcza Jan 12, 2021

Choose a reason for hiding this comment

dtig Jan 12, 2021

Choose a reason for hiding this comment

Maratyszcza commented Sep 25, 2020 •

edited

Loading

Maratyszcza commented Sep 29, 2020 •

edited

Loading

Maratyszcza commented Dec 18, 2020 •

edited

Loading