-
Notifications
You must be signed in to change notification settings - Fork 43
Saturating Rounding Q-format Multiplication #365
Conversation
What is the alternative without this instruction? The asymmetry doesn't look great here (5 v.s. 1 instruction). |
@ngzhian With AVX it is just 3 instructions. Emulation in the current WebAssembly SIMD specification is way more expensive: v128_t wasm_i16x8_q15mulr_sat(v128_t a, v128_t b) {
v128_t lo = wasm_i32x4_mul(wasm_i32x4_widen_low_i16x8(a), wasm_i32x4_widen_low_i16x8(b));
v128_t hi = wasm_i32x4_mul(wasm_i32x4_widen_high_i16x8(a), wasm_i32x4_widen_high_i16x8(b));
const v128_t inc = wasm_i32x4_splat(0x4000);
lo = wasm_i32x4_add(lo, inc);
hi = wasm_i32x4_add(hi, inc);
lo = wasm_i32x4_shr(lo, 15);
hi = wasm_i32x4_shr(hi, 15);
return wasm_i16x8_narrow_i32x4(lo, hi);
} |
Sure, I'm looking at V8's SIMD baseline though, which is SSE4.1, so that's 5. I'm not familiar with this Q format, do you expect other instructions from this "class" to be useful? This suggestion is different from the rest in that it is just a single instruction. Think this is worth prototyping just to see the improvements, it also variety of use cases (across ML, codecs). I'll get started on this. |
Other instructions are just regular integer saturated addition/subtraction. Q31 multiplication is useful, but it is natively supported only on ARM, and emulation on x86 would be very expensive. |
The asymmetry is likely to be worse than 5 vs 1, as the |
@penzn |
I am not sure if this would be worth it even if the functionality did exist. In this case As for the current state of things, I don't think any engines maintain lists of I think we need to keep implementation reality in mind, at the very least because we depend on implementations in order to move the proposal forward. |
Not 100% sure what you're thinking about here but our optimizing compiler will constant-fold |
Prototyped on arm64 as of https://chromium-review.googlesource.com/c/v8/v8/+/2438990, should see it in canary tomorrow. |
This saturating, rounding, Q-format multiplication instruction is proposed in WebAssembly/simd#365. Differential Revision: https://reviews.llvm.org/D88968
Not to pick on this PR in particular, as this is a common trend lately, but how many of these projects are porting or have been ported to WebAssembly:
(this a copy of the list in PR description) |
Skia has a Wasm port (https://skia.org/user/modules/canvaskit) that seems to be actively maintained (the most serious one.) The rest of the projects turn up some experimental ports (such as https://github.com/GoogleChromeLabs/webm-wasm), not sure if anything serious (happy to be corrected.) Good question though Petr, we should focus on use cases that are being worked on. |
Various folks are building libvpx to Wasm and it looks like it is part of some official WebRTC code The dav1d AVI codec builds to WebAssembly. Here's a blog post about someone using gemmlowp with Emscripten to make a shower timer, of all things 🚿 Didn't find anything in particular for OpenVINO, but this is still more than I expected to find. |
IMO the list of applications is not to show which apps will use the new instruction as soon as it is available, but rather to show that instruction is useful for different applications. Practically speaking, Emscripten includes emulation of SSE and NEON intrinsics, so most codebases written with these SIMD intrinsics can be ported with little effort. |
@ngzhian, @tlively, good list, thank you! Especially the shower timer 🙄 I am (predictably) curious if any of those use cases can be used to test Wasm SIMD vs Wasm in context of this PR. Seems like Skia can be a good example.
That seems to be a different AV1 codec though than the one in the description.
I don't think this is completely fair - if we are to delay the proposal longer, we need to understand how code in the wild would benefit from proposed operations. In a sense, consider this a re-phrasing of @binji's question in #343 (comment)
Only if we consider compiling those codes as the end goal. Hitting a particular operation does not guarantee performance, due to various architectural quirks (see discussion on Again, this is not to pick on this particular operation, but to illustrate the point of having real-world tests. |
Including saturating, rounding Q15 multiplication as proposed in WebAssembly/simd#365 and extending multiplications as proposed in WebAssembly/simd#376. Since these are just prototypes, skips adding them to the C or JS APIs and the fuzzer, as well as implementing them in the interpreter.
Including saturating, rounding Q15 multiplication as proposed in WebAssembly/simd#365 and extending multiplications as proposed in WebAssembly/simd#376. Since these are just prototypes, skips adding them to the C or JS APIs and the fuzzer, as well as implementing them in the interpreter.
This instruction has landed in both LLVM and Binaryen and should be ready to benchmark in tip-of-tree Emscripten in a few hours. The builtin function to use is |
"i8x16" ? or "i16x8" ? |
Oh haha I committed it with the wrong name. I'll push a quick fix right now. Thanks for the catch! |
The name has now been fixed to |
We are starting to get almost java-like in the length of our identifiers here! |
nah, java would be AbstractBuiltinWasmQ15MulrSaturateSI16x8ProducerBuilderFactory |
I evaluated performance impact of the proposed instruction by porting fixed-point headers in gemmlowp to WebAssembly SIMD and benchmarking fixed-point (16-bit) sigmoid implementation from gemmlowp (this implementation is being used in TensorFlow Lite). Benchmark results are presented below:
|
How large is the benchmark? Is it a small benchmark exercising just the sigmoid implementation (single function), or is it some end-to-end inference benchmark in TF Lite? |
The benchmark is for a single operator (fixed-point Sigmoid). Here's the function being benchmarked: void Sigmoid(const int16_t* input_ptr, int16_t* output_ptr, size_t elements) {
assert(elements % 16 == 0);
// F0 uses 0 integer bits, range [-1, 1].
// This is the return type of math functions such as tanh, logistic,
// whose range is in [-1, 1].
using F0 = gemmlowp::FixedPoint<gemmlowp::int16x8_v128_t, 0>;
// F3 uses 3 integer bits, range [-8, 8], the input range expected here.
using F3 = gemmlowp::FixedPoint<gemmlowp::int16x8_v128_t, 3>;
do {
F3 input0 =
F3::FromRaw(gemmlowp::to_int16x8_v128_t(wasm_v128_load(input_ptr)));
F3 input1 =
F3::FromRaw(gemmlowp::to_int16x8_v128_t(wasm_v128_load(input_ptr + 8)));
F0 output0 = gemmlowp::logistic(input0);
F0 output1 = gemmlowp::logistic(input1);
wasm_v128_store(output_ptr, output0.raw().v);
wasm_v128_store(output_ptr + 8, output1.raw().v);
elements -= 16;
input_ptr += 16;
output_ptr += 16;
} while (elements != 0);
} |
The gemmlowp code changes are in the google/gemmlowp#202 PR. |
Would this change affect xnnpack as well? What would be the effect there?
How does the comparison work then? Looks like there is just one implementation. Is this the implementation on // Returns logistic(x) = 1 / (1 + exp(-x)) for any x.
template <typename tRawType, int tIntegerBits>
FixedPoint<tRawType, 0> logistic(FixedPoint<tRawType, tIntegerBits> a) {
typedef FixedPoint<tRawType, tIntegerBits> InputF;
typedef FixedPoint<tRawType, 0> ResultF;
tRawType mask_if_positive = MaskIfGreaterThan(a, InputF::Zero());
tRawType mask_if_zero = MaskIfZero(a);
InputF abs_input = SelectUsingMask(mask_if_positive, a, -a);
ResultF result_if_positive = logistic_on_positive_values(abs_input);
ResultF result_if_negative = ResultF::One() - result_if_positive;
const ResultF one_half =
GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(ResultF, 1 << 30, 0.5);
return SelectUsingMask(mask_if_zero, one_half,
SelectUsingMask(mask_if_positive, result_if_positive,
result_if_negative));
} |
No, XNNPACK is unaffected. It might be in the future when it implements the same operator.
Replace this line with
This is the entry point. Note that it calls other inline functions, e.g. |
Discussed in #402 (12/22/2020 Sync Meeting) -- https://docs.google.com/document/d/1Tnf-fvRcCVj_vv8CjAN_qUvj4Wa57JnPa8ghGi4_C-s/edit# -- Currently awaiting x64 prototypes (@ngzhian) to do x64 benchmarks. |
Here are the additional results on x86-64 systems:
|
proposals/simd/NewOpcodes.md
Outdated
@@ -108,6 +108,7 @@ | |||
| i8x16.max_u | 0x79 | i16x8.max_u | 0x99 | i32x4.max_u | 0xb9 | ---- | 0xd9 | | |||
| ---------------- | 0x7a | ---------------- | 0x9a | i32x4.dot_i16x8_s | 0xba | ---- | 0xda | | |||
| i8x16.avgr_u | 0x7b | i16x8.avgr_u | 0x9b | ---- avgr_u ---- | 0xbb | ---- | 0xdb | | |||
| ---- | 0x7c | i16x8.q15mulr_sat_s | 0x9c | ---- | 0xbc | ---- | 0xdc | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add TBD here, too, so that the two opcode files are consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted the modification of this file
def S.q15mulr_sat_s(a, b): | ||
def subq15mulr(x, y): | ||
return S.SignedSaturate((x * y + 0x4000) >> 15) | ||
return S.lanewise_binary(subsat, S.AsSigned(a), S.AsSigned(b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, is this subsat
should be subq15mulr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, didn't catch that during review - fixed in #424.
This was merged in WebAssembly#365.
This was merged in WebAssembly#365.
This was merged in WebAssembly#365.
This was merged in WebAssembly#365.
* [interpreter] Implement i16x8.qmulr_sat_s This was merged in #365. * Update interpreter/exec/int.ml Co-authored-by: Andreas Rossberg <rossberg@mpi-sws.org> Co-authored-by: Andreas Rossberg <rossberg@mpi-sws.org>
This saturating, rounding, Q-format multiplication instruction is proposed in WebAssembly/simd#365. Differential Revision: https://reviews.llvm.org/D88968
Introduction
Fixed-point algorithms often represent fractional numbers in Q format, where a fixed number of bits is allocated to fractional part. Addition and subtraction in Q format are no different than addition and subtraction on integers, but multiplication is substantially different. First, to produce multiplication result in the same Q format as the factors, we need to compute the full (i.e. with twice the number of bits of the factors) product of the factors, and then do arithmetic shift right by Q bits. Secondly, as the right shift loses some of the bits of the product, it is typically done with rounding to nearest Q-format number to minimize the rounding error. Thus, for Q15 format (i.e. with 15 fractional bits) the multiplication result is computed as
(a * b + 0x4000) >> 15
wherea
andb
are integer representation of the input factors.Despite sophisticated low-level definition, Q-format multiplication is commonly used in fixed-point algorithms, and widely supported in hardware. x86 since SSSE3 supports multiplication of 16-bit numbers in Q15 format, and ARM NEON supports multiplication of 16-bit numbers in Q15 format and 32-bit numbers in Q31 format. This proposal suggests to add 16-bit Q15-format multiplication to WebAssembly SIMD instruction set. The native x86 and ARM variants of this instruction differ in how they handle overflow: x86 version wraps around while ARM version saturates. The overflow happens only when both inputs are
INT16_MIN
, and can be corrected with a couple of instructions. For the purpose of bitwise compatibility this proposal standardize on the saturating variant as the more mathematically meaningful.Q-format multiplication instruction is particularly important for fixed-point neural network inference, and was previously requested by @bjacob in #221.
Applications
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX instruction set
VPMULHRSW xmm_y, xmm_a, xmm_b
VPCMPEQW xmm_tmp, xmm_y, wasm_i16x8_splat(0x8000)
VPXOR xmm_y, xmm_y, xmm_tmp
x86/x86-64 processors with SSSE3 instruction set
MOVDQA xmm_y, xmm_a
MOVDQA xmm_tmp, wasm_i16x8_splat(0x8000)
PMULHRSW xmm_y, xmm_b
PCMPEQW xmm_tmp, xmm_y
PXOR xmm_y, xmm_tmp
x86/x86-64 processors with SSE2 instruction set
y
is NOTa
andy
is NOTb
) is lowered toMOVDQA xmm_y, xmm_a
MOVDQA xmm_tmp, xmm_a
PMULLW xmm_y, xmm_b
PMULHW xmm_tmp, xmm_b
PSRLW xmm_y, 14
PADDW xmm_tmp, xmm_tmp
PAVGW xmm_y, wasm_i16x8_splat(0)
PADDW xmm_y, xmm_tmp
MOVDQA xmm_tmp, wasm_i16x8_splat(0x8000)
PCMPEQW xmm_tmp, xmm_y
PXOR xmm_y, xmm_tmp
ARM64 processors
SQRDMULH Vy.8H, Va.8H, Vb.8H
ARMv7 processors with NEON instruction set
VQRDMULH.S16 Qy, Qa, Qb