Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8bit*8bit 4-D dot-product accumulating to 32bit, similar to ARM SDOT and x86 VNNI #9

Open
bjacob opened this issue Aug 26, 2020 · 14 comments

Comments

@bjacob
Copy link

bjacob commented Aug 26, 2020

This issue is a placeholder for future discussion about supporting 4-dimensional-reducing dot-product instructions taking 8bit inputs and accumulating into 32bit, i.e.

int32_accumulator += int8_lhs_0 * int8_rhs_0 + ... + int8_lhs_3 * int8_rhs_3

This would be similar to recent instructions: ARM SDOT/UDOT (and even more recent USDOT and SUDOT supporting mixed signednesses) and x86 AVX-VNNI instructions.

The motivation for filing this issue now is that I had created some confusion by commenting on that topic on PR WebAssembly/simd#127 which is actually about something different.

@kpu let's take discussion here.

@kpu
Copy link

kpu commented Aug 26, 2020

I'm after 8-bit signed GEMM too for my project @browsermt, ultimately for quantized neural networks, which appears to be @bjacob's use case as well.

As you mentioned in WebAssembly/simd#224, GEMM routines want to use all the registers with a larger tile to minimize memory bandwidth requirements. Which implies not only the ability to query for register count but some knowledge of how many extra registers the implementation requires. For example, a pre-VNNI x86 machine typically does vpmaddubsw to get a 16-bit result followed by vpmaddwd with 1s to upcast and horizontally add to 32-bit. And those 1s take up a register. Users could address this problem in their code via shipping various tile sizes and autotuning.

The instruction sets are slightly different. ARM has signed * signed in SDOT with more support. x86 has unsigned * signed only which is annoying and requires arithmetic hacks to make one of the arguments signed (by adding 128 and subtracting out in a bias term). We could target USDOT/SUDOT on ARM but then you get the unnecessary extra arithmetic and require a more recent processor just to be compatible with Intel. Another option is to emulate signed * signed on x86 via sign bit manipulation.

In practice a GEMM will want to use the longest register length it can get away with.

So while WebAssembly should have an 8-bit dot product instruction, I wonder if browsers should just support more general matrix multiplication. They already have limited support in DOMMatrix and WebGL.

Paging @mlopatka @XapaJIaMnu

@mlopatka
Copy link

Perhaps @lars-t-hansen can provide some thoughts on whether supporting such operations is interesting/feasible from Mozilla's perspective given our current implementation of WASM and the roadmap for this year.

@kpu
Copy link

kpu commented Aug 28, 2020

The WebNN people are proposing to add GEMM to the browser, including 8-bit: https://webmachinelearning.github.io/webnn/#api-neuralnetworkcontext-gemm .

@bjacob
Copy link
Author

bjacob commented Aug 28, 2020

We had taken a look at WebAsm SIMD for NN inference here. The relevance to the present issue is that as there are multiple other issues preventing the WebAsm SIMD proposal from approaching native performance levels, it is difficult to advocate a big investment in the present feature, which is about recent ISA extensions, until those other issues are resolved. The deepest is Issue WebAssembly/simd#225, which is a general issue about the entire 'intrinsics' programming model, for which native code has a toolbox of work-arounds and mitigations that are not practical in WebAsm SIMD. As resolving Issue WebAssembly/simd#225 seems to be out of immediate scope for WebAsm SIMD, I would like to see alternatives to WebAsm SIMD emerge where tackling this hard issue would be part of the initial design. There, adding new instructions to follow recent ISA extensions, like discussed in the present issue, would be far easier to justify.

Adding @Maratyszcza .

@penzn
Copy link
Contributor

penzn commented Aug 28, 2020

Matching AVX-VNNI most likely would not be feasible for this proposal, unless it can be efficiently emulated in SSE - there is no intended AVX support here at all. There is flexible vectors proposal which is planned to support AVX.

@kpu
Copy link

kpu commented Aug 28, 2020

The operation can be emulated with SSSE3 and even SSE2 if necessary (but I think WebAssembly already assumes SSSE3).

Usually, 8-bit GEMM is implemented on pre-VNNI Intel as vpmaddubsw, vpmaddwd, and vpaddd. That's how efficient 8-bit GEMMs do it. But pedantically it violates this statement from WebAssembly/simd#225 (comment)

No, exposing underlying architectural details would introduce platform-specific behavior and violate WebAssembly's determinism.

because vpmaddubsw saturates to signed 16-bit int. For example -128 * 255 - 128 * 255 = -65280 but it saturates to -32768 to fit in 16-bit. Which would be platform-specific saturation.

Since we're one bit short, another strategy is sign bit manipulation before calling vpmaddubsw. But -128 is weird. And -128*-128 + -128 * -128 = 32768 would still saturate to 32767. In a neural network, I don't care about this amount of saturation.

It's also possible to emulate in SSE2 by widening to 16-bit then doing vpmaddwd for 16-bit signed multiply twice and vpaddd twice. This would be deterministic and match VNNI, albeit slowly.

@penzn
Copy link
Contributor

penzn commented Aug 29, 2020

We should probably look into expected instruction sequences within ISA limits that the proposal has, though I can't promise it is going to make into MVP.

@lars-t-hansen
Copy link

but I think WebAssembly already assumes SSSE3

@kpu, citation for this?

(SpiderMonkey currently will disable Wasm SIMD for < SSE4.1 and I think V8 does the same, but the spec has so far allowed scalarization of every operation and I'm not aware of any assumption of available technology short of IEEE math.)

@kpu
Copy link

kpu commented Aug 31, 2020

@lars-t-hansen Sorry what I meant is WebAssembly already has instructions that map to SSSE3 and later on Intel, which is the highest version required by vpmaddubsw, vpmaddwd, and vpaddd (perhaps without the v). Of course the proposed instruction can be emulated, serially or in 16-bit on older than SSSE3 were it to be supported.

In particular, I am stressing to @penzn that _mm_dpbusds_epi32 is a 128-bit version for new processors and there are ways to semi-efficiently implement it on older x86 all the way back to SSSE3. There is no requirement for wider SIMD to get this instruction (though that would be super-useful for GEMM).

IBM z/Architecture has signed * signed and unsigned * unsigned 8-bit instructions with addition into a 16-bit accumulator. They don't do horizontal add, so there are separate instructions to get results for odd-index and even-indexed positions. So VMAE and VMAO (signed * signed) or VMALE and VMALO (unsigned * unsigned). Then widen and sum operations to accomplish a dot product into a 32-bit result.

@ngzhian
Copy link
Member

ngzhian commented Mar 19, 2021

At today's SIMD sync meeting, we discussed considering this for relaxed-simd, and agreed that we can carry on further discussions, hence transferring the issue there.

@penzn
Copy link
Contributor

penzn commented Apr 16, 2021

I've filed WebAssembly/flexible-vectors#15 for this a while back, though given what is flexible vectors about it would need to be consistent across platforms to be part of it.

@kpu
Copy link

kpu commented Jan 13, 2022

To add some specific numbers: https://bugzilla.mozilla.org/show_bug.cgi?id=1746631

A machine translation application compiled to WebAssembly https://github.com/browsermt/bergamot-translator . Speed measured in words translated per second (wps). Heavy user of 8-bit integer matrix multiplication.

Pure WASM SIMD: 95 wps.
Add pmaddubs to WASM: 390 wps (+310% to Pure WASM SIMD)
Add a native 8-bit matrix multiply on SSSE3 as intrinsic : 490 wps (+25% to pmaddubs, +415% to Pure WASM SIMD)
Add a native 8-bit matrix multiply on AVX2 as intrinsic : 560 wps (+43% to pmaddubs, +489% to Pure WASM SIMD)

(The rest of the app is compiled to WebAssembly).

@Maratyszcza
Copy link
Collaborator

@kpu Thanks for sharing! What is Pure WAsm? Is it WebAssembly MVP or WebAssembly SIMD?

@kpu
Copy link

kpu commented Jan 13, 2022

@Maratyszcza WebAssembly SIMD 128-bit. I've updated my comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants