Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Proposal to add fixed-point multiplication instructions #221

Closed
bjacob opened this issue May 5, 2020 · 5 comments
Closed

Proposal to add fixed-point multiplication instructions #221

bjacob opened this issue May 5, 2020 · 5 comments

Comments

@bjacob
Copy link

bjacob commented May 5, 2020

(Branched from Issue #175)

It would be useful to have fixed-point multiplication instructions, e.g. 32x32=32 and 16x16=16, similar to ARM SQRDMULH.

Some may think that the availability of a 32x32=64 integer multiplication (Issue #175) would remove the need for that, but that would be sub-optimal: staying within 32bit means doing 4 scalar operations per 128-bit vector operation, and most applications want to use the rounding flavor (SQRDMULH not SQDMULH) which would require a few more instructions to emulate if the instruction is missing, which in practice would result in applications making compromises between accuracy and performance.

(This is critical to integer-quantized neural network applications as performed by TensorFlow Lite using the ruy matrix multiplication library, see e.g. the usage of these instructions here, https://github.com/google/ruy/blob/57e64b4c8f32e813ce46eb495d13ee301826e498/ruy/kernel_arm64.cc#L517 )

@dtig
Copy link
Member

dtig commented May 8, 2020

Thanks for filing this issue, as we're at Phase 3 of the current SIMD proposal we've put a soft freeze on addition of new operations as discussed in #203. Some interesting questions to answer - the issue description highlights the usage of these instructions in neural network applications, are there other applications that would benefit from this addition? What would the corresponding codegen look like for the 32x32=32 case on Intel platforms?

@nfrechette
Copy link

One use case is to use it with SSE4 (or is it AVX?) to perform a logical shift per SIMD lane. With SSE, logical shifts either take an immediate value and all lanes are shifted by the same amount or they take a shift amount as a u64 value and all lanes are shifted by the same amount. It is thus not possible to shift each lane by a different amount. One way to achieve this is with integer multiplication but it is only worth it when the intrinsic is available. It avoids the need to swizzle each lane, shift, and reconstruct. I'm not sure if it's faster with a multiplication but it uses a lot fewer instructions and registers and it inlines better. I intend to use this trick in my decompression code path where a vector3 is packed in a variable number of bits (each lane having the same number of bits). Due to bit alignment, when the value is loaded, it needs to be shifted. By using the bit offset, a lookup table can provide a shift value per lane.

@bjacob
Copy link
Author

bjacob commented May 11, 2020

Thanks for filing this issue, as we're at Phase 3 of the current SIMD proposal

Real-world application developers like me are only going to start looking at the proposal once it's far enough into implementation. To say that the instruction set is soft frozen at this stage, is to say that it will only be marginally informed by real-world usage.

are there other applications that would benefit from this addition?

Some are listed here,
https://en.wikipedia.org/wiki/Fixed-point_arithmetic#Software_application_examples

Multiple CPU architectures support it:
ARM: SQRDMULH
MIPS: MULR_Q
x86: has an instruction, pmulhrsw / _mm_mulhrs_epi16 , that only supports the 16-bit case and only rounds towards positive infinity. It is used in the 16-bit neon_2_sse code linked below.

What would the corresponding codegen look like for the 32x32=32 case on Intel platforms?

Intel's neon_2_sse header implements it as:
16-bit: https://github.com/intel/ARM_NEON_2_x86_SSE/blob/0f4e857421c964826def1820bbfe0707f73ebda4/NEON_2_SSE.h#L4306-L4314
32-bit: https://github.com/intel/ARM_NEON_2_x86_SSE/blob/0f4e857421c964826def1820bbfe0707f73ebda4/NEON_2_SSE.h#L4287-L4304

@dtig
Copy link
Member

dtig commented May 20, 2020

Thanks for filing this issue, as we're at Phase 3 of the current SIMD proposal

Real-world application developers like me are only going to start looking at the proposal once it's far enough into implementation. To say that the instruction set is soft frozen at this stage, is to say that it will only be marginally informed by real-world usage.

While we've only progressed in the official phases, the Chrome implementation for this has been supporting the latest version of this proposal for at least more than a year now. Moving to the implementation phase for this proposal specifically required that this was useful and performant for multiple real world use cases. So I would argue that this proposal is informed by real-world usage, though I sympathize that it may not be optimal for your particular use case. The reason for a soft freeze on the opcodes is to give some room for implementations, and tools to catch up to the current proposal, and also given the nature of the SIMD proposal it is possible to have a long tail of operations, so unfortunately we do have to draw a line in the sand in the interest of forward progress.

That said, we did discuss in #203 that if there are very compelling reasons to consider adding new operations (that were not already filed at the time), we should evaluate them on a case-by-case basis. If this is something that you would like to push for, please submit a PR with the proposed semantics.

are there other applications that would benefit from this addition?

Some are listed here,
https://en.wikipedia.org/wiki/Fixed-point_arithmetic#Software_application_examples

Multiple CPU architectures support it:
ARM: SQRDMULH
MIPS: MULR_Q
x86: has an instruction, pmulhrsw / _mm_mulhrs_epi16 , that only supports the 16-bit case and only rounds towards positive infinity. It is used in the 16-bit neon_2_sse code linked below.

What would the corresponding codegen look like for the 32x32=32 case on Intel platforms?

Intel's neon_2_sse header implements it as:
16-bit: https://github.com/intel/ARM_NEON_2_x86_SSE/blob/0f4e857421c964826def1820bbfe0707f73ebda4/NEON_2_SSE.h#L4306-L4314
32-bit: https://github.com/intel/ARM_NEON_2_x86_SSE/blob/0f4e857421c964826def1820bbfe0707f73ebda4/NEON_2_SSE.h#L4287-L4304

@Maratyszcza
Copy link
Contributor

This is covered by #365

@ngzhian ngzhian closed this as completed Mar 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants