Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Proposal to add mul 32x32=64 #175

Closed
jan-wassenberg opened this issue Jan 14, 2020 · 12 comments · Fixed by #376
Closed

Proposal to add mul 32x32=64 #175

jan-wassenberg opened this issue Jan 14, 2020 · 12 comments · Fixed by #376
Labels
needs discussion Proposal with an unclear resolution

Comments

@jan-wassenberg
Copy link

I haven't seen _mm_mul_epu32 = PMULUDQ mentioned here. This is important for crypto/hashing, and a quick search on Github turned up uses in poly1305, QNNPACK, skia, HighwayHash.

The proposed semantics are: u64x2 r = mul_even(u32x4 a, u32x4 b), where

r[i] = u64(a[2*i]) * b[2*i]

MULE on PPC has the same semantics as PMULUDQ on x86. ARM can vmull_u32 after two vuzp1q_u32.

Does this seem like a useful operation to add?

@ngzhian
Copy link
Member

ngzhian commented Feb 6, 2020

I am supportive of adding this. (V8 uses this in a bunch of places, e.g. in x64 we use pmuludq for i64x2mul, so this proposal will expose the existing instruction.)

@jan-wassenberg
Copy link
Author

For completeness, here's a proposed ARM implementation:

// Multiplies even lanes (0, 2 ..) and places the double-wide result into
// even and the upper half into its odd neighbor lane.
Vec128<int64_t> MulEven(const Vec128<int32_t> a,
                        const Vec128<int32_t> b) {
  int32x4_t a_packed = vuzp1q_s32(a.raw, a.raw);
  int32x4_t b_packed = vuzp1q_s32(b.raw, b.raw);
  return Vec128<int64_t>(
      vmull_s32(vget_low_s32(a_packed), vget_low_s32(b_packed)));
}

@dtig
Copy link
Member

dtig commented Feb 20, 2020

Thanks @jan-wassenberg for adding mappings to x64, and Aarch64 instructions. This is possibly a useful operation, but the ARM sequence in this case is somewhat unfortunate, concretely on Aarch64, this would map to 5 instructions - 2 UZP1 Vd.4S,Vn.4S,Vm.4S instructions for the vuzp1q_s32 intrinsic, 2 DUP Vd.1D,Vn.D[0] instructions for the vget_low_s32 intrinsic, and an
SMULL Vd.2D,Vn.2S,Vm.2S for the vmull_s32 intrinsic.

For ARMv7, the code sequence is somewhat fuzzy because though the DUP, and the SMULL instructions are supported on v7, the UZP1 ASFAIK instruction doesn't have a direct mapping. Do you happen to know what the sequence of this would be on 32-bit ARM architectures?

@rrwinterton
Copy link

So @jan-wassenberg @dtig what you would like to have is a register instruction as opposed to a load extend instruction proposed in Introduce Load and Extend issue #98 ? It depends on what you are doing I guess? If you have to load anyway you could keep it in a register. It depends on how many you want and if you start to spill? I probably am missing something.

@jan-wassenberg
Copy link
Author

@dtig You're welcome! I believe the DUPs can be elided? https://gcc.godbolt.org/z/zYJGPJ

@rrwinterton The proposal is aimed at multiplication, it can be independent of load-extend. We're trying to expose pmuludq, because IIRC only AVX3 has full 64x64 bit multiplication.

@tlively tlively added the needs discussion Proposal with an unclear resolution label Mar 28, 2020
@dtig
Copy link
Member

dtig commented Apr 1, 2020

@jan-wassenberg Not that I can see for pre-ARMv8, am I missing something?

I would like to hear more opinions about this, and perhaps some links to code in the wild if possible? As with general positive interest, would not be opposed to prototyping this and benchmarking as well considering most of the concerns about performance are from me.

@jan-wassenberg
Copy link
Author

@dtig
Copy link
Member

dtig commented Apr 2, 2020

Thanks @jan-wassenberg - originally we planned to optimize the subset of operations just for Neon support but as more engines/applications have come up it does look like they care about performance on some subset of older ARM devices especially on mobile, as most of the operations till now have codegen that's fairly reasonable so far this hasn't been an issue, but with some of the newer proposed operations it's less clear how much we should index on this particular aspect. To be consistent across architectures, though it would make sense to bias this towards it being future facing. I'll take an AI to get a more accurate sense for this and open an issue, or respond here.

@bjacob
Copy link

bjacob commented May 5, 2020

As a tangent, I would like to mention that for fixed-point calculations, it would be useful to also have fixed-point multiplication instructions, e.g. 32x32=32 and 16x16=16, similar to ARM SQRDMULH. I mention this here because some may think that the availability of a 32x32=64 integer multiplication would remove the need for that, but that would be sub-optimal: staying within 32bit means doing 4 scalar operations per 128-bit vector operation, and most applications want to use the rounding flavor (SQRDMULH not SQDMULH) which would require a few more instructions to emulate if the instruction is missing, which in practice would result in applications making compromises between accuracy and performance.

(This is critical to integer-quantized neural network applications as performed by TensorFlow Lite using the ruy matrix multiplication library, see e.g. the usage of these instructions here, https://github.com/google/ruy/blob/57e64b4c8f32e813ce46eb495d13ee301826e498/ruy/kernel_arm64.cc#L517 )

@bjacob
Copy link

bjacob commented May 5, 2020

Branched into Issue #221.

@Maratyszcza
Copy link
Contributor

32x32->64 multiplication was added in #376, related fixed-point multiplication instruction (Q15 format) was added in #365

@ngzhian
Copy link
Member

ngzhian commented Mar 17, 2021

Closing since we have 32x32=64, as Marat pointed out.

@ngzhian ngzhian closed this as completed Mar 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs discussion Proposal with an unclear resolution
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants