Proposal to add mul 32x32=64 #175

jan-wassenberg · 2020-01-14T09:20:01Z

I haven't seen _mm_mul_epu32 = PMULUDQ mentioned here. This is important for crypto/hashing, and a quick search on Github turned up uses in poly1305, QNNPACK, skia, HighwayHash.

The proposed semantics are: u64x2 r = mul_even(u32x4 a, u32x4 b), where

r[i] = u64(a[2*i]) * b[2*i]

MULE on PPC has the same semantics as PMULUDQ on x86. ARM can vmull_u32 after two vuzp1q_u32.

Does this seem like a useful operation to add?

The text was updated successfully, but these errors were encountered:

ngzhian · 2020-02-06T18:52:15Z

I am supportive of adding this. (V8 uses this in a bunch of places, e.g. in x64 we use pmuludq for i64x2mul, so this proposal will expose the existing instruction.)

jan-wassenberg · 2020-02-07T16:59:51Z

For completeness, here's a proposed ARM implementation:

// Multiplies even lanes (0, 2 ..) and places the double-wide result into
// even and the upper half into its odd neighbor lane.
Vec128<int64_t> MulEven(const Vec128<int32_t> a,
                        const Vec128<int32_t> b) {
  int32x4_t a_packed = vuzp1q_s32(a.raw, a.raw);
  int32x4_t b_packed = vuzp1q_s32(b.raw, b.raw);
  return Vec128<int64_t>(
      vmull_s32(vget_low_s32(a_packed), vget_low_s32(b_packed)));
}

dtig · 2020-02-20T06:49:03Z

Thanks @jan-wassenberg for adding mappings to x64, and Aarch64 instructions. This is possibly a useful operation, but the ARM sequence in this case is somewhat unfortunate, concretely on Aarch64, this would map to 5 instructions - 2 UZP1 Vd.4S,Vn.4S,Vm.4S instructions for the vuzp1q_s32 intrinsic, 2 DUP Vd.1D,Vn.D[0] instructions for the vget_low_s32 intrinsic, and an
SMULL Vd.2D,Vn.2S,Vm.2S for the vmull_s32 intrinsic.

For ARMv7, the code sequence is somewhat fuzzy because though the DUP, and the SMULL instructions are supported on v7, the UZP1 ASFAIK instruction doesn't have a direct mapping. Do you happen to know what the sequence of this would be on 32-bit ARM architectures?

rrwinterton · 2020-02-20T23:28:17Z

So @jan-wassenberg @dtig what you would like to have is a register instruction as opposed to a load extend instruction proposed in Introduce Load and Extend issue #98 ? It depends on what you are doing I guess? If you have to load anyway you could keep it in a register. It depends on how many you want and if you start to spill? I probably am missing something.

jan-wassenberg · 2020-02-21T10:14:58Z

@dtig You're welcome! I believe the DUPs can be elided? https://gcc.godbolt.org/z/zYJGPJ

@rrwinterton The proposal is aimed at multiplication, it can be independent of load-extend. We're trying to expose pmuludq, because IIRC only AVX3 has full 64x64 bit multiplication.

dtig · 2020-04-01T17:32:57Z

@jan-wassenberg Not that I can see for pre-ARMv8, am I missing something?

I would like to hear more opinions about this, and perhaps some links to code in the wild if possible? As with general positive interest, would not be opposed to prototyping this and benchmarking as well considering most of the concerns about performance are from me.

jan-wassenberg · 2020-04-01T18:04:42Z

@dtig Ah yes, looks like those are A64 instructions - I am not very familiar with v7.
Here are some users of the VMULL/MULUDQ instruction:
https://github.com/google/skia/blob/master/include/private/SkNx_neon.h
https://github.com/google/highwayhash/blob/master/highwayhash/hh_neon.h
https://github.com/floodyberry/poly1305-opt/blob/master/app/extensions/poly1305/poly1305_sse2-64.inc

dtig · 2020-04-02T18:05:37Z

Thanks @jan-wassenberg - originally we planned to optimize the subset of operations just for Neon support but as more engines/applications have come up it does look like they care about performance on some subset of older ARM devices especially on mobile, as most of the operations till now have codegen that's fairly reasonable so far this hasn't been an issue, but with some of the newer proposed operations it's less clear how much we should index on this particular aspect. To be consistent across architectures, though it would make sense to bias this towards it being future facing. I'll take an AI to get a more accurate sense for this and open an issue, or respond here.

bjacob · 2020-05-05T19:31:27Z

As a tangent, I would like to mention that for fixed-point calculations, it would be useful to also have fixed-point multiplication instructions, e.g. 32x32=32 and 16x16=16, similar to ARM SQRDMULH. I mention this here because some may think that the availability of a 32x32=64 integer multiplication would remove the need for that, but that would be sub-optimal: staying within 32bit means doing 4 scalar operations per 128-bit vector operation, and most applications want to use the rounding flavor (SQRDMULH not SQDMULH) which would require a few more instructions to emulate if the instruction is missing, which in practice would result in applications making compromises between accuracy and performance.

(This is critical to integer-quantized neural network applications as performed by TensorFlow Lite using the ruy matrix multiplication library, see e.g. the usage of these instructions here, https://github.com/google/ruy/blob/57e64b4c8f32e813ce46eb495d13ee301826e498/ruy/kernel_arm64.cc#L517 )

bjacob · 2020-05-05T19:48:01Z

Branched into Issue #221.

Maratyszcza · 2021-01-14T05:50:29Z

32x32->64 multiplication was added in #376, related fixed-point multiplication instruction (Q15 format) was added in #365

ngzhian · 2021-03-17T22:01:00Z

Closing since we have 32x32=64, as Marat pointed out.

tlively added the needs discussion Proposal with an unclear resolution label Mar 28, 2020

This was referenced May 5, 2020

Proposal to add fixed-point multiplication instructions #221

Closed

Consistently support widening/long variants of integer instructions #226

Open

Maratyszcza mentioned this issue Oct 7, 2020

Extended multiplication instructions #376

Merged

dtig linked a pull request Jan 8, 2021 that will close this issue

Extended multiplication instructions #376

Merged

ngzhian closed this as completed Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal to add mul 32x32=64 #175

Proposal to add mul 32x32=64 #175

jan-wassenberg commented Jan 14, 2020

ngzhian commented Feb 6, 2020

jan-wassenberg commented Feb 7, 2020

dtig commented Feb 20, 2020 •

edited

Loading

rrwinterton commented Feb 20, 2020

jan-wassenberg commented Feb 21, 2020

dtig commented Apr 1, 2020

jan-wassenberg commented Apr 1, 2020

dtig commented Apr 2, 2020

bjacob commented May 5, 2020 •

edited

Loading

bjacob commented May 5, 2020 •

edited

Loading

Maratyszcza commented Jan 14, 2021

ngzhian commented Mar 17, 2021

Proposal to add mul 32x32=64 #175

Proposal to add mul 32x32=64 #175

Comments

jan-wassenberg commented Jan 14, 2020

ngzhian commented Feb 6, 2020

jan-wassenberg commented Feb 7, 2020

dtig commented Feb 20, 2020 • edited Loading

rrwinterton commented Feb 20, 2020

jan-wassenberg commented Feb 21, 2020

dtig commented Apr 1, 2020

jan-wassenberg commented Apr 1, 2020

dtig commented Apr 2, 2020

bjacob commented May 5, 2020 • edited Loading

bjacob commented May 5, 2020 • edited Loading

Maratyszcza commented Jan 14, 2021

ngzhian commented Mar 17, 2021

dtig commented Feb 20, 2020 •

edited

Loading

bjacob commented May 5, 2020 •

edited

Loading

bjacob commented May 5, 2020 •

edited

Loading