Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Consider adding Horizontal Add #20

Open
dtig opened this issue May 25, 2017 · 33 comments
Open

Consider adding Horizontal Add #20

dtig opened this issue May 25, 2017 · 33 comments

Comments

@dtig
Copy link
Member

dtig commented May 25, 2017

Packed horizontal arithmetic is reasonably performant on SSE3+ and Neon. These would be useful for complex multiplications, and in the absence of the opcodes below, these would need to be a combination of shifts and adds.

f32x4.addHoriz(x: v128, y:v128) -> v128
i32x4.addHoriz(x: v128, y:v128) -> v128
i16x8.addHoriz(x: v128, y:v128) -> v128

Thoughts on whether horizontal add instructions would be useful to include in the current SIMD spec?

@billbudge
Copy link

I'm in favor of this. We've implemented it in the V8 prototype.

@stoklund
Copy link
Contributor

The f64x2 version could be included too. ARMv7 would need to do two scalar additions anyway, and ARMv8/SSE3 have the instruction.

MIPS and POWER do not have these instructions.

@gnzlbg
Copy link
Contributor

gnzlbg commented Apr 6, 2018

@billbudge What semantics does the horizontal have w.r.t. floating-point arithmetic ? Is it ordered (if so, how did you implement it) ? Or does it perform a tree reduction ?

@billbudge
Copy link

These are pairwise additions, so for a 4 lane vector type, two source operands would form a single destination vector like this:

[ src0[0] + src0[1], src0[2] + src0[3], src1[0] + src1[1], src1[2] + src1[3] ]

The semantics are that of the vpadd instruction on ARM, and the haddps, phaddw, and phaddd on Intel SSE.

@gnzlbg
Copy link
Contributor

gnzlbg commented Apr 10, 2018

I see. I was expecting the intent of these operations to help with full horizontal reductions (e.g. f32x4.addHoriz(x: v128) -> f32), but yes those are useful as well.

@billbudge
Copy link

These can be composed to do the full reductions. The advantage of keeping them primitive (pairwise) is that a compiler will have more opportunity to schedule the instructions. If we expose full reduction opcodes, then a WASM compiler has to generate a sequence of multiple pairwise reductions that stall, since WASM compilers don't necessarily do scheduling and other optimization passes. For example, the anyTrue and allTrue boolean vector opcodes have this problem on some platforms (arm).

@gnzlbg
Copy link
Contributor

gnzlbg commented Apr 10, 2018

These can be composed to do the full reductions.

How? The full reductions performs: src[0] + src[1] + src[2] + src[3]. I don't see a way to emulate this with pairwise reductions.

@billbudge
Copy link

Starting with a vector:
[x0, x1, x2, x3]

pairwise reduction with itself:
[x0 + x1, x2 + x3, x0 + x1, x2 + x3]
another pairwise reduction with itself:
[x0 + x1 + x2 + x3, x0 + x1 + x2 + x3, x0 + x1 + x2 + x3, x0 + x1 + x2 + x3] // 4 copies of reduction.

If instead you use a zeroed register as the second source, two pairwise reductions give:

[x0 + x1, x2 + x3, 0, 0]
[x0 + x1 + x2 + x3, 0, 0, 0]

Does that make sense?

@gnzlbg
Copy link
Contributor

gnzlbg commented Apr 10, 2018

Does that make sense?

That assumes that floating-point math is associative, but it isn't, that is: (x0 + x1) + (x2 + x3) != x0 + x1 + x2 + x3.

If the intent of the user is to perform an horizontal ordered reduction of the vector elements "as if it were an array" these operations don't help at all.

@gnzlbg
Copy link
Contributor

gnzlbg commented Apr 10, 2018

FWIW I am not saying that these operations are not useful, I was just wondering what their semantics were, because depending on that they allow some operations or others.

@billbudge
Copy link

OK, I see what you're saying. You're correct that floating point operations are not associative. The general intent of SIMD is to give performance improvements for vector operations. If you care about exact math, you'd have to shuffle the data or extract lanes and use scalar operations.

@gnzlbg
Copy link
Contributor

gnzlbg commented Apr 10, 2018

FYI Rust's horizontal SIMD reductions currently specify that they perform tree-reductions, and I was trying to see how to map them to WASM. I think LLVM should be able to map them to the horizontal reductions proposed without issues :)

For exact math we might be adding horizontal reductions of the form f32x4.addHoriz(acc: f32, x: v128) -> f32 in the future since some ISAs have instructions for them (e.g. fadda on ARM), but as you mention these will need to fall back to scalar code in other targets.

@lemaitre
Copy link

lemaitre commented Apr 18, 2018

The main problem I see is that floating point arithmetic is not associative.
So the result will depend on the implementation.

We could have the following:

  • f32x4.addHorizOrdered(x: v128) -> f32 which computes x0 + (x1 + (x2 + x3))
  • f32x4.addHorizTree(x: v128) -> f32 which computes (x0 + x1) + (x2 + x3)
  • f32x4.addHorizUnordered(x: v128) -> f32 where the order of the computation is not specified: implementation defined

It would be conceivable to have all 3 instructions that get converted to the right implementation.
To be remembered that VHADDPS instruction in SSE/AVX is actually slower to compute the full reduction than regular adds and shuffles.

But actually, it would also be tempting to not provide those, and let the compiler applying adds and shuffles to get the right result.
Like this, it will always give the correct result as add is fully determined.

I would say it is fine to have instructions that can give slightly different results on different hardware if they are documented as such, and easy alternatives with fully specified results are also provided.

Also, other reductions might be interesting: min, max, bit_and, bit_or, bit_xor...
If you don't provide any instruction for that, it will be implemented with a bunch of shuffles

@gnzlbg
Copy link
Contributor

gnzlbg commented Apr 18, 2018

Whatever the intrinsics added do, they should specify exactly what they do.

The proposed ones are available in most hardware, and it is rare that hardware has intrinsics to do something else. I think the proposed ones are fine for WebAssembly. If users/compilers want to offer other kinds of reduction intrinsics, they should build them on top of these.

For example, Rust offers portable tree reduction operations, and they could be implemented on top of the intrinsics specified here without issues in a reasonably efficient way.

IMO we should only add more intrinsics for these operations if those become widely available in the hardware that webassembly programs runs on.

@lemaitre
Copy link

To be fair, I think we should either put all of them (the tree, the ordered and the unspecified), or none.
Those are easy to write without any specific instruction.

I don't know if this could be easily detected by the VM, but it would also be possible to detect the pattern with shuffles and convert it to the best assembly code possible for the hardware.

@gnzlbg
Copy link
Contributor

gnzlbg commented Apr 18, 2018

WebAssembly is an Instruction Set Architecture, it don't agree that adding assembly instructions to it with unspecified behavior is a good idea.

The operations proposed are available on many many architectures.

Which architectures do support built in tree reductions in a single operation?

Which architectures do support ordered reductions with the semantics you proposed?

@lemaitre
Copy link

First, I also proposed not to add any instructions for that as it is easy to implement without any special instructions. Actually, it would be very nice if the VM would be able to detect this pattern and use better hardware to do it, without creating a specialized instruction for that.

The unspecified order could come later, but if you provide the ordered reduction, you need to provide the tree reduction, and vice-versa: both are useful.

Now to answer your question:
SSE, AVX, AVX512, Neon, Altivec/VSX have none
ARM-SVE has both.

AVX512 intrinsics provide _mm512_reduce_add_ps, but this is not a single instruction.
According to the documentation, it does an ordered reduction, but actually, compilers produce a tree reduction.

@gnzlbg
Copy link
Contributor

gnzlbg commented Apr 18, 2018

AFAIK the intrinsics proposed here are supported by all SIMD architectures (at least sse, avx, altivec, vsx, neon, asimd, msa, ...). The ordered and tree reductions only by ARM SVE, and the ordered reductions have a different API there, taking in an accumulator, and this is for a good reason.

Currently there aren't even that many ARM SVE chips available, that compiler support for ARM SVE is incomplete, at least in LLVM, and for all we currently know ARM SVE might be a "one generation ISA" with future ARM hardware pursuing something else.

Given that adding intrinsics for tree and ordered reductions to WASM forces all WASM compilers to add logic to lower them to something else on all architectures except on ARM SVE, I don't think that we should initially add them to WASM. If ARM SVE survives the first generation and if these instructions become widely available on other architectures, one can always add them to WASM in the future.

@lemaitre
Copy link

I have the impression we basically agree with each other: no reduction instruction is required for WASM currently.

I just don't get what you means by:

AFAIK the intrinsics proposed here are supported by all SIMD architectures

Does this include any form of addHoriz? A simplified one? Or just that adds and shuffles are enough and supported by all architectures?

@gnzlbg
Copy link
Contributor

gnzlbg commented Apr 19, 2018

Does this include any form of addHoriz? A simplified one? Or just that adds and shuffles are enough and supported by all architectures?

The three addHoriz intrinsics proposed in this issue: #20 (comment) There are assembly instructions that map 1:1 to them on pretty much every architecture with SIMD instructions.

@lemaitre
Copy link

What do they do?
Are they equivalent to the HADDPS on SSE/AVX (pairwise additions)?
If so, then I think it is a bad idea as this instruction is slow if you want the full reduction.
I don't think a partial reduction is worth it.

I would much prefer have no specialized instruction whatsoever, and write the reduction only with adds and shuffles (which would be faster on x86, anyway).

@gnzlbg
Copy link
Contributor

gnzlbg commented Apr 19, 2018

What do they do?

  • On x86 and x86_64:
    • SSE3: haddps, haddpd
    • SSSE3: phaddw, phaddd
  • On A32's NEON and A64's ASIMD:
    • {S,U,F}HADD

If so, then I think it is a bad idea as this instruction is slow if you want the full reduction.

Sure, but these instructions don't perform a full reduction.

I don't think a partial reduction is worth it.

Yeah, I just went through the MIPS ISA, PowerPC altivec and VSX isas, and RISCV ISA, and couldn't find any ways to implement these reductions efficiently.

A tree reduction can be implemented efficiently almost everywhere. For example, MIPS and A64 DOTPROD have a vector dot-product instruction that at least on MIPS does a pairwise multiply add. That is, dotprod(v, v{1,1,1,1}) performs a tree-reduction IIUC how these work. ARM SVE has an intrinsic exclusively for this, and RISCV vector extensions might allow this with "matrix shapes" as well. AFAICT on x86 and older ARM one needs a couple of instructions to do a tree reduction, but with the hadds mentioned above (or using shuffles as you mention) one can do this relatively efficiently.

So maybe a full vector tree reduction might be both more useful, and more easily implementable across the board, than just horizontal pairwise adds. About the ordered reductions, I don't personally think they are worth it yet.

@lemaitre
Copy link

If so, then I think it is a bad idea as this instruction is slow if you want the full reduction.

Sure, but these instructions don't perform a full reduction.
If there is no use case apart from the full reduction, then what's the point anyway?
If some people have a use case for these, fine enough, but do not call this function an horizontal add: it is not.

Maybe this: f32x4.addPair(x: v128, y:v128) -> v128

So maybe a full vector tree reduction might be both more useful, and more easily implementable across the board, than just horizontal pairwise adds.
I agree with you, but in that case, I think it is also worth providing them for other common operations like min, max and some binary operations.
And also the ordered reduction (only for floating point, as integer arithmetic is associative).

If we want reduction instructions, then I think we need all these:
Sum:

  • i8x16.addHoriz(x: v128) -> i8: Sum(xi) (output might be bigger to avoid overflow?)
  • i16x8.addHoriz(x: v128) -> i16: Sum(xi)
  • i32x4.addHoriz(x: v128) -> i32: Sum(xi)
  • i64x2.addHoriz(x: v128) -> i64: Sum(xi)
  • f32x4.addHoriz(x: v128) -> f32: (x0 + x1) + (x2 + x3)
  • f32x4.addHorizOrdered(x: v128) -> f32: x0 + (x1 + (x2 + x3))
  • f64x2.addHoriz(x: v128) -> f64: x0 + x1

Saturating Sum:

  • i8x16.addHoriz_saturate_s(x: v128) -> i8: SaturatedSum(xi)
  • i16x8.addHoriz_saturate_s(x: v128) -> i16: SaturatedSum(xi)
  • i32x4.addHoriz_saturate_s(x: v128) -> i32: SaturatedSum(xi)
  • i64x2.addHoriz_saturate_s(x: v128) -> i64: SaturatedSum(xi)
  • i8x16.addHoriz_saturate_u(x: v128) -> i8: SaturatedSum(unsigned(xi))
  • i16x8.addHoriz_saturate_u(x: v128) -> i16: SaturatedSum(unsigned(xi))
  • i32x4.addHoriz_saturate_u(x: v128) -> i32: SaturatedSum(unsigned(xi))
  • i64x2.addHoriz_saturate_u(x: v128) -> i64: SaturatedSum(unsigned(xi))

Min:

  • i8x16.minHoriz(x: v128) -> i8: Min(xi)
  • i16x8.minHoriz(x: v128) -> i16: Min(xi)
  • i32x4.minHoriz(x: v128) -> i32: Min(xi)
  • i64x2.minHoriz(x: v128) -> i64: Min(xi)
  • i8x16.minHoriz_u(x: v128) -> i8: Min(unsigned(xi))
  • i16x8.minHoriz_u(x: v128) -> i16: Min(unsigned(xi))
  • i32x4.minHoriz_u(x: v128) -> i32: Min(unsigned(xi))
  • i64x2.minHoriz_u(x: v128) -> i64: Min(unsigned(xi))
  • f32x4.minHoriz(x: v128) -> f32: Min(xi)
  • f64x2.minHoriz(x: v128) -> f64: Min(xi)

Max:

  • i8x16.maxHoriz(x: v128) -> i8: Max(xi)
  • i16x8.maxHoriz(x: v128) -> i16: Max(xi)
  • i32x4.maxHoriz(x: v128) -> i32: Max(xi)
  • i64x2.maxHoriz(x: v128) -> i64: Max(xi)
  • i8x16.maxHoriz_u(x: v128) -> i8: Max(unsigned(xi))
  • i16x8.maxHoriz_u(x: v128) -> i16: Max(unsigned(xi))
  • i32x4.maxHoriz_u(x: v128) -> i32: Max(unsigned(xi))
  • i64x2.maxHoriz_u(x: v128) -> i64: Max(unsigned(xi))
  • f32x4.maxHoriz(x: v128) -> f32: Max(xi)
  • f64x2.maxHoriz(x: v128) -> f64: Max(xi)

binary and:

  • i8x16.andHoriz(x: v128) -> i8: and(xi)
  • i16x8.andHoriz(x: v128) -> i16: and(xi)
  • i32x4.andHoriz(x: v128) -> i32: and(xi)
  • i64x2.andHoriz(x: v128) -> i64: and(xi)

binary or:

  • i8x16.orHoriz(x: v128) -> i8: or(xi)
  • i16x8.orHoriz(x: v128) -> i16: or(xi)
  • i32x4.orHoriz(x: v128) -> i32: or(xi)
  • i64x2.orHoriz(x: v128) -> i64: or(xi)

binary xor: (not if useful)

  • i8x16.xorHoriz(x: v128) -> i8: xor(xi)
  • i16x8.xorHoriz(x: v128) -> i16: xor(xi)
  • i32x4.xorHoriz(x: v128) -> i32: xor(xi)
  • i64x2.xorHoriz(x: v128) -> i64: xor(xi)

Because it is many instructions, I have the feeling that not providing anything might be a good solution. And let the compiler (WASM generation) do the job with shuffles and ops.
If this approach is not performant enough on some platforms, the VM can probably detect some patterns and use better instructions.

@gnzlbg
Copy link
Contributor

gnzlbg commented Apr 19, 2018

If we want reduction instructions, then I think we need all these:

Programming languages can provide these (Rust provides all of them, and many more), but WASM does not need to have one instruction for each one of them to be an useful Rust target.

It just has to provide enough instructions for programming languages to be able to expose these reductions in such a way that the generated WASM can be lowered down to efficient machine instructions by WASM code generators on most targets.

(output might be bigger to avoid overflow?)

AFAIK most hardware has modulo 2^n behavior here.

f32x4.addHorizOrdered(x: v128) -> f32: x0 + (x1 + (x2 + x3))

This one still does not make sense to me. There is only one barely used widely unsupported piece of hardware that can do this efficiently, and any code generator would need to generate scalar code for this anywhere else. Also, in the particular piece of hardware that supports it, that signature doesn't allow you to use it for what that particular instruction was intended for, which is to reduce a large array (larger than a vector) in an ordered way, you would need an accumulator for that:

`f32x4.addHorizOrdered(acc: f32, x: v128) -> f32: acc + (x0 + (x1 + (x2 + x3)))`

And yet it would still be useless to have that in 99% of the hardware where one would probably be better off performing the reduction without going through vector registers.

And let the compiler (WASM generation) do the job with shuffles and ops.

That's one possibility, but then that's exactly the machine code we are going to get. WASM->machine code generators are not optimizing compilers.

@lemaitre
Copy link

If we want reduction instructions, then I think we need all these:

Programming languages can provide these (Rust provides all of them, and many more), but WASM does not need to have one instruction for each one of them to be an useful Rust target.

It just has to provide enough instructions for programming languages to be able to expose these reductions in such a way that the generated WASM can be lowered down to efficient machine instructions by WASM code generators on most targets.

I agree with you, and that's exactly why I think we don't need any of those.
But I think if we provide some of them, we should provide all of them just to be consistent. I would prefer to avoid making the same mistakes as SSE and AVX which have many inconsistencies.

AFAIK most hardware has modulo 2^n behavior here.

True, but my point here is it will almost always overflow that would make this function quite useless?

f32x4.addHorizOrdered(acc: f32, x: v128) -> f32: acc + (x0 + (x1 + (x2 + x3)))

This one might make more sense, you're right. Like I said earlier, there is nothing worse than an inconsistent ISA. That's mainly why I want this one. And it might be used by compilers for vectorising sums without fast-math (or not).
Actually, the double precision variant will also be needed.

And let the compiler (WASM generation) do the job with shuffles and ops.

That's one possibility, but then that's exactly the machine code we are going to get.

Which is fine as it will be efficient on the vast majority of targets.

WASM->machine code generators are not optimizing compilers.

Yes they are, but they are only doing very quick optimizations and I have the impression that detecting such a pattern is quick enough to be embedded in this machinery (but I might be wrong on this assumption).

@gnzlbg
Copy link
Contributor

gnzlbg commented Apr 19, 2018

Yes they are, but they are only doing very quick optimizations and I have the impression that detecting such a pattern is quick enough to be embedded in this machinery (but I might be wrong on this assumption).

If we don't provide the intrinsics, and X -> WASM compilers lower this to scalar code, you would need an auto-vectorizing WASM->Machine code-generator. Native languages, like C++ and Rust, have pretty good auto-vectorizers, and yet they still expose all of these intrinsics because auto-vectorizers often get it wrong.

Which machine code generators for WASM perform auto-vectorization? AFAIK Cretonne performs no optimizations whatsoever because performing optimizations is the job of the X->WASM compiler. I'd expect others to perform minor optimizations while lowering WASM to machine code ("clever" instruction selection at most), but nothing close to LLVM.

@lemaitre
Copy link

lemaitre commented Apr 19, 2018

I never said anything about scalar. I don't expect any vectorizing WASM->ASM for a long time, if ever.

I said that compilers generate some shuffles + vectorized add like this:

get_local 0
get_local 0
v32x4.shuffle 0b01001110
f32x4.add
get_local 0
get_local 0
v32x4.shuffle 0b10110001
f32x4.add
f32x4.extract_lane 0

And the VM detect this pattern and convert it into faster assembly, if any.

PS: I don't think this shuffle syntax exists, but that's only for simplicity

@gnzlbg
Copy link
Contributor

gnzlbg commented Apr 19, 2018

That makes sense, I wasn't getting your point. Adding shuffles seems like a better way to pursue this.

@lemaitre
Copy link

I just created a merge request about shuffling: #30
It mentions at the end how to use shuffling operations to compute a reduction.

@baryluk
Copy link

baryluk commented Dec 28, 2018

Horizontal reductions, add, mul, min and max are really something I wish was in WASM. Emulate with trivial code if it is not supported in code. It would make it more semantically clear what is being done, and make code cleaner, and access lanes directly less often.

+1

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 2, 2019

I agree with @dtig 'sthat we don't need to add horizontal reductions to the MVP, and as long as PR #30 makes it into the MVP, then that wouldn't really be a big deal since we can emulate them with that.

Emulating reductions with shuffles would be a temporary situation, that has some downsides: increased binary size, sub-optimal performance if the WASM->target machine code generator doesn't pattern match the sequence, which is hard because many shuffle sequences can used to express the same type of reduction, etc. OTOH it has the advantage that WASM->target machine code generators wouldn't need to polyfill these in architectures that do not support them.

We could mitigate the performance downsides by writing down a document outside the spec containing one "recommended" way to express each reduction with the SIMD MVP, that X->WASM compilers are "encouraged" to always use, and WASM->X compilers are "encouraged" to pattern match.

@MaxGraey
Copy link

That's really necessary operation. I tried different approaches to emulate this instruction but from optimal: https://godbolt.org/z/Sw3yzi

Pairwise or tree summation will be enough because it pairwise summation in general case more precise instead ordered (but non-sorted)

@omnisip
Copy link

omnisip commented Oct 5, 2020

There's a lot going on in this thread, but I could really use something that's the equivalent of pmaddubsw. That's a horizontal multiply and add that leaves it in 16-bit integers. Likewise, a standard horizontal add. For anyone who thinks we should split up the multiply and add instructions, I would strongly discourage them from doing so. Aside from comments of associativity which doesn't exist for floating-point numbers, Compilers have a long-standing ban against reordering arithmetic operations. Imagine the result of n*22/7 if 22/7 is evaluated first. It would be seriously problematic. With respect to that, horizontal adds, subtractions, multiplies and adds, and so forth, should all be done as single ops separate from one another -- otherwise, the behavior cannot be expected to be accurate.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants