-
Notifications
You must be signed in to change notification settings - Fork 43
Consider adding Horizontal Add #20
Comments
I'm in favor of this. We've implemented it in the V8 prototype. |
The f64x2 version could be included too. ARMv7 would need to do two scalar additions anyway, and ARMv8/SSE3 have the instruction. MIPS and POWER do not have these instructions. |
@billbudge What semantics does the horizontal have w.r.t. floating-point arithmetic ? Is it ordered (if so, how did you implement it) ? Or does it perform a tree reduction ? |
These are pairwise additions, so for a 4 lane vector type, two source operands would form a single destination vector like this: [ src0[0] + src0[1], src0[2] + src0[3], src1[0] + src1[1], src1[2] + src1[3] ] The semantics are that of the vpadd instruction on ARM, and the haddps, phaddw, and phaddd on Intel SSE. |
I see. I was expecting the intent of these operations to help with full horizontal reductions (e.g. |
These can be composed to do the full reductions. The advantage of keeping them primitive (pairwise) is that a compiler will have more opportunity to schedule the instructions. If we expose full reduction opcodes, then a WASM compiler has to generate a sequence of multiple pairwise reductions that stall, since WASM compilers don't necessarily do scheduling and other optimization passes. For example, the anyTrue and allTrue boolean vector opcodes have this problem on some platforms (arm). |
How? The full reductions performs: |
Starting with a vector: pairwise reduction with itself: If instead you use a zeroed register as the second source, two pairwise reductions give: [x0 + x1, x2 + x3, 0, 0] Does that make sense? |
That assumes that floating-point math is associative, but it isn't, that is: If the intent of the user is to perform an horizontal ordered reduction of the vector elements "as if it were an array" these operations don't help at all. |
FWIW I am not saying that these operations are not useful, I was just wondering what their semantics were, because depending on that they allow some operations or others. |
OK, I see what you're saying. You're correct that floating point operations are not associative. The general intent of SIMD is to give performance improvements for vector operations. If you care about exact math, you'd have to shuffle the data or extract lanes and use scalar operations. |
FYI Rust's horizontal SIMD reductions currently specify that they perform tree-reductions, and I was trying to see how to map them to WASM. I think LLVM should be able to map them to the horizontal reductions proposed without issues :) For exact math we might be adding horizontal reductions of the form |
The main problem I see is that floating point arithmetic is not associative. We could have the following:
It would be conceivable to have all 3 instructions that get converted to the right implementation. But actually, it would also be tempting to not provide those, and let the compiler applying adds and shuffles to get the right result. I would say it is fine to have instructions that can give slightly different results on different hardware if they are documented as such, and easy alternatives with fully specified results are also provided. Also, other reductions might be interesting: min, max, bit_and, bit_or, bit_xor... |
Whatever the intrinsics added do, they should specify exactly what they do. The proposed ones are available in most hardware, and it is rare that hardware has intrinsics to do something else. I think the proposed ones are fine for WebAssembly. If users/compilers want to offer other kinds of reduction intrinsics, they should build them on top of these. For example, Rust offers portable tree reduction operations, and they could be implemented on top of the intrinsics specified here without issues in a reasonably efficient way. IMO we should only add more intrinsics for these operations if those become widely available in the hardware that webassembly programs runs on. |
To be fair, I think we should either put all of them (the tree, the ordered and the unspecified), or none. I don't know if this could be easily detected by the VM, but it would also be possible to detect the pattern with shuffles and convert it to the best assembly code possible for the hardware. |
WebAssembly is an Instruction Set Architecture, it don't agree that adding assembly instructions to it with unspecified behavior is a good idea. The operations proposed are available on many many architectures. Which architectures do support built in tree reductions in a single operation? Which architectures do support ordered reductions with the semantics you proposed? |
First, I also proposed not to add any instructions for that as it is easy to implement without any special instructions. Actually, it would be very nice if the VM would be able to detect this pattern and use better hardware to do it, without creating a specialized instruction for that. The unspecified order could come later, but if you provide the ordered reduction, you need to provide the tree reduction, and vice-versa: both are useful. Now to answer your question: AVX512 intrinsics provide |
AFAIK the intrinsics proposed here are supported by all SIMD architectures (at least sse, avx, altivec, vsx, neon, asimd, msa, ...). The ordered and tree reductions only by ARM SVE, and the ordered reductions have a different API there, taking in an accumulator, and this is for a good reason. Currently there aren't even that many ARM SVE chips available, that compiler support for ARM SVE is incomplete, at least in LLVM, and for all we currently know ARM SVE might be a "one generation ISA" with future ARM hardware pursuing something else. Given that adding intrinsics for tree and ordered reductions to WASM forces all WASM compilers to add logic to lower them to something else on all architectures except on ARM SVE, I don't think that we should initially add them to WASM. If ARM SVE survives the first generation and if these instructions become widely available on other architectures, one can always add them to WASM in the future. |
I have the impression we basically agree with each other: no reduction instruction is required for WASM currently. I just don't get what you means by:
Does this include any form of |
The three |
What do they do? I would much prefer have no specialized instruction whatsoever, and write the reduction only with adds and shuffles (which would be faster on x86, anyway). |
Sure, but these instructions don't perform a full reduction.
Yeah, I just went through the MIPS ISA, PowerPC altivec and VSX isas, and RISCV ISA, and couldn't find any ways to implement these reductions efficiently. A tree reduction can be implemented efficiently almost everywhere. For example, MIPS and A64 DOTPROD have a vector dot-product instruction that at least on MIPS does a pairwise multiply add. That is, dotprod(v, v{1,1,1,1}) performs a tree-reduction IIUC how these work. ARM SVE has an intrinsic exclusively for this, and RISCV vector extensions might allow this with "matrix shapes" as well. AFAICT on x86 and older ARM one needs a couple of instructions to do a tree reduction, but with the hadds mentioned above (or using shuffles as you mention) one can do this relatively efficiently. So maybe a full vector tree reduction might be both more useful, and more easily implementable across the board, than just horizontal pairwise adds. About the ordered reductions, I don't personally think they are worth it yet. |
Maybe this:
If we want reduction instructions, then I think we need all these:
Saturating Sum:
Min:
Max:
binary and:
binary or:
binary xor: (not if useful)
Because it is many instructions, I have the feeling that not providing anything might be a good solution. And let the compiler (WASM generation) do the job with shuffles and ops. |
Programming languages can provide these (Rust provides all of them, and many more), but WASM does not need to have one instruction for each one of them to be an useful Rust target. It just has to provide enough instructions for programming languages to be able to expose these reductions in such a way that the generated WASM can be lowered down to efficient machine instructions by WASM code generators on most targets.
AFAIK most hardware has modulo 2^n behavior here.
This one still does not make sense to me. There is only one barely used widely unsupported piece of hardware that can do this efficiently, and any code generator would need to generate scalar code for this anywhere else. Also, in the particular piece of hardware that supports it, that signature doesn't allow you to use it for what that particular instruction was intended for, which is to reduce a large array (larger than a vector) in an ordered way, you would need an accumulator for that:
And yet it would still be useless to have that in 99% of the hardware where one would probably be better off performing the reduction without going through vector registers.
That's one possibility, but then that's exactly the machine code we are going to get. WASM->machine code generators are not optimizing compilers. |
I agree with you, and that's exactly why I think we don't need any of those.
True, but my point here is it will almost always overflow that would make this function quite useless?
This one might make more sense, you're right. Like I said earlier, there is nothing worse than an inconsistent ISA. That's mainly why I want this one. And it might be used by compilers for vectorising sums without fast-math (or not).
Which is fine as it will be efficient on the vast majority of targets.
Yes they are, but they are only doing very quick optimizations and I have the impression that detecting such a pattern is quick enough to be embedded in this machinery (but I might be wrong on this assumption). |
If we don't provide the intrinsics, and X -> WASM compilers lower this to scalar code, you would need an auto-vectorizing WASM->Machine code-generator. Native languages, like C++ and Rust, have pretty good auto-vectorizers, and yet they still expose all of these intrinsics because auto-vectorizers often get it wrong. Which machine code generators for WASM perform auto-vectorization? AFAIK Cretonne performs no optimizations whatsoever because performing optimizations is the job of the X->WASM compiler. I'd expect others to perform minor optimizations while lowering WASM to machine code ("clever" instruction selection at most), but nothing close to LLVM. |
I never said anything about scalar. I don't expect any vectorizing WASM->ASM for a long time, if ever. I said that compilers generate some shuffles + vectorized add like this:
And the VM detect this pattern and convert it into faster assembly, if any. PS: I don't think this shuffle syntax exists, but that's only for simplicity |
That makes sense, I wasn't getting your point. Adding shuffles seems like a better way to pursue this. |
I just created a merge request about shuffling: #30 |
Horizontal reductions, add, mul, min and max are really something I wish was in WASM. Emulate with trivial code if it is not supported in code. It would make it more semantically clear what is being done, and make code cleaner, and access lanes directly less often. +1 |
I agree with @dtig 'sthat we don't need to add horizontal reductions to the MVP, and as long as PR #30 makes it into the MVP, then that wouldn't really be a big deal since we can emulate them with that. Emulating reductions with shuffles would be a temporary situation, that has some downsides: increased binary size, sub-optimal performance if the WASM->target machine code generator doesn't pattern match the sequence, which is hard because many shuffle sequences can used to express the same type of reduction, etc. OTOH it has the advantage that WASM->target machine code generators wouldn't need to polyfill these in architectures that do not support them. We could mitigate the performance downsides by writing down a document outside the spec containing one "recommended" way to express each reduction with the SIMD MVP, that X->WASM compilers are "encouraged" to always use, and WASM->X compilers are "encouraged" to pattern match. |
That's really necessary operation. I tried different approaches to emulate this instruction but from optimal: https://godbolt.org/z/Sw3yzi Pairwise or tree summation will be enough because it pairwise summation in general case more precise instead ordered (but non-sorted) |
There's a lot going on in this thread, but I could really use something that's the equivalent of pmaddubsw. That's a horizontal multiply and add that leaves it in 16-bit integers. Likewise, a standard horizontal add. For anyone who thinks we should split up the multiply and add instructions, I would strongly discourage them from doing so. Aside from comments of associativity which doesn't exist for floating-point numbers, Compilers have a long-standing ban against reordering arithmetic operations. Imagine the result of n*22/7 if 22/7 is evaluated first. It would be seriously problematic. With respect to that, horizontal adds, subtractions, multiplies and adds, and so forth, should all be done as single ops separate from one another -- otherwise, the behavior cannot be expected to be accurate. |
Packed horizontal arithmetic is reasonably performant on SSE3+ and Neon. These would be useful for complex multiplications, and in the absence of the opcodes below, these would need to be a combination of shifts and adds.
f32x4.addHoriz(x: v128, y:v128) -> v128
i32x4.addHoriz(x: v128, y:v128) -> v128
i16x8.addHoriz(x: v128, y:v128) -> v128
Thoughts on whether horizontal add instructions would be useful to include in the current SIMD spec?
The text was updated successfully, but these errors were encountered: