Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

LLVM's "optimization" of shuffles penalizes x64 codegen #196

Closed
zeux opened this issue Feb 17, 2020 · 22 comments
Closed

LLVM's "optimization" of shuffles penalizes x64 codegen #196

zeux opened this issue Feb 17, 2020 · 22 comments

Comments

@zeux
Copy link
Contributor

zeux commented Feb 17, 2020

So we've discussed this previously at some point on the WASM call, but I keep hitting this issue and I feel like something must be done.

Briefly, the issue is as follows:

  • WASM SIMD shuffle with constant indices is a general byte shuffle
  • v8 codegen for the shuffle tries to pattern match a set of common SSE shuffles, and failing that, emits really terrible code sequence
  • LLVM often optimizes shuffles that do form a recognizable pattern. This optimization is harmful and breaks pattern matching in v8 - and, presumably, other engines.

Here's the recent example I hit. Given this code that tries to extract 16-bit component from 4 64-bit halves of two vectors:

const v128_t zmask = wasm_i32x4_splat(0x7fff);

v128_t z4 = wasmx_shuffle_v32x4(n4_0, n4_1, 1, 3, 1, 3);
v128_t zf = wasm_v128_and(z4, zmask);

With wasmx_shuffle_v32x4 defined as follows to simulate SSE2 shufps:

#define wasmx_shuffle_v32x4(v, w, i, j, k, l) wasm_v8x16_shuffle(v, w, 4 * i, 4 * i + 1, 4 * i + 2, 4 * i + 3, 4 * j, 4 * j + 1, 4 * j + 2, 4 * j + 3, 16 + 4 * k, 16 + 4 * k + 1, 16 + 4 * k + 2, 16 + 4 * k + 3, 16 + 4 * l, 16 + 4 * l + 1, 16 + 4 * l + 2, 16 + 4 * l + 3)

LLVM notices that the results of the shuffle have two bytes in each 32-bit lane unused and "optimizes" the shuffle to:

v8x16.shuffle 0x00000504 0x00000d0c 0x00001514 0x00001d1c

V8 doesn't recognize this as a valid shuffle, and generates this:

000000E5ED349504   204  4989e1         REX.W movq r9,rsp
000000E5ED349507   207  4883e4f0       REX.W andq rsp,0xf0
000000E5ED34950B   20b  c4417810f8     vmovups xmm15,xmm8
000000E5ED349510   210  49ba8080000080800000 REX.W movq r10,0000808000008080
000000E5ED34951A   21a  4152           push r10
000000E5ED34951C   21c  49ba040500000c0d0000 REX.W movq r10,00000D0C00000504
000000E5ED349526   226  4152           push r10
000000E5ED349528   228  c46201003c24   vpshufb xmm15,xmm15,[rsp]
000000E5ED34952E   22e  450f10e1       movups xmm12,xmm9
000000E5ED349532   232  49ba040580800c0d8080 REX.W movq r10,80800D0C80800504
000000E5ED34953C   23c  4152           push r10
000000E5ED34953E   23e  49ba8080808080808080 REX.W movq r10,8080808080808080
000000E5ED349548   248  4152           push r10
000000E5ED34954A   24a  c46219002424   vpshufb xmm12,xmm12,[rsp]
000000E5ED349550   250  c44119ebe7     vpor xmm12,xmm12,xmm15
000000E5ED349555   255  498be1         REX.W movq rsp,r9

This code sequence is catastrophically bad. To put it in perspective, this code without this problem runs at 3.2 GB/s, and with this problem it runs at 1.1 GB/s - this is though the loop without this code sequence is actually pretty large, ~70 SSE/AVX instructions plus some amount of loop/branch scaffolding that v8 emits - this shuffle is merely a small piece of a larger transform, and it alone wreaks havoc on the performance.

Now, clearly the instruction sequence doesn't have to be this bad. However, note that this is actually a shuffle with two distinct arguments - so even if v8 could pre-compute the shuffle masks, this would still result in something like

vpshufb reg, reg, [rip + off1]
vpshufb reg1, reg1, [rip + off1]
vpor reg, reg, reg1

So LLVM actively pessimizes the code that the programmer writes. I've hit this issue so many times and every time it takes time to figure out how to apply a fragile workaround - in this case I had to mark zmask as volatile, spending an extra stack memory load just to make sure LLVM doesn't do anything stupid.

I know we discussed that adding target-specific logic to LLVM transforms seems problematic, since that's a lot of code to maintain. Have we considered not optimizing shuffles at all? I've yet to see evidence that LLVM combining shuffles without being aware of the target platform can produce beneficial results.

@jan-wassenberg
Copy link

Ouch, that's a major pessimization. This is the kind of performance cliff that we are all trying to avoid.
I would agree that merging any kind of shuffle is dubious. If it's helping the autovectorizer, would it be feasible to not optimize shuffles that came from __builtin_shufflevector?

@penzn
Copy link
Contributor

penzn commented Feb 18, 2020

This might be the same issue as in #118. This looks like the same operation, @zeux please correct me if I am wrong.

Back in #118 (comment) we realized that it might be necessary to add some of the common operations, like unpack, or at least we should consider adding them. Such operations are common among platforms, and code depends on those particular patterns being faster than generic shuffle operation.

@tlively
Copy link
Member

tlively commented Feb 18, 2020

Thanks for the writeup, @zeux! Having real numbers is helpful for this conversation. The pessimization here looks serious enough that I believe disabling shuffle merging in the LLVM backend is a prudent short-term and medium-term solution. In the long term, I would like to get inline assembly working so we can solve this in user code rather than in the compiler. Does that plan sound reasonable to everyone?

@sunfishcode
Copy link
Member

Inline asm as a long term solution would imply that it's ok for developers to use a cost model for shuffles. If we're going to say that's ok, would we be ok for LLVM to have a cost model too? If so, LLVM could use it to avoid merging shuffles in cases where the cost model advises against it, and then we wouldn't need to depend on inline asm :-).

@penzn
Copy link
Contributor

penzn commented Feb 18, 2020

LLVM only optimizes shuffles if it can prove that the combined shuffle is better than the original (one phshufb is better than two, but much worse than a few unpacks). It has a cost model, just for Wasm cost is equal (since that is what the spec says ;)

There are options:

  1. Explicit cost model for existing shuffle operation - stating in the spec that masks are not equal
  2. Separate shuffle instructions to match the existing fast shuffles (like unpack)
  3. Disable optimizations of shuffles altogether

1 and 2 both allow the toolchain to have a cost model, meaning that optimizations should have positive impact. 3 would fix this case, but would disable combining multiple slow shuffles into one, which would come with a performance penalty at some point.

In general I think wasm should have a cost model for shuffles because all platforms we care about have a cost model and fast shuffle instructions are pretty similar across the board. In my opinion the best way to do it is via new instructions - that would keep the spec for generic shuffle simple and would make instruction choices more explicit.

The shuffle ops that hardware accelerates that come to mind:

  • Shuffle with wider lanes
  • Pack/Unpack (ABCDABCD -> AABBCCDD, reverse and the like)
  • Byte shift (shift 128-bit value by a number of bytes with or without wraparound)

@zeux can you think of anything else?

@zeux
Copy link
Contributor Author

zeux commented Feb 19, 2020

@penzn I'd add a few to the list:

  • Blends between two vectors with 32/16/8 masks; equivalent to bitselect with a constant mask but bitselect is slower (constant load + 3 instructions)
  • Restricted shuffle with first two components coming from first vector and second two from the second vector (SSE2)

I am not actually sure I agree that new instructions are the optimal way to handle this though. It adds a lot of spec complexity. Granted, this will come with code size / codegen time benefits, but I'd still be curious to know if there are real-world scenarios where shuffle merging is at all beneficial.

@zeux
Copy link
Contributor Author

zeux commented Feb 19, 2020

@tlively By "inline assembly", do you mean using inline wasm code that LLVM can't reason about instead of intrinsics for shuffles?

@tlively
Copy link
Member

tlively commented Feb 19, 2020

@tlively By "inline assembly", do you mean using inline wasm code that LLVM can't reason about instead of intrinsics for shuffles?

Yes, precisely. We could use it in the intrinsics header to create a shuffle that is opaque to the optimizer.

@zeux
Copy link
Contributor Author

zeux commented Feb 20, 2020

One possible alternative (if I understand this right) is that we could add a wasm-specific shuffle instruction that is opaque to the optimizer, and keep the current generic shuffle.

That way we can, in user code, arbitrate between those. E.g. maybe we expose a few pseudo-intrinsics we discussed here that mirror the hardware capabilities through a wasm-specific shuffle instruction and keep the general one using the optimizable instruction, or something to that extent.

I don't know enough about how inline assembly works in LLVM so maybe these are equivalent, but maybe a custom opaque intrinsic results in the optimizer having freedom to move the instruction or eliminate it entirely, without replacing the instruction with another one.

@tlively
Copy link
Member

tlively commented Feb 20, 2020

Yes, a custom intrinsic would work nicely. I actually already have a patch implementing this solution ready to go if that sounds appropriate to everyone.

@penzn
Copy link
Contributor

penzn commented Feb 20, 2020

@zeux, shuffle merging is used extensively in the CPU backends. There must be exact a motivation for it, but in general operations like shuffle bytes are costly (as a general rule, but more on some platforms and less on the other). I haven't found exact workload yet, but will post if I do.

One possible alternative (if I understand this right) is that we could add a wasm-specific shuffle instruction that is opaque to the optimizer, and keep the current generic shuffle.

I doubt it would be optimal from LLVM point of view, as all of the shuffle transformations tied to its single IR instruction for shuffle. The change would need to justify why wasm is different from other platforms, especially since we are trying to do this to mimic other platform's behavior (that shuffle masks are not equivalent).

I think we should make such assumptions explicit in the description of shuffle instruction or the spec in general.

@tlively
Copy link
Member

tlively commented Feb 20, 2020

@zeux, shuffle merging is used extensively in the CPU backends. There must be exact a motivation for it, but in general operations like shuffle bytes are costly (as a general rule, but more on some platforms and less on the other). I haven't found exact workload yet, but will post if I do.

A clear cut case where shuffle merging is beneficial is when a user uses multiple calls to __builtin_shufflevector that don't map to any particular hardware instruction. It's better to have one terrible shuffle than two terrible shuffles. This optimization is not useful to users who do the due diligence on their platform and only use shuffles that are already fast.

I doubt it would be optimal from LLVM point of view, as all of the shuffle transformations tied to its single IR instruction for shuffle. The change would need to justify why wasm is different from other platforms, especially since we are trying to do this to mimic other platform's behavior (that shuffle masks are not equivalent).

I think we should make such assumptions explicit in the description of shuffle instruction or the spec in general.

I disagree. The spec does not describe a cost model for WebAssembly because engines should not need to have any particular cost model to be spec-compliant. But that does not mean that toolchains should not do work to optimize the real-world performance of WebAssembly applications. What we have in this conversation that we did not have before is hard data showing a large real-world slowdown due to shuffle merging, and that is justification enough. As a pragmatic matter, we have to provide a workaround in the tools to support our users even if there are no spec changes.

@sunfishcode
Copy link
Member

The spec does not describe a cost model for WebAssembly because engines should not need to have any particular cost model to be spec-compliant.

This is true of the normative text of the spec, but the spec can also provide non-normative performance guidance. For precedent, see the align field on load and store instructions, which has no normative effect, but is nonetheless both in the spec and useful.

Defining new instructions which perform special cases of shuffles and are intended to be "fast", or just adding non-normative notes, or perhaps even a non-normative appendix, to the spec, describing which special cases of shuffles are expected to be "fast", are both plausible approaches.

@zeux
Copy link
Contributor Author

zeux commented Feb 22, 2020

Just to have more than one example... I ran into this again today :)

While optimizing the store part of a SIMD kernel, I wanted to optimize some shuffle/mixing of values needed to pack multiple 16-bit values into the output vector.

I initially had code like this:

v128_t y0r = wasm_v128_and(yr, wasm_i32x4_splat(0xffff));

v128_t res_0 = wasmx_unpacklo_v16x8(xzr, yr);
v128_t res_1 = wasmx_unpackhi_v16x8(xzr, yr);

res_0 = wasm_v128_or(res_0, wasm_v128_and(n4_0, wasm_i64x2_splat(0xffff000000000000)));
res_1 = wasm_v128_or(res_1, wasm_v128_and(n4_1, wasm_i64x2_splat(0xffff000000000000)));

In an attempt to optimize this by removing redundant ands/masks, I decided to use shuffles that should map directly to SSE pblendw instruction like so:

v128_t res_0 = wasmx_unpacklo_v16x8(xzr, yr);
v128_t res_1 = wasmx_unpackhi_v16x8(xzr, yr);

res_0 = wasmx_blend_v16x8(res_0, n4_0, 0, 0, 0, 1, 0, 0, 0, 1);
res_1 = wasmx_blend_v16x8(res_1, n4_1, 0, 0, 0, 1, 0, 0, 0, 1);

This replaces 3 and and 2 ors with 2 shuffles that should directly map to pblendw - clearly, a win!

Of course, LLVM thought otherwise - it looked at the shuffles used in blends, understood that res_0 and res_1 have components that are not used (because they are masked by the following shuffles), and replaced these two with shuffles that don't pattern-match:

res_0: v8x16.shuffle 0x11100100 0x00000302 0x15140504 0x00000706 // note 0x0000 in 2nd and 4th components
res_0: v8x16.shuffle 0x03020100 0x17160504 0x0b0a0908 0x1f1e0d0c
res_1: v8x16.shuffle 0x19180908 0x00000b0a 0x1d1c0d0c 0x00000f0e // note 0x0000 in 2nd and 4th components
res_1: v8x16.shuffle 0x03020100 0x17160504 0x0b0a0908 0x1f1e0d0c

As a result, a 4 GB/s kernel started running at 1.4 GB/s.

edit noteworthy is the fact that in both examples, the optimization LLVM makes is always making things strictly worse or not-better - it doesn't merge any instructions in both of the cases I mentioned, it just changes one of the shuffle masks involved which is very unlikely to replace a "bad" shuffle with a "good" shuffle and much more likely to make a good shuffle bad.

@penzn
Copy link
Contributor

penzn commented Feb 26, 2020

Ouch. I have heard that this have come up with native shuffles as at some point and was fixed by distinguishing between shuffle masks, which would work for this case. @topperc, do you have any more context?

@sunfishcode - thanks for pointing out alignment. I think we should add that to the spec. I would be prefer separate instructions, mainly because those masks represent separate instructions in hardware, and correspond to separate instinsics in existing code (also it would make instruction descriptions more self contained). However I don't feel strongly one way or the other. I don't think we should disable all shuffle transformations to preserve masks - that would mean we are encouraging users to write shuffle masks matching "instruction X on platform Y", which would bake exact hardware knowledge into the code and would not necessarily be cross-platform.

@tlively
Copy link
Member

tlively commented May 1, 2020

We discussed this at the sync meeting today and reached tentative consensus on a path forward. To check that we have wider community consensus, please consider this proposed solution and vote with a 👍 if you support it, 👀 if you're neutral, and 👎 if you oppose it.

The proposed solution is to:

  1. Immediately disable all WebAssembly shuffle merging and unused lane zeroing in LLVM.

  2. Not make any normative changes to the spec relating to this issue. Keep the current shuffle instruction unchanged.

  3. Leave the door open for future optimizations of shuffles in LLVM where they can be shown to be beneficial across platforms.

  4. Leave the door open for documenting recommended shuffle lowerings and their relative importance in non-normative documentation.

@sunfishcode
Copy link
Member

Would it be fair to characterize this position as:

  • There will be an official cost model, in which some shuffles are cheaper than others.
  • At this time, we don't have a full description of it, nor an implementation in LLVM.
  • LLVM should be conservative until we have both of those ready.

?

@tlively
Copy link
Member

tlively commented May 1, 2020

Yes, that might be fair depending on your expectation of how official the cost model will be. We don't expect there to be a cost model in any normative part of the spec, but we would be happy for a cost model based on empirical, cross-platform data to be implemented in the tools and documented non-normatively. I would expect such documentation to be about as official as the tool-conventions repo.

@penzn
Copy link
Contributor

penzn commented May 2, 2020

I think being conservative would be better than have undefined, or worse - wrong cost model (which is likely what is happening now).

I think we need to keep the door open not only for the documenting lowering, but also breaking up the shuffle operations, this way may prove to be too generic. Though I do know having one shuffle operation was intentional and I am personally at peace with this work not happening, at least in current simd128 proposal.

@zeux
Copy link
Contributor Author

zeux commented May 2, 2020

It's worth noting that there are some other cases where a cost model may be appropriate for LLVM. Not sure what the full extent of SIMD transformations looks like, but one basic example I've ran into is LLVM replacing i32_gt with i32_ne for inputs that were known to be non-negative, and getting slower codegen as a result. So maybe as a rule of thumb it would be nice to augment LLVM optimizer with some cost model about Wasm instructions where the cost model doesn't depend (much) on the underlying ISA - if instruction A is expected to be faster or at least on par with instruction B, the cost model could indicate as such to the optimizer.

@tlively
Copy link
Member

tlively commented May 2, 2020

@zeux It would be great if you could start a new tracking issue for small codegen issues like that. One of the downsides of the division between the tool and engine work is that I don't have great visibility into that sort of issue, so this is very helpful feedback.

tlively added a commit to llvm/llvm-project that referenced this issue May 11, 2020
Summary:

Although using `__builtin_shufflevector` and the `shufflevector`
instruction works fine, they are not opaque to the optimizer. As a
result, DAGCombine can potentially reduce the number of shuffles and
change the shuffle masks. This is unexpected behavior for users of the
WebAssembly SIMD intrinsics who have crafted their shuffles to
optimize the code generated by engines. This patch solves the problem
by adding a new shuffle intrinsic that is opaque to the optimizers in
line with the decision of the WebAssembly SIMD contributors at
WebAssembly/simd#196 (comment). In
the future we may implement custom DAG combines to properly optimize
shuffles and replace this solution.

Reviewers: aheejin, dschuff

Subscribers: sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D66983
arichardson pushed a commit to arichardson/llvm-project that referenced this issue Jul 2, 2020
Summary:

Although using `__builtin_shufflevector` and the `shufflevector`
instruction works fine, they are not opaque to the optimizer. As a
result, DAGCombine can potentially reduce the number of shuffles and
change the shuffle masks. This is unexpected behavior for users of the
WebAssembly SIMD intrinsics who have crafted their shuffles to
optimize the code generated by engines. This patch solves the problem
by adding a new shuffle intrinsic that is opaque to the optimizers in
line with the decision of the WebAssembly SIMD contributors at
WebAssembly/simd#196 (comment). In
the future we may implement custom DAG combines to properly optimize
shuffles and replace this solution.

Reviewers: aheejin, dschuff

Subscribers: sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D66983
@zeux
Copy link
Contributor Author

zeux commented Dec 15, 2020

I believe this can now be closed as LLVM should use a shuffle instruction that's opaque to the optimizer.

@tlively tlively closed this as completed Dec 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants