-
Notifications
You must be signed in to change notification settings - Fork 43
Toolchain optimization of shuffles #118
Comments
I think that part of the confusion is that Wasm shuffle lowering depends on the mask and can lead to more efficient instruction(s) in one case and less efficient in another case, based on hardware and runtime. From the proposal's point of view, we can try to define the semantics better (very tricky and still up to the runtimes) or split the cases we think would be more efficient into instructions of their own, giving the toolchain ability to emit either faster variants or (potentially slower) shuffle. |
I don't want to repeat the discussion in #8 where we settled on the current shuffle solution as far as engines and the spec are concerned. Instead I would like feedback on whether the toolchain should be punching through the spec's shuffle abstraction and reasoning about underlying platform details. My position is that it should not, particularly because we have decided that the spec should not be doing so and the toolchain should target the spec, not underlying platforms. |
I haven't finished reading #8 yet, sorry. I think that from Wasm point of view your suggestion for the toolchain is reasonable, but it may lead to clashes with current semantics of hardware shuffles. I am personally in favor of adding supplementary instructions if that would make code generation less ambiguous. |
Interestingly, #8 (comment) describes exactly the situation reporter of this bug ran into, and seem that the solution was to leave the shuffles alone as well. On the other hand, the bug shows the example when native unpack is turned into a Wasm shuffle, which then gets inlined and combined with other shuffles. The reason why this does not happen for the native target is the use of a separate instruction, so that toolchain can say that it would not optimize, say, adjacent unpack ops which are expected to be fast, while it would do that to shuffles (which are expected to be slow anyway). Note that native shuffle can do what unpack can do, but is slower. Disabling shuffle optimizations in LLVM would leave adjacent shuffles in the output, in the situations when those cannot be optimized to something in runtime code gen it would result in multiple hardware shuffle instructions emitted by the runtime, which has drawbacks. (sorry big edit, did not want to add a third comment) |
I don't think you should be tuning the toolchain for specific runtimes or CPUs, but how can you possibly generate good code for a target with unspecified performance? It seems like LLVM should be targeting a composite runtime+native ISA that does some pattern matching on Regarding this specific code pattern: they are loading 16 interleaved RGB pixels, and deinterleaving it to 16 Rs, 16Gs, and 16Bs. This is an instance of a common pattern, and the way to do it optimally is very ISA-specific: the OpenCV code they were porting has not just SSE 2 and Neon variants, but also SSE 3 and SSE 4.1 variants! The Neon code path maps directly to Neon's 3-way interleaved load instruction: I think it's worth considering whether it's worthwhile to add interleaved load/store instructions to WASM SIMD. |
I'm just lurking and I'm a bit late to the discussion. Would it be possible to support instructions friendly to SSE and friendly to NEON (a superset of both). It would be possible to provide two WASM versions for both architectures that a web server could serve. Both versions would work everywhere but would only yield optimal assembly on the target device. At the end of the day, if you write high performance code, you have to know the target architecture. Writing for something like WASM that abstracts that will only lead to these sorts of problems. For example, I write and maintain the Realtime Math library and I already support SSE, AVX, armv7, and ARM64. Adding flavors of those for various WASM targets would be trivial as long as WASM has intrinsics that match the desired target. This is akin to compiling C++ while specifying the CPU it will run on. The code will run on other CPUs but it is tuned for the one you specify. Abstraction has a cost and it cannot always be hidden. I intend to port the Animation Compression Library to WASM through emscripten next year (along with RTM). For me, I would not mind writing SSE and NEON flavored WASM. Worth noting that Intel ISPC takes a similar approach. You can compile for SSE2 and AVX and AVX512 in the same object file and at runtime through a cpu flag check, the optimal path is used. One binary, optimal for all processors. In the case of WASM, that might be an option as well where the llvm backend can discard the flavor of targets it doesn't care about. This would yield a larger executable but everything would be bundled together. 2cents |
Although this would be technically possible, it sounds like we agree that this would violate the spirit and intent of WebAssembly. Target-specific optimizations are best left to engines, and it's on them to squeeze the last bit of performance out after the toolchain has done all the optimizations that are clearly beneficial on all platforms.
Thanks for chiming in! I think we're all late to this discussion, which dates back to the SIMD proposal for asm.js
This approach contradicts the goal of specifying a portable subset of SIMD operations for the initial proposal, but may be on the table for future SIMD extensions. Due to all the work and though that went into #8, I consider it a settled matter that the current SIMD proposal will not have differentiated shuffles.
Having multiple code paths in a single executable is what the feature detection / conditional compilation proposal is all about. It allows bundling SIMD and non-SIMD code together and will allow for bundling different future SIMD extensions together as well. That's outside the scope of the current MVP SIMD proposal, though. |
Not really. "WASM" is your programming language, and just like with ISPC, you write only one version of the code, and your compiler can generate code for multiple platforms: your WASM -> native code compiler can generate code for SSE, AVX, etc. For the issue at hand, the real problem is that V8 is generating bad machine code for a single WASM shuffle instruction. Whether that WASM code comes from LLVM or not, doesn't really matter. The same user that reported this issue could have just used a single shuffle just like LLVM is doing, and still be filling the same bug saying that V8 generates bad code for it. I don't think that we need to do anything about this here. Whether LLVM wants to work around V8's bug or not, is kind of up to them. There is nothing that we can do here that would prevent LLVM from adding a Still, even if LLVM were to generate V8 friendly code, nothing prevents V8 from merging the two shuffle instructions as an optimization and continue to emit bad code... so I think its kind of pointless to try to solve this at the LLVM level. Is there a V8 bug report open for this issue? It would be interesting to know if there are any fundamental issues for not being able to fix this on V8. AFAICT, V8 could just inspect the shuffle arguments and generate better code for the case in which they are known. |
I thought the user used adjacent Wasm shuffles with "interleave" in the mask, which they were hoping would turn into corresponding x86 ops, but after the shuffles got combined by LLVM they ended up with The issue with LLVM is that it runs all shuffle operations through generic shuffle IR instruction, and two of them can be combined depending on the mask. #8 (comment) has a description of how this works:
We don't have special purpose shuffle instructions, that's why the workaround used for preserving behavior of intrinsics would not work. We can either introduce differentiation between masks by adding language to the spec than certain masks are supposed to be fast as proposed by @billbudge in #8 (comment) or we would have to disable shuffle optimizations for Wasm altogether. The downside of the latter is that it would inevitably slow down situations where there are two adjacent slow shuffles. I am personally somewhat against disabling shuffle optimizations altogether, and in favor of differentiating (ideally by adding special-purpose ops like interleave, though I know there is consensus against that). @PeterJensen do you have any thoughts, as you were part of the discussion in #8? |
This really seemed like a good philosophy to me, and clearly motivates inhibiting merges of generic shuffles in the absence of specialized shuffle instructions. |
I don't have a strong opinion either way at this point, but I do want to make one observation. The philosophy you describe works because users who write intrinsics often have an idea of which hardware instructions they want. However, if we're compiling to wasm, and we don't want the compiler to know that some masks are cheaper than others, then users knowing that some masks are cheaper than others has the same problem. |
That's a good point. I actually would prefer users not try to target underlying platforms, and I don't think the original reporter should be expecting to observe any particular instruction sequence emitted by the engine. But the difference between users and the toolchain is that I can control the assumptions the toolchain makes ;) |
This is a bit of catch-22, if users should not expect specific shuffle instructions, then they should not be able to reliably port the codes that depend on those. The reporter from the original issue would only be able to use Wasm SIMD in OpenCV if he can expect its kernels to run faster than scalar due to hardware support for pack/unpack instructions (more on availability below). When you change LLVM logic and the correct shuffle masks would get to V8 then the port would be again relying on V8 emitting faster shuffles for those masks, otherwise there is no point in doing it. The performance penalties for not using accelerated shuffles are real, the code in the original report ran slower than scalar, despite Emscripten reducing the number of shuffle instructions - huningxin/opencv#301. Accelerated shuffle instructions are present in every SIMD ISA and do roughly the same thing across platforms (interleave, shift/rotate, etc), see another example in #110. And there is lots of code that depends on them for performance. It would be unfortunate if the same code can be written using Neon and SSE, but not Wasm SIMD. |
This is correct, that's what the user hoped would happen, and that's problematic, because those expectation are unrealistic. To make an analogy. when I compile code to x86 with clang, I expect clang to generate optimal x86 code for that target. Does clang have a bug if, when that x86 code is run on arm hardware under QEMU, QEMU is not able to lower it to optimal ARM instructions because "clang optimized the program too much" ? I don't think clang has a bug. I think that the expectation that clang would prefer to optimize for "x86 for running on ARM on top of QEMU" instead of just on real "x86" hardware are unrealistic, and incorrect, because that's not what clang says it does. One can definitely build a toolchain that does that (or add a -mcpu=wasm-v8-x86 to emscripten!), but those are just not what modern toolchains target by default when one just says "wasm" or "x86". IMO, the better question to ask here is: "Why did this user have the wrong expectations?" Maybe the clang / emscripten intrinsics are documented as "This lowers exactly to this instruction". If so, either the user is right, and the bug is that emscripten does not use "inline assembly" to implement these instructions, or the user has discovered a documentation bug, and these instruction should guarantee the semantics of the operations they perform, maybe using the instruction as an "example" of what might be generated, but without guaranteeing that this precise instruction is always generated. IMO, the spirit of the WASM C intrinsics is that we guarantee the semantics of the operation and not a particular instruction, because otherwise, optimizations like "constant folding" would be illegal, since when these optimizations are performed the instruction that we would promise would not be generated. I think most users want these optimizations and that V8 has a performance bug. If a user really wants a specific WASM instruction to be generated, most toolchains offer a feature for that. For example, with emscripten and clang, you can use "inline assembly" to achieve that, and this might be the only way this particular user has to workaround the V8 performance bug for the time being. |
That is a fair point, spec does not say that some shuffle masks are faster than others. And I am not proposing to change that, maybe even the opposite - add a note that all shuffle masks are equivalent.
Just to be clear, user was not looking for a drastically different instruction pattern per se, they were expecting a certain shuffle mask, which they new v8 would accelerate for them, as it intentionally emits fast shuffles if the mask is right. Current proposed fix would disable shuffle transformations in the toolchain, which would allow users to still write "fast" shuffle masks, without compiler getting in their way - I feel that is the opposite of what should happen if the shuffle masks are truly equal (though it is definitely better than the situation when combined shuffle performs much worse than sum of its components). I am worried that the fix would calcify developers' attitude that some masks are fast and the only way to discourage relying on mask patterns is to provide separate instructions for those situations (there are other minor benefits, like code size reduction). I think it is important given amount of code in the wild using those operations. |
The single shuffle with one mask is semantically equivalent to the two shuffles with different masks (otherwise the LLVM optimization that gets performed would be unsound, and this would just be an LLVM bug). The problem here is that V8 does not have a pattern match for this mask, so they generate poor machine code for it. Maybe that's a performance bug in V8 or maybe that's by design because V8 prefers to trade run-time performance for code generation time (although adding a new case to that table wouldn't really increase code generation by any measurable amount), but since nobody has filled a V8 bug, we can't know what's what. I don't think it would be a good idea for LLVM to de-optimize this code to make it faster on V8 by default since that would increase binary size and translation time and even run-time for other use cases. |
For shuffles specifically, I think that runtimes should generally prefer consuming fused shuffles to unfused shuffles, and for the shuffle masks to be canonicalized in any way possible. It's easier for a runtime to turn a fused shuffle into multiple specialized shuffles than it is to turn multiple unfused general-purpose shuffles into a single general-purpose shuffle. This seems more like a case where v8 (and probably others) are missing a shuffle mask pattern instead of a case where LLVM should not be fusing shuffles. |
It is likely that those "combined" masks would be along the lines of "byte-wise interleave on one half of the vector", I am not sure how to recover the original "8-byte unpack on high half followed by byte-wise interleave on low half" from that, it may not be realistic to implement, given that there are many potential combinations. |
This is a solved problem. LLVM fuses the shuffle at the IR level, yet generates optimal code for x86, arm, etc. There are many trade-offs that a WASM runtime needs to make, and it is very fair for a WASM run-time to balance other things with generating optimal code, and that just means that this run-time does not always generate optimal code even though it has all the information required to do so, and that's ok. Such run-times not generating optimal code for some SIMD shuffles would just be a consequence of that. I don't think there is anything we can do about it. If a user wants to generate WASM optimized for a particular run-time, they can always use "inline assembly" here to generate the exact sequence of opcodes that such run-times are able to pattern match. |
I agree with @gnzlbg that it's easier for the compiler to merge shuffles than it is for the WASM runtime. Introducing shuffles on wider elements would help the runtime to detect common patterns as it will have less possible shuffling masks to explore. But there are cases where user doesn't want to shuffle the whole vector, but only a part and doesn't care about the value of the other elements. Such an instruction could be used for other things than shuffles. It could be used to generate half width load without requiring any specialized instructions for it. I believe this approach is easier for the runtime than having extra values for shuffling masks because if the runtime is not smart enough, it will have a default fully specified mask to work with and does not have to recreate it from the underspecified shuffling mask. This approach can easily be built right now even if there are no current support for it because because ignoring this special instruction is a valid implementation of it. |
You only need to match sequences of specialized shuffles that are faster than the generalized shuffle would be, so the combinatorics are not that bad. |
@lemaitre LLVM generates optimal x86 machine code for this mask and I don't think it pattern matches 2^80 masks. I suppose it just uses it an algorithm for generating the optimal instruction sequence instead, but I haven't taken a look. |
How do you know in the general case that the dynamic shuffle is the best option?
I assume that LLVM uses a combination of pattern matching and some heuristics to know if it can be faster than the general shuffle. I don't think it is reasonable for the WASM runtime to have the same complexity for generating efficient shuffles than LLVM (at least for now). Plus, this doesn't solve the cases where some lanes are ignored and could be translated into different shuffles according to the target architecture. |
FWIW what LLVM does is here: https://github.com/llvm-mirror/llvm/blob/master/lib/Target/X86/X86ISelLowering.cpp (search for "Vector shuffle lowering" in the file, github doesn't allow linking the line). |
I think I've come around to agreeing that we should have uniform expectations for both the toolchain and its users and that we should not even implicitly suggest that users can expect any particular code to be emitted by engines, so allowing the toolchain to merge shuffles should be ok. That means that there will always be corner cases where the final code is slower than if users had full control, but engines can and should optimize common cases, so the total cost should be negligible. |
We have four options, each with its own pros and cons. In order of effort, they are:
(1) has the downside of requiring high performance cost or engine complexity cost for common patterns, but it is nice because it does not violate any abstractions and aligns with discussion in #8. (2) has the downside of not being consistent with how other LLVM targets work. (3) has the downside of violating the WebAssembly abstraction and incorrectly reasoning about underlying implementations. (4) has the downside that supporting all common shuffle patterns would require many instructions and does not allow engines to gradually improve optimizations over time as discussed in #8. It is nice because it avoids both performance cost and abstraction violation. The benefit of (2) and (3) is that they support users who have deep understanding of particular engines and want to get particular native instruction sequences. I personally don't think we should be supporting anyone expecting to get particular instruction sequences, so I am not a fan of (2) or (3). My impression is that (1) would have an unacceptably high performance impact (do we have numbers?), so I would support looking more at (4), with an extremely high bar for what patterns are important enough to get their own instructions. I would really like to see numbers on just how bad (1) is for real workloads, though. |
Notice that (1) just means that it would be up to the engines to lower the shuffles efficiently. Doing (2) or (3) on the LLVM side is "meaningless" if then a tool like I'm not opposed to (4), but the first and hardest step is to identify commonly used shuffle patterns. It would be worth it to document these along with the machine code they should lower to in this repo. That would already be enough to allow WASM developers to at least try to implement efficient lowering for their shuffle intrinsics. If there are some patterns that are too hard to lower efficiently, maybe we could discuss them individually and see if it makes sense to add a new instruction to represent them, but hopefully good documentation for important patterns, and maybe a test suite, are enough. |
#196 brings real numbers to this conversation. In #196 (comment) I proposed that we go with (2) for now. I previously said,
I stand by that as an ideal, but it is clearly not how our early users are interacting with the proposal in practice and I think we need to accept that. I'll close this issue and we can continue the discussion in #196 where we have real data and users to make the problem more concrete. |
We received a bug report on emscripten (emscripten-core/emscripten#9340) because LLVM was combining shuffles a user had written as intrinsics, and V8 was therefore producing a slow pshufb instead of the pair of fast shuffle instructions the user had expected.
This raises the question of how the toolchain should reason about WebAssembly shuffles. The reporting user simply wanted the toolchain to not mess with their shuffle intrinsics, but in https://reviews.llvm.org/D66983 I received pushback from members of the LLVM community saying that shuffles should be optimized by the toolchain no matter what users write. The suggested fix is to bake knowledge of what WebAssembly shuffles will be fast into LLVM. The problem with this solution is that it depends on engine platform-specific information in the WebAssembly backend, which will necessarily favor some platforms over others and generally breaks the WebAssembly abstraction layer. That is not a precedent I would like to set.
In #8 (comment) it is suggested that LLVM prefers not to mess with a user's shuffles unless it knows for sure that it can make them better. This is not inconsistent with the statement from Craig Topper on my LLVM patch that "x86 is aggressive about optimizing shuffles no matter where they came from" because the x86 backend can know for sure that many optimizations would be helpful.
The WebAssembly backend cannot know that a shuffle optimization will be helpful on all possible platforms, so I propose that in the WebAssembly LLVM backend we neither perform optimizations on shuffles that we do not know for sure will be useful nor bake in target-specific information that punches through the WebAssembly abstraction and favors some platforms over others. This is consistent with the philosophy behind x86's optimization approach but looks very different because it makes our shuffles essentially opaque, so I would like to hear the community's thoughts.
cc @sunfishcode because this is a continuation of a conversation you were in on the LLVM patch.
The text was updated successfully, but these errors were encountered: