Toolchain optimization of shuffles #118

tlively · 2019-10-16T00:17:48Z

We received a bug report on emscripten (emscripten-core/emscripten#9340) because LLVM was combining shuffles a user had written as intrinsics, and V8 was therefore producing a slow pshufb instead of the pair of fast shuffle instructions the user had expected.

This raises the question of how the toolchain should reason about WebAssembly shuffles. The reporting user simply wanted the toolchain to not mess with their shuffle intrinsics, but in https://reviews.llvm.org/D66983 I received pushback from members of the LLVM community saying that shuffles should be optimized by the toolchain no matter what users write. The suggested fix is to bake knowledge of what WebAssembly shuffles will be fast into LLVM. The problem with this solution is that it depends on engine platform-specific information in the WebAssembly backend, which will necessarily favor some platforms over others and generally breaks the WebAssembly abstraction layer. That is not a precedent I would like to set.

In #8 (comment) it is suggested that LLVM prefers not to mess with a user's shuffles unless it knows for sure that it can make them better. This is not inconsistent with the statement from Craig Topper on my LLVM patch that "x86 is aggressive about optimizing shuffles no matter where they came from" because the x86 backend can know for sure that many optimizations would be helpful.

The WebAssembly backend cannot know that a shuffle optimization will be helpful on all possible platforms, so I propose that in the WebAssembly LLVM backend we neither perform optimizations on shuffles that we do not know for sure will be useful nor bake in target-specific information that punches through the WebAssembly abstraction and favors some platforms over others. This is consistent with the philosophy behind x86's optimization approach but looks very different because it makes our shuffles essentially opaque, so I would like to hear the community's thoughts.

cc @sunfishcode because this is a continuation of a conversation you were in on the LLVM patch.

penzn · 2019-10-16T19:21:26Z

I think that part of the confusion is that Wasm shuffle lowering depends on the mask and can lead to more efficient instruction(s) in one case and less efficient in another case, based on hardware and runtime. From the proposal's point of view, we can try to define the semantics better (very tricky and still up to the runtimes) or split the cases we think would be more efficient into instructions of their own, giving the toolchain ability to emit either faster variants or (potentially slower) shuffle.

tlively · 2019-10-16T20:16:20Z

I don't want to repeat the discussion in #8 where we settled on the current shuffle solution as far as engines and the spec are concerned. Instead I would like feedback on whether the toolchain should be punching through the spec's shuffle abstraction and reasoning about underlying platform details. My position is that it should not, particularly because we have decided that the spec should not be doing so and the toolchain should target the spec, not underlying platforms.

penzn · 2019-10-16T20:24:30Z

I haven't finished reading #8 yet, sorry. I think that from Wasm point of view your suggestion for the toolchain is reasonable, but it may lead to clashes with current semantics of hardware shuffles. I am personally in favor of adding supplementary instructions if that would make code generation less ambiguous.

penzn · 2019-10-16T20:52:20Z

Interestingly, #8 (comment) describes exactly the situation reporter of this bug ran into, and seem that the solution was to leave the shuffles alone as well.

On the other hand, the bug shows the example when native unpack is turned into a Wasm shuffle, which then gets inlined and combined with other shuffles. The reason why this does not happen for the native target is the use of a separate instruction, so that toolchain can say that it would not optimize, say, adjacent unpack ops which are expected to be fast, while it would do that to shuffles (which are expected to be slow anyway). Note that native shuffle can do what unpack can do, but is slower.

Disabling shuffle optimizations in LLVM would leave adjacent shuffles in the output, in the situations when those cannot be optimized to something in runtime code gen it would result in multiple hardware shuffle instructions emitted by the runtime, which has drawbacks.

(sorry big edit, did not want to add a third comment)

AndrewScheidecker · 2019-10-16T21:44:17Z

My position is that it should not, particularly because we have decided that the spec should not be doing so and the toolchain should target the spec, not underlying platforms.

I don't think you should be tuning the toolchain for specific runtimes or CPUs, but how can you possibly generate good code for a target with unspecified performance?

It seems like LLVM should be targeting a composite runtime+native ISA that does some pattern matching on v8x16.shuffle immediates to match some specialized native shuffle instructions, but doesn't make heroic optimization efforts.

Regarding this specific code pattern: they are loading 16 interleaved RGB pixels, and deinterleaving it to 16 Rs, 16Gs, and 16Bs. This is an instance of a common pattern, and the way to do it optimally is very ISA-specific: the OpenCV code they were porting has not just SSE 2 and Neon variants, but also SSE 3 and SSE 4.1 variants! The Neon code path maps directly to Neon's 3-way interleaved load instruction: vld3.8.

I think it's worth considering whether it's worthwhile to add interleaved load/store instructions to WASM SIMD.

nfrechette · 2019-10-16T22:27:13Z

I'm just lurking and I'm a bit late to the discussion. Would it be possible to support instructions friendly to SSE and friendly to NEON (a superset of both). It would be possible to provide two WASM versions for both architectures that a web server could serve. Both versions would work everywhere but would only yield optimal assembly on the target device.

At the end of the day, if you write high performance code, you have to know the target architecture. Writing for something like WASM that abstracts that will only lead to these sorts of problems.

For example, I write and maintain the Realtime Math library and I already support SSE, AVX, armv7, and ARM64. Adding flavors of those for various WASM targets would be trivial as long as WASM has intrinsics that match the desired target.

This is akin to compiling C++ while specifying the CPU it will run on. The code will run on other CPUs but it is tuned for the one you specify. Abstraction has a cost and it cannot always be hidden.

I intend to port the Animation Compression Library to WASM through emscripten next year (along with RTM). For me, I would not mind writing SSE and NEON flavored WASM.

Worth noting that Intel ISPC takes a similar approach. You can compile for SSE2 and AVX and AVX512 in the same object file and at runtime through a cpu flag check, the optimal path is used. One binary, optimal for all processors. In the case of WASM, that might be an option as well where the llvm backend can discard the flavor of targets it doesn't care about. This would yield a larger executable but everything would be bundled together.

2cents

tlively · 2019-10-16T23:14:52Z

I don't think you should be tuning the toolchain for specific runtimes or CPUs, but how can you possibly generate good code for a target with unspecified performance?

It seems like LLVM should be targeting a composite runtime+native ISA that does some pattern matching on v8x16.shuffle immediates to match some specialized native shuffle instructions, but doesn't make heroic optimization efforts.

Although this would be technically possible, it sounds like we agree that this would violate the spirit and intent of WebAssembly. Target-specific optimizations are best left to engines, and it's on them to squeeze the last bit of performance out after the toolchain has done all the optimizations that are clearly beneficial on all platforms.

I'm just lurking and I'm a bit late to the discussion.

Thanks for chiming in! I think we're all late to this discussion, which dates back to the SIMD proposal for asm.js

Would it be possible to support instructions friendly to SSE and friendly to NEON (a superset of both). It would be possible to provide two WASM versions for both architectures that a web server could serve. Both versions would work everywhere but would only yield optimal assembly on the target device.

This approach contradicts the goal of specifying a portable subset of SIMD operations for the initial proposal, but may be on the table for future SIMD extensions. Due to all the work and though that went into #8, I consider it a settled matter that the current SIMD proposal will not have differentiated shuffles.

Worth noting that Intel ISPC takes a similar approach. You can compile for SSE2 and AVX and AVX512 in the same object file and at runtime through a cpu flag check, the optimal path is used. One binary, optimal for all processors. In the case of WASM, that might be an option as well where the llvm backend can discard the flavor of targets it doesn't care about. This would yield a larger executable but everything would be bundled together.

Having multiple code paths in a single executable is what the feature detection / conditional compilation proposal is all about. It allows bundling SIMD and non-SIMD code together and will allow for bundling different future SIMD extensions together as well. That's outside the scope of the current MVP SIMD proposal, though.

gnzlbg · 2019-10-17T07:52:15Z

Worth noting that Intel ISPC takes a similar approach.

Not really. "WASM" is your programming language, and just like with ISPC, you write only one version of the code, and your compiler can generate code for multiple platforms: your WASM -> native code compiler can generate code for SSE, AVX, etc.

For the issue at hand, the real problem is that V8 is generating bad machine code for a single WASM shuffle instruction. Whether that WASM code comes from LLVM or not, doesn't really matter. The same user that reported this issue could have just used a single shuffle just like LLVM is doing, and still be filling the same bug saying that V8 generates bad code for it.

I don't think that we need to do anything about this here.

Whether LLVM wants to work around V8's bug or not, is kind of up to them. There is nothing that we can do here that would prevent LLVM from adding a -mcpu=v8-x86 flag that uses different instruction costs for that target. For this particular issue, LLVM is assuming that the WASM shuffle instruction has the same cost for all inputs. That's true for WASM, but it isn't true for the native code that will be generated from that WASM.

Still, even if LLVM were to generate V8 friendly code, nothing prevents V8 from merging the two shuffle instructions as an optimization and continue to emit bad code... so I think its kind of pointless to try to solve this at the LLVM level.

Is there a V8 bug report open for this issue? It would be interesting to know if there are any fundamental issues for not being able to fix this on V8. AFAICT, V8 could just inspect the shuffle arguments and generate better code for the case in which they are known.

penzn · 2019-10-17T17:21:51Z

For the issue at hand, the real problem is that V8 is generating bad machine code for a single WASM shuffle instruction.

I thought the user used adjacent Wasm shuffles with "interleave" in the mask, which they were hoping would turn into corresponding x86 ops, but after the shuffles got combined by LLVM they ended up with pshufb instead, For V8 to crack this, it would have to reconstruct the original masks and emit multiple instructions.

The issue with LLVM is that it runs all shuffle operations through generic shuffle IR instruction, and two of them can be combined depending on the mask. #8 (comment) has a description of how this works:

LLVM's x86 backend seems to have received more shuffle-related work recently (and more science). The code is complicated, but the general approach is to prefer pshufb unless the shuffle mask can be implemented in one or two special-purpose instructions. Part of the complication is that LLVM tries to honor the original shuffle masks from the source code unless it can replace it with something that is definitely better. This is to avoid breaking code that was written using <xmmintrin.h> by people who know what they are doing.

We don't have special purpose shuffle instructions, that's why the workaround used for preserving behavior of intrinsics would not work. We can either introduce differentiation between masks by adding language to the spec than certain masks are supposed to be fast as proposed by @billbudge in #8 (comment) or we would have to disable shuffle optimizations for Wasm altogether. The downside of the latter is that it would inevitably slow down situations where there are two adjacent slow shuffles.

I am personally somewhat against disabling shuffle optimizations altogether, and in favor of differentiating (ideally by adding special-purpose ops like interleave, though I know there is consensus against that).

@PeterJensen do you have any thoughts, as you were part of the discussion in #8?

tlively · 2019-10-17T17:39:04Z

LLVM tries to honor the original shuffle masks from the source code unless it can replace it with something that is definitely better.

This really seemed like a good philosophy to me, and clearly motivates inhibiting merges of generic shuffles in the absence of specialized shuffle instructions.

sunfishcode · 2019-10-17T17:56:29Z

This really seemed like a good philosophy to me, and clearly motivates inhibiting merges of generic shuffles in the absence of specialized shuffle instructions.

I don't have a strong opinion either way at this point, but I do want to make one observation.

The philosophy you describe works because users who write intrinsics often have an idea of which hardware instructions they want. However, if we're compiling to wasm, and we don't want the compiler to know that some masks are cheaper than others, then users knowing that some masks are cheaper than others has the same problem.

tlively · 2019-10-17T19:42:12Z

That's a good point. I actually would prefer users not try to target underlying platforms, and I don't think the original reporter should be expecting to observe any particular instruction sequence emitted by the engine. But the difference between users and the toolchain is that I can control the assumptions the toolchain makes ;)

penzn · 2019-10-17T21:26:34Z

I actually would prefer users not try to target underlying platforms, and I don't think the original reporter should be expecting to observe any particular instruction sequence emitted by the engine.

This is a bit of catch-22, if users should not expect specific shuffle instructions, then they should not be able to reliably port the codes that depend on those. The reporter from the original issue would only be able to use Wasm SIMD in OpenCV if he can expect its kernels to run faster than scalar due to hardware support for pack/unpack instructions (more on availability below). When you change LLVM logic and the correct shuffle masks would get to V8 then the port would be again relying on V8 emitting faster shuffles for those masks, otherwise there is no point in doing it.

The performance penalties for not using accelerated shuffles are real, the code in the original report ran slower than scalar, despite Emscripten reducing the number of shuffle instructions - huningxin/opencv#301.

Accelerated shuffle instructions are present in every SIMD ISA and do roughly the same thing across platforms (interleave, shift/rotate, etc), see another example in #110. And there is lots of code that depends on them for performance. It would be unfortunate if the same code can be written using Neon and SSE, but not Wasm SIMD.

gnzlbg · 2019-10-17T22:07:32Z

I thought the user used adjacent Wasm shuffles with "interleave" in the mask, which they were hoping would turn into corresponding x86 ops,

This is correct, that's what the user hoped would happen, and that's problematic, because those expectation are unrealistic.

To make an analogy. when I compile code to x86 with clang, I expect clang to generate optimal x86 code for that target. Does clang have a bug if, when that x86 code is run on arm hardware under QEMU, QEMU is not able to lower it to optimal ARM instructions because "clang optimized the program too much" ?

I don't think clang has a bug. I think that the expectation that clang would prefer to optimize for "x86 for running on ARM on top of QEMU" instead of just on real "x86" hardware are unrealistic, and incorrect, because that's not what clang says it does.

One can definitely build a toolchain that does that (or add a -mcpu=wasm-v8-x86 to emscripten!), but those are just not what modern toolchains target by default when one just says "wasm" or "x86".

IMO, the better question to ask here is: "Why did this user have the wrong expectations?" Maybe the clang / emscripten intrinsics are documented as "This lowers exactly to this instruction". If so, either the user is right, and the bug is that emscripten does not use "inline assembly" to implement these instructions, or the user has discovered a documentation bug, and these instruction should guarantee the semantics of the operations they perform, maybe using the instruction as an "example" of what might be generated, but without guaranteeing that this precise instruction is always generated.

IMO, the spirit of the WASM C intrinsics is that we guarantee the semantics of the operation and not a particular instruction, because otherwise, optimizations like "constant folding" would be illegal, since when these optimizations are performed the instruction that we would promise would not be generated.

I think most users want these optimizations and that V8 has a performance bug. If a user really wants a specific WASM instruction to be generated, most toolchains offer a feature for that. For example, with emscripten and clang, you can use "inline assembly" to achieve that, and this might be the only way this particular user has to workaround the V8 performance bug for the time being.

penzn · 2019-10-17T23:26:16Z

That is a fair point, spec does not say that some shuffle masks are faster than others. And I am not proposing to change that, maybe even the opposite - add a note that all shuffle masks are equivalent.

I think most users want these optimizations and that V8 has a performance bug. If a user really wants a specific WASM instruction to be generated, most toolchains offer a feature for that.

Just to be clear, user was not looking for a drastically different instruction pattern per se, they were expecting a certain shuffle mask, which they new v8 would accelerate for them, as it intentionally emits fast shuffles if the mask is right. Current proposed fix would disable shuffle transformations in the toolchain, which would allow users to still write "fast" shuffle masks, without compiler getting in their way - I feel that is the opposite of what should happen if the shuffle masks are truly equal (though it is definitely better than the situation when combined shuffle performs much worse than sum of its components).

I am worried that the fix would calcify developers' attitude that some masks are fast and the only way to discourage relying on mask patterns is to provide separate instructions for those situations (there are other minor benefits, like code size reduction). I think it is important given amount of code in the wild using those operations.

gnzlbg · 2019-10-18T00:06:24Z

I feel that is the opposite of what should happen if the shuffle masks are truly equal

The single shuffle with one mask is semantically equivalent to the two shuffles with different masks (otherwise the LLVM optimization that gets performed would be unsound, and this would just be an LLVM bug).

The problem here is that V8 does not have a pattern match for this mask, so they generate poor machine code for it.

Maybe that's a performance bug in V8 or maybe that's by design because V8 prefers to trade run-time performance for code generation time (although adding a new case to that table wouldn't really increase code generation by any measurable amount), but since nobody has filled a V8 bug, we can't know what's what.

I don't think it would be a good idea for LLVM to de-optimize this code to make it faster on V8 by default since that would increase binary size and translation time and even run-time for other use cases.

AndrewScheidecker · 2019-10-19T00:08:25Z

For shuffles specifically, I think that runtimes should generally prefer consuming fused shuffles to unfused shuffles, and for the shuffle masks to be canonicalized in any way possible. It's easier for a runtime to turn a fused shuffle into multiple specialized shuffles than it is to turn multiple unfused general-purpose shuffles into a single general-purpose shuffle.

This seems more like a case where v8 (and probably others) are missing a shuffle mask pattern instead of a case where LLVM should not be fusing shuffles.

penzn · 2019-10-19T01:05:19Z

It is likely that those "combined" masks would be along the lines of "byte-wise interleave on one half of the vector", I am not sure how to recover the original "8-byte unpack on high half followed by byte-wise interleave on low half" from that, it may not be realistic to implement, given that there are many potential combinations.

gnzlbg · 2019-10-19T09:28:03Z

t is likely that those "combined" masks would be along the lines of "byte-wise interleave on one half of the vector", I am not sure how to recover the original "8-byte unpack on high half followed by byte-wise interleave on low half" from that, it may not be realistic to implement, given that there are many potential combinations.

This is a solved problem. LLVM fuses the shuffle at the IR level, yet generates optimal code for x86, arm, etc.

There are many trade-offs that a WASM runtime needs to make, and it is very fair for a WASM run-time to balance other things with generating optimal code, and that just means that this run-time does not always generate optimal code even though it has all the information required to do so, and that's ok.

Such run-times not generating optimal code for some SIMD shuffles would just be a consequence of that. I don't think there is anything we can do about it.

If a user wants to generate WASM optimized for a particular run-time, they can always use "inline assembly" here to generate the exact sequence of opcodes that such run-times are able to pattern match.

lemaitre · 2019-10-19T10:49:33Z

I agree with @gnzlbg that it's easier for the compiler to merge shuffles than it is for the WASM runtime.
So I think the guideline should be that the compiler should not assume any fast mask for shuffling and merge the shuffles a maximum. Then, the WASM runtime would detect common mask patterns and generate perfect instruction for those.
The problem here is that it would require too many cases to explore (2^80 possible masks).
We cannot reasonnably expect perfect codegen for all possible masks.
So the runtime has to limit itself to common shuffling rules (shuffling larger elements is a common pattern and should be recognized).

Introducing shuffles on wider elements would help the runtime to detect common patterns as it will have less possible shuffling masks to explore.

But there are cases where user doesn't want to shuffle the whole vector, but only a part and doesn't care about the value of the other elements.
A smart runtime could, in theory, use this information to select a better native shuffle that depends on the actual architecture.
In order to do that, I think we could have a special WASM instruction to tell which elements we care about, and which ones we don't.
Like that, if we want to shuffle only lower half, the compiler can generate a full width shuffle followed by this special instruction to "select" only the lower half.
If the runtime is smart enough, it will be able to merge those 2 instructions and know about the real intent of the user. And if the runtime is simpler, it will only ignore the "select" instruction and generate the full width shuffle with no further optimization.

Such an instruction could be used for other things than shuffles. It could be used to generate half width load without requiring any specialized instructions for it.
Or to specify scalar operations on vectors.

I believe this approach is easier for the runtime than having extra values for shuffling masks because if the runtime is not smart enough, it will have a default fully specified mask to work with and does not have to recreate it from the underspecified shuffling mask.

This approach can easily be built right now even if there are no current support for it because because ignoring this special instruction is a valid implementation of it.

AndrewScheidecker · 2019-10-20T12:43:31Z

The problem here is that it would require too many cases to explore (2^80 possible masks).
We cannot reasonnably expect perfect codegen for all possible masks.

You only need to match sequences of specialized shuffles that are faster than the generalized shuffle would be, so the combinatorics are not that bad.

gnzlbg · 2019-10-20T12:46:24Z

The problem here is that it would require too many cases to explore (2^80 possible masks).

@lemaitre LLVM generates optimal x86 machine code for this mask and I don't think it pattern matches 2^80 masks. I suppose it just uses it an algorithm for generating the optimal instruction sequence instead, but I haven't taken a look.

lemaitre · 2019-10-20T13:20:08Z

You only need to match sequences of specialized shuffles that are faster than the generalized shuffle would be, so the combinatorics are not that bad.

How do you know in the general case that the dynamic shuffle is the best option?
The main problem is not generating a couple of instructions to make the shuffle correct, but find the most efficient way to do it, and ensure it is actually more efficient than the default shuffle.

I suppose it just uses it an algorithm for generating the optimal instruction sequence instead, but I haven't taken a look.

I assume that LLVM uses a combination of pattern matching and some heuristics to know if it can be faster than the general shuffle.
And the complexity of this will increase with time and the number of specialized instructions on newer architectures.

I don't think it is reasonable for the WASM runtime to have the same complexity for generating efficient shuffles than LLVM (at least for now).

Plus, this doesn't solve the cases where some lanes are ignored and could be translated into different shuffles according to the target architecture.

gnzlbg · 2019-10-20T15:10:06Z

FWIW what LLVM does is here: https://github.com/llvm-mirror/llvm/blob/master/lib/Target/X86/X86ISelLowering.cpp (search for "Vector shuffle lowering" in the file, github doesn't allow linking the line).

tlively · 2019-10-21T20:10:34Z

I think I've come around to agreeing that we should have uniform expectations for both the toolchain and its users and that we should not even implicitly suggest that users can expect any particular code to be emitted by engines, so allowing the toolchain to merge shuffles should be ok.

That means that there will always be corner cases where the final code is slower than if users had full control, but engines can and should optimize common cases, so the total cost should be negligible.

tlively · 2019-10-30T23:49:48Z

We have four options, each with its own pros and cons. In order of effort, they are:

Do nothing and allow LLVM to continue merging any shuffles.
Prevent LLVM from merging shuffles (possibly controlled by a flag).
Prevent LLVM from merging shuffles that are "known good" according to engine- and platform-specific heuristics baked into the tools.
Add specialized instructions for common shuffle patterns.

(1) has the downside of requiring high performance cost or engine complexity cost for common patterns, but it is nice because it does not violate any abstractions and aligns with discussion in #8.

(2) has the downside of not being consistent with how other LLVM targets work.

(3) has the downside of violating the WebAssembly abstraction and incorrectly reasoning about underlying implementations.

(4) has the downside that supporting all common shuffle patterns would require many instructions and does not allow engines to gradually improve optimizations over time as discussed in #8. It is nice because it avoids both performance cost and abstraction violation.

The benefit of (2) and (3) is that they support users who have deep understanding of particular engines and want to get particular native instruction sequences.

I personally don't think we should be supporting anyone expecting to get particular instruction sequences, so I am not a fan of (2) or (3). My impression is that (1) would have an unacceptably high performance impact (do we have numbers?), so I would support looking more at (4), with an extremely high bar for what patterns are important enough to get their own instructions. I would really like to see numbers on just how bad (1) is for real workloads, though.

gnzlbg · 2020-01-02T14:18:33Z

My impression is that (1) would have an unacceptably high performance impact (do we have numbers?), so I would support looking more at (4), with an extremely high bar for what patterns are important enough to get their own instructions. I would really like to see numbers on just how bad (1) is for real workloads, though.

Notice that (1) just means that it would be up to the engines to lower the shuffles efficiently. Doing (2) or (3) on the LLVM side is "meaningless" if then a tool like wasm-opt is run on the generated WASM and ends up producing the same "optimized" code as (1).

I'm not opposed to (4), but the first and hardest step is to identify commonly used shuffle patterns. It would be worth it to document these along with the machine code they should lower to in this repo. That would already be enough to allow WASM developers to at least try to implement efficient lowering for their shuffle intrinsics. If there are some patterns that are too hard to lower efficiently, maybe we could discuss them individually and see if it makes sense to add a new instruction to represent them, but hopefully good documentation for important patterns, and maybe a test suite, are enough.

tlively · 2020-02-18T20:45:08Z

#196 brings real numbers to this conversation. In #196 (comment) I proposed that we go with (2) for now. I previously said,

I personally don't think we should be supporting anyone expecting to get particular instruction sequences, so I am not a fan of (2) or (3).

I stand by that as an ideal, but it is clearly not how our early users are interacting with the proposal in practice and I think we need to accept that. I'll close this issue and we can continue the discussion in #196 where we have real data and users to make the problem more concrete.

AndrewScheidecker mentioned this issue Oct 17, 2019

Load/store interleaved instructions #119

Open

dtig mentioned this issue Oct 21, 2019

SIMD Sync meeting 10/22/2019 Agenda #121

Closed

penzn mentioned this issue Feb 18, 2020

LLVM's "optimization" of shuffles penalizes x64 codegen #196

Closed

tlively closed this as completed Feb 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Toolchain optimization of shuffles #118

Toolchain optimization of shuffles #118

tlively commented Oct 16, 2019

penzn commented Oct 16, 2019

tlively commented Oct 16, 2019

penzn commented Oct 16, 2019 •

edited

Loading

penzn commented Oct 16, 2019 •

edited

Loading

AndrewScheidecker commented Oct 16, 2019

nfrechette commented Oct 16, 2019

tlively commented Oct 16, 2019

gnzlbg commented Oct 17, 2019 •

edited

Loading

penzn commented Oct 17, 2019

tlively commented Oct 17, 2019

sunfishcode commented Oct 17, 2019

tlively commented Oct 17, 2019

penzn commented Oct 17, 2019

gnzlbg commented Oct 17, 2019 •

edited

Loading

penzn commented Oct 17, 2019 •

edited

Loading

gnzlbg commented Oct 18, 2019

AndrewScheidecker commented Oct 19, 2019

penzn commented Oct 19, 2019

gnzlbg commented Oct 19, 2019

lemaitre commented Oct 19, 2019

AndrewScheidecker commented Oct 20, 2019

gnzlbg commented Oct 20, 2019

lemaitre commented Oct 20, 2019

gnzlbg commented Oct 20, 2019

tlively commented Oct 21, 2019

tlively commented Oct 30, 2019

gnzlbg commented Jan 2, 2020 •

edited

Loading

tlively commented Feb 18, 2020

Toolchain optimization of shuffles #118

Toolchain optimization of shuffles #118

Comments

tlively commented Oct 16, 2019

penzn commented Oct 16, 2019

tlively commented Oct 16, 2019

penzn commented Oct 16, 2019 • edited Loading

penzn commented Oct 16, 2019 • edited Loading

AndrewScheidecker commented Oct 16, 2019

nfrechette commented Oct 16, 2019

tlively commented Oct 16, 2019

gnzlbg commented Oct 17, 2019 • edited Loading

penzn commented Oct 17, 2019

tlively commented Oct 17, 2019

sunfishcode commented Oct 17, 2019

tlively commented Oct 17, 2019

penzn commented Oct 17, 2019

gnzlbg commented Oct 17, 2019 • edited Loading

penzn commented Oct 17, 2019 • edited Loading

gnzlbg commented Oct 18, 2019

AndrewScheidecker commented Oct 19, 2019

penzn commented Oct 19, 2019

gnzlbg commented Oct 19, 2019

lemaitre commented Oct 19, 2019

AndrewScheidecker commented Oct 20, 2019

gnzlbg commented Oct 20, 2019

lemaitre commented Oct 20, 2019

gnzlbg commented Oct 20, 2019

tlively commented Oct 21, 2019

tlively commented Oct 30, 2019

gnzlbg commented Jan 2, 2020 • edited Loading

tlively commented Feb 18, 2020

penzn commented Oct 16, 2019 •

edited

Loading

penzn commented Oct 16, 2019 •

edited

Loading

gnzlbg commented Oct 17, 2019 •

edited

Loading

gnzlbg commented Oct 17, 2019 •

edited

Loading

penzn commented Oct 17, 2019 •

edited

Loading

gnzlbg commented Jan 2, 2020 •

edited

Loading