New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6718: [DRAFT] [Rust] Remove packed_simd #7037
Conversation
removing packed_simd would also make this bug obsolete: https://issues.apache.org/jira/browse/ARROW-8598 |
Thanks @nevi-me. I re-created everything locally and I don't see a reason to keep Also the future of packed_simd is unclear and it's one less dependency on nightly, now we just need specialization on stable... |
Hi,
I attempted to change this to use
to the
and
this led to a significant performance improvement:
@nevi-me @paddyhoran any thoughts how to improve this further? |
This definitely looks great from # of code deduction :D , but yeah it will be better if we can keep the perf loss minimum.
I'm not seeing a light from this tunnel yet - maybe we could try to replace it with Another way is to rewrite the code (I think it is mostly used in Parquet encoder/decoder and Arrow array builders?). It is tedious work but I think it should be do-able. |
Hi @yordan-pavlov, we want to remove packed_simd due to the uncertainty with it being stabilised soon. We so far found that if we optimise some non-SIMD code, we don't lose a lot of performance relative to the explicit SIMD code. It'd be great if we could see where we could further improve the non-SIMD functions |
hi @nevi-me , I would love to have arrow use stable rust. (but I also want high performance) I have been working on some benchmarks which compare performance of filtering of in-memory data implemented in different ways, including for-loop, iterator, arrow without SIMD, arrow with SIMD, etc. I have also done pretty much the same benchmarks for .net core and rust. I hope to find some time to publish those benchmarks in the next few days, but it does look like the rust compiler is very good in auto-vectorization. So far the best performance I have managed to produce with arrow and SIMD (357.53 us) is only about twice as fast as filtering using an iterator without arrow (654.22 us) but is about four times faster than filtering with arrow and loops (1.2704 ms) and many times faster than filtering with arrow using compute::no_simd_compare_op (8.4308 ms). Yes, SIMD is a lot of work, but at the moment gives the best performance. I wonder if the SIMD features could be moved to another library separate from the core arrow library. |
That's interesting, and how are other benchmarks affected, or are you only focusing on filter? To the extent that there's overlap with |
@nevi-me I have only focused on filtering so far; I will probably implement benchmarks for other operations once I have fully explored filtering. |
@yordan-pavlov when you say SIMD you mean the "simd" feature (packed_simd), right? |
@paddyhoran yes that's what I meant, apologies if it was unclear |
hi @nevi-me , I hope it will be useful in benchmarking and improving arrow performance. |
So, I'm not sure now about removing SIMD altogether. It took me some time to add all the SIMD features but I don't pretend to be an expert on it. I'd hate to remove all the SIMD code only to realize that someone smarter than I could have improved it (like @yordan-pavlov). I think we shouldn't use it by default because it requires nightly, #7057 changes it to be opt-in. I would love to remove some of the code if it's not required though. @yordan-pavlov I think the best way to move this forward is if you can contribute to the benchmarks for the main arrow repo. Then we can look at removing the SIMD code piece by piece to ensure we don't have performance regressions. Thoughts? |
I agree with @paddyhoran - if the goal is just to enable the use of arrow with stable Rust, it would be reasonable to just not enable the SIMD feature by default, but still keep it so it is available as a choice for those users who need the best performance possible. A lot of work has gone into the SIMD feature already and it would be a shame to remove it prematurely, without doing enough benchmarking. Furthermore, I think Rust could have a great future in the big data space and I think this project could play an important part. But SIMD is important in big data. So we should be looking to have SIMD stabilized (in Rust) rather than remove it. If SIMD is removed from arrow, what killer feature would motivate its stabilization in Rust? For convenience here are the results from my filtering benchmarks:
In the table above we can see that SIMD filtering (against scalar values) is 49% faster than a loop, and 76% faster than an iterator implementation. This could mean a difference between waiting 12h or 7h for a job to complete. So I think more benchmarking, profiling and performance improvements have to be done before it can be decided with confidence to remove SIMD (or not). The source code for the benchmarks used to produce the results about is here: I am happy to contribute benchmarks, I just have to figure out how / if they would fit in the main arrow repo. |
I'm happy with turning SIMD off by default |
This removes the dependency on packed_simd. I initially thought that boolean kernels were slower than with explicit SIMD, but this was a false alarm as the benchmarks weren't comparing SIMD vs non-SIMD.
While doing this, I noticed that the
divide
kernel appears to be unsound, as it checks if a null is 0 (which can be true when the default data behind the bitmask is 0).Below is the performance comparison:
From 0.15.0 to 0.16.0
0.16.0 included the change that I made to autovectorize some compute kernels. This mainly resulted in non-SIMD kernels having a smaller performance gap (from 50-60% slower to 10-20% slower).
From 0.16.0 to no `packed_simd`
Are the perf drops worth it?
I suppose it'll boil down to whether getting closer to stable Rust (without feature flags) is worth the slight performance drop.
Outstanding work to do