New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10040: [Rust] Add slice that realigns Buffer #8223
Conversation
Has the consequence of removing the alignment limit on bool kernels. It however comes at the cost of slower buffer manipulation.
@jhorstmann @paddyhoran this is related to the alignment fixes made recently. I noticed while reviewing another PR that we had a limitation on boolean kernels if offsets weren't a multiple of 8. So I've implemented a slice on I expect there to be a minor perf impact, and I only use the above slice method when necessary. It's very likely that my implementation can be improved upon, but I'm at my current limits on bit manipulation. Please see https://godbolt.org/z/bMc5qd for some of the assembly generated. Not happy with the instruction count, and would appreciate some guidance or improvements. Feel free to push any changes directly on this branch. |
Hi @nevi-me I guess I have to thank you for pushing this topic. I played around today with a different approach today that uses an iterator over the validity mask and returns bits in chunks of u8/u16/u32/u64. This can then be used together with a chunked iterator over the typed data of buffer to efficiently implement kernels. So far it's more of a proof of concept because it doesn't handle the remainder bits of this chunking yet. It's also largely untested. Generated code for a kernel looks very very nice, the inner loop gets unrolled and it even uses avx512 mask registers if available: https://rust.godbolt.org/z/qGcrW8 This approach could also be used with explicit vector instructions in the inner loop, and a scalar loop for the remainder. Ensuring no out of bounds reads if the chunk size matches the vector register size. Big todos are the remainder part and a lot of tests. |
My proof of concept is available at https://github.com/jhorstmann/bititer-poc/blob/master/src/lib.rs but I'm not sure when I will have time to integrate it with arrow. |
@jhorstmann can I close this PR, and rely on your implementation when ready? Also, do you think we'd be able to use your implementation in |
@nevi-me can you point me to the part of the parquet code that you have in mind? I found the |
Hey @jhorstmann, I haven't had time to look, but maybe I'm confused. What I recall is that I needed a way of converting a Buffer to a I'm closing this PR |
…itrary offsets @nevi-me this is the chunked iterator based approach i mentioned in #8223 I'm not fully satisfied with the solution yet: - I'd prefer to move all the bit-based functions into `Bitmap`, but creating a `Bitmap` from a `&Buffer` would involve cloning an `Arc`. - I need to do some benchmarking about how much the `packed_simd` implementation actually helps. If it's not a big difference I'd propose to remove it to simplify the code. Closes #8262 from jhorstmann/ARROW-10040-unaligned-bit-buffers Authored-by: Jörn Horstmann <joern.horstmann@signavio.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…itrary offsets @nevi-me this is the chunked iterator based approach i mentioned in apache#8223 I'm not fully satisfied with the solution yet: - I'd prefer to move all the bit-based functions into `Bitmap`, but creating a `Bitmap` from a `&Buffer` would involve cloning an `Arc`. - I need to do some benchmarking about how much the `packed_simd` implementation actually helps. If it's not a big difference I'd propose to remove it to simplify the code. Closes apache#8262 from jhorstmann/ARROW-10040-unaligned-bit-buffers Authored-by: Jörn Horstmann <joern.horstmann@signavio.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
Has the consequence of removing the alignment limit on bool kernels.
It however comes at the cost of slower buffer manipulation.