Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10040: [Rust] Add slice that realigns Buffer #8223

Closed
wants to merge 2 commits into from

Conversation

nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Sep 19, 2020

Has the consequence of removing the alignment limit on bool kernels.
It however comes at the cost of slower buffer manipulation.

Has the consequence of removing the alignment limit on bool kernels.
It however comes at the cost of slower buffer manipulation.
@nevi-me
Copy link
Contributor Author

nevi-me commented Sep 19, 2020

@jhorstmann @paddyhoran this is related to the alignment fixes made recently. I noticed while reviewing another PR that we had a limitation on boolean kernels if offsets weren't a multiple of 8. So I've implemented a slice on Buffer that copies the buffer and removes the offsets.

I expect there to be a minor perf impact, and I only use the above slice method when necessary.

It's very likely that my implementation can be improved upon, but I'm at my current limits on bit manipulation. Please see https://godbolt.org/z/bMc5qd for some of the assembly generated. Not happy with the instruction count, and would appreciate some guidance or improvements. Feel free to push any changes directly on this branch.

@github-actions
Copy link

@jhorstmann
Copy link
Contributor

Hi @nevi-me I guess I have to thank you for pushing this topic. I played around today with a different approach today that uses an iterator over the validity mask and returns bits in chunks of u8/u16/u32/u64. This can then be used together with a chunked iterator over the typed data of buffer to efficiently implement kernels. So far it's more of a proof of concept because it doesn't handle the remainder bits of this chunking yet. It's also largely untested. Generated code for a kernel looks very very nice, the inner loop gets unrolled and it even uses avx512 mask registers if available:

https://rust.godbolt.org/z/qGcrW8

This approach could also be used with explicit vector instructions in the inner loop, and a scalar loop for the remainder. Ensuring no out of bounds reads if the chunk size matches the vector register size.

Big todos are the remainder part and a lot of tests.

@jhorstmann
Copy link
Contributor

My proof of concept is available at https://github.com/jhorstmann/bititer-poc/blob/master/src/lib.rs but I'm not sure when I will have time to integrate it with arrow.

@nevi-me
Copy link
Contributor Author

nevi-me commented Sep 22, 2020

@jhorstmann can I close this PR, and rely on your implementation when ready? Also, do you think we'd be able to use your implementation in parquet, as we might need that for iterating through Boolean arrays when converting from Arrow to Parquet.

@jhorstmann
Copy link
Contributor

@nevi-me can you point me to the part of the parquet code that you have in mind? I found the BitReader used by bit packed encoding but that seems to solve a more general problem of reading arbitrary bit widths and also seems to work on a different parquet::util::memory::Buffer implementation instead of arrow::Buffer.

@nevi-me
Copy link
Contributor Author

nevi-me commented Sep 25, 2020

Hey @jhorstmann, I haven't had time to look, but maybe I'm confused. What I recall is that I needed a way of converting a Buffer to a Vec<bool> when I write to Parquet. It might even be that I don't need to do this though, but I'll see when I get time to work on the Arrow Parquet writer some more.

I'm closing this PR

@nevi-me nevi-me closed this Sep 25, 2020
nevi-me pushed a commit that referenced this pull request Oct 8, 2020
…itrary offsets

@nevi-me this is the chunked iterator based approach i mentioned in #8223

I'm not fully satisfied with the solution yet:

 - I'd prefer to move all the bit-based functions into `Bitmap`, but creating a `Bitmap` from a `&Buffer` would involve cloning an `Arc`.
 - I need to do some benchmarking about how much the `packed_simd` implementation actually helps. If it's not a big difference I'd propose to remove it to simplify the code.

Closes #8262 from jhorstmann/ARROW-10040-unaligned-bit-buffers

Authored-by: Jörn Horstmann <joern.horstmann@signavio.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
@nevi-me nevi-me deleted the ARROW-10040 branch October 17, 2020 19:21
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…itrary offsets

@nevi-me this is the chunked iterator based approach i mentioned in apache#8223

I'm not fully satisfied with the solution yet:

 - I'd prefer to move all the bit-based functions into `Bitmap`, but creating a `Bitmap` from a `&Buffer` would involve cloning an `Arc`.
 - I need to do some benchmarking about how much the `packed_simd` implementation actually helps. If it's not a big difference I'd propose to remove it to simplify the code.

Closes apache#8262 from jhorstmann/ARROW-10040-unaligned-bit-buffers

Authored-by: Jörn Horstmann <joern.horstmann@signavio.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants