-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add view buffer for parquet reader #5970
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @XiangpengHao -- this looks pretty exciting. I have a suggestion on how we can potentially avoid adding so much duplicated code. Let me know what you think
I also merged this PR up from main to get #5968 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks almost perfect to me -- I had a note about unsafe
that I think is worth looking into
Now that we pulled 'create_view_unchecked` into its own function, we should also run the benchmarks. I have some other benchmarks running now, but can run them on this PR when that is done
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
I synced the commits to #5972, once we merge this, we can proceed to that one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- I am running benchmarks and then plan to merge
Specifically to ensure the refactoring in the builder doesn't slow down other things accidentally
Initial run: group master parquet-string-view
----- ------ -------------------
arrow_array_reader/BinaryViewArray/dictionary encoded, mandatory, no NULLs 1.00 908.5±2.26µs ? ?/sec 1.08 984.2±3.57µs ? ?/sec
arrow_array_reader/BinaryViewArray/dictionary encoded, optional, half NULLs 1.01 1613.4±16.09µs ? ?/sec 1.00 1597.7±3.75µs ? ?/sec
arrow_array_reader/BinaryViewArray/dictionary encoded, optional, no NULLs 1.00 907.9±4.59µs ? ?/sec 1.09 987.9±3.75µs ? ?/sec
arrow_array_reader/BinaryViewArray/plain encoded, mandatory, no NULLs 1.00 1037.1±5.51µs ? ?/sec 1.10 1137.4±132.17µs ? ?/sec
arrow_array_reader/BinaryViewArray/plain encoded, optional, half NULLs 1.01 1725.4±7.13µs ? ?/sec 1.00 1703.4±6.10µs ? ?/sec
arrow_array_reader/BinaryViewArray/plain encoded, optional, no NULLs 1.00 1047.8±6.97µs ? ?/sec 1.09 1144.8±134.91µs ? ?/sec
arrow_array_reader/StringViewArray/dictionary encoded, mandatory, no NULLs 1.00 1992.5±5.36µs ? ?/sec 1.03 2.0±0.01ms ? ?/sec
arrow_array_reader/StringViewArray/dictionary encoded, optional, half NULLs 1.02 2.8±0.01ms ? ?/sec 1.00 2.7±0.01ms ? ?/sec
arrow_array_reader/StringViewArray/dictionary encoded, optional, no NULLs 1.00 2.0±0.01ms ? ?/sec 1.02 2.1±0.01ms ? ?/sec
arrow_array_reader/StringViewArray/plain encoded, mandatory, no NULLs 1.00 2.1±0.01ms ? ?/sec 1.04 2.2±0.01ms ? ?/sec
arrow_array_reader/StringViewArray/plain encoded, optional, half NULLs 1.02 2.8±0.01ms ? ?/sec 1.00 2.8±0.01ms ? ?/sec
arrow_array_reader/StringViewArray/plain encoded, optional, no NULLs 1.00 2.1±0.01ms ? ?/sec 1.04 2.2±0.01ms ? ?/sec Seems like some of the benchmarks have gotten slower 🤔 Maybe it is time to throw a Also I wonder if we should change the name from |
(I am rerunning those numbers to see if they are reproducable) |
Ok, now this claims this branch is faster than main for some reason (which is surprising as this code isn't used). Anyhow, let's keep the code going
🚀 |
Thanks again @XiangpengHao |
Which issue does this PR close?
Part of #5904 , sequel to #5968
Rationale for this change
This PR is not ready to review until we merged #5968 .
Currently we build OffsetBuffer from the parquet decoder, this is not ideal for optimal performance. Instead, we should directly build the view buffer that can be used to build StringViewArray.
This PR is largely inspired by the excellent work of @ariesdevil ❤️ from #5557
What changes are included in this PR?
Added a view buffer. It should be straightforward and much of the functionality is duplicated from GenericByteViewBuilder. But we can't directly use it because we need to pad_null, which is somewhat unique to parquet reading scenarios.
I did not include how to use this view buffer in this PR for ease of review. Once we get this in, I'll file a new PR to use View buffer to read BinaryViewArray
Are there any user-facing changes?