Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add view buffer for parquet reader #5970

Merged
merged 11 commits into from
Jun 28, 2024

Conversation

XiangpengHao
Copy link
Contributor

Which issue does this PR close?

Part of #5904 , sequel to #5968

Rationale for this change

This PR is not ready to review until we merged #5968 .

Currently we build OffsetBuffer from the parquet decoder, this is not ideal for optimal performance. Instead, we should directly build the view buffer that can be used to build StringViewArray.

This PR is largely inspired by the excellent work of @ariesdevil ❤️ from #5557

What changes are included in this PR?

Added a view buffer. It should be straightforward and much of the functionality is duplicated from GenericByteViewBuilder. But we can't directly use it because we need to pad_null, which is somewhat unique to parquet reading scenarios.

I did not include how to use this view buffer in this PR for ease of review. Once we get this in, I'll file a new PR to use View buffer to read BinaryViewArray

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 27, 2024
@alamb alamb marked this pull request as ready for review June 28, 2024 11:09
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @XiangpengHao -- this looks pretty exciting. I have a suggestion on how we can potentially avoid adding so much duplicated code. Let me know what you think

parquet/src/arrow/buffer/view_buffer.rs Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Jun 28, 2024

I also merged this PR up from main to get #5968

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 28, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks almost perfect to me -- I had a note about unsafe that I think is worth looking into

Now that we pulled 'create_view_unchecked` into its own function, we should also run the benchmarks. I have some other benchmarks running now, but can run them on this PR when that is done

arrow-array/src/builder/generic_bytes_view_builder.rs Outdated Show resolved Hide resolved
parquet/src/arrow/buffer/view_buffer.rs Show resolved Hide resolved
XiangpengHao and others added 2 commits June 28, 2024 15:48
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@XiangpengHao
Copy link
Contributor Author

I synced the commits to #5972, once we merge this, we can proceed to that one

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- I am running benchmarks and then plan to merge

Specifically to ensure the refactoring in the builder doesn't slow down other things accidentally

@alamb
Copy link
Contributor

alamb commented Jun 28, 2024

Initial run:

group                                                                          master                                 parquet-string-view
-----                                                                          ------                                 -------------------
arrow_array_reader/BinaryViewArray/dictionary encoded, mandatory, no NULLs     1.00    908.5±2.26µs        ? ?/sec    1.08    984.2±3.57µs        ? ?/sec
arrow_array_reader/BinaryViewArray/dictionary encoded, optional, half NULLs    1.01  1613.4±16.09µs        ? ?/sec    1.00   1597.7±3.75µs        ? ?/sec
arrow_array_reader/BinaryViewArray/dictionary encoded, optional, no NULLs      1.00    907.9±4.59µs        ? ?/sec    1.09    987.9±3.75µs        ? ?/sec
arrow_array_reader/BinaryViewArray/plain encoded, mandatory, no NULLs          1.00   1037.1±5.51µs        ? ?/sec    1.10  1137.4±132.17µs        ? ?/sec
arrow_array_reader/BinaryViewArray/plain encoded, optional, half NULLs         1.01   1725.4±7.13µs        ? ?/sec    1.00   1703.4±6.10µs        ? ?/sec
arrow_array_reader/BinaryViewArray/plain encoded, optional, no NULLs           1.00   1047.8±6.97µs        ? ?/sec    1.09  1144.8±134.91µs        ? ?/sec
arrow_array_reader/StringViewArray/dictionary encoded, mandatory, no NULLs     1.00   1992.5±5.36µs        ? ?/sec    1.03      2.0±0.01ms        ? ?/sec
arrow_array_reader/StringViewArray/dictionary encoded, optional, half NULLs    1.02      2.8±0.01ms        ? ?/sec    1.00      2.7±0.01ms        ? ?/sec
arrow_array_reader/StringViewArray/dictionary encoded, optional, no NULLs      1.00      2.0±0.01ms        ? ?/sec    1.02      2.1±0.01ms        ? ?/sec
arrow_array_reader/StringViewArray/plain encoded, mandatory, no NULLs          1.00      2.1±0.01ms        ? ?/sec    1.04      2.2±0.01ms        ? ?/sec
arrow_array_reader/StringViewArray/plain encoded, optional, half NULLs         1.02      2.8±0.01ms        ? ?/sec    1.00      2.8±0.01ms        ? ?/sec
arrow_array_reader/StringViewArray/plain encoded, optional, no NULLs           1.00      2.1±0.01ms        ? ?/sec    1.04      2.2±0.01ms        ? ?/sec

Seems like some of the benchmarks have gotten slower 🤔 Maybe it is time to throw a #[inline(always)] into make_view_unchecked 🤔

Also I wonder if we should change the name from make_view_unchecked to make_view

@alamb
Copy link
Contributor

alamb commented Jun 28, 2024

(I am rerunning those numbers to see if they are reproducable)

@alamb
Copy link
Contributor

alamb commented Jun 28, 2024

Ok, now this claims this branch is faster than main for some reason (which is surprising as this code isn't used). Anyhow, let's keep the code going

++ critcmp master parquet-string-view
group                                                                          master                                 parquet-string-view
-----                                                                          ------                                 -------------------
arrow_array_reader/BinaryViewArray/dictionary encoded, mandatory, no NULLs     1.07    900.4±3.90µs        ? ?/sec    1.00    842.8±4.40µs        ? ?/sec
arrow_array_reader/BinaryViewArray/dictionary encoded, optional, half NULLs    1.01   1610.0±3.86µs        ? ?/sec    1.00   1594.4±5.16µs        ? ?/sec
arrow_array_reader/BinaryViewArray/dictionary encoded, optional, no NULLs      1.07    909.2±2.63µs        ? ?/sec    1.00    848.7±5.53µs        ? ?/sec
arrow_array_reader/BinaryViewArray/plain encoded, mandatory, no NULLs          1.04   1031.8±9.37µs        ? ?/sec    1.00    992.9±6.79µs        ? ?/sec
arrow_array_reader/BinaryViewArray/plain encoded, optional, half NULLs         1.01   1726.0±6.87µs        ? ?/sec    1.00   1703.8±6.85µs        ? ?/sec
arrow_array_reader/BinaryViewArray/plain encoded, optional, no NULLs           1.03   1041.6±5.61µs        ? ?/sec    1.00  1009.9±15.33µs        ? ?/sec
arrow_array_reader/StringViewArray/dictionary encoded, mandatory, no NULLs     1.02   1998.0±3.86µs        ? ?/sec    1.00   1966.9±6.34µs        ? ?/sec
arrow_array_reader/StringViewArray/dictionary encoded, optional, half NULLs    1.00      2.8±0.01ms        ? ?/sec    1.00      2.7±0.01ms        ? ?/sec
arrow_array_reader/StringViewArray/dictionary encoded, optional, no NULLs      1.02      2.0±0.01ms        ? ?/sec    1.00  1973.9±19.89µs        ? ?/sec
arrow_array_reader/StringViewArray/plain encoded, mandatory, no NULLs          1.01      2.1±0.01ms        ? ?/sec    1.00      2.1±0.01ms        ? ?/sec
arrow_array_reader/StringViewArray/plain encoded, optional, half NULLs         1.00      2.8±0.01ms        ? ?/sec    1.00      2.8±0.01ms        ? ?/sec
arrow_array_reader/StringViewArray/plain encoded, optional, no NULLs           1.01      2.1±0.01ms        ? ?/sec    1.00      2.1±0.01ms        ? ?/sec

🚀

@alamb alamb merged commit a7b4a3b into apache:master Jun 28, 2024
27 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 28, 2024

Thanks again @XiangpengHao

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants