Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid copy/allocation when read view types from parquet #5877

Merged
merged 2 commits into from
Jun 13, 2024

Conversation

XiangpengHao
Copy link
Contributor

Which issue does this PR close?

Part of #5530.

An alternative implementation (maybe subset) of #5557

Rationale for this change

This change is very simple -- only 8 lines, but gives us many performance improvements (10-80x):

arrow_array_reader/StringViewArray/plain encoded, mandatory, no NULLs
                        time:   [274.71 µs 275.03 µs 275.35 µs]
                        change: [-98.842% -98.837% -98.829%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
arrow_array_reader/StringViewArray/plain encoded, optional, no NULLs
                        time:   [275.04 µs 275.41 µs 275.75 µs]
                        change: [-98.848% -98.843% -98.839%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 27 outliers among 100 measurements (27.00%)
  12 (12.00%) low severe
  6 (6.00%) low mild
  6 (6.00%) high mild
  3 (3.00%) high severe
arrow_array_reader/StringViewArray/plain encoded, optional, half NULLs
                        time:   [319.11 µs 319.48 µs 319.98 µs]
                        change: [-94.940% -94.934% -94.926%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  8 (8.00%) high mild
  1 (1.00%) high severe
arrow_array_reader/StringViewArray/dictionary encoded, mandatory, no NULLs
                        time:   [259.89 µs 260.22 µs 260.65 µs]
                        change: [-97.848% -97.846% -97.842%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe
arrow_array_reader/StringViewArray/dictionary encoded, optional, no NULLs
                        time:   [267.99 µs 268.26 µs 268.56 µs]
                        change: [-97.762% -97.760% -97.757%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
arrow_array_reader/StringViewArray/dictionary encoded, optional, half NULLs
                        time:   [301.73 µs 302.40 µs 303.20 µs]
                        change: [-90.950% -90.895% -90.852%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

To reproduce:

cd parquet
cargo bench --bench arrow_reader --features="arrow test_common experimental" "arrow_array_reader/StringViewArray"

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 12, 2024
@ariesdevil
Copy link
Contributor

FYI: We discussed in #5557, using an independent view_buffer to replace offset_buffer, see #5557 (comment) for details.

@XiangpengHao
Copy link
Contributor Author

FYI: We discussed in #5557, using an independent view_buffer to replace offset_buffer, see #5557 (comment) for details.

The try_append_view was not implemented when the discussion was made. Now that we have the API to directly construct a view without calling append_value -- thus avoiding allocating memory, I think it is simpler with this approach.


if len != 0 {
builder
.try_append_view(0, start.as_usize() as u32, len as u32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 -> append_block's return offset

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @XiangpengHao and @ariesdevil

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @XiangpengHao

I ran the benchmarks and it looks 💯

++ critcmp master parquet-view
group                                                                          master                                 parquet-view
-----                                                                          ------                                 ------------
arrow_array_reader/StringViewArray/dictionary encoded, mandatory, no NULLs     38.83    21.2±0.04ms        ? ?/sec    1.00    546.7±1.85µs        ? ?/sec
arrow_array_reader/StringViewArray/dictionary encoded, optional, half NULLs    7.61      5.6±0.01ms        ? ?/sec    1.00    736.2±1.89µs        ? ?/sec
arrow_array_reader/StringViewArray/dictionary encoded, optional, no NULLs      38.75    21.2±0.04ms        ? ?/sec    1.00    548.1±1.03µs        ? ?/sec
arrow_array_reader/StringViewArray/plain encoded, mandatory, no NULLs          69.37    42.3±0.06ms        ? ?/sec    1.00    609.7±1.34µs        ? ?/sec
arrow_array_reader/StringViewArray/plain encoded, optional, half NULLs         14.44    11.2±0.03ms        ? ?/sec    1.00    774.7±1.57µs        ? ?/sec
arrow_array_reader/StringViewArray/plain encoded, optional, no NULLs           69.40    42.4±0.06ms        ? ?/sec    1.00    610.4±1.59µs        ? ?/sec

I wonder if we can also use try_append_unchecked and make it even faster?

@alamb
Copy link
Contributor

alamb commented Jun 13, 2024

I wonder if we can also use try_append_unchecked and make it even faster?

I am merging this in to keep things moving -- maybe we can consider using try_append_unchecked as a follow on PR

cc @ariesdevil

@alamb alamb merged commit c6359bf into apache:master Jun 13, 2024
16 checks passed
@XiangpengHao XiangpengHao deleted the parquet-view branch June 13, 2024 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants