Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specialize Thrift Decoding (~40% Faster) (#4891) #4892

Merged
merged 2 commits into from
Oct 10, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Oct 4, 2023

Which issue does this PR close?

Closes #4891

Rationale for this change

We aren't using TCompactSliceInputProtocol and yet are already seeing non-trivial performance benefits from where we are

open(default)           time:   [23.783 µs 23.801 µs 23.822 µs]
                        change: [-10.316% -10.211% -10.106%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

open(page index)        time:   [705.79 µs 705.98 µs 706.18 µs]
                        change: [-33.072% -33.014% -32.960%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 4, 2023
@tustvold
Copy link
Contributor Author

tustvold commented Oct 4, 2023

With the latest changes that switch to TCompactSliceInputProtocol everywhere apart from reading page headers.

open(default)           time:   [15.697 µs 15.705 µs 15.714 µs]
                        change: [-40.766% -40.682% -40.602%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

open(page index)        time:   [691.76 µs 692.23 µs 692.72 µs]
                        change: [-34.457% -34.394% -34.334%] (p = 0.00 < 0.05)
                        Performance has improved.

@tustvold tustvold added the api-change Changes to the arrow API label Oct 4, 2023

let reader = remaining_metadata.as_ref().chain(&suffix[..suffix_len - 8]);
(read_metadata(reader)?, None)
let meta = fetch.fetch(metadata_start..file_size - 8).await?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does mean we re-read data, however, in almost all cases this will be irrelevant in the grand scheme of things.

Another option would be to concatenate the data here, but this might be less efficient, e.g. the underlying AsyncFileReader is just slicing a Bytes from memory.

@tustvold tustvold changed the title Specialize thrift (#4891) Specialize Thrift Decoding (~40% Faster) (#4891) Oct 4, 2023
@tustvold tustvold marked this pull request as ready for review October 4, 2023 19:50
@alamb
Copy link
Contributor

alamb commented Oct 4, 2023

I think it is important to note that this is related to making Parquet reading faster - as the connection between TCompactSliceInputProtocol and reading parquet metadata may not be obvious

@@ -65,7 +65,7 @@ To compile and view in the browser, run `cargo doc --no-deps --open`.
To generate the parquet format (thrift definitions) code run from the repository root run

```
$ docker run -v $(pwd):/thrift/src -it archlinux pacman -Sy --noconfirm thrift && wget https://raw.githubusercontent.com/apache/parquet-format/apache-parquet-format-2.9.0/src/main/thrift/parquet.thrift -O /tmp/parquet.thrift && thrift --gen rs /tmp/parquet.thrift && sed -i '/use thrift::server::TProcessor;/d' parquet.rs && mv parquet.rs parquet/src/format.rs
$ docker run -v $(pwd):/thrift/src -it archlinux pacman -Sy --noconfirm thrift && wget https://raw.githubusercontent.com/apache/parquet-format/master/src/main/thrift/parquet.thrift -O /tmp/parquet.thrift && thrift --gen rs /tmp/parquet.thrift && sed -i '/use thrift::server::TProcessor;/d' parquet.rs && sed -i 's/impl TSerializable for/impl crate::thrift::TSerializable for/g' parquet.rs && sed -i 's/fn write_to_out_protocol(&self, o_prot: &mut dyn TOutputProtocol)/fn write_to_out_protocol<T: TOutputProtocol>(\&self, o_prot: \&mut T)/g' parquet.rs && sed -i 's/fn read_from_in_protocol(i_prot: &mut dyn TInputProtocol)/fn read_from_in_protocol<T: TInputProtocol>(i_prot: \&mut T)/g' parquet.rs && mv parquet.rs parquet/src/format.rs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe place this into a script file?

Also I think the master commit should be pinned here, so that the output is reproducible.

@@ -65,7 +65,7 @@ To compile and view in the browser, run `cargo doc --no-deps --open`.
To generate the parquet format (thrift definitions) code run from the repository root run

```
$ docker run -v $(pwd):/thrift/src -it archlinux pacman -Sy --noconfirm thrift && wget https://raw.githubusercontent.com/apache/parquet-format/apache-parquet-format-2.9.0/src/main/thrift/parquet.thrift -O /tmp/parquet.thrift && thrift --gen rs /tmp/parquet.thrift && sed -i '/use thrift::server::TProcessor;/d' parquet.rs && mv parquet.rs parquet/src/format.rs
$ docker run -v $(pwd):/thrift/src -it archlinux pacman -Sy --noconfirm thrift && wget https://raw.githubusercontent.com/apache/parquet-format/master/src/main/thrift/parquet.thrift -O /tmp/parquet.thrift && thrift --gen rs /tmp/parquet.thrift && sed -i '/use thrift::server::TProcessor;/d' parquet.rs && sed -i 's/impl TSerializable for/impl crate::thrift::TSerializable for/g' parquet.rs && sed -i 's/fn write_to_out_protocol(&self, o_prot: &mut dyn TOutputProtocol)/fn write_to_out_protocol<T: TOutputProtocol>(\&self, o_prot: \&mut T)/g' parquet.rs && sed -i 's/fn read_from_in_protocol(i_prot: &mut dyn TInputProtocol)/fn read_from_in_protocol<T: TInputProtocol>(i_prot: \&mut T)/g' parquet.rs && mv parquet.rs parquet/src/format.rs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere we should encode the exact version of the thrift file that the currently checked in metadata came from, I think. By referring to master it will "depend" on when the script got run

@tustvold tustvold merged commit 538a7bf into apache:master Oct 10, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rework Thrift Encoding / Decoding of Parquet Metadata
3 participants