-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specialize Thrift Decoding (~40% Faster) (#4891) #4892
Conversation
bbce13f
to
bbf3405
Compare
With the latest changes that switch to
|
|
||
let reader = remaining_metadata.as_ref().chain(&suffix[..suffix_len - 8]); | ||
(read_metadata(reader)?, None) | ||
let meta = fetch.fetch(metadata_start..file_size - 8).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does mean we re-read data, however, in almost all cases this will be irrelevant in the grand scheme of things.
Another option would be to concatenate the data here, but this might be less efficient, e.g. the underlying AsyncFileReader is just slicing a Bytes from memory.
I think it is important to note that this is related to making Parquet reading faster - as the connection between TCompactSliceInputProtocol and reading parquet metadata may not be obvious |
parquet/CONTRIBUTING.md
Outdated
@@ -65,7 +65,7 @@ To compile and view in the browser, run `cargo doc --no-deps --open`. | |||
To generate the parquet format (thrift definitions) code run from the repository root run | |||
|
|||
``` | |||
$ docker run -v $(pwd):/thrift/src -it archlinux pacman -Sy --noconfirm thrift && wget https://raw.githubusercontent.com/apache/parquet-format/apache-parquet-format-2.9.0/src/main/thrift/parquet.thrift -O /tmp/parquet.thrift && thrift --gen rs /tmp/parquet.thrift && sed -i '/use thrift::server::TProcessor;/d' parquet.rs && mv parquet.rs parquet/src/format.rs | |||
$ docker run -v $(pwd):/thrift/src -it archlinux pacman -Sy --noconfirm thrift && wget https://raw.githubusercontent.com/apache/parquet-format/master/src/main/thrift/parquet.thrift -O /tmp/parquet.thrift && thrift --gen rs /tmp/parquet.thrift && sed -i '/use thrift::server::TProcessor;/d' parquet.rs && sed -i 's/impl TSerializable for/impl crate::thrift::TSerializable for/g' parquet.rs && sed -i 's/fn write_to_out_protocol(&self, o_prot: &mut dyn TOutputProtocol)/fn write_to_out_protocol<T: TOutputProtocol>(\&self, o_prot: \&mut T)/g' parquet.rs && sed -i 's/fn read_from_in_protocol(i_prot: &mut dyn TInputProtocol)/fn read_from_in_protocol<T: TInputProtocol>(i_prot: \&mut T)/g' parquet.rs && mv parquet.rs parquet/src/format.rs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe place this into a script file?
Also I think the master commit should be pinned here, so that the output is reproducible.
parquet/CONTRIBUTING.md
Outdated
@@ -65,7 +65,7 @@ To compile and view in the browser, run `cargo doc --no-deps --open`. | |||
To generate the parquet format (thrift definitions) code run from the repository root run | |||
|
|||
``` | |||
$ docker run -v $(pwd):/thrift/src -it archlinux pacman -Sy --noconfirm thrift && wget https://raw.githubusercontent.com/apache/parquet-format/apache-parquet-format-2.9.0/src/main/thrift/parquet.thrift -O /tmp/parquet.thrift && thrift --gen rs /tmp/parquet.thrift && sed -i '/use thrift::server::TProcessor;/d' parquet.rs && mv parquet.rs parquet/src/format.rs | |||
$ docker run -v $(pwd):/thrift/src -it archlinux pacman -Sy --noconfirm thrift && wget https://raw.githubusercontent.com/apache/parquet-format/master/src/main/thrift/parquet.thrift -O /tmp/parquet.thrift && thrift --gen rs /tmp/parquet.thrift && sed -i '/use thrift::server::TProcessor;/d' parquet.rs && sed -i 's/impl TSerializable for/impl crate::thrift::TSerializable for/g' parquet.rs && sed -i 's/fn write_to_out_protocol(&self, o_prot: &mut dyn TOutputProtocol)/fn write_to_out_protocol<T: TOutputProtocol>(\&self, o_prot: \&mut T)/g' parquet.rs && sed -i 's/fn read_from_in_protocol(i_prot: &mut dyn TInputProtocol)/fn read_from_in_protocol<T: TInputProtocol>(i_prot: \&mut T)/g' parquet.rs && mv parquet.rs parquet/src/format.rs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhere we should encode the exact version of the thrift file that the currently checked in metadata
came from, I think. By referring to master it will "depend" on when the script got run
Which issue does this PR close?
Closes #4891
Rationale for this change
We aren't using
TCompactSliceInputProtocol
and yet are already seeing non-trivial performance benefits from where we areWhat changes are included in this PR?
Are there any user-facing changes?