Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose bloom filter offset in ColumnChunkMetadata #1309

Merged
merged 6 commits into from
Feb 15, 2022

Conversation

shanisolomon
Copy link
Contributor

Closes #1308.

Exposing the bloom filter offset in in the column chunk metadata so parquet engines could optimize their reads.

Test file data_index_bloom.parquet is added to parquet-testing via apache/parquet-testing#22.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Feb 14, 2022
@alamb
Copy link
Contributor

alamb commented Feb 14, 2022

Thank you @shanisolomon for helping to push the rust reader implementation along 🎉 -- I think the tests are failing here because the reference to the new example data in apache/parquet-testing#22 hasn't been incorporated yet. Once we get the new example data in parquet-testing, we can then update the submodule reference on this PR.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the code contribution @shanisolomon -- the code looks great to me 🏅 -- a very nice first contribution.

Once we sort out the getting the appropriate test file in parquet-data I think we'll be good to go.

If you don't mind my asking, do you have larger plans for supporting index pages and bloom filters? I think there would be substantial interest in the larger Rust/arrow community for Bloom filter support and I would be willing to find time to help design / review it if that would be helpful

Thanks again

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alamb
Copy link
Contributor

alamb commented Feb 15, 2022

I am now preparing a PR to update the arrowrs to apache/parquet-testing#22

@alamb
Copy link
Contributor

alamb commented Feb 15, 2022

#1311

@shanisolomon
Copy link
Contributor Author

Thanks @alamb.

Regarding further bloom filters support - yes, we would eventually be interested in reading bloom filters (and other optional metadata) in order to optimize our reads. Currently we're just interested in knowing whether or not such bloom filters exists, hence this PR. I'll be sure to reach out to you when and if we choose to do so :). Thanks!

@alamb
Copy link
Contributor

alamb commented Feb 15, 2022

I verified that this branch passes all the tests when I cherry-picked 1bf62e3 locally

So the next steps should be:

  1. I'll merge Update parquest-testing pin #1311 when it passes CI
  2. You then update the PR to pull in the changes (git merge apache/master or equivalent)
  3. We'll merge it in!

@codecov-commenter
Copy link

Codecov Report

Merging #1309 (f63b12b) into master (9209e55) will decrease coverage by 0.00%.
The diff coverage is 52.63%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1309      +/-   ##
==========================================
- Coverage   83.04%   83.03%   -0.01%     
==========================================
  Files         180      180              
  Lines       52846    52865      +19     
==========================================
+ Hits        43884    43895      +11     
- Misses       8962     8970       +8     
Impacted Files Coverage Δ
parquet/src/schema/printer.rs 71.54% <0.00%> (-0.79%) ⬇️
parquet/src/file/metadata.rs 91.39% <44.44%> (-1.57%) ⬇️
parquet/src/file/serialized_reader.rs 94.47% <100.00%> (+0.09%) ⬆️
arrow/src/datatypes/datatype.rs 66.40% <0.00%> (-0.40%) ⬇️
parquet/src/encodings/encoding.rs 93.71% <0.00%> (+0.19%) ⬆️
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (+0.22%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9209e55...f63b12b. Read the comment docs.

@alamb alamb merged commit 1ab95d5 into apache:master Feb 15, 2022
@alamb
Copy link
Contributor

alamb commented Feb 15, 2022

Thanks again @shanisolomon

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Feb 16, 2022
@alamb alamb changed the title Expose has bloom offset Expose bloom filter offset in ColumnChunkMetadata Feb 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose bloom filter metadata offset
4 participants