Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parquet bloom filter length #4885

Merged
merged 2 commits into from
Oct 2, 2023

Conversation

letian-jiang
Copy link
Contributor

Which issue does this PR close?

Closes #4398.

Rationale for this change

What changes are included in this PR?

If bloom_filter_length is available in ColumnMetaData, use a single IO to fetch both header and bitset.

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 1, 2023
Signed-off-by: Letian Jiang <letian.jiang@outlook.com>
impl Default for Statistics {
fn default() -> Self {
Statistics{
max: Some(Vec::new()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that this is technically a breaking change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this is replaced by #[derive(Default)] at line 641

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I didn't misunderstand it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that will default the fields to None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. It is a breaking change.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor nits, thank you for this

Comment on lines 272 to 282
column_chunk
.meta_data
.as_mut()
.expect("can't have bloom filter without column metadata")
.bloom_filter_offset = Some(start_offset as i64);
column_chunk
.meta_data
.as_mut()
.expect("can't have bloom filter without column metadata")
.bloom_filter_length =
Some((end_offset - start_offset) as i32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
column_chunk
.meta_data
.as_mut()
.expect("can't have bloom filter without column metadata")
.bloom_filter_offset = Some(start_offset as i64);
column_chunk
.meta_data
.as_mut()
.expect("can't have bloom filter without column metadata")
.bloom_filter_length =
Some((end_offset - start_offset) as i32)
let meta = column_chunk.meta_data.as_mut().expect("can't have bloom filter without column metadata");
meta.bloom_filter_offset = Some(start_offset as i64);
meta.bloom_filter_length = Some((end_offset - start_offset) as i32);

Or something to that effect

Comment on lines -293 to -295
let length: usize = header.num_bytes.try_into().map_err(|_| {
ParquetError::General("Bloom filter length is invalid".to_string())
})?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We appear to have lost this check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Signed-off-by: Letian Jiang <letian.jiang@outlook.com>
@letian-jiang
Copy link
Contributor Author

@tustvold Hi, I have made some updates. Could you please take another look?

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, will merge once CI is green

@tustvold tustvold merged commit d941ff1 into apache:master Oct 2, 2023
16 checks passed
@mapleFU
Copy link
Member

mapleFU commented Oct 2, 2023

Since 2.10 is not released, can bloom filter length be used now? 🤔

@tustvold
Copy link
Contributor

tustvold commented Oct 2, 2023

I hadn't realised it wasn't released, but it's a fairly innocuous change so I'm inclined to think this is fine

@mapleFU
Copy link
Member

mapleFU commented Oct 2, 2023

Nice! This is helpful but I think currently parquet 2.10 is not released, so user might not able to write this...

@letian-jiang
Copy link
Contributor Author

🤔 I think this doesn't harm as long as old readers can read new files (w. bloom filter length) correctly.

@letian-jiang letian-jiang deleted the bloom-filter-length branch October 2, 2023 14:54
@tustvold
Copy link
Contributor

tustvold commented Oct 2, 2023

as long as old readers can read new files (w. bloom filter length) correctly.

Thrift supports this form of forwards compatibility natively

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support bloom_filter_length
3 participants