-
Notifications
You must be signed in to change notification settings - Fork 792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo #3102
Conversation
74a191c
to
c66d7a0
Compare
Thank you @jimexist -- I have this PR on my review list and hopefully will get to it today. The backlog in DataFusion is substantial, however, so it might not be until tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some suggestions, but taking a step back I wonder if we could do the following.
The bloom filter specification states.
The Bloom filter data can be stored before the page indexes after all row groups
Or it can be stored between row groups
Once we have read the file metadata we know the byte ranges of the column chunks, and page indexes, as well as the offsets of the bloom filter data for each column chunk. It should therefore be possible to get a fairly accurate overestimate of the length of each bloom filter, simply by process of elimination.
Not only would this remove the need to do iterative reads, but would also have a clear path to supporting reading this information from object storage, where we need to know the byte ranges ahead of time. Effectively we could make Sbbf::read_from_column_chunk
take an &[u8]
.
What do you think?
//! cargo run --features=cli --bin parquet-show-bloom-filter -- --file-name XYZ.parquet --column id --values a | ||
//! ``` | ||
|
||
extern crate parquet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extern crate parquet; |
parquet/src/bloom_filter/mod.rs
Outdated
let bitset_offset = offset + length - buffer.remaining(); | ||
return Ok((h, bitset_offset)); | ||
} else { | ||
// continue to try by reading another batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrupt data I think will cause this to potentially iterate indefinitely, which would be bad...
parquet/src/bloom_filter/mod.rs
Outdated
// this size should not be too large to not to hit short read too early (although unlikely) | ||
// but also not to small to ensure cache efficiency, this is essential a "guess" of the header | ||
// size. In the demo test the size is 15 bytes. | ||
const STEP_SIZE: usize = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the thrift compact protocol encoding - https://raw.githubusercontent.com/apache/thrift/master/doc/specs/thrift-compact-protocol.md.
The BloomFilterHeader consists of 1 int32 and 3 enums. Enumerations are encoded as int32. Each int32 is encoded as 1 to 5 bytes.
Therefore the maximum header size is 20 bytes. One possibility might be to read 20 bytes, and bail on error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good idea, let me change it to 20 byte read and bail, before switching to more complicated "guessing" from the page indices gaps
Thanks for the suggestion. I wonder if that is future proof, e.g. if there are more data structure to be added later beside sbbf, page index, etc. would that be a problem? Thinking out loud... that this would just be ballooning the over-estimate and/or make the likelihood of needing to look at both locations before it can correctly locate which was the right one when parquet file was written. |
I think it unlikely that a subsequent addition would omit sufficient information to properly identify its location in the file, but as you say, the result would simply be overestimating the length of the final bloom filter which is relatively harmless |
@@ -143,6 +145,10 @@ pub trait RowGroupReader: Send + Sync { | |||
Ok(col_reader) | |||
} | |||
|
|||
#[cfg(feature = "bloom")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about handling this in the same eager way that we handle page indexes, namely add an option to ReadOptions
to enable reading bloom filters, and read this data in SerializedFileReader
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tustvold are you also suggesting dropping the feature gate altogether and enable it by default? I added the feature gate trying to reduce binary size but then if the feature is very likely to be used there's no need for this gate any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suggesting rather than providing a lazy API to read the bloom filter on demand, provide an API to make SerializedReader load blook filters as part of ParquetMetadata if the corresponding feature and ReadOption is enabled. Similar to how we handle the page index.
This is necessary to be able to support object stores, and is generally a good idea to avoid lots of small IO reads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also thanks for the suggestion, I agree with this direction, however I have:
coming up, so I'd like to maybe merge this as is and quickly follow up on this after it is merged. do you think that works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @jimexist -- it is looking very cool to see this in the implementation
When writing bloom filters, I wonder if we could have the rust writer write them all contiguously to avoid multiple potential random access reads.
//! cargo run --features=cli --bin parquet-show-bloom-filter -- --file-name XYZ.parquet --column id --values a | ||
//! ``` | ||
|
||
use clap::Parser; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could add the ability to dump bloom filters to to https://github.com/apache/arrow-rs/blob/master/parquet/src/bin/parquet-schema.rs rather than make a new executable
I don't feel strongly however
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am coding this similar to https://github.com/apache/parquet-mr/blob/master/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowBloomFilterCommand.java so I'd like to keep it this way for now.
@@ -125,11 +162,8 @@ impl Sbbf { | |||
let length: usize = header.num_bytes.try_into().map_err(|_| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is unfortunate that we need to do more than one read to potentially read a bloom filter (read the bloom header and then read the length). I think @tustvold noted this as a limitation in the parquet file format itself (that the file metadata only has the bloom filter starting offset, but not its length)
Perhaps the reader abstraction can hide most/all of this nonsense from us
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can take a look at https://github.com/apache/arrow-rs/pull/3119/files#diff-3b307348aabe465890fa39973e9fda0243bd2344cb7cb9cdf02ac2d39521d7caR232-R236 which should show how it works - similar to writing column offsets and indices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy for this to be merged as is, there is plenty of time before the next release to get the APIs to a happy place, and the feature is still experimental anyway
Benchmark runs are scheduled for baseline = c95eb4c and contender = 73d66d8. 73d66d8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
… update API, and add cli demo (#3102) * add feature flag * add api * fix reading with chunk reader * refactor * add a binary to demo * add bin * remove unused * fix clippy * adjust byte size * update read method * parquet-show-bloom-filter with bloom feature required * remove extern crate * get rid of loop read * refactor to test * rework api * remove unused trait * update help
… update API, and add cli demo (#3102) * add feature flag * add api * fix reading with chunk reader * refactor * add a binary to demo * add bin * remove unused * fix clippy * adjust byte size * update read method * parquet-show-bloom-filter with bloom feature required * remove extern crate * get rid of loop read * refactor to test * rework api * remove unused trait * update help
Which issue does this PR close?
Rationale for this change
parquet bloom filter part II:
What changes are included in this PR?
data generation:
Are there any user-facing changes?