Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo #3102

Merged
merged 17 commits into from
Nov 16, 2022

Conversation

jimexist
Copy link
Member

@jimexist jimexist commented Nov 13, 2022

Which issue does this PR close?

Rationale for this change

parquet bloom filter part II:

  • read sbbf bitset from row group reader
  • update API
  • add cli demo

What changes are included in this PR?

data generation:

In [1]: import pyspark.sql

In [2]: spark = pyspark.sql.SparkSession.builder.getOrCreate()

In [4]: spark.conf.set("parquet.bloom.filter.max.bytes", 32)

In [5]: spark.conf.set("parquet.bloom.filter.expected.ndv", 10)

In [6]: spark.conf.set("parquet.bloom.filter.enabled", True)

In [8]: data=[('a'+str(i%10),) for i in range(100)]

In [9]: df = spark.createDataFrame(data, ["id"]).repartition(1)

In [10]: df.write.parquet("bla.parquet", mode = "overwrite")
❯ cargo run --features=cli --bin parquet-show-bloom-filter -- --file-name bla.parquet/part-00000-3da010f4-d863-497e-a525-3519b58d00ae-c000.snappy.parquet --column id -v a0 -v a1 -v aa
    Finished dev [unoptimized + debuginfo] target(s) in 0.09s
     Running `target/debug/parquet-show-bloom-filter --file-name bla.parquet/part-00000-3da010f4-d863-497e-a525-3519b58d00ae-c000.snappy.parquet --column id -v a0 -v a1 -v aa`
Row group #0
================================================================================
Value a0 is present in bloom filter
Value a1 is present in bloom filter
Value aa is absent in bloom filter

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 13, 2022
@jimexist jimexist changed the title WIP Add bloom filter part II parquet bloom filter part II: read sbbf bitset from row group reader, update API, and add cli demo Nov 14, 2022
@jimexist jimexist requested review from alamb and tustvold and removed request for alamb November 14, 2022 12:22
@alamb
Copy link
Contributor

alamb commented Nov 14, 2022

Thank you @jimexist -- I have this PR on my review list and hopefully will get to it today. The backlog in DataFusion is substantial, however, so it might not be until tomorrow

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some suggestions, but taking a step back I wonder if we could do the following.

The bloom filter specification states.

The Bloom filter data can be stored before the page indexes after all row groups
Or it can be stored between row groups

Once we have read the file metadata we know the byte ranges of the column chunks, and page indexes, as well as the offsets of the bloom filter data for each column chunk. It should therefore be possible to get a fairly accurate overestimate of the length of each bloom filter, simply by process of elimination.

Not only would this remove the need to do iterative reads, but would also have a clear path to supporting reading this information from object storage, where we need to know the byte ranges ahead of time. Effectively we could make Sbbf::read_from_column_chunk take an &[u8].

What do you think?

//! cargo run --features=cli --bin parquet-show-bloom-filter -- --file-name XYZ.parquet --column id --values a
//! ```

extern crate parquet;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
extern crate parquet;

let bitset_offset = offset + length - buffer.remaining();
return Ok((h, bitset_offset));
} else {
// continue to try by reading another batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrupt data I think will cause this to potentially iterate indefinitely, which would be bad...

// this size should not be too large to not to hit short read too early (although unlikely)
// but also not to small to ensure cache efficiency, this is essential a "guess" of the header
// size. In the demo test the size is 15 bytes.
const STEP_SIZE: usize = 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the thrift compact protocol encoding - https://raw.githubusercontent.com/apache/thrift/master/doc/specs/thrift-compact-protocol.md.

The BloomFilterHeader consists of 1 int32 and 3 enums. Enumerations are encoded as int32. Each int32 is encoded as 1 to 5 bytes.

Therefore the maximum header size is 20 bytes. One possibility might be to read 20 bytes, and bail on error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good idea, let me change it to 20 byte read and bail, before switching to more complicated "guessing" from the page indices gaps

@jimexist
Copy link
Member Author

Once we have read the file metadata we know the byte ranges of the column chunks, and page indexes, as well as the offsets of the bloom filter data for each column chunk. It should therefore be possible to get a fairly accurate overestimate of the length of each bloom filter, simply by process of elimination.

Thanks for the suggestion. I wonder if that is future proof, e.g. if there are more data structure to be added later beside sbbf, page index, etc. would that be a problem? Thinking out loud... that this would just be ballooning the over-estimate and/or make the likelihood of needing to look at both locations before it can correctly locate which was the right one when parquet file was written.

@tustvold
Copy link
Contributor

I think it unlikely that a subsequent addition would omit sufficient information to properly identify its location in the file, but as you say, the result would simply be overestimating the length of the final bloom filter which is relatively harmless

tustvold
tustvold previously approved these changes Nov 15, 2022
@tustvold tustvold dismissed their stale review November 15, 2022 19:50

Changed mind

@@ -143,6 +145,10 @@ pub trait RowGroupReader: Send + Sync {
Ok(col_reader)
}

#[cfg(feature = "bloom")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about handling this in the same eager way that we handle page indexes, namely add an option to ReadOptions to enable reading bloom filters, and read this data in SerializedFileReader?

Copy link
Member Author

@jimexist jimexist Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold are you also suggesting dropping the feature gate altogether and enable it by default? I added the feature gate trying to reduce binary size but then if the feature is very likely to be used there's no need for this gate any more.

Copy link
Contributor

@tustvold tustvold Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting rather than providing a lazy API to read the bloom filter on demand, provide an API to make SerializedReader load blook filters as part of ParquetMetadata if the corresponding feature and ReadOption is enabled. Similar to how we handle the page index.

This is necessary to be able to support object stores, and is generally a good idea to avoid lots of small IO reads.

Copy link
Member Author

@jimexist jimexist Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also thanks for the suggestion, I agree with this direction, however I have:

coming up, so I'd like to maybe merge this as is and quickly follow up on this after it is merged. do you think that works?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @jimexist -- it is looking very cool to see this in the implementation

When writing bloom filters, I wonder if we could have the rust writer write them all contiguously to avoid multiple potential random access reads.

//! cargo run --features=cli --bin parquet-show-bloom-filter -- --file-name XYZ.parquet --column id --values a
//! ```

use clap::Parser;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add the ability to dump bloom filters to to https://github.com/apache/arrow-rs/blob/master/parquet/src/bin/parquet-schema.rs rather than make a new executable

I don't feel strongly however

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -125,11 +162,8 @@ impl Sbbf {
let length: usize = header.num_bytes.try_into().map_err(|_| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is unfortunate that we need to do more than one read to potentially read a bloom filter (read the bloom header and then read the length). I think @tustvold noted this as a limitation in the parquet file format itself (that the file metadata only has the bloom filter starting offset, but not its length)

Perhaps the reader abstraction can hide most/all of this nonsense from us

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can take a look at https://github.com/apache/arrow-rs/pull/3119/files#diff-3b307348aabe465890fa39973e9fda0243bd2344cb7cb9cdf02ac2d39521d7caR232-R236 which should show how it works - similar to writing column offsets and indices

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy for this to be merged as is, there is plenty of time before the next release to get the APIs to a happy place, and the feature is still experimental anyway

@tustvold tustvold merged commit 73d66d8 into master Nov 16, 2022
@ursabot
Copy link

ursabot commented Nov 16, 2022

Benchmark runs are scheduled for baseline = c95eb4c and contender = 73d66d8. 73d66d8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@jimexist jimexist deleted the add-bloom-filter-2 branch November 16, 2022 03:54
jimexist added a commit that referenced this pull request Nov 16, 2022
… update API, and add cli demo (#3102)

* add feature flag

* add api

* fix reading with chunk reader

* refactor

* add a binary to demo

* add bin

* remove unused

* fix clippy

* adjust byte size

* update read method

* parquet-show-bloom-filter with bloom feature required

* remove extern crate

* get rid of loop read

* refactor to test

* rework api

* remove unused trait

* update help
jimexist added a commit that referenced this pull request Nov 16, 2022
… update API, and add cli demo (#3102)

* add feature flag

* add api

* fix reading with chunk reader

* refactor

* add a binary to demo

* add bin

* remove unused

* fix clippy

* adjust byte size

* update read method

* parquet-show-bloom-filter with bloom feature required

* remove extern crate

* get rid of loop read

* refactor to test

* rework api

* remove unused trait

* update help
@alamb
Copy link
Contributor

alamb commented Nov 16, 2022

🚄 love to see this -- thank you @tustvold and @jimexist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants