-
Notifications
You must be signed in to change notification settings - Fork 792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add bloom filter implementation based on split block (sbbf) spec #3057
Conversation
turns out i had to make the sbbf module public to make the clippy happy but either way i am going to make use of it soon later and then maybe change it to crate private? one thing that has been bothering me is the fact that down the lane, I'd probably need to setup a separate trait for this purpose... |
We may have to do something similar to what we do for the footer, guess what the length is, and then read additional data if necessary |
thanks, do happen to know the link to that example? |
i intend to merge this one first before moving on to the part of parsing bitset from parquet file |
I will try and review this PR later today. Thank you @jimexist |
I started (but did not finish) reviewing this PR, thank you. I need to find some dedicated time to study the bloom filter spec in more detail So far, my analysis is that twox-hash seems reasonable -- it is widely used https://crates.io/crates/twox-hash/reverse_dependencies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @jimexist -- thank you
I went over the code in detail while reading https://github.com/apache/parquet-format/blob/master/BloomFilter.md#technical-approach. The implementation seems to conform well to the spec.
My only substantial suggestions are about testing. I left a link to another test from @jorgecarleitao 's parquet2 (and its comments) that would be good to add.
also cc @zeevm and @shanisolomon who previously showed some interest in this work when exposing the metadata
https://github.com/apache/arrow-rs/pulls?q=is%3Apr+bloom+is%3Aclosed
All in all 🚀 very nice and THANK YOU
fn mask(x: u32) -> Block { | ||
let mut result = [0_u32; 8]; | ||
for i in 0..8 { | ||
// wrapping instead of checking for overflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't know the implications of using wrapping mul here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically it's very likely to wrap given the salt is numerically large, but the idea of salting is to make the distribution pseudo random so wrapping is a good idea.
Self(data) | ||
} | ||
|
||
pub fn read_from_column_chunk<R: Read + Seek>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to write a test for this function? Maybe we can do so eventually using the data in https://github.com/apache/parquet-testing/tree/master/data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes eventually, but i think it can be done later after being used in the reader, likely RowGroupReader
.
@@ -57,6 +57,7 @@ seq-macro = { version = "0.3", default-features = false } | |||
futures = { version = "0.3", default-features = false, features = ["std"], optional = true } | |||
tokio = { version = "1.0", optional = true, default-features = false, features = ["macros", "rt", "io-util"] } | |||
hashbrown = { version = "0.13", default-features = false } | |||
twox-hash = { version = "1.6", optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
37a024d
to
b08f97c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👨🍳 👌 Looks very good to me
200, 1, 80, 20, 64, 68, 8, 109, 6, 37, 4, 67, 144, 80, 96, 32, 8, 132, 43, | ||
33, 0, 5, 99, 65, 2, 0, 224, 44, 64, 78, 96, 4, | ||
]; | ||
let sbbf = Sbbf::new(bitset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about adding the second test from https://github.com/jorgecarleitao/parquet2/blob/main/src/bloom_filter/mod.rs#L14-L69 ? We could definitely do it as a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it is essentially the same as the one added, i'd prefer to implement the whole thing and then test using real bloomfilter test data files
pub fn hash_bytes<A: AsRef<[u8]>>(value: A) -> u64 { | ||
let mut hasher = XxHash64::with_seed(SEED); | ||
hasher.write(value.as_ref()); | ||
hasher.finish() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the parameter hash
on above functions generated by this function? Perhaps adding a doc for public function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh i only added pub to silence clippy, otherwise it'll warn unused function.
i plan to do more around this later but getting this pull request merge is a first step.
the alternative, of course, is to just code the whole thing but it will make code review harder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for breaking it up into small pieces!
fn hash_to_block_index(&self, hash: u64) -> usize { | ||
// unchecked_mul is unstable, but in reality this is safe, we'd just use saturating mul | ||
// but it will not saturate | ||
(((hash >> 32).saturating_mul(self.0.len() as u64)) >> 32) as usize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this guaranteed to be in the range of block index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is per spec
The filter_insert operation first uses the most significant 32 bits of its argument to select a block to operate on. Call the argument "h", and recall the use of "z" to mean the number of blocks. Then a block number i between 0 and z-1 (inclusive) to operate on is chosen as follows:
unsigned int64 h_top_bits = h >> 32;
unsigned int64 z_as_64_bit = z;
unsigned int32 i = (h_top_bits * z_as_64_bit) >> 32;
The first line extracts the most significant 32 bits from h and assignes them to a 64-bit unsigned integer. The second line is simpler: it just sets an unsigned 64-bit value to the same value as the 32-bit unsigned value z. The purpose of having both h_top_bits and z_as_64_bit be 64-bit values is so that their product is a 64-bit value. That product is taken in the third line, and then the most significant 32 bits are extracted into the value i, which is the index of the block that will be operated on.
After this process to select i, filter_insert uses the least significant 32 bits of h as the argument to block_insert called on block i.
The technique for converting the most significant 32 bits to an integer between 0 and z-1 (inclusive) avoids using the modulo operation, which is often very slow. This trick can be found in Kenneth A. Ross's 2006 IBM research report, "Efficient Hash Probes on Modern Processors"
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
@@ -84,6 +84,8 @@ pub mod arrow; | |||
pub mod column; | |||
experimental!(mod compression); | |||
experimental!(mod encodings); | |||
#[cfg(feature = "bloom")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurs to me that you probably need to add bloom
to the list of features in CI to get this tested:
arrow-rs/.github/workflows/arrow.yml
Lines 77 to 78 in 3084ee2
- name: Test arrow with all features apart from simd | |
run: cargo test -p arrow --features=force_validate,prettyprint,ipc_compression,ffi,dyn_cmp_dict,dyn_arith_dict,chrono-tz |
Also perhaps it is worth a mention (as experimental) in https://github.com/apache/arrow-rs/tree/master/arrow#feature-flags
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll try to add that after the feature is completed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on a second look, @alamb i think the bloom filter is added in parquet not in arrow so i don't think you comment is applicable per se.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I think the feature flag should be added to the parquet docs: https://github.com/apache/arrow-rs/tree/master/parquet#feature-flags
I doubled checked the parquet CI and it looks like this feature will be enabled and thus covered:
arrow-rs/.github/workflows/parquet.yml
Line 92 in 46da606
run: cargo check -p parquet --all-features |
let me try to merge this and address the issues in a subsequent pull request |
Benchmark runs are scheduled for baseline = 3084ee2 and contender = b7af85c. b7af85c is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Part I of:
Next up:
This implementation is partly based on https://github.com/jorgecarleitao/parquet2/tree/main/src/bloom_filter on the testing data part.
Rationale for this change
add bloom filter implementation based on split block spec
What changes are included in this PR?
SBBF implementation per spec
Are there any user-facing changes?