-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-41: add bloom filter #99
Conversation
src/main/thrift/parquet.thrift
Outdated
@@ -658,6 +658,74 @@ struct ColumnMetaData { | |||
* This information can be used to determine if all data pages are | |||
* dictionary encoded for example **/ | |||
13: optional list<PageEncodingStats> encoding_stats; | |||
|
|||
/** Byte offset from beginning of file to bloom filter data. The bloom filters | |||
* data of columns together is stored before the start of row group wihch describe.**/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I'm having trouble understanding this sentence. Could you describe it in more detail and then we can work on a re-wording?
Also, "Bloom" is the last name of the inventor, so generally we should prefer to capitalize it, here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first sentence should be same as data_page_offset above. The second sentence says the Bloom filter data will be stored at the start position of row group.
src/main/thrift/parquet.thrift
Outdated
14: optional i64 bloom_filter_offset; | ||
} | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some formatting nits: this file appears to usually format single-line prose comments without surrounding comment lines.
/** Like this. */
/**
* Not like this.
*/
Also, empty structs have both opening and closing brace on one line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/main/thrift/parquet.thrift
Outdated
|
||
/** | ||
* Definition of bloom filter algorithm. | ||
* In order for farward compatibility, we use union to replace enum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paragraphs in this file are denoted by blank lines. Adjacent lines without a blank line between them are wrapped. So:
* This is a sentence. This is another sentence
or
* This is a sentence.
*
* This is another sentence.
but never
* This is a sentence.
* This is another sentence.
Please check this here and below.
Also, "forward", not "farward".
src/main/thrift/parquet.thrift
Outdated
} | ||
|
||
/** | ||
* Block based algorithm type annotation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and below, when using as an adjective, hyphenate "block-based":
https://www.grammarbook.com/punctuation/hyphens.asp
Generally, hyphenate two or more words when they come before a noun they modify and act as a single idea. This is called a compound adjective.
src/main/thrift/parquet.thrift
Outdated
* The bloom filter bitset is separated into tiny bucket as tiny bloom | ||
* filter, the high 32 bits hash value is used to select bucket, and | ||
* lower 32 bits hash values are used to set bits in tiny bloom filter. | ||
* See “Cache-, Hash- and Space-Efficient Bloom Filters”. Specifically, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You used smart quotes here, not plain double-quotes.
src/main/thrift/parquet.thrift
Outdated
/** | ||
* Block based algorithm type annotation. | ||
*/ | ||
struct BlockAlgorithm { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a combination of the "blocked Bloom filters", described in "Cache-, Hash- and Space-Efficient Bloom Filters", and a variant mentioned in, though not invented in, "Network Applications of Bloom Filters: A Survey". The latter are sometimes called "split Bloom filters".
Maybe struct SplitBlockBloomAlgorithm {}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/main/thrift/parquet.thrift
Outdated
* Bloom filter header is stored at beginning of bloom filter data of each column | ||
* and followed by its bitset. | ||
*/ | ||
struct BloomFilterHeader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this also need a mention in PageType and PageHeader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks to remind again.
It depends on how we treat Bloom filter data, it is natural if we abstract Bloom filter data as a specific page. Previously, I didn't read or write it as page since the PageReader/PageWriter APIs in parquet-mr looks specific to data and dictionary page. While I think it is easy to integrate if we want to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/main/thrift/parquet.thrift
Outdated
} | ||
|
||
/** | ||
* Definition for hash function used to compute hash of column value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would reword this as "The hash function used in the Bloom filter. This function takes the hash of a column value using the plain encoding."
You can leave off the bit about union vs. enum, both here and above.
src/main/thrift/parquet.thrift
Outdated
/** | ||
* Hash strategy type annotation. | ||
*/ | ||
struct Murmur3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to be specific, as below, you can say "MurmurHash3_x64_128 from the original SMHasher repo by Austin Appleby" and then remove the detailed comment below explaining more. The restriction of that output by first taking some 64 bits and then using the low bits of that to index into the blocks need not be specified here, I suppose.
src/main/thrift/parquet.thrift
Outdated
* and followed by its bitset. | ||
*/ | ||
struct BloomFilterHeader { | ||
/** The size of bitset in bytes, must be a power of 2**/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That it must be a power of 2 might depend on the algorithm, so I'd omit this.
2c17e6d
to
cc8985b
Compare
66c9e9f
to
4948425
Compare
src/main/thrift/parquet.thrift
Outdated
/** The algorithm used in Bloom filter. **/ | ||
union BloomFilterAlgorithm { | ||
/** Block-based Bloom filter. **/ | ||
1: SplitBlockAlgorithm BLOCK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
identation -1
src/main/thrift/parquet.thrift
Outdated
} | ||
|
||
/** Hash strategy type annotation. It uses Murmur3Hash_x64_128 from the original SMHasher | ||
* repo by Austin Appleby. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should depend on a external codebase for this implementation (as it is subject to change). May be document the actual implementation details here instead of citing the reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, we are citing a kind of well-known hash which is widely used in hive, orc, guava and etc.. It just like codec definition above we don't have implementation detail document. In the project hive, orc, and guava, I don't see they have implementation document.
@zivanfi , Could you please help to create a feature branch for this? |
Sure, I have created parquet-format/bloom-filter and parquet-mr/bloom-filter for you. Please open new pull requests on these branches instead of the existing ones on the master branches. Thanks! |
Thanks a lot, @zivanfi |
move to #112 and close this one. |
This is rebased bloom filter PR for #62. The original PR contains a lot of rebasing commit message may be confused.