-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1630: Update Bloom filter format #146
Conversation
Hi @jbapple, Could you please help to take a look firstly? |
In the interest of expediency, I have added my own PR, here: Would you be willing to limit this patch to the section about the file format only? |
Thanks @jbapple, the algorithm description looks much clear:) Will update this to just format section |
Done! @jbapple please take a look this PR as well. |
BloomFilter.md
Outdated
filter data offset is stored in column chunk metadata. Here are Bloom filter definitions in | ||
thrift: | ||
Each multi-block Bloom filter is required to work for only one column chunk. The data of a multi-block Bloom | ||
filter contains a header of Bloom filter, which must includes the size of the filter in bytes, the algorithm, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you say "a header of Bloom filter", do you mean "the header of one Bloom filter"?
Also, "include the size", not "includes the size".
BloomFilter.md
Outdated
@@ -181,6 +182,9 @@ struct ColumnMetaData { | |||
|
|||
``` | |||
|
|||
The Bloom filter data is stored right after pages indexes, the file layout is look like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Bloom filter data is stored right after pages indexes, the file layout is look like: | |
The Bloom filter data is stored right after the page indexes, and the file layout looks like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still incorrect. Can you look at my suggestion more carefully and identify what you missed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. I think I forgot to use git add
before git commit --amend
.
BloomFilter.md
Outdated
filter contains the header of one Bloom filter, which must include the size of the filter in bytes, the algorithm, | ||
the hash function, and the Bloom filter bitset. The offset in column chunk metadata points to the start of | ||
the Bloom filter header. | ||
Here are Bloom filter definitions in thrift: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are Bloom filter definitions in thrift: | |
Here are the Bloom filter definitions in thrift: |
LGTM |
@jbapple From the mail thread we are planning to add compression into bloom filter file format, so how about adding a field to store compression algorithm? |
I think this PR (and other PRs should) do one thing each. I said in the email thread "I'll send a PR for that after #147 is checked in to avoid rebase racing." |
@rdblue, could you please take a look at this as well. This doesn't include compression thing, Jim will submit a separate PR for that as mentioned. |
ea57431
to
cbfcbc1
Compare
cbfcbc1
to
6e7622a
Compare
@chenjunjiedada, this is getting close. Please have a look at my latest comments. |
+1 |
No description provided.