-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-41: Add bloom filters to parquet statistics #215
base: master
Are you sure you want to change the base?
Conversation
@@ -57,6 +58,7 @@ | |||
private ValuesWriter dataColumn; | |||
private int valueCount; | |||
private int valueCountForNextSizeCheck; | |||
private BloomFilterOpts opts; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we rename 'opts' to 'bloomFilterOpts' to increase readability along the code?
It's looking good Ferd. Here are some questions I have.
|
org.apache.parquet.column.statistics.Statistics statistics) { | ||
if (!(statistics instanceof BloomFilterStatistics)) { | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow why we need this test and define the signature of this method this way.
can't we just convert from one type to the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the stage of converting, it constructs a statistics used in parquet-format and updates the data retrieving from the statistics from parquet-mr.
Hi @spena
At this point, I didn't add the support for fall back. If it's really useful, I think we could do it in a follow-up ticket.
The bloom filter is used to filter a entire row group in the same way as min/max statistics. I am not very familiar with dictionary encoding in parquet. But I think it should be used before dictionary encoding.
I tried to think about a way to calculate it but didn't come up with a good idea. But I think nobody understands the data better than the person who uses it. |
a347660
to
4ba507c
Compare
5c220da
to
ac15839
Compare
PARQUET-41: Update patch addressing comments Parquet-41: Adding other data types support and enable Unit tests Change the bitset from arraylist to array Add statistics option and enable tests for bloom filter Fix failed unit tests Remove the page level bloom filter bit set Rebase code
It's the PR in mr part.