Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-41: Add bloom filters to parquet statistics #215

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

winningsix
Copy link

It's the PR in mr part.

@winningsix
Copy link
Author

@spena @rdblue Could you help me review this patch? Thank you!

@@ -57,6 +58,7 @@
private ValuesWriter dataColumn;
private int valueCount;
private int valueCountForNextSizeCheck;
private BloomFilterOpts opts;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rename 'opts' to 'bloomFilterOpts' to increase readability along the code?

@spena
Copy link

spena commented Jun 17, 2015

It's looking good Ferd.

Here are some questions I have.

  • Should we use fall back for bloom filters in case the bloom is not good for the row group? Dictionary encoding does this.
  • If a value is found on the dictionary, is there a way to skip the bloom hashing for better write perf? And add the values to the bloom in case they are fallen back?
  • Is there a way to calculate the # of expected entries instead of asking the user to pass a value?

org.apache.parquet.column.statistics.Statistics statistics) {
if (!(statistics instanceof BloomFilterStatistics)) {
return;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow why we need this test and define the signature of this method this way.
can't we just convert from one type to the other?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the stage of converting, it constructs a statistics used in parquet-format and updates the data retrieving from the statistics from parquet-mr.

@winningsix
Copy link
Author

Hi @spena
Please see my inline comments below. Thank you! (Sorry for some delays since I am taking a holiday :<)

Should we use fall back for bloom filters in case the bloom is not good for the row group? Dictionary encoding does this.

At this point, I didn't add the support for fall back. If it's really useful, I think we could do it in a follow-up ticket.

If a value is found on the dictionary, is there a way to skip the bloom hashing for better write perf? And add the values to the bloom in case they are fallen back?

The bloom filter is used to filter a entire row group in the same way as min/max statistics. I am not very familiar with dictionary encoding in parquet. But I think it should be used before dictionary encoding.

Is there a way to calculate the # of expected entries instead of asking the user to pass a value?

I tried to think about a way to calculate it but didn't come up with a good idea. But I think nobody understands the data better than the person who uses it.

Ferdinand Xu added 2 commits August 24, 2016 08:59
PARQUET-41: Update patch addressing comments

Parquet-41: Adding other data types support and enable Unit tests

Change the bitset from arraylist to array

Add statistics option and enable tests for bloom filter

Fix failed unit tests

Remove the page level bloom filter bit set

Rebase code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants