Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1660: align Bloom filter implementation with format #686

Merged
merged 4 commits into from
Jan 7, 2020

Conversation

chenjunjiedada
Copy link
Contributor

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

int bucketIndex = (int)(hash >> 32) & (bitset.length / BYTES_PER_BLOCK - 1);
long numBlocks = bitset.length / BYTES_PER_BLOCK;
long lowHash = hash >>> 32;
int blockIndex = (int)(lowHash * numBlocks >> 32);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this product overflows? How does that behavior compare to this line operating on unsigned values in C++, which cannot overflow on multiplication?

Copy link
Contributor Author

@chenjunjiedada chenjunjiedada Oct 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of blocks right shift 5 bits at first, so its value should be less than 1<<27 and the overflow should not happen here.

@jbapple
Copy link

jbapple commented Dec 16, 2019

LGTM.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, do you already use this internally?

@@ -221,6 +231,12 @@ public boolean getPageWriteChecksumEnabled() {
return bloomFilterExpectedDistinctNumbers;
}

public Set<String> getBloomFilterColumns() {return bloomFilterColumns;}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put the return on the next line, similar to getMaxBloomFilterBytes

@chenjunjiedada
Copy link
Contributor Author

@Fokko, @gszadovszky, could you help to have another look? Is it close to merging?

@chenjunjiedada
Copy link
Contributor Author

@Fokko, forgot your last question. Yes, we already use it internally in some cases.

@chenjunjiedada
Copy link
Contributor Author

@gszadovszky, I updated the code, would you please take another look?

@chenjunjiedada
Copy link
Contributor Author

@gszadovszky , thanks for your review. I'd like to rebase this to master before merging. Maybe it needs your another look again. Thanks in advance!

@chenjunjiedada chenjunjiedada force-pushed the bloom-filter branch 2 times, most recently from 81dd09b to 039ffdc Compare January 7, 2020 09:15
@chenjunjiedada
Copy link
Contributor Author

Looks like I need to merge instead of rebasing. Let me fix this. Sorry for confusing.

@chenjunjiedada
Copy link
Contributor Author

@gszadovszky @Fokko, I just realized that you may need squashing for this PR, so it would be better to submit a separated PR for merging master. So please help to merge this in your convenience.

@Fokko
Copy link
Contributor

Fokko commented Jan 7, 2020

So you'll create a new PR from bloom-filter to master? I'm fine with that. @gszadovszky WDYT?

@chenjunjiedada
Copy link
Contributor Author

@Fokko, I may put in the wrong way, but it is a PR to merge master to bloom-filter branch. I have done that job local machine and can submit that if you prefer to squash one more merging PR.

@gszadovszky
Copy link
Contributor

Let's squash+merge this PR to the feature branch first. Then, check the merge PR and push it to the feature branch as well.
If everything is ready, create a PR from the feature branch to master. There shall be no conflicts. After the final review/tests succeed squash+merge the whole to master.

@chenjunjiedada
Copy link
Contributor Author

SGTM

@gszadovszky gszadovszky merged commit ba28686 into apache:bloom-filter Jan 7, 2020
gszadovszky pushed a commit that referenced this pull request Feb 26, 2020
* PARQUET-1328: Add Bloom filter reader and writer (#587)
* PARQUET-1516: Store Bloom filters near to footer (#608)
* PARQUET-1391: Integrate Bloom filter logic (#619)
* PARQUET-1660: align Bloom filter implementation with format (#686)
shangxinli pushed a commit to shangxinli/parquet-mr that referenced this pull request Mar 1, 2020
* PARQUET-1328: Add Bloom filter reader and writer (apache#587)
* PARQUET-1516: Store Bloom filters near to footer (apache#608)
* PARQUET-1391: Integrate Bloom filter logic (apache#619)
* PARQUET-1660: align Bloom filter implementation with format (apache#686)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants