New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2157: add bloom filter fpp config #975
Conversation
import java.util.concurrent.Callable; | ||
|
||
import net.openhft.hashing.LongHashFunction; | ||
import org.apache.commons.lang3.RandomStringUtils; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid CI failure, please add this as a test dependency to parquet-hadoop/pom.xml
.
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.9</version>
<scope>test</scope>
</dependency>
The CI passed. Thanks a lot @dongjoon-hyun |
@@ -282,6 +286,63 @@ public void testParquetFileWithBloomFilter() throws IOException { | |||
} | |||
} | |||
|
|||
@Test | |||
public void testParquetFileWithBloomFilterWithFpp() throws IOException { | |||
final int totalCount = 100000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Why do we need final
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
.withConf(conf) | ||
.withDictionaryEncoding(false) | ||
.withBloomFilterEnabled("name", true) | ||
.withBloomFilterNDV("name", 100000l) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can we use TotalCount
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, just some minor nits.
} | ||
} | ||
// The exist should be less than totalCount * fpp. Add 10% here for error space. | ||
assertTrue(exist < totalCount * (testFpp[i] * 1.1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious if totalCount
is sufficient; how often exist
> 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two related questions:
- what should be the
totalCount
to reliably ensure that a)exist > 0
b)exist < totalCount * (testFpp[i] * 1.1)
? Depending on thefpp
value, we can get a random assert exception iftotalCount
is too low (also,exist
could be just 0 then). IftotalCount
is high, the unitest could take a very long time. - how long does this unitest run on your laptop? (with the current
totalCount
of 100000).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically exist
> 0 is false positive. which happens when any given hash value that was never inserted into the bloom filter causes the check to return true. I don't think there is a simple closed-form calculation of this probability, but setting totalCount
to be 100000
seems to be a pretty safe number for the test to pass.
I am thinking we probably should disallow the Bloom filter's size to be unreasonably small. We currently only have the
maximum bytes of the Bloom filter. Shall we also have the minimum bytes of the Bloom filter? What do you think? @chenjunjiedada
The test takes about 2300 milli seconds on my laptop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size of the bloom filter is computed with ndv
and fpp
. So even the size is "unreasonable" small it should be enough to handle the given situation. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Agree.
// The exist counts the number of times FindHash returns true. | ||
int exist = 0; | ||
while (distinctStrings.size() < totalCount) { | ||
String str = RandomStringUtils.randomAlphabetic(10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the original values are 12 char long. To make sure that finding a different length string among them is always false, can you change it to originalLength - 2
, instead of hard coding 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed. Thanks!
@@ -471,6 +484,12 @@ public Builder withBloomFilterNDV(String columnPath, long ndv) { | |||
return this; | |||
} | |||
|
|||
public Builder withBloomFilterFPP(String columnPath, double fpp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if this value is set, but the BF is not enabled? (general / per-column)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value will be silently ignored.
Ok, this is reasonable. If this time is sufficient for reliably testing the upper limit of FPPs, it should be good enough to also check the lower limit, eg |
Thanks for the suggestion! I can't find a reliable number for the lower limit. I put |
LGTM |
Thank you all very much! @chenjunjiedada @dongjoon-hyun @ggershinsky @shangxinli |
Could you resolve JIRA please? I realized that the JIRA is still open although this is delivered. |
Make sure you have checked all steps below.
Jira
Tests
Commits
Documentation