Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2157: add bloom filter fpp config #975

Merged
merged 10 commits into from Jun 18, 2022
Merged

Conversation

huaxingao
Copy link
Contributor

@huaxingao huaxingao commented Jun 12, 2022

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@huaxingao huaxingao changed the title add bloom filter fpp config PARQUET-2157: add bloom filter fpp config Jun 12, 2022
import java.util.concurrent.Callable;

import net.openhft.hashing.LongHashFunction;
import org.apache.commons.lang3.RandomStringUtils;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid CI failure, please add this as a test dependency to parquet-hadoop/pom.xml.

    <dependency>
      <groupId>org.apache.commons</groupId>
      <artifactId>commons-lang3</artifactId>
      <version>3.9</version>
      <scope>test</scope>
    </dependency>

@huaxingao
Copy link
Contributor Author

The CI passed. Thanks a lot @dongjoon-hyun

@huaxingao
Copy link
Contributor Author

@@ -282,6 +286,63 @@ public void testParquetFileWithBloomFilter() throws IOException {
}
}

@Test
public void testParquetFileWithBloomFilterWithFpp() throws IOException {
final int totalCount = 100000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Why do we need final?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

.withConf(conf)
.withDictionaryEncoding(false)
.withBloomFilterEnabled("name", true)
.withBloomFilterNDV("name", 100000l)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we use TotalCount?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!

Copy link
Contributor

@chenjunjiedada chenjunjiedada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, just some minor nits.

}
}
// The exist should be less than totalCount * fpp. Add 10% here for error space.
assertTrue(exist < totalCount * (testFpp[i] * 1.1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious if totalCount is sufficient; how often exist > 0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two related questions:

  • what should be the totalCount to reliably ensure that a) exist > 0 b) exist < totalCount * (testFpp[i] * 1.1) ? Depending on the fpp value, we can get a random assert exception if totalCount is too low (also, exist could be just 0 then). If totalCount is high, the unitest could take a very long time.
  • how long does this unitest run on your laptop? (with the current totalCount of 100000).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically exist > 0 is false positive. which happens when any given hash value that was never inserted into the bloom filter causes the check to return true. I don't think there is a simple closed-form calculation of this probability, but setting totalCount to be 100000 seems to be a pretty safe number for the test to pass.

I am thinking we probably should disallow the Bloom filter's size to be unreasonably small. We currently only have the
maximum bytes of the Bloom filter. Shall we also have the minimum bytes of the Bloom filter? What do you think? @chenjunjiedada

The test takes about 2300 milli seconds on my laptop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size of the bloom filter is computed with ndv and fpp. So even the size is "unreasonable" small it should be enough to handle the given situation. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Agree.

// The exist counts the number of times FindHash returns true.
int exist = 0;
while (distinctStrings.size() < totalCount) {
String str = RandomStringUtils.randomAlphabetic(10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the original values are 12 char long. To make sure that finding a different length string among them is always false, can you change it to originalLength - 2, instead of hard coding 10?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. Thanks!

@@ -471,6 +484,12 @@ public Builder withBloomFilterNDV(String columnPath, long ndv) {
return this;
}

public Builder withBloomFilterFPP(String columnPath, double fpp) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if this value is set, but the BF is not enabled? (general / per-column)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value will be silently ignored.

@ggershinsky
Copy link
Contributor

The test takes about 2300 milli seconds on my laptop.

Ok, this is reasonable. If this time is sufficient for reliably testing the upper limit of FPPs, it should be good enough to also check the lower limit, eg exist > totalCount * (testFpp[i] * 0.9) , or exist > totalCount * (testFpp[i] * 0.5) , or even exist > 0. What do you think? This way, we'll be certain the test passes not because exist is just 0.

@huaxingao
Copy link
Contributor Author

it should be good enough to also check the lower limit, eg exist > totalCount * (testFpp[i] * 0.9) , or exist > totalCount * (testFpp[i] * 0.5) , or even exist > 0. What do you think? This way, we'll be certain the test passes not because exist is just 0.

Thanks for the suggestion! I can't find a reliable number for the lower limit. I put exist > 0.

@shangxinli
Copy link
Contributor

LGTM

@shangxinli shangxinli merged commit e063844 into apache:master Jun 18, 2022
@huaxingao
Copy link
Contributor Author

Thank you all very much! @chenjunjiedada @dongjoon-hyun @ggershinsky @shangxinli

@huaxingao huaxingao deleted the fpp branch June 18, 2022 03:53
@dongjoon-hyun
Copy link
Member

Could you resolve JIRA please? I realized that the JIRA is still open although this is delivered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants