PARQUET-2157: add bloom filter fpp config #975

huaxingao · 2022-06-12T23:42:49Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-2157
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

dongjoon-hyun · 2022-06-13T17:07:23Z

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java

 import java.util.concurrent.Callable;

 import net.openhft.hashing.LongHashFunction;
+import org.apache.commons.lang3.RandomStringUtils;


To avoid CI failure, please add this as a test dependency to parquet-hadoop/pom.xml.

<dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-lang3</artifactId> <version>3.9</version> <scope>test</scope> </dependency>

huaxingao · 2022-06-13T18:38:16Z

The CI passed. Thanks a lot @dongjoon-hyun

huaxingao · 2022-06-13T18:39:49Z

cc @chenjunjiedada @ggershinsky @shangxinli

chenjunjiedada · 2022-06-14T01:03:56Z

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java

@@ -282,6 +286,63 @@ public void testParquetFileWithBloomFilter() throws IOException {
    }
  }

+  @Test
+  public void testParquetFileWithBloomFilterWithFpp() throws IOException {
+    final int totalCount = 100000;


Nit: Why do we need final?

chenjunjiedada · 2022-06-14T01:04:30Z

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java

+        .withConf(conf)
+        .withDictionaryEncoding(false)
+        .withBloomFilterEnabled("name", true)
+        .withBloomFilterNDV("name", 100000l)


Nit: Can we use TotalCount?

Fixed. Thanks!

chenjunjiedada

LGTM overall, just some minor nits.

ggershinsky · 2022-06-15T12:02:27Z

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java

+          }
+        }
+        // The exist should be less than totalCount * fpp. Add 10% here for error space.
+        assertTrue(exist < totalCount * (testFpp[i] * 1.1));


just curious if totalCount is sufficient; how often exist > 0?

Two related questions:

what should be the totalCount to reliably ensure that a) exist > 0 b) exist < totalCount * (testFpp[i] * 1.1) ? Depending on the fpp value, we can get a random assert exception if totalCount is too low (also, exist could be just 0 then). If totalCount is high, the unitest could take a very long time.

how long does this unitest run on your laptop? (with the current totalCount of 100000).

Basically exist > 0 is false positive. which happens when any given hash value that was never inserted into the bloom filter causes the check to return true. I don't think there is a simple closed-form calculation of this probability, but setting totalCount to be 100000 seems to be a pretty safe number for the test to pass.

I am thinking we probably should disallow the Bloom filter's size to be unreasonably small. We currently only have the
maximum bytes of the Bloom filter. Shall we also have the minimum bytes of the Bloom filter? What do you think? @chenjunjiedada

The test takes about 2300 milli seconds on my laptop.

The size of the bloom filter is computed with ndv and fpp. So even the size is "unreasonable" small it should be enough to handle the given situation. Right?

Yes. Agree.

ggershinsky · 2022-06-15T12:04:08Z

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java

+        // The exist counts the number of times FindHash returns true.
+        int exist = 0;
+        while (distinctStrings.size() < totalCount) {
+          String str = RandomStringUtils.randomAlphabetic(10);


the original values are 12 char long. To make sure that finding a different length string among them is always false, can you change it to originalLength - 2, instead of hard coding 10?

Changed. Thanks!

ggershinsky · 2022-06-15T12:07:51Z

parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java

@@ -471,6 +484,12 @@ public Builder withBloomFilterNDV(String columnPath, long ndv) {
      return this;
    }

+    public Builder withBloomFilterFPP(String columnPath, double fpp) {


what happens if this value is set, but the BF is not enabled? (general / per-column)

This value will be silently ignored.

ggershinsky · 2022-06-16T11:59:45Z

The test takes about 2300 milli seconds on my laptop.

Ok, this is reasonable. If this time is sufficient for reliably testing the upper limit of FPPs, it should be good enough to also check the lower limit, eg exist > totalCount * (testFpp[i] * 0.9) , or exist > totalCount * (testFpp[i] * 0.5) , or even exist > 0. What do you think? This way, we'll be certain the test passes not because exist is just 0.

huaxingao · 2022-06-16T15:00:47Z

it should be good enough to also check the lower limit, eg exist > totalCount * (testFpp[i] * 0.9) , or exist > totalCount * (testFpp[i] * 0.5) , or even exist > 0. What do you think? This way, we'll be certain the test passes not because exist is just 0.

Thanks for the suggestion! I can't find a reliable number for the lower limit. I put exist > 0.

shangxinli · 2022-06-18T03:09:32Z

LGTM

huaxingao · 2022-06-18T03:53:54Z

Thank you all very much! @chenjunjiedada @dongjoon-hyun @ggershinsky @shangxinli

dongjoon-hyun · 2022-12-30T05:31:10Z

Could you resolve JIRA please? I realized that the JIRA is still open although this is delivered.

https://issues.apache.org/jira/browse/PARQUET-2157

add bloom filter fpp config

2ecd05a

huaxingao changed the title ~~add bloom filter fpp config~~ PARQUET-2157: add bloom filter fpp config Jun 12, 2022

Trigger Build

795312f

dongjoon-hyun reviewed Jun 13, 2022

View reviewed changes

add commons-lang dependecy in hadoop test

1651d6b

chenjunjiedada reviewed Jun 14, 2022

View reviewed changes

chenjunjiedada approved these changes Jun 14, 2022

View reviewed changes

huaxingao added 4 commits June 13, 2022 18:40

address comments

ac39269

update doc

054d096

fix doc format

5407f05

add one more space to break the line in md file

3437ce9

ggershinsky reviewed Jun 15, 2022

View reviewed changes

address comments

7e9ab68

address comments

e380893

remove fpp 0.005 from the test

b3f3c5e

ggershinsky approved these changes Jun 17, 2022

View reviewed changes

shangxinli merged commit e063844 into apache:master Jun 18, 2022

huaxingao deleted the fpp branch June 18, 2022 03:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2157: add bloom filter fpp config #975

PARQUET-2157: add bloom filter fpp config #975

huaxingao commented Jun 12, 2022 •

edited

dongjoon-hyun Jun 13, 2022 •

edited

huaxingao commented Jun 13, 2022

huaxingao commented Jun 13, 2022

chenjunjiedada Jun 14, 2022

huaxingao Jun 14, 2022

chenjunjiedada Jun 14, 2022

huaxingao Jun 14, 2022

chenjunjiedada left a comment

ggershinsky Jun 15, 2022

ggershinsky Jun 15, 2022

huaxingao Jun 15, 2022

chenjunjiedada Jun 16, 2022

huaxingao Jun 16, 2022

ggershinsky Jun 15, 2022 •

edited

huaxingao Jun 15, 2022

ggershinsky Jun 15, 2022

huaxingao Jun 15, 2022

ggershinsky commented Jun 16, 2022

huaxingao commented Jun 16, 2022

shangxinli commented Jun 18, 2022

huaxingao commented Jun 18, 2022

dongjoon-hyun commented Dec 30, 2022

PARQUET-2157: add bloom filter fpp config #975

PARQUET-2157: add bloom filter fpp config #975

Conversation

huaxingao commented Jun 12, 2022 • edited

Jira

Tests

Commits

Documentation

dongjoon-hyun Jun 13, 2022 • edited

Choose a reason for hiding this comment

huaxingao commented Jun 13, 2022

huaxingao commented Jun 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenjunjiedada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggershinsky Jun 15, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggershinsky commented Jun 16, 2022

huaxingao commented Jun 16, 2022

shangxinli commented Jun 18, 2022

huaxingao commented Jun 18, 2022

dongjoon-hyun commented Dec 30, 2022

huaxingao commented Jun 12, 2022 •

edited

dongjoon-hyun Jun 13, 2022 •

edited

ggershinsky Jun 15, 2022 •

edited