PARQUET-41: Add bloom filter #757

chenjunjiedada · 2020-02-12T13:14:17Z

This pull request contains a total of bloom filter patches which list as followed:

PARQUET-1328: Add Bloom filter reader and writer
PARQUET-1391: Integrate Bloom filter logic
PARQUET-1516: Store Bloom filters near to footer
PARQUET-1660: Align Bloom filter implementation with the format

* PARQUET-1516: Store Bloom filters near to footer

Conflicts: parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java pom.xml

Conflicts: parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetFileWriter.java

gszadovszky · 2020-02-12T13:35:50Z

I think, it would be better to name this PR after the parent jira (PARQUET-41) to have the proper jira listed in the CHANGES.
Also, please follow the naming convention of PARQUET-XXXX: headline.

chenjunjiedada · 2020-02-12T14:04:46Z

@gszadovszky, Sure, I was waiting for CI to pass. I have to switch multiple proxies to run the maven build and tests. Some packages like parquet-cascading, parquet-generator related packages cannot be downloaded from apache repo. And parquet-tools dependencies cannot be downloaded no matter what proxies I switch.

gszadovszky

I have a couple of findings in the code.

Also, please add more tests.

Some tests around the low level bloom filters with different properties (NDV, max bytes).
I would expect tests covering the different scenarios at filtering level (missing columns, null pages etc.). See TestColumnIndexFilter as an example.
It would also be great to have a higher level test to write generated data to files and verify that the bloom filter does not drop any row groups where requested data exist. See TestColumnIndexFiltering for an example.

dev/travis-before_install-bloom-filter.sh

parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreV1.java

parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java

parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java

parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java

pom.xml

parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java

chenjunjiedada · 2020-02-13T15:28:21Z

@gszadovszky , Thanks for the great comments. I will take some time to address them and ping you when ready.

gszadovszky

I've added some comments. Also, some of my previous comments are not resolved yet.
Open points:

More tests are required (see here for details)
Synchronization in BloomFilterReader (see here for details)
Configuration for filtering based on bloom filters (see here for details)
Shading zero-allocation-hashing

...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java

chenjunjiedada · 2020-02-17T14:05:13Z

@gszadovszky , Thanks for the summary, I was updating another round when you adding latest comments. I think I should address all comments in one commit next.

gszadovszky

I have only one additional comment for now.

parquet-hadoop/src/main/java/org/apache/parquet/ParquetReadOptions.java

chenjunjiedada · 2020-02-20T09:07:17Z

@gszadovszky , Could you please take another look? Since the bloom filter is a row group filter and it should not take any effect on the missing column and affected by null pages, so I don't add the unit tests for missing columns, null pages. What do you think?

gszadovszky

Added some more comments. Also added to places which I've already reviewed but just discovered something.

One more think. We shall ensure somehow that this implementation is exactly the one specified. It is required to ensure that we are compatible with other potential implementations (e.g. parquet-cpp). So, we need to ensure that the hashes generated by LongHashFunction are correct. Are you able to gather/calculate the hash for some values with different types (Binary, int, long, float, double) so we can add a unit test validating the hash calculations?
Also, we need to ensure that a bloom filter generated using these hashes contains all of them properly. In other words, no false negative results shall occur in any case. To approximate this requirement we should generate random values that we know they would not match any of the values in the dataset and check them with the bloom filter. Keep in mind that if we use a random seed for the random generator we shall log this seed so the potential failures are reproducible.

parquet-column/pom.xml

parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java

...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java

parquet-hadoop/pom.xml

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestBloomFiltering.java

chenjunjiedada · 2020-02-20T16:39:05Z

@gszadovszky Thanks for the comments, for the hash value correctness I think it may get some values from the test suite of original xxhash website. I will add the unit tests once I can get some.

…e minir updates

gszadovszky

I've added some more notes.

I would also expect a test that exhaustively verifies the bloom filter for potential false negative cases. Of course, you cannot cover infinity but you can build up a bloom filter using huge number of random values and verify that all the values are contained by the bloom filter. Even, you can use a random seed for the random generator so every execution would cover more values. In this case please log out the seed so a potential failure would be reproducible. (See TestColumnIndexes for an example.)

gszadovszky · 2020-02-24T09:11:29Z

parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java

@@ -53,6 +58,9 @@
  private long rowsWrittenSoFar = 0;
  private int pageRowCount;

+  private BloomFilterWriter bloomFilterWriter;
+  private BloomFilter bloomFilter;


Having these variables final would help JIT to optimize out the bloomFilter != null parts.

The final variable need to be intialized while we still have some constructor that not intialize these variables.

As far as I can see bloomFilter is either initialized in a constructor (line 100 and 104) or null. Similarly for bloomFilterWriter. So, you can declare them final only that you might need to initialize them null in the constructors and code paths where no non-null value is not assigned.

parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java

...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java

parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java

chenjunjiedada · 2020-02-24T16:13:46Z

@gszadovszky, Not sure I understand the test you mentioned correctly, the added unit test covers correctness for supported hash types and contains one million random values for each type.

gszadovszky

The test you've written addresses my concerns. Thanks a lot.
Added a couple of comments. I'll approve after they are fixed.

...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java

...mn/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java

gszadovszky

One nit-pick and a still open discussion

...mn/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java

gszadovszky

Please, fix the compilation failure.

parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java

garawalid · 2020-02-26T00:28:19Z

parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java

@@ -56,6 +60,7 @@
  public static final int DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH = 64;
  public static final int DEFAULT_STATISTICS_TRUNCATE_LENGTH = Integer.MAX_VALUE;
  public static final int DEFAULT_PAGE_ROW_COUNT_LIMIT = 20_000;
+  public static final int DEFAULT_MAX_BLOOM_FILTER_BYTES = 1024 * 1024;


@chenjunjiedada
I'm curious about the maximum default value. Could you please explain why you choose 1 MB?

Assume we have a row group with only one column of UUID (36 bytes), according to the formula and FPP = 0.01 we will need about 4MB. I expect that we will have more columns in the real scenario.

@chenjunjiedada Thanks for the clarification!

gszadovszky · 2020-02-26T11:51:20Z

@chenjunjiedada, your branch was conflicting because my change (#754) went in. So, I went ahead and resolved the conflicts. Please check if it is fine for you.

chenjunjiedada · 2020-02-26T11:54:20Z

@gszadovszky , Thanks, it looks good to me.

chenjunjiedada · 2020-02-26T16:05:07Z

@gszadovszky , Thanks a lot for reviews!

* PARQUET-1328: Add Bloom filter reader and writer (apache#587) * PARQUET-1516: Store Bloom filters near to footer (apache#608) * PARQUET-1391: Integrate Bloom filter logic (apache#619) * PARQUET-1660: align Bloom filter implementation with format (apache#686)

shannonwells · 2021-04-16T23:29:42Z

@chenjunjiedada I'm interested in how you arrived at the formula for the optimal number of bits. Can you please elaborate on this? After reading the referenced paper on it (http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf) I'm unclear as to which equation you used from that paper or if you used another one. We're attempting to implement this algorithm in a different language. Thank you.

jbapple · 2021-04-18T04:11:44Z

@shannonwells If you use equation 3 and fix the block size as 256 bits and the number of inner hash functions as 8, you'll be able to generate something akin to figure 1. You can then compare the FPP you calculated with the minimum FPP for static filters.

chenjunjiedada and others added 8 commits January 21, 2019 11:47

PARQUET-1328: Add Bloom filter reader and writer (apache#587)

d473d17

PARQUET-1516: Store Bloom filters near to footer (apache#608)

96c2fef

* PARQUET-1516: Store Bloom filters near to footer

PARQUET-1391: Integrate Bloom filter logic (apache#619)

dd7e655

Merge branch 'master' into bloom-filter

21c45ed

Merge branch 'master' into bloom-filter

1fc2733

PARQUET-1660: align Bloom filter implementation with format (apache#686)

ba28686

Merge remote-tracking branch 'apache/master' into bloom-filter

1e44aa4

Conflicts: parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java pom.xml

chenjunjiedada changed the title ~~[PARQUET-1795] Merge bloom filter back to master~~ PARQUET-41: Add bloom filter Feb 12, 2020

remove some useless changes

6260ca4

gszadovszky requested changes Feb 13, 2020

View reviewed changes

address comments first round

a67480d

gszadovszky requested changes Feb 17, 2020

View reviewed changes

address comments second round

28216ce

gszadovszky requested changes Feb 19, 2020

View reviewed changes

parquet-hadoop/src/main/java/org/apache/parquet/ParquetReadOptions.java Outdated Show resolved Hide resolved

address comments, add unit tests, shade jar

1519967

chenjunjiedada force-pushed the bloom-filter branch from 3bacd12 to 1519967 Compare February 20, 2020 09:01

gszadovszky requested changes Feb 20, 2020

View reviewed changes

chenjunjiedada force-pushed the bloom-filter branch from b153fb2 to b912587 Compare February 22, 2020 16:29

Add xxhash correctness unit test, update shading, upper bound and som…

6a8e641

…e minir updates

chenjunjiedada force-pushed the bloom-filter branch from b912587 to 6a8e641 Compare February 23, 2020 02:32

gszadovszky requested changes Feb 24, 2020

View reviewed changes

add more unit tests, some minor fixes

069cc22

gszadovszky requested changes Feb 25, 2020

View reviewed changes

minor updates

c16e8e8

gszadovszky requested changes Feb 25, 2020

View reviewed changes

...mn/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java Outdated Show resolved Hide resolved

minor updates

8f04bb8

chenjunjiedada force-pushed the bloom-filter branch from 8b07f7d to 8f04bb8 Compare February 25, 2020 12:32

gszadovszky requested changes Feb 25, 2020

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java Outdated Show resolved Hide resolved

garawalid reviewed Feb 26, 2020

View reviewed changes

fix build

5a98e78

gszadovszky approved these changes Feb 26, 2020

View reviewed changes

Merge branch 'master' into bloom-filter

373d811

gszadovszky merged commit 806037c into apache:master Feb 26, 2020

zhangjiashen mentioned this pull request Nov 4, 2023

PARQUET-2373: Improve I/O performance with bloom_filter_length #1184

Merged

4 tasks

asfimport mentioned this pull request Jun 23, 2024

Add bloom filters to parquet statistics #1468

Closed

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-41: Add bloom filter #757

PARQUET-41: Add bloom filter #757

chenjunjiedada commented Feb 12, 2020 •

edited

Loading

gszadovszky commented Feb 12, 2020

chenjunjiedada commented Feb 12, 2020

gszadovszky left a comment

chenjunjiedada commented Feb 13, 2020

gszadovszky left a comment

chenjunjiedada commented Feb 17, 2020

gszadovszky left a comment

chenjunjiedada commented Feb 20, 2020

gszadovszky left a comment

chenjunjiedada commented Feb 20, 2020

gszadovszky left a comment

gszadovszky Feb 24, 2020

chenjunjiedada Feb 24, 2020

gszadovszky Feb 25, 2020

chenjunjiedada Feb 25, 2020

chenjunjiedada commented Feb 24, 2020

gszadovszky left a comment

gszadovszky left a comment

gszadovszky left a comment

garawalid Feb 26, 2020

chenjunjiedada Feb 26, 2020

garawalid Feb 26, 2020

gszadovszky commented Feb 26, 2020

chenjunjiedada commented Feb 26, 2020

chenjunjiedada commented Feb 26, 2020

shannonwells commented Apr 16, 2021 •

edited

Loading

jbapple commented Apr 18, 2021

PARQUET-41: Add bloom filter #757

PARQUET-41: Add bloom filter #757

Conversation

chenjunjiedada commented Feb 12, 2020 • edited Loading

gszadovszky commented Feb 12, 2020

chenjunjiedada commented Feb 12, 2020

gszadovszky left a comment

Choose a reason for hiding this comment

chenjunjiedada commented Feb 13, 2020

gszadovszky left a comment

Choose a reason for hiding this comment

chenjunjiedada commented Feb 17, 2020

gszadovszky left a comment

Choose a reason for hiding this comment

chenjunjiedada commented Feb 20, 2020

gszadovszky left a comment

Choose a reason for hiding this comment

chenjunjiedada commented Feb 20, 2020

gszadovszky left a comment

Choose a reason for hiding this comment

gszadovszky Feb 24, 2020

Choose a reason for hiding this comment

chenjunjiedada Feb 24, 2020

Choose a reason for hiding this comment

gszadovszky Feb 25, 2020

Choose a reason for hiding this comment

chenjunjiedada Feb 25, 2020

Choose a reason for hiding this comment

chenjunjiedada commented Feb 24, 2020

gszadovszky left a comment

Choose a reason for hiding this comment

gszadovszky left a comment

Choose a reason for hiding this comment

gszadovszky left a comment

Choose a reason for hiding this comment

garawalid Feb 26, 2020

Choose a reason for hiding this comment

chenjunjiedada Feb 26, 2020

Choose a reason for hiding this comment

garawalid Feb 26, 2020

Choose a reason for hiding this comment

gszadovszky commented Feb 26, 2020

chenjunjiedada commented Feb 26, 2020

chenjunjiedada commented Feb 26, 2020

shannonwells commented Apr 16, 2021 • edited Loading

jbapple commented Apr 18, 2021

chenjunjiedada commented Feb 12, 2020 •

edited

Loading

shannonwells commented Apr 16, 2021 •

edited

Loading