-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-41: Add bloom filter #757
Conversation
* PARQUET-1516: Store Bloom filters near to footer
Conflicts: parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java pom.xml
Conflicts: parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetFileWriter.java
I think, it would be better to name this PR after the parent jira (PARQUET-41) to have the proper jira listed in the CHANGES. |
@gszadovszky, Sure, I was waiting for CI to pass. I have to switch multiple proxies to run the maven build and tests. Some packages like parquet-cascading, parquet-generator related packages cannot be downloaded from apache repo. And parquet-tools dependencies cannot be downloaded no matter what proxies I switch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of findings in the code.
Also, please add more tests.
- Some tests around the low level bloom filters with different properties (NDV, max bytes).
- I would expect tests covering the different scenarios at filtering level (missing columns, null pages etc.). See TestColumnIndexFilter as an example.
- It would also be great to have a higher level test to write generated data to files and verify that the bloom filter does not drop any row groups where requested data exist. See TestColumnIndexFiltering for an example.
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreV1.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java
Outdated
Show resolved
Hide resolved
@gszadovszky , Thanks for the great comments. I will take some time to address them and ping you when ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some comments. Also, some of my previous comments are not resolved yet.
Open points:
...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java
Show resolved
Hide resolved
@gszadovszky , Thanks for the summary, I was updating another round when you adding latest comments. I think I should address all comments in one commit next. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have only one additional comment for now.
parquet-hadoop/src/main/java/org/apache/parquet/ParquetReadOptions.java
Outdated
Show resolved
Hide resolved
3bacd12
to
1519967
Compare
@gszadovszky , Could you please take another look? Since the bloom filter is a row group filter and it should not take any effect on the missing column and affected by null pages, so I don't add the unit tests for missing columns, null pages. What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some more comments. Also added to places which I've already reviewed but just discovered something.
One more think. We shall ensure somehow that this implementation is exactly the one specified. It is required to ensure that we are compatible with other potential implementations (e.g. parquet-cpp). So, we need to ensure that the hashes generated by LongHashFunction
are correct. Are you able to gather/calculate the hash for some values with different types (Binary
, int
, long
, float
, double
) so we can add a unit test validating the hash calculations?
Also, we need to ensure that a bloom filter generated using these hashes contains all of them properly. In other words, no false negative results shall occur in any case. To approximate this requirement we should generate random values that we know they would not match any of the values in the dataset and check them with the bloom filter. Keep in mind that if we use a random seed for the random generator we shall log this seed so the potential failures are reproducible.
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java
Show resolved
Hide resolved
...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java
Show resolved
Hide resolved
...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestBloomFiltering.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestBloomFiltering.java
Outdated
Show resolved
Hide resolved
@gszadovszky Thanks for the comments, for the hash value correctness I think it may get some values from the test suite of original xxhash website. I will add the unit tests once I can get some. |
b153fb2
to
b912587
Compare
b912587
to
6a8e641
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some more notes.
I would also expect a test that exhaustively verifies the bloom filter for potential false negative cases. Of course, you cannot cover infinity but you can build up a bloom filter using huge number of random values and verify that all the values are contained by the bloom filter. Even, you can use a random seed for the random generator so every execution would cover more values. In this case please log out the seed so a potential failure would be reproducible. (See TestColumnIndexes for an example.)
@@ -53,6 +58,9 @@ | |||
private long rowsWrittenSoFar = 0; | |||
private int pageRowCount; | |||
|
|||
private BloomFilterWriter bloomFilterWriter; | |||
private BloomFilter bloomFilter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having these variables final
would help JIT to optimize out the bloomFilter != null
parts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final
variable need to be intialized while we still have some constructor that not intialize these variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can see bloomFilter
is either initialized in a constructor (line 100 and 104) or null
. Similarly for bloomFilterWriter
. So, you can declare them final only that you might need to initialize them null
in the constructors and code paths where no non-null
value is not assigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java
Outdated
Show resolved
Hide resolved
...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java
Outdated
Show resolved
Hide resolved
...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java
Outdated
Show resolved
Hide resolved
...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java
Outdated
Show resolved
Hide resolved
...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java
Outdated
Show resolved
Hide resolved
...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java
Show resolved
Hide resolved
...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java
Outdated
Show resolved
Hide resolved
@gszadovszky, Not sure I understand the test you mentioned correctly, the added unit test covers correctness for supported hash types and contains one million random values for each type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test you've written addresses my concerns. Thanks a lot.
Added a couple of comments. I'll approve after they are fixed.
...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java
Outdated
Show resolved
Hide resolved
...column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java
Show resolved
Hide resolved
...mn/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit-pick and a still open discussion
...mn/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java
Outdated
Show resolved
Hide resolved
8b07f7d
to
8f04bb8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, fix the compilation failure.
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java
Outdated
Show resolved
Hide resolved
@@ -56,6 +60,7 @@ | |||
public static final int DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH = 64; | |||
public static final int DEFAULT_STATISTICS_TRUNCATE_LENGTH = Integer.MAX_VALUE; | |||
public static final int DEFAULT_PAGE_ROW_COUNT_LIMIT = 20_000; | |||
public static final int DEFAULT_MAX_BLOOM_FILTER_BYTES = 1024 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenjunjiedada
I'm curious about the maximum default value. Could you please explain why you choose 1 MB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assume we have a row group with only one column of UUID (36 bytes), according to the formula and FPP = 0.01 we will need about 4MB. I expect that we will have more columns in the real scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenjunjiedada Thanks for the clarification!
@chenjunjiedada, your branch was conflicting because my change (#754) went in. So, I went ahead and resolved the conflicts. Please check if it is fine for you. |
@gszadovszky , Thanks, it looks good to me. |
@gszadovszky , Thanks a lot for reviews! |
* PARQUET-1328: Add Bloom filter reader and writer (apache#587) * PARQUET-1516: Store Bloom filters near to footer (apache#608) * PARQUET-1391: Integrate Bloom filter logic (apache#619) * PARQUET-1660: align Bloom filter implementation with format (apache#686)
@chenjunjiedada I'm interested in how you arrived at the formula for the optimal number of bits. Can you please elaborate on this? After reading the referenced paper on it (http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf) I'm unclear as to which equation you used from that paper or if you used another one. We're attempting to implement this algorithm in a different language. Thank you. |
@shannonwells If you use equation 3 and fix the block size as 256 bits and the number of inner hash functions as 8, you'll be able to generate something akin to figure 1. You can then compare the FPP you calculated with the minimum FPP for static filters. |
This pull request contains a total of bloom filter patches which list as followed: