Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-41: Add bloom filter #757

Merged
merged 18 commits into from
Feb 26, 2020
Merged

Conversation

chenjunjiedada
Copy link
Contributor

@chenjunjiedada chenjunjiedada commented Feb 12, 2020

This pull request contains a total of bloom filter patches which list as followed:

  • PARQUET-1328: Add Bloom filter reader and writer
  • PARQUET-1391: Integrate Bloom filter logic
  • PARQUET-1516: Store Bloom filters near to footer
  • PARQUET-1660: Align Bloom filter implementation with the format

chenjunjiedada and others added 8 commits January 21, 2019 11:47
* PARQUET-1516: Store Bloom filters near to footer
Conflicts:
	parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java
	parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java
	pom.xml
Conflicts:
	parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java
	parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java
	parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java
	parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetFileWriter.java
@gszadovszky
Copy link
Contributor

I think, it would be better to name this PR after the parent jira (PARQUET-41) to have the proper jira listed in the CHANGES.
Also, please follow the naming convention of PARQUET-XXXX: headline.

@chenjunjiedada chenjunjiedada changed the title [PARQUET-1795] Merge bloom filter back to master PARQUET-41: Add bloom filter Feb 12, 2020
@chenjunjiedada
Copy link
Contributor Author

@gszadovszky, Sure, I was waiting for CI to pass. I have to switch multiple proxies to run the maven build and tests. Some packages like parquet-cascading, parquet-generator related packages cannot be downloaded from apache repo. And parquet-tools dependencies cannot be downloaded no matter what proxies I switch.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of findings in the code.

Also, please add more tests.

  • Some tests around the low level bloom filters with different properties (NDV, max bytes).
  • I would expect tests covering the different scenarios at filtering level (missing columns, null pages etc.). See TestColumnIndexFilter as an example.
  • It would also be great to have a higher level test to write generated data to files and verify that the bloom filter does not drop any row groups where requested data exist. See TestColumnIndexFiltering for an example.

@chenjunjiedada
Copy link
Contributor Author

@gszadovszky , Thanks for the great comments. I will take some time to address them and ping you when ready.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some comments. Also, some of my previous comments are not resolved yet.
Open points:

  • More tests are required (see here for details)
  • Synchronization in BloomFilterReader (see here for details)
  • Configuration for filtering based on bloom filters (see here for details)
  • Shading zero-allocation-hashing

@chenjunjiedada
Copy link
Contributor Author

@gszadovszky , Thanks for the summary, I was updating another round when you adding latest comments. I think I should address all comments in one commit next.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only one additional comment for now.

@chenjunjiedada
Copy link
Contributor Author

@gszadovszky , Could you please take another look? Since the bloom filter is a row group filter and it should not take any effect on the missing column and affected by null pages, so I don't add the unit tests for missing columns, null pages. What do you think?

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more comments. Also added to places which I've already reviewed but just discovered something.

One more think. We shall ensure somehow that this implementation is exactly the one specified. It is required to ensure that we are compatible with other potential implementations (e.g. parquet-cpp). So, we need to ensure that the hashes generated by LongHashFunction are correct. Are you able to gather/calculate the hash for some values with different types (Binary, int, long, float, double) so we can add a unit test validating the hash calculations?
Also, we need to ensure that a bloom filter generated using these hashes contains all of them properly. In other words, no false negative results shall occur in any case. To approximate this requirement we should generate random values that we know they would not match any of the values in the dataset and check them with the bloom filter. Keep in mind that if we use a random seed for the random generator we shall log this seed so the potential failures are reproducible.

@chenjunjiedada
Copy link
Contributor Author

@gszadovszky Thanks for the comments, for the hash value correctness I think it may get some values from the test suite of original xxhash website. I will add the unit tests once I can get some.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some more notes.

I would also expect a test that exhaustively verifies the bloom filter for potential false negative cases. Of course, you cannot cover infinity but you can build up a bloom filter using huge number of random values and verify that all the values are contained by the bloom filter. Even, you can use a random seed for the random generator so every execution would cover more values. In this case please log out the seed so a potential failure would be reproducible. (See TestColumnIndexes for an example.)

@@ -53,6 +58,9 @@
private long rowsWrittenSoFar = 0;
private int pageRowCount;

private BloomFilterWriter bloomFilterWriter;
private BloomFilter bloomFilter;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having these variables final would help JIT to optimize out the bloomFilter != null parts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final variable need to be intialized while we still have some constructor that not intialize these variables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see bloomFilter is either initialized in a constructor (line 100 and 104) or null. Similarly for bloomFilterWriter. So, you can declare them final only that you might need to initialize them null in the constructors and code paths where no non-null value is not assigned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@chenjunjiedada
Copy link
Contributor Author

@gszadovszky, Not sure I understand the test you mentioned correctly, the added unit test covers correctness for supported hash types and contains one million random values for each type.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test you've written addresses my concerns. Thanks a lot.
Added a couple of comments. I'll approve after they are fixed.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit-pick and a still open discussion

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, fix the compilation failure.

@@ -56,6 +60,7 @@
public static final int DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH = 64;
public static final int DEFAULT_STATISTICS_TRUNCATE_LENGTH = Integer.MAX_VALUE;
public static final int DEFAULT_PAGE_ROW_COUNT_LIMIT = 20_000;
public static final int DEFAULT_MAX_BLOOM_FILTER_BYTES = 1024 * 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenjunjiedada
I'm curious about the maximum default value. Could you please explain why you choose 1 MB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume we have a row group with only one column of UUID (36 bytes), according to the formula and FPP = 0.01 we will need about 4MB. I expect that we will have more columns in the real scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenjunjiedada Thanks for the clarification!

@gszadovszky
Copy link
Contributor

@chenjunjiedada, your branch was conflicting because my change (#754) went in. So, I went ahead and resolved the conflicts. Please check if it is fine for you.

@chenjunjiedada
Copy link
Contributor Author

@gszadovszky , Thanks, it looks good to me.

@gszadovszky gszadovszky merged commit 806037c into apache:master Feb 26, 2020
@chenjunjiedada
Copy link
Contributor Author

@gszadovszky , Thanks a lot for reviews!

shangxinli pushed a commit to shangxinli/parquet-mr that referenced this pull request Mar 1, 2020
* PARQUET-1328: Add Bloom filter reader and writer (apache#587)
* PARQUET-1516: Store Bloom filters near to footer (apache#608)
* PARQUET-1391: Integrate Bloom filter logic (apache#619)
* PARQUET-1660: align Bloom filter implementation with format (apache#686)
@shannonwells
Copy link

shannonwells commented Apr 16, 2021

@chenjunjiedada I'm interested in how you arrived at the formula for the optimal number of bits. Can you please elaborate on this? After reading the referenced paper on it (http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf) I'm unclear as to which equation you used from that paper or if you used another one. We're attempting to implement this algorithm in a different language. Thank you.

@jbapple
Copy link

jbapple commented Apr 18, 2021

@shannonwells If you use equation 3 and fix the block size as 256 bits and the number of inner hash functions as 8, you'll be able to generate something akin to figure 1. You can then compare the FPP you calculated with the minimum FPP for static filters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants