PARQUET-41: Add bloom filters to parquet statistics #215

winningsix · 2015-06-17T05:51:01Z

It's the PR in mr part.

winningsix · 2015-06-17T05:51:51Z

@spena @rdblue Could you help me review this patch? Thank you!

spena · 2015-06-17T17:03:25Z

parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java

@@ -57,6 +58,7 @@
  private ValuesWriter dataColumn;
  private int valueCount;
  private int valueCountForNextSizeCheck;
+  private BloomFilterOpts opts;


Could we rename 'opts' to 'bloomFilterOpts' to increase readability along the code?

spena · 2015-06-17T17:24:09Z

It's looking good Ferd.

Here are some questions I have.

Should we use fall back for bloom filters in case the bloom is not good for the row group? Dictionary encoding does this.
If a value is found on the dictionary, is there a way to skip the bloom hashing for better write perf? And add the values to the bloom in case they are fallen back?
Is there a way to calculate the # of expected entries instead of asking the user to pass a value?

julienledem · 2015-06-17T17:26:46Z

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

+      org.apache.parquet.column.statistics.Statistics statistics) {
+    if (!(statistics instanceof BloomFilterStatistics)) {
+      return;
+    }


I don't follow why we need this test and define the signature of this method this way.
can't we just convert from one type to the other?

In the stage of converting, it constructs a statistics used in parquet-format and updates the data retrieving from the statistics from parquet-mr.

winningsix · 2015-06-23T07:05:23Z

Hi @spena
Please see my inline comments below. Thank you! (Sorry for some delays since I am taking a holiday :<)

Should we use fall back for bloom filters in case the bloom is not good for the row group? Dictionary encoding does this.

At this point, I didn't add the support for fall back. If it's really useful, I think we could do it in a follow-up ticket.

If a value is found on the dictionary, is there a way to skip the bloom hashing for better write perf? And add the values to the bloom in case they are fallen back?

The bloom filter is used to filter a entire row group in the same way as min/max statistics. I am not very familiar with dictionary encoding in parquet. But I think it should be used before dictionary encoding.

Is there a way to calculate the # of expected entries instead of asking the user to pass a value?

I tried to think about a way to calculate it but didn't come up with a good idea. But I think nobody understands the data better than the person who uses it.

PARQUET-41: Update patch addressing comments Parquet-41: Adding other data types support and enable Unit tests Change the bitset from arraylist to array Add statistics option and enable tests for bloom filter Fix failed unit tests Remove the page level bloom filter bit set Rebase code

spena reviewed Jun 17, 2015
View reviewed changes

julienledem reviewed Jun 17, 2015
View reviewed changes

winningsix force-pushed the PARQUET-41 branch from 1720bac to 0399e4d Compare June 23, 2015 06:52

winningsix force-pushed the PARQUET-41 branch from 0399e4d to d93e243 Compare June 23, 2015 07:20

winningsix force-pushed the PARQUET-41 branch 2 times, most recently from a347660 to 4ba507c Compare November 12, 2015 08:42

winningsix mentioned this pull request Nov 12, 2015

PARQUET-319: Define the parquet bloom filter statistics in parquet format apache/parquet-format#28

Closed

winningsix force-pushed the PARQUET-41 branch from 5c220da to ac15839 Compare February 15, 2016 07:22

winningsix force-pushed the PARQUET-41 branch from ac15839 to 284b1b7 Compare March 28, 2016 08:05

Ferdinand Xu added 2 commits August 24, 2016 08:59

Fix some issues

8d3d4e4

winningsix force-pushed the PARQUET-41 branch from 284b1b7 to 8d3d4e4 Compare August 30, 2016 08:54

Enable BF in read path

4cef1a5

asfimport mentioned this pull request Jun 23, 2024

Add bloom filters to parquet statistics #1468

Closed

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-41: Add bloom filters to parquet statistics #215

PARQUET-41: Add bloom filters to parquet statistics #215

winningsix commented Jun 17, 2015

winningsix commented Jun 17, 2015

spena Jun 17, 2015

spena commented Jun 17, 2015

julienledem Jun 17, 2015

winningsix Jun 23, 2015

winningsix commented Jun 23, 2015

PARQUET-41: Add bloom filters to parquet statistics #215

Are you sure you want to change the base?

PARQUET-41: Add bloom filters to parquet statistics #215

Conversation

winningsix commented Jun 17, 2015

winningsix commented Jun 17, 2015

spena Jun 17, 2015

Choose a reason for hiding this comment

spena commented Jun 17, 2015

julienledem Jun 17, 2015

Choose a reason for hiding this comment

winningsix Jun 23, 2015

Choose a reason for hiding this comment

winningsix commented Jun 23, 2015