PARQUET-41: add bloom filter #99

chenjunjiedada · 2018-06-20T00:08:13Z

This is rebased bloom filter PR for #62. The original PR contains a lot of rebasing commit message may be confused.

jbapple-cloudera · 2018-06-25T03:11:17Z

src/main/thrift/parquet.thrift

@@ -658,6 +658,74 @@ struct ColumnMetaData {
   * This information can be used to determine if all data pages are
   * dictionary encoded for example **/
  13: optional list<PageEncodingStats> encoding_stats;
+
+  /** Byte offset from beginning of file to bloom filter data. The bloom filters
+   * data of columns together is stored before the start of row group wihch describe.**/


I'm sorry, I'm having trouble understanding this sentence. Could you describe it in more detail and then we can work on a re-wording?

Also, "Bloom" is the last name of the inventor, so generally we should prefer to capitalize it, here and below.

The first sentence should be same as data_page_offset above. The second sentence says the Bloom filter data will be stored at the start position of row group.

jbapple-cloudera · 2018-06-25T03:13:43Z

src/main/thrift/parquet.thrift

+  14: optional i64 bloom_filter_offset;
+}
+
+/**


Some formatting nits: this file appears to usually format single-line prose comments without surrounding comment lines.

/** Like this. */ /** * Not like this. */

Also, empty structs have both opening and closing brace on one line.

jbapple-cloudera · 2018-06-25T03:20:15Z

src/main/thrift/parquet.thrift

+
+/**
+  * Definition of bloom filter algorithm.
+  * In order for farward compatibility, we use union to replace enum.


Paragraphs in this file are denoted by blank lines. Adjacent lines without a blank line between them are wrapped. So:

* This is a sentence. This is another sentence

or

* This is a sentence. * * This is another sentence.

but never

* This is a sentence. * This is another sentence.

Please check this here and below.

Also, "forward", not "farward".

jbapple-cloudera · 2018-06-25T03:20:54Z

src/main/thrift/parquet.thrift

+}
+
+/**
+  * Block based algorithm type annotation.


Here and below, when using as an adjective, hyphenate "block-based":

https://www.grammarbook.com/punctuation/hyphens.asp

Generally, hyphenate two or more words when they come before a noun they modify and act as a single idea. This is called a compound adjective.

jbapple-cloudera · 2018-06-25T03:30:02Z

src/main/thrift/parquet.thrift

+   * The bloom filter bitset is separated into tiny bucket as tiny bloom
+   * filter, the high 32 bits hash value is used to select bucket, and
+   * lower 32 bits hash values are used to set bits in tiny bloom filter.
+   * See “Cache-, Hash- and Space-Efficient Bloom Filters”. Specifically,


You used smart quotes here, not plain double-quotes.

jbapple-cloudera · 2018-06-25T03:40:40Z

src/main/thrift/parquet.thrift

+/**
+  * Block based algorithm type annotation.
+  */
+struct BlockAlgorithm {


This is a combination of the "blocked Bloom filters", described in "Cache-, Hash- and Space-Efficient Bloom Filters", and a variant mentioned in, though not invented in, "Network Applications of Bloom Filters: A Survey". The latter are sometimes called "split Bloom filters".

Maybe struct SplitBlockBloomAlgorithm {}?

jbapple-cloudera · 2018-06-25T04:07:13Z

src/main/thrift/parquet.thrift

+  * Bloom filter header is stored at beginning of bloom filter data of each column 
+  * and followed by its bitset.
+  */
+struct BloomFilterHeader {


Does this also need a mention in PageType and PageHeader?

Any thoughts on this?

Thanks to remind again.

It depends on how we treat Bloom filter data, it is natural if we abstract Bloom filter data as a specific page. Previously, I didn't read or write it as page since the PageReader/PageWriter APIs in parquet-mr looks specific to data and dictionary page. While I think it is easy to integrate if we want to do this.

jbapple-cloudera · 2018-06-25T04:09:08Z

src/main/thrift/parquet.thrift

+}
+
+/** 
+ * Definition for hash function used to compute hash of column value.


I would reword this as "The hash function used in the Bloom filter. This function takes the hash of a column value using the plain encoding."

You can leave off the bit about union vs. enum, both here and above.

jbapple-cloudera · 2018-06-25T04:12:36Z

src/main/thrift/parquet.thrift

+/**
+  * Hash strategy type annotation.
+  */
+struct Murmur3 {


If you want to be specific, as below, you can say "MurmurHash3_x64_128 from the original SMHasher repo by Austin Appleby" and then remove the detailed comment below explaining more. The restriction of that output by first taking some 64 bits and then using the low bits of that to index into the blocks need not be specified here, I suppose.

jbapple-cloudera · 2018-06-25T04:14:12Z

src/main/thrift/parquet.thrift

+  * and followed by its bitset.
+  */
+struct BloomFilterHeader {
+  /** The size of bitset in bytes, must be a power of 2**/


That it must be a power of 2 might depend on the algorithm, so I'd omit this.

aniket486 · 2018-07-20T03:15:14Z

src/main/thrift/parquet.thrift

+/** The algorithm used in Bloom filter. **/
+union BloomFilterAlgorithm {
+  /** Block-based Bloom filter. **/
+   1: SplitBlockAlgorithm BLOCK;


identation -1

aniket486 · 2018-07-20T03:16:45Z

src/main/thrift/parquet.thrift

+}
+
+/** Hash strategy type annotation. It uses Murmur3Hash_x64_128 from the original SMHasher
+ * repo by Austin Appleby.


I don't think we should depend on a external codebase for this implementation (as it is subject to change). May be document the actual implementation details here instead of citing the reference.

Thanks, we are citing a kind of well-known hash which is widely used in hive, orc, guava and etc.. It just like codec definition above we don't have implementation detail document. In the project hive, orc, and guava, I don't see they have implementation document.

chenjunjiedada · 2018-09-27T11:36:29Z

@zivanfi , Could you please help to create a feature branch for this?

zivanfi · 2018-09-27T12:03:42Z

Sure, I have created parquet-format/bloom-filter and parquet-mr/bloom-filter for you.

Please open new pull requests on these branches instead of the existing ones on the master branches. Thanks!

chenjunjiedada · 2018-09-27T15:15:20Z

Thanks a lot, @zivanfi

chenjunjiedada · 2018-10-03T07:44:05Z

move to #112 and close this one.

jbapple-cloudera suggested changes Jun 25, 2018

View reviewed changes

Chen, Junjie added 2 commits July 9, 2018 20:28

PARQUET-41: add bloom filter

691cb88

PARQUET-41: update comments

cc8985b

chenjunjiedada force-pushed the parquet-41-rebase branch from 2c17e6d to cc8985b Compare July 9, 2018 12:29

PARQUET-41: Add Bloom filter to PageType and PageHeader

4948425

chenjunjiedada force-pushed the parquet-41-rebase branch from 66c9e9f to 4948425 Compare July 12, 2018 09:40

aniket486 reviewed Jul 20, 2018

View reviewed changes

PARQUET-41: fix indentation

71d5b93

chenjunjiedada closed this Oct 3, 2018

asfimport mentioned this pull request Jun 23, 2024

Add bloom filters to parquet statistics apache/parquet-java#1468

Closed

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-41: add bloom filter #99

PARQUET-41: add bloom filter #99

chenjunjiedada commented Jun 20, 2018

jbapple-cloudera Jun 25, 2018

chenjunjiedada Jul 9, 2018

jbapple-cloudera Jun 25, 2018

chenjunjiedada Jul 9, 2018

jbapple-cloudera Jun 25, 2018

jbapple-cloudera Jun 25, 2018

jbapple-cloudera Jun 25, 2018

jbapple-cloudera Jun 25, 2018

chenjunjiedada Jul 9, 2018

jbapple-cloudera Jun 25, 2018

jbapple-cloudera Jul 9, 2018

chenjunjiedada Jul 9, 2018

chenjunjiedada Jul 12, 2018

jbapple-cloudera Jun 25, 2018

jbapple-cloudera Jun 25, 2018

jbapple-cloudera Jun 25, 2018

aniket486 Jul 20, 2018

aniket486 Jul 20, 2018

chenjunjiedada Jul 20, 2018

chenjunjiedada commented Sep 27, 2018

zivanfi commented Sep 27, 2018

chenjunjiedada commented Sep 27, 2018

chenjunjiedada commented Oct 3, 2018

PARQUET-41: add bloom filter #99

PARQUET-41: add bloom filter #99

Conversation

chenjunjiedada commented Jun 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenjunjiedada commented Sep 27, 2018

zivanfi commented Sep 27, 2018

chenjunjiedada commented Sep 27, 2018

chenjunjiedada commented Oct 3, 2018