Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-41: add bloom filter #99

Closed

Conversation

chenjunjiedada
Copy link
Contributor

This is rebased bloom filter PR for #62. The original PR contains a lot of rebasing commit message may be confused.

@@ -658,6 +658,74 @@ struct ColumnMetaData {
* This information can be used to determine if all data pages are
* dictionary encoded for example **/
13: optional list<PageEncodingStats> encoding_stats;

/** Byte offset from beginning of file to bloom filter data. The bloom filters
* data of columns together is stored before the start of row group wihch describe.**/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I'm having trouble understanding this sentence. Could you describe it in more detail and then we can work on a re-wording?

Also, "Bloom" is the last name of the inventor, so generally we should prefer to capitalize it, here and below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first sentence should be same as data_page_offset above. The second sentence says the Bloom filter data will be stored at the start position of row group.

14: optional i64 bloom_filter_offset;
}

/**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some formatting nits: this file appears to usually format single-line prose comments without surrounding comment lines.

/** Like this. */

/**
 * Not like this.
 */ 

Also, empty structs have both opening and closing brace on one line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


/**
* Definition of bloom filter algorithm.
* In order for farward compatibility, we use union to replace enum.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paragraphs in this file are denoted by blank lines. Adjacent lines without a blank line between them are wrapped. So:

 * This is a sentence. This is another sentence

or

 * This is a sentence.
 *
 * This is another sentence.

but never

 * This is a sentence.
 * This is another sentence.

Please check this here and below.

Also, "forward", not "farward".

}

/**
* Block based algorithm type annotation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below, when using as an adjective, hyphenate "block-based":

https://www.grammarbook.com/punctuation/hyphens.asp

Generally, hyphenate two or more words when they come before a noun they modify and act as a single idea. This is called a compound adjective.

* The bloom filter bitset is separated into tiny bucket as tiny bloom
* filter, the high 32 bits hash value is used to select bucket, and
* lower 32 bits hash values are used to set bits in tiny bloom filter.
* See “Cache-, Hash- and Space-Efficient Bloom Filters”. Specifically,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You used smart quotes here, not plain double-quotes.

/**
* Block based algorithm type annotation.
*/
struct BlockAlgorithm {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a combination of the "blocked Bloom filters", described in "Cache-, Hash- and Space-Efficient Bloom Filters", and a variant mentioned in, though not invented in, "Network Applications of Bloom Filters: A Survey". The latter are sometimes called "split Bloom filters".

Maybe struct SplitBlockBloomAlgorithm {}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* Bloom filter header is stored at beginning of bloom filter data of each column
* and followed by its bitset.
*/
struct BloomFilterHeader {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also need a mention in PageType and PageHeader?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to remind again.

It depends on how we treat Bloom filter data, it is natural if we abstract Bloom filter data as a specific page. Previously, I didn't read or write it as page since the PageReader/PageWriter APIs in parquet-mr looks specific to data and dictionary page. While I think it is easy to integrate if we want to do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

/**
* Definition for hash function used to compute hash of column value.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would reword this as "The hash function used in the Bloom filter. This function takes the hash of a column value using the plain encoding."

You can leave off the bit about union vs. enum, both here and above.

/**
* Hash strategy type annotation.
*/
struct Murmur3 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to be specific, as below, you can say "MurmurHash3_x64_128 from the original SMHasher repo by Austin Appleby" and then remove the detailed comment below explaining more. The restriction of that output by first taking some 64 bits and then using the low bits of that to index into the blocks need not be specified here, I suppose.

* and followed by its bitset.
*/
struct BloomFilterHeader {
/** The size of bitset in bytes, must be a power of 2**/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That it must be a power of 2 might depend on the algorithm, so I'd omit this.

/** The algorithm used in Bloom filter. **/
union BloomFilterAlgorithm {
/** Block-based Bloom filter. **/
1: SplitBlockAlgorithm BLOCK;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

identation -1

}

/** Hash strategy type annotation. It uses Murmur3Hash_x64_128 from the original SMHasher
* repo by Austin Appleby.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should depend on a external codebase for this implementation (as it is subject to change). May be document the actual implementation details here instead of citing the reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, we are citing a kind of well-known hash which is widely used in hive, orc, guava and etc.. It just like codec definition above we don't have implementation detail document. In the project hive, orc, and guava, I don't see they have implementation document.

@chenjunjiedada
Copy link
Contributor Author

@zivanfi , Could you please help to create a feature branch for this?

@zivanfi
Copy link
Contributor

zivanfi commented Sep 27, 2018

Sure, I have created parquet-format/bloom-filter and parquet-mr/bloom-filter for you.

Please open new pull requests on these branches instead of the existing ones on the master branches. Thanks!

@chenjunjiedada
Copy link
Contributor Author

Thanks a lot, @zivanfi

@chenjunjiedada
Copy link
Contributor Author

move to #112 and close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants