Skip to content

Commit

Permalink
PARQUET-41: Add Bloom filter (#112)
Browse files Browse the repository at this point in the history
* PARQUET-41: Add Bloom filter

* Grammar and structure tweaking for Bloom filter prose.
  • Loading branch information
Chen, Junjie authored and majetideepak committed Oct 12, 2018
1 parent b475182 commit 54839ad
Show file tree
Hide file tree
Showing 2 changed files with 157 additions and 0 deletions.
120 changes: 120 additions & 0 deletions BloomFilter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
<!--
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
-->

Parquet Bloom Filter
===
### Problem statement
In their current format, column statistics and dictionaries can be used for predicate
pushdown. Statistics include minimum and maximum value, which can be used to filter out
values not in the range. Dictionaries are more specific, and readers can filter out values
that are between min and max but not in the dictionary. However, when there are too many
distinct values, writers sometimes choose not to add dictionaries because of the extra
space they occupy. This leaves columns with large cardinalities and widely separated min
and max without support for predicate pushdown.

A Bloom filter[1] is a compact data structure that overapproximates a set. It can respond
to membership queries with either "definitely no" or "probably yes", where the probability
of false positives is configured when the filter is initialized. Bloom filters do not have
false negatives.

Because Bloom filters are small compared to dictionaries, they can be used for predicate
pushdown even in columns with high cardinality and when space is at a premium.

### Goal
* Enable predicate pushdown for high-cardinality columns while using less space than
dictionaries.

* Induce no additional I/O overhead when executing queries on columns without Bloom
filters attached or when executing non-selective queries.

### Technical Approach
The initial Bloom filter algorithm in Parquet is implemented using a combination of two
Bloom filter techniques.

First, the block Bloom filter algorithm from Putze et al.'s "Cache-, Hash- and
Space-Efficient Bloom filters"[2] is used. This divides a filter into many tiny Bloom
filters, each one of which is called a "block". In Parquet's initial implementation, each
block is 256 bits. When inserting or finding a value, part of the hash of that value is
used to index into the array of blocks and pick a single one. This single block is then
used for the remaining part of the operation.

Second, within each block, this implementation uses the folklore split Bloom filter
technique, as described in section 2.1 of "Network Applications of Bloom Filters: A
Survey"[5]. This divides the 256 bits in each block up into eight contiguous 32-bit lanes
and sets or checks one bit in each lane.

#### Algorithm
In the initial algorithm, the most significant 32 bits from the hash value are used as the
index to select a block from bitset. The lower 32 bits of the hash value, along with eight
constant salt values, are used to compute the bit to set in each lane of the block. The
salt and lower 32 bits are combined using the multiply-shift[3] hash function:

```c
// 8 SALT values used to compute bit pattern
static const uint32_t SALT[8] = {0x47b6137bU, 0x44974d91U, 0x8824ad5bU, 0xa2b7289dU,
0x705495c7U, 0x2df1424bU, 0x9efc4947U, 0x5c6bfb31U};

// key: the lower 32 bits of hash result
// mask: the output bit pattern for a tiny Bloom filter
void Mask(uint32_t key, uint32_t mask[8]) {
for (int i = 0; i < 8; ++i) {
mask[i] = key * SALT[i];
}
for (int i = 0; i < 8; ++i) {
mask[i] = mask[i] >> 27;
}
for (int i = 0; i < 8; ++i) {
mask[i] = UINT32_C(1) << mask[i];
}
}

```

#### Hash Function
The function used to hash values in the initial implementation is MurmurHash3[4], using

This comment has been minimized.

Copy link
@toddlipcon

toddlipcon Apr 1, 2019

Why choose MurmurHash3? Based on some of the benchmarks on http://fastcompression.blogspot.com/2019_03_10_archive.html it seems like it's not a particularly good choice for short string hashing. xxh3 seems a decent choice, or potentially crc32c (with hardware support on x86) even better, though it requires a bit of "mixing" to avoid bias in LSB.

This comment has been minimized.

Copy link
@chenjunjiedada

chenjunjiedada Apr 2, 2019

Contributor

I know there are some hash functions may perform better than Murmur3Hash in some cases, so here we design BloomFilterHash as enumeration for extension. One can add a new hash strategy by define a new enum in BloomFilterHash and implement in parquet project.

the least-significant 64 bits of the 128-bit version of the function on the x86-64
platform. Note that the function produces different values on different architectures, so
implementors must be careful to use the version specific to x86-64. That function can be
emulated on different platforms without difficulty.

#### Build a Bloom filter
The fact that exactly eight bits are checked during each lookup means that these filters
are most space efficient when used with an expected false positive rate of about
0.5%. This is achieved when there are about 11.54 bits for every distinct value inserted
into the filter.

To calculate the size the filter should be for another false positive rate `p`, use the
following formula. The output is in bits per distinct element:

```c
-8 / log(1 - pow(p, 1.0 / 8));
```

#### File Format
The Bloom filter data of a column is stored at the beginning of its column chunk in the

This comment has been minimized.

Copy link
@toddlipcon

toddlipcon Apr 1, 2019

Is this position mandated? Storing it closer to the rowgroup footer means it's more likely to be able to be read in a single IO along with the footer, making pruning more efficient (vs doing a second random-read for each column to fetch the BF)

This comment has been minimized.

Copy link
@chenjunjiedada

chenjunjiedada Apr 2, 2019

Contributor

I have a implementation in bloom-filter branch which store bloom filter binary at the end of parquet file. You can have a look from https://github.com/apache/parquet-mr/tree/bloom-filter. This format document will be update a bit when it getting merge.

row group. The column chunk metadata contains the Bloom filter offset. The Bloom filter is
stored with a header containing the size of the filter in bytes, the algorithm, and the

This comment has been minimized.

Copy link
@toddlipcon

toddlipcon Apr 1, 2019

Should we mandate that the bloom filter be sized as a power of two so that modulo can be implemented by masking?

This comment has been minimized.

Copy link
@chenjunjiedada

chenjunjiedada Apr 2, 2019

Contributor

Yes, I agree with you, and the implementation in branch do force the size to a power of two. Here we just describe the basic definition of a bloom filter, I think it would be better to add some advice for reference ,will add some in next update.

hash function.

### Reference
1. [Bloom filter introduction at Wiki](https://en.wikipedia.org/wiki/Bloom_filter)
2. [Cache-, Hash- and Space-Efficient Bloom Filters](http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf)
3. [A Reliable Randomized Algorithm for the Closest-Pair Problem](http://www.diku.dk/~jyrki/Paper/CP-11.4.1997.ps)
4. [Murmur Hash at Wiki](https://en.wikipedia.org/wiki/MurmurHash)
5. [Network Applications of Bloom Filters: A Survey](https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf)
37 changes: 37 additions & 0 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -475,6 +475,7 @@ enum PageType {
INDEX_PAGE = 1;
DICTIONARY_PAGE = 2;
DATA_PAGE_V2 = 3;
BLOOM_FILTER_PAGE = 4;
}

/**
Expand Down Expand Up @@ -554,6 +555,38 @@ struct DataPageHeaderV2 {
8: optional Statistics statistics;
}

/** Block-based algorithm type annotation. **/
struct SplitBlockAlgorithm {}
/** The algorithm used in Bloom filter. **/
union BloomFilterAlgorithm {
/** Block-based Bloom filter. **/
1: SplitBlockAlgorithm BLOCK;
}
/** Hash strategy type annotation. It uses Murmur3Hash_x64_128 from the original SMHasher
* repo by Austin Appleby.
**/
struct Murmur3 {}

This comment has been minimized.

Copy link
@zivanfi

zivanfi Oct 29, 2018

Contributor

I think this struct name is very cryptic and does not follow the naming convention we typically use for structs that are used in unions. Could this be renamed to Murmur3Hash or Murmur3HashStrategy before it gets released? Thanks!

This comment has been minimized.

Copy link
@chenjunjiedada

chenjunjiedada Oct 29, 2018

Contributor

No problem.

/**
* The hash function used in Bloom filter. This function takes the hash of a column value
* using plain encoding.
**/
union BloomFilterHash {
/** Murmur3 Hash Strategy. **/
1: Murmur3 MURMUR3;
}
/**
* Bloom filter header is stored at beginning of Bloom filter data of each column
* and followed by its bitset.
**/
struct BloomFilterPageHeader {
/** The size of bitset in bytes **/
1: required i32 numBytes;
/** The algorithm for setting bits. **/
2: required BloomFilterAlgorithm algorithm;
/** The hash function used for Bloom filter. **/
3: required BloomFilterHash hash;
}

struct PageHeader {
/** the type of the page: indicates which of the *_header fields is set **/
1: required PageType type
Expand All @@ -574,6 +607,7 @@ struct PageHeader {
6: optional IndexPageHeader index_page_header;
7: optional DictionaryPageHeader dictionary_page_header;
8: optional DataPageHeaderV2 data_page_header_v2;
9: optional BloomFilterPageHeader bloom_filter_page_header;
}

/**
Expand Down Expand Up @@ -660,6 +694,9 @@ struct ColumnMetaData {
* This information can be used to determine if all data pages are
* dictionary encoded for example **/
13: optional list<PageEncodingStats> encoding_stats;

/** Byte offset from beginning of file to Bloom filter data. **/
14: optional i64 bloom_filter_offset;
}

struct ColumnChunk {
Expand Down

0 comments on commit 54839ad

Please sign in to comment.