-
Notifications
You must be signed in to change notification settings - Fork 423
Commit
* PARQUET-41: Add Bloom filter * Grammar and structure tweaking for Bloom filter prose.
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,120 @@ | ||
<!-- | ||
- Licensed to the Apache Software Foundation (ASF) under one | ||
- or more contributor license agreements. See the NOTICE file | ||
- distributed with this work for additional information | ||
- regarding copyright ownership. The ASF licenses this file | ||
- to you under the Apache License, Version 2.0 (the | ||
- "License"); you may not use this file except in compliance | ||
- with the License. You may obtain a copy of the License at | ||
- | ||
- http://www.apache.org/licenses/LICENSE-2.0 | ||
- | ||
- Unless required by applicable law or agreed to in writing, | ||
- software distributed under the License is distributed on an | ||
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
- KIND, either express or implied. See the License for the | ||
- specific language governing permissions and limitations | ||
- under the License. | ||
--> | ||
|
||
Parquet Bloom Filter | ||
=== | ||
### Problem statement | ||
In their current format, column statistics and dictionaries can be used for predicate | ||
pushdown. Statistics include minimum and maximum value, which can be used to filter out | ||
values not in the range. Dictionaries are more specific, and readers can filter out values | ||
that are between min and max but not in the dictionary. However, when there are too many | ||
distinct values, writers sometimes choose not to add dictionaries because of the extra | ||
space they occupy. This leaves columns with large cardinalities and widely separated min | ||
and max without support for predicate pushdown. | ||
|
||
A Bloom filter[1] is a compact data structure that overapproximates a set. It can respond | ||
to membership queries with either "definitely no" or "probably yes", where the probability | ||
of false positives is configured when the filter is initialized. Bloom filters do not have | ||
false negatives. | ||
|
||
Because Bloom filters are small compared to dictionaries, they can be used for predicate | ||
pushdown even in columns with high cardinality and when space is at a premium. | ||
|
||
### Goal | ||
* Enable predicate pushdown for high-cardinality columns while using less space than | ||
dictionaries. | ||
|
||
* Induce no additional I/O overhead when executing queries on columns without Bloom | ||
filters attached or when executing non-selective queries. | ||
|
||
### Technical Approach | ||
The initial Bloom filter algorithm in Parquet is implemented using a combination of two | ||
Bloom filter techniques. | ||
|
||
First, the block Bloom filter algorithm from Putze et al.'s "Cache-, Hash- and | ||
Space-Efficient Bloom filters"[2] is used. This divides a filter into many tiny Bloom | ||
filters, each one of which is called a "block". In Parquet's initial implementation, each | ||
block is 256 bits. When inserting or finding a value, part of the hash of that value is | ||
used to index into the array of blocks and pick a single one. This single block is then | ||
used for the remaining part of the operation. | ||
|
||
Second, within each block, this implementation uses the folklore split Bloom filter | ||
technique, as described in section 2.1 of "Network Applications of Bloom Filters: A | ||
Survey"[5]. This divides the 256 bits in each block up into eight contiguous 32-bit lanes | ||
and sets or checks one bit in each lane. | ||
|
||
#### Algorithm | ||
In the initial algorithm, the most significant 32 bits from the hash value are used as the | ||
index to select a block from bitset. The lower 32 bits of the hash value, along with eight | ||
constant salt values, are used to compute the bit to set in each lane of the block. The | ||
salt and lower 32 bits are combined using the multiply-shift[3] hash function: | ||
|
||
```c | ||
// 8 SALT values used to compute bit pattern | ||
static const uint32_t SALT[8] = {0x47b6137bU, 0x44974d91U, 0x8824ad5bU, 0xa2b7289dU, | ||
0x705495c7U, 0x2df1424bU, 0x9efc4947U, 0x5c6bfb31U}; | ||
|
||
// key: the lower 32 bits of hash result | ||
// mask: the output bit pattern for a tiny Bloom filter | ||
void Mask(uint32_t key, uint32_t mask[8]) { | ||
for (int i = 0; i < 8; ++i) { | ||
mask[i] = key * SALT[i]; | ||
} | ||
for (int i = 0; i < 8; ++i) { | ||
mask[i] = mask[i] >> 27; | ||
} | ||
for (int i = 0; i < 8; ++i) { | ||
mask[i] = UINT32_C(1) << mask[i]; | ||
} | ||
} | ||
|
||
``` | ||
|
||
#### Hash Function | ||
The function used to hash values in the initial implementation is MurmurHash3[4], using | ||
This comment has been minimized.
Sorry, something went wrong.
This comment has been minimized.
Sorry, something went wrong.
chenjunjiedada
Contributor
|
||
the least-significant 64 bits of the 128-bit version of the function on the x86-64 | ||
platform. Note that the function produces different values on different architectures, so | ||
implementors must be careful to use the version specific to x86-64. That function can be | ||
emulated on different platforms without difficulty. | ||
|
||
#### Build a Bloom filter | ||
The fact that exactly eight bits are checked during each lookup means that these filters | ||
are most space efficient when used with an expected false positive rate of about | ||
0.5%. This is achieved when there are about 11.54 bits for every distinct value inserted | ||
into the filter. | ||
|
||
To calculate the size the filter should be for another false positive rate `p`, use the | ||
following formula. The output is in bits per distinct element: | ||
|
||
```c | ||
-8 / log(1 - pow(p, 1.0 / 8)); | ||
``` | ||
|
||
#### File Format | ||
The Bloom filter data of a column is stored at the beginning of its column chunk in the | ||
This comment has been minimized.
Sorry, something went wrong.
toddlipcon
|
||
row group. The column chunk metadata contains the Bloom filter offset. The Bloom filter is | ||
stored with a header containing the size of the filter in bytes, the algorithm, and the | ||
This comment has been minimized.
Sorry, something went wrong.
toddlipcon
|
||
hash function. | ||
|
||
### Reference | ||
1. [Bloom filter introduction at Wiki](https://en.wikipedia.org/wiki/Bloom_filter) | ||
2. [Cache-, Hash- and Space-Efficient Bloom Filters](http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf) | ||
3. [A Reliable Randomized Algorithm for the Closest-Pair Problem](http://www.diku.dk/~jyrki/Paper/CP-11.4.1997.ps) | ||
4. [Murmur Hash at Wiki](https://en.wikipedia.org/wiki/MurmurHash) | ||
5. [Network Applications of Bloom Filters: A Survey](https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -475,6 +475,7 @@ enum PageType { | |
INDEX_PAGE = 1; | ||
DICTIONARY_PAGE = 2; | ||
DATA_PAGE_V2 = 3; | ||
BLOOM_FILTER_PAGE = 4; | ||
} | ||
|
||
/** | ||
|
@@ -554,6 +555,38 @@ struct DataPageHeaderV2 { | |
8: optional Statistics statistics; | ||
} | ||
|
||
/** Block-based algorithm type annotation. **/ | ||
struct SplitBlockAlgorithm {} | ||
/** The algorithm used in Bloom filter. **/ | ||
union BloomFilterAlgorithm { | ||
/** Block-based Bloom filter. **/ | ||
1: SplitBlockAlgorithm BLOCK; | ||
} | ||
/** Hash strategy type annotation. It uses Murmur3Hash_x64_128 from the original SMHasher | ||
* repo by Austin Appleby. | ||
**/ | ||
struct Murmur3 {} | ||
This comment has been minimized.
Sorry, something went wrong.
zivanfi
Contributor
|
||
/** | ||
* The hash function used in Bloom filter. This function takes the hash of a column value | ||
* using plain encoding. | ||
**/ | ||
union BloomFilterHash { | ||
/** Murmur3 Hash Strategy. **/ | ||
1: Murmur3 MURMUR3; | ||
} | ||
/** | ||
* Bloom filter header is stored at beginning of Bloom filter data of each column | ||
* and followed by its bitset. | ||
**/ | ||
struct BloomFilterPageHeader { | ||
/** The size of bitset in bytes **/ | ||
1: required i32 numBytes; | ||
/** The algorithm for setting bits. **/ | ||
2: required BloomFilterAlgorithm algorithm; | ||
/** The hash function used for Bloom filter. **/ | ||
3: required BloomFilterHash hash; | ||
} | ||
|
||
struct PageHeader { | ||
/** the type of the page: indicates which of the *_header fields is set **/ | ||
1: required PageType type | ||
|
@@ -574,6 +607,7 @@ struct PageHeader { | |
6: optional IndexPageHeader index_page_header; | ||
7: optional DictionaryPageHeader dictionary_page_header; | ||
8: optional DataPageHeaderV2 data_page_header_v2; | ||
9: optional BloomFilterPageHeader bloom_filter_page_header; | ||
} | ||
|
||
/** | ||
|
@@ -660,6 +694,9 @@ struct ColumnMetaData { | |
* This information can be used to determine if all data pages are | ||
* dictionary encoded for example **/ | ||
13: optional list<PageEncodingStats> encoding_stats; | ||
|
||
/** Byte offset from beginning of file to Bloom filter data. **/ | ||
14: optional i64 bloom_filter_offset; | ||
} | ||
|
||
struct ColumnChunk { | ||
|
Why choose MurmurHash3? Based on some of the benchmarks on http://fastcompression.blogspot.com/2019_03_10_archive.html it seems like it's not a particularly good choice for short string hashing. xxh3 seems a decent choice, or potentially crc32c (with hardware support on x86) even better, though it requires a bit of "mixing" to avoid bias in LSB.