Skip to content

Commit

Permalink
PARQUET-1630: Loosen size restrictions on Bloom filters
Browse files Browse the repository at this point in the history
This patch uses a range reduction trick to produce a pseudorandom
number within an index without using the modulo operator '%', which is
often very slow.

The oldest reference I know to this trick is Kenneth A. Ross's IBM
research report from 2006, "Efficient Hash Probes on Modern
Processors", available at
https://domino.research.ibm.com/library/cyberdig.nsf/papers/DF54E3545C82E8A585257222006FD9A2/$File/rc24100.pdf
  • Loading branch information
jbapple committed Aug 10, 2019
1 parent fa53342 commit 214bf18
Showing 1 changed file with 21 additions and 10 deletions.
31 changes: 21 additions & 10 deletions BloomFilter.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,21 +147,30 @@ This closes the definition of a block and the operations on it.

Now that a block is defined, we can describe Parquet's split block
Bloom filters. A split block Bloom filter (henceforth "SBBF") is
composed of `z` blocks, where `z` is a power of two greater than or
equal to one and less than 2 to the 27th power. When an SBBF is
initialized, each block in it is initialized, which means each bit in
each word in each block in the SBBF is unset.
composed of `z` blocks, where `z` is greater than or equal to one and
less than 2 to the 31st power. When an SBBF is initialized, each block
in it is initialized, which means each bit in each word in each block
in the SBBF is unset.

In addition to initialization, an SBBF supports an operation called
`filter_insert` and one called `filter_check`. Each takes as an
argument a 64-bit unsigned integer; `filter_check` returns a boolean
and `filter_insert` does not return a value, but does modify the SBBF.

The `filter_insert` operation first uses the most significant 32 bits
of its argument, modulo the number of blocks, to select a block to
operate on. It then uses the least significant 32 bits of the argument
to `filter_insert` as an argument to `block_insert` called on that
block.
of its argument to select a block to operate on. If the number of
blocks is `z`, the most significant 32 bits of the `filter_insert`
argument are multiplied by `z`; the most significant 32 bits of this
product are the index of the block to operate on. The `filter_insert`
function then uses the least significant 32 bits of the argument to
`filter_insert` as an argument to `block_insert` called on that block.

The technique for converting the most significant 32 bits to an
integer between 0 and z-1 (inclusive) avoids using the modulo
operation, which is often very slow. The oldest reference I know to
this trick is [Kenneth A. Ross's IBM research report from 2006,
"Efficient Hash Probes on Modern Processors"](
https://domino.research.ibm.com/library/cyberdig.nsf/papers/DF54E3545C82E8A585257222006FD9A2/$File/rc24100.pdf)

The `filter_check` operation uses the same method as `filter_insert`
to select a block to operate on, then uses the least significant 32
Expand All @@ -178,14 +187,16 @@ significant 32 bits.

```
void filter_insert(SBBF filter, unsigned int64 x) {
block b = filter.getBlock((x >> 32) % filter.numberOfBlocks());
unsigned int64 i = ((x >> 32) * filter.numberOfBlocks()) >> 32;
block b = filter.getBlock(i);
block_insert(b, (unsigned int32)x)
}
```

```
boolean filter_check(SBBF filter, unsigned int64 x) {
block b = filter.getBlock((x >> 32) % filter.numberOfBlocks());
unsigned int64 i = ((x >> 32) * filter.numberOfBlocks()) >> 32;
block b = filter.getBlock(i);
return block_check(b, (unsigned int32)x)
}
```
Expand Down

0 comments on commit 214bf18

Please sign in to comment.