PARQUET-1630: Loosen size restrictions on Bloom filters

This patch uses a range reduction trick to produce a pseudorandom number within an index without using the modulo operator '%', which is often very slow. The oldest reference I know to this trick is Kenneth A. Ross's IBM research report from 2006, "Efficient Hash Probes on Modern Processors", available at https://domino.research.ibm.com/library/cyberdig.nsf/papers/DF54E3545C82E8A585257222006FD9A2/$File/rc24100.pdf
apache · Aug 10, 2019 · 214bf18 · 214bf18
1 parent fa53342
commit 214bf18
Showing 1 changed file with 21 additions and 10 deletions.
diff --git a/BloomFilter.md b/BloomFilter.md
@@ -147,21 +147,30 @@ This closes the definition of a block and the operations on it.
 
 Now that a block is defined, we can describe Parquet's split block
 Bloom filters. A split block Bloom filter (henceforth "SBBF") is
-composed of `z` blocks, where `z` is a power of two greater than or
-equal to one and less than 2 to the 27th power. When an SBBF is
-initialized, each block in it is initialized, which means each bit in
-each word in each block in the SBBF is unset.
+composed of `z` blocks, where `z` is greater than or equal to one and
+less than 2 to the 31st power. When an SBBF is initialized, each block
+in it is initialized, which means each bit in each word in each block
+in the SBBF is unset.
 
 In addition to initialization, an SBBF supports an operation called
 `filter_insert` and one called `filter_check`. Each takes as an
 argument a 64-bit unsigned integer; `filter_check` returns a boolean
 and `filter_insert` does not return a value, but does modify the SBBF.
 
 The `filter_insert` operation first uses the most significant 32 bits
-of its argument, modulo the number of blocks, to select a block to
-operate on. It then uses the least significant 32 bits of the argument
-to `filter_insert` as an argument to `block_insert` called on that
-block.
+of its argument to select a block to operate on. If the number of
+blocks is `z`, the most significant 32 bits of the `filter_insert`
+argument are multiplied by `z`; the most significant 32 bits of this
+product are the index of the block to operate on. The `filter_insert`
+function then uses the least significant 32 bits of the argument to
+`filter_insert` as an argument to `block_insert` called on that block.
+
+The technique for converting the most significant 32 bits to an
+integer between 0 and z-1 (inclusive) avoids using the modulo
+operation, which is often very slow.  The oldest reference I know to
+this trick is [Kenneth A. Ross's IBM research report from 2006,
+"Efficient Hash Probes on Modern Processors"](
+https://domino.research.ibm.com/library/cyberdig.nsf/papers/DF54E3545C82E8A585257222006FD9A2/$File/rc24100.pdf)
 
 The `filter_check` operation uses the same method as `filter_insert`
 to select a block to operate on, then uses the least significant 32
@@ -178,14 +187,16 @@ significant 32 bits.
 
 ```
 void filter_insert(SBBF filter, unsigned int64 x) {
-  block b = filter.getBlock((x >> 32) % filter.numberOfBlocks());
+  unsigned int64 i = ((x >> 32) * filter.numberOfBlocks()) >> 32;
+  block b = filter.getBlock(i);
   block_insert(b, (unsigned int32)x)
 }
 ```
 
 ```
 boolean filter_check(SBBF filter, unsigned int64 x) {
-  block b = filter.getBlock((x >> 32) % filter.numberOfBlocks());
+  unsigned int64 i = ((x >> 32) * filter.numberOfBlocks()) >> 32;
+  block b = filter.getBlock(i);
   return block_check(b, (unsigned int32)x)
 }
 ```