PARQUET-1630: Loosen size restrictions on Bloom filters #150

jbapple · 2019-08-10T21:28:03Z

This patch uses a range reduction trick to produce a pseudorandom
number within an index without using the modulo operator '%', which is
often very slow.

The oldest reference I know to this trick is Kenneth A. Ross's IBM
research report from 2006, "Efficient Hash Probes on Modern
Processors", available at
https://domino.research.ibm.com/library/cyberdig.nsf/papers/DF54E3545C82E8A585257222006FD9A2/$File/rc24100.pdf

jbapple · 2019-08-13T00:59:57Z

@rdblue review requested

rdblue · 2019-08-13T15:53:43Z

BloomFilter.md

+
+The technique for converting the most significant 32 bits to an
+integer between 0 and z-1 (inclusive) avoids using the modulo
+operation, which is often very slow.  The oldest reference I know to


Nit: first person statements in a spec seem strange. Can we rephrase this to something like "this trick was found via Kennith A. Ross's report ..."?

rdblue · 2019-08-13T15:55:31Z

BloomFilter.md

+argument are multiplied by `z`; the most significant 32 bits of this
+product are the index of the block to operate on. The `filter_insert`
+function then uses the least significant 32 bits of the argument to
+`filter_insert` as an argument to `block_insert` called on that block.


This refers to "the argument" or "its argument" quite a bit, but then references arguments to other functions, like "as an argument to". It's a bit hard to follow, although the code makes it clear. I would prefer to be a bit more clear, maybe by breaking up the use of the most significant and least significant bits into separate paragraphs, or calling the 64-bit argument a hash value?

To be honest, I think this is a function of trying to write in prose what is much clearer in code. I'll take a stab at clarifying it, but I suspect we're in the same hole as mathematics was before algebraic notation

https://www.academia.edu/7114581/A_Brief_History_of_Algebraic_Notation

If the instance be, "ten and thing to be multiplied by thing less ten," then this is the same as if it were said thing and ten by thing less ten. You say, therefore, thing multiplied by thing is a square positive; and ten by thing is ten things positive ...

My guess is the prose is twice as long as the equivalent C code, and half as clear.

rdblue · 2019-08-13T15:57:00Z

BloomFilter.md

+of its argument to select a block to operate on. If the number of
+blocks is `z`, the most significant 32 bits of the `filter_insert`
+argument are multiplied by `z`; the most significant 32 bits of this
+product are the index of the block to operate on. The `filter_insert`


the most significant 32 bits of this product are the index of the block to operate on

I found this hard to read at first because it doesn't talk about the shift. I think it should be a little more clear and state the operation to produce i including the final shift, then note that i is guaranteed to be a number between 0 and z.

I'm having trouble following, but I've made an attempt. If this isn't what you were thinking, could you be more explicit? Maybe use Github's "Insert a suggestion" feature?

rdblue · 2019-08-13T15:57:24Z

Looks close. Just a couple of minor requests to rephrase for clarity.

This patch uses a range reduction trick to produce a pseudorandom number within an index without using the modulo operator '%', which is often very slow. The oldest reference I know to this trick is Kenneth A. Ross's IBM research report from 2006, "Efficient Hash Probes on Modern Processors", available at https://domino.research.ibm.com/library/cyberdig.nsf/papers/DF54E3545C82E8A585257222006FD9A2/$File/rc24100.pdf

rdblue · 2019-08-21T16:47:51Z

Looks good to me. Merging this.

rdblue reviewed Aug 13, 2019

View reviewed changes

jbapple force-pushed the bloom-size branch from 214bf18 to 84d3dbe Compare August 14, 2019 04:10

rdblue merged commit cd08b7f into apache:master Aug 21, 2019

asfimport mentioned this pull request Jun 23, 2024

Resolve Bloom filter spec concerns #376

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-1630: Loosen size restrictions on Bloom filters #150

PARQUET-1630: Loosen size restrictions on Bloom filters #150

jbapple commented Aug 10, 2019

jbapple commented Aug 13, 2019

rdblue Aug 13, 2019

jbapple Aug 14, 2019

rdblue Aug 13, 2019

jbapple Aug 14, 2019

rdblue Aug 13, 2019

jbapple Aug 14, 2019

rdblue commented Aug 13, 2019

rdblue commented Aug 21, 2019

PARQUET-1630: Loosen size restrictions on Bloom filters #150

PARQUET-1630: Loosen size restrictions on Bloom filters #150

Conversation

jbapple commented Aug 10, 2019

jbapple commented Aug 13, 2019

rdblue Aug 13, 2019

Choose a reason for hiding this comment

jbapple Aug 14, 2019

Choose a reason for hiding this comment

rdblue Aug 13, 2019

Choose a reason for hiding this comment

jbapple Aug 14, 2019

Choose a reason for hiding this comment

rdblue Aug 13, 2019

Choose a reason for hiding this comment

jbapple Aug 14, 2019

Choose a reason for hiding this comment

rdblue commented Aug 13, 2019

rdblue commented Aug 21, 2019