Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1630: Loosen size restrictions on Bloom filters #150

Merged
merged 1 commit into from
Aug 21, 2019

Conversation

jbapple
Copy link
Contributor

@jbapple jbapple commented Aug 10, 2019

This patch uses a range reduction trick to produce a pseudorandom
number within an index without using the modulo operator '%', which is
often very slow.

The oldest reference I know to this trick is Kenneth A. Ross's IBM
research report from 2006, "Efficient Hash Probes on Modern
Processors", available at
https://domino.research.ibm.com/library/cyberdig.nsf/papers/DF54E3545C82E8A585257222006FD9A2/$File/rc24100.pdf

@jbapple
Copy link
Contributor Author

jbapple commented Aug 13, 2019

@rdblue review requested

BloomFilter.md Outdated

The technique for converting the most significant 32 bits to an
integer between 0 and z-1 (inclusive) avoids using the modulo
operation, which is often very slow. The oldest reference I know to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: first person statements in a spec seem strange. Can we rephrase this to something like "this trick was found via Kennith A. Ross's report ..."?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

BloomFilter.md Outdated
argument are multiplied by `z`; the most significant 32 bits of this
product are the index of the block to operate on. The `filter_insert`
function then uses the least significant 32 bits of the argument to
`filter_insert` as an argument to `block_insert` called on that block.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refers to "the argument" or "its argument" quite a bit, but then references arguments to other functions, like "as an argument to". It's a bit hard to follow, although the code makes it clear. I would prefer to be a bit more clear, maybe by breaking up the use of the most significant and least significant bits into separate paragraphs, or calling the 64-bit argument a hash value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I think this is a function of trying to write in prose what is much clearer in code. I'll take a stab at clarifying it, but I suspect we're in the same hole as mathematics was before algebraic notation

https://www.academia.edu/7114581/A_Brief_History_of_Algebraic_Notation

If the instance be, "ten and thing to be multiplied by thing less ten," then this is the same as if it were said thing and ten by thing less ten. You say, therefore, thing multiplied by thing is a square positive; and ten by thing is ten things positive ...

My guess is the prose is twice as long as the equivalent C code, and half as clear.

BloomFilter.md Outdated
of its argument to select a block to operate on. If the number of
blocks is `z`, the most significant 32 bits of the `filter_insert`
argument are multiplied by `z`; the most significant 32 bits of this
product are the index of the block to operate on. The `filter_insert`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the most significant 32 bits of this product are the index of the block to operate on

I found this hard to read at first because it doesn't talk about the shift. I think it should be a little more clear and state the operation to produce i including the final shift, then note that i is guaranteed to be a number between 0 and z.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble following, but I've made an attempt. If this isn't what you were thinking, could you be more explicit? Maybe use Github's "Insert a suggestion" feature?

@rdblue
Copy link
Contributor

rdblue commented Aug 13, 2019

Looks close. Just a couple of minor requests to rephrase for clarity.

This patch uses a range reduction trick to produce a pseudorandom
number within an index without using the modulo operator '%', which is
often very slow.

The oldest reference I know to this trick is Kenneth A. Ross's IBM
research report from 2006, "Efficient Hash Probes on Modern
Processors", available at
https://domino.research.ibm.com/library/cyberdig.nsf/papers/DF54E3545C82E8A585257222006FD9A2/$File/rc24100.pdf
@rdblue rdblue merged commit cd08b7f into apache:master Aug 21, 2019
@rdblue
Copy link
Contributor

rdblue commented Aug 21, 2019

Looks good to me. Merging this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants