Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1630: Clarify the Bloom filter algorithm #147

Merged
merged 1 commit into from
Aug 9, 2019

Conversation

jbapple
Copy link
Contributor

@jbapple jbapple commented Aug 5, 2019

The specific questions answered from
https://lists.apache.org/thread.html/82d2e50f8c1007720564c5dc64aeae7947e949f3954a83436dc36760@%3Cdev.parquet.apache.org%3E
are

How is the bloom filter block selected from the 32 most-significant
bits from of the hash function? These details must be in the spec and
not in papers linked from the spec.

How is the number of blocks determined? From the overall filter size?

I think that the exact procedure for a lookup in each block should be
covered in a section, followed by a section for how to perform a look
up in the multi-block filter. The wording also needs to be cleaned up
so that it is always clear whether the filter that is referenced is a
block or the multi-block filter.

The spec should give more detail on how to choose the number of blocks
and on false positive rates. The sentence with “11.54 bits for each
distinct value inserted into the filter” is vague: is this the
multi-block filter? Why is a 1% false-positive rate “recommended”?

I think it is okay to use 0.5% as each block’s false-positive rate,
but then this should state how to achieve an overall false-positive
rate as a function of the number of distinct values.

@jbapple jbapple changed the title Answer some mailing list questions about Bloom filters PARQUET-1630: Clarify the Bloom filter algorithm Aug 5, 2019
@jbapple
Copy link
Contributor Author

jbapple commented Aug 5, 2019

@rdblue @chenjunjiedada your review is requested.

for (int i = 0; i < 8; ++i) {
mask[i] = key * SALT[i];

The section describes split block Bloom filters, which is the first
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a header for the section here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, since this is the only BF algorithm in the spec right now.


In the real scenario, the size of the Bloom filter and the false positive rate may vary from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep the recommendation here? The recommendation was asked in previous code review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we do. I think that was probably overkill in the first place. While it's possible to explain what "optimal" means for some definition of "optimal", that adds unnecessary complexity.

Copy link
Contributor

@chenjunjiedada chenjunjiedada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

BloomFilter.md Outdated
conversion of an unsigned 64-bit integer to an unsigned 32-bit integer
containing only the most significant 32 bits, and C's cast operator
"`(unsigned int32)`" is used to denote the conversion of an unsigned
64-bit integer to an unsigned 32-bit integer containing only the lest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
64-bit integer to an unsigned 32-bit integer containing only the lest
64-bit integer to an unsigned 32-bit integer containing only the least

BloomFilter.md Outdated
was never inserted into the SBBF. These are called "false
positives". There is not a simple closed-form calculation of this
probability, but empirically, an SBBF that uses 6 bits of space (or
6/256 blocks) per unique item inserted will have a false positive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this 6? Each value is set in a block and each block requires setting 8 bits, one in each 32-bit word. So this isn't possible in this spec, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this could be the average across all blocks, not the actual number of bits set. That's what the next calculation seems to imply. Although we are setting 8 bits per key, the average is around 6 bits per value if you insert 43691 values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. While eight bits are set each time an item is inserted, that doesn't mean that the fraction bits_of_space / number_of_distinct_items_inserted is eight. I attempted to convey this by adding the modifier "of space".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to have a note about these calculations. I had a lot of trouble understanding what was intended by the 6/256. I recommend showing calculations that relate the total number of bits (blocks * 256) to bits per distinct value and distinct values. That's easier to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a face-to-face discussion, Ryan explained a bit more how this was confusing. I've changed the wording as a result of that conversation to note that bits-per-insert is a derived value.

BloomFilter.md Outdated
c = -8 / log(1 - pow(p, 1.0 / 8))
```
For example, consider an SBBF that contains 1024 blocks and has been
loaded with (256/6) * 1024 = 43691 unique keys. Running `filter_check`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more clear to add units to this example and group the 1024 blocks * 256 bits/block to get the total number of bits first, then divide by bits per key.

It sounds like the way to size is to estimate the number of expected values per row group, say 10,000 is:

  • Choose the desired false-positive rate, like 1% and multiply by values to get total required bits: 10.5 bits * 10,000 values = 105,000 bits.
  • Divide total bits by 256 to get the required number of blocks: 105,000 bits / 256 bits per block = 410 blocks.
  • Find the first power of 2 higher than the number of blocks: 2^8 = 256 < 410, 2^9 = 512 > 410, so use 512 blocks.

Is this correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, correct.

BloomFilter.md Outdated
on 100 items that were never `filter_insert`ed will yield about 10
`true` responses in expectation.

Using more space will reduce the false positive probability:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the calculations for previous proposals, we found that over-filling a bloom filter by 2x caused the false-positive rate to be 10x the desired rate. In other words, over-filling is expensive. But, under-filling is also expensive because it takes much more space.

It would be helpful to have a rough calculation here to get the false-positive probability from a number of blocks and an estimate of the number of unique keys, although above it seems to imply that there isn't one. At least stating that it is better to under-fill than over-fill would be good?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there isn't one. I'd hesitate to say it is "better" to overfill than underfill. They're just different tradeoffs.

Now that a block is defined, we can describe Parquet's split block
Bloom filters. A split block Bloom filter (henceforth "SBBF") is
composed of `z` blocks, where `z` is a power of two greater than or
equal to one and less than 2 to the 27th power. When an SBBF is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be a power of 2? Looks like this might be so that we can use a mask instead of a mod operation to get the block.

Does this affect bloom filter false-positive rate in some way?

The problem is that under- or over-filling has a large effect on either size or false-positive rate. Requiring the number of blocks to be a power of 2 seems to complicate a how to choose a size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's to use a mask. It doesn't influence the false positive probability in any way.

I agree that it makes sense to remove this restriction. I'll send that in a follow-on patch once this one is checked in, so that each patch does one thing: this one clarifies, that one will change the spec.

significant 32 bits.

```
void filter_insert(SBBF filter, unsigned int64 x) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clear and thorough write-up of how to implement this. I think that it really helps make the spec clear.

Do you think it may be worth noting where these can be sped up using masks and integer operations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think shorter is better here. I'm not adverse to this coming in a follow-on patch, maybe even after the release, since such a note would be advisory or commentary, not normative.

The specific questions answered from
https://lists.apache.org/thread.html/82d2e50f8c1007720564c5dc64aeae7947e949f3954a83436dc36760@%3Cdev.parquet.apache.org%3E
are

How is the bloom filter block selected from the 32 most-significant
bits from of the hash function? These details must be in the spec and
not in papers linked from the spec.

How is the number of blocks determined? From the overall filter size?

I think that the exact procedure for a lookup in each block should be
covered in a section, followed by a section for how to perform a look
up in the multi-block filter. The wording also needs to be cleaned up
so that it is always clear whether the filter that is referenced is a
block or the multi-block filter.

The spec should give more detail on how to choose the number of blocks
and on false positive rates. The sentence with “11.54 bits for each
distinct value inserted into the filter” is vague: is this the
multi-block filter? Why is a 1% false-positive rate “recommended”?

I think it is okay to use 0.5% as each block’s false-positive rate,
but then this should state how to achieve an overall false-positive
rate as a function of the number of distinct values.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants