PARQUET-1630: Clarify the Bloom filter algorithm #147

jbapple · 2019-08-05T01:01:34Z

The specific questions answered from
https://lists.apache.org/thread.html/82d2e50f8c1007720564c5dc64aeae7947e949f3954a83436dc36760@%3Cdev.parquet.apache.org%3E
are

How is the bloom filter block selected from the 32 most-significant
bits from of the hash function? These details must be in the spec and
not in papers linked from the spec.

How is the number of blocks determined? From the overall filter size?

I think that the exact procedure for a lookup in each block should be
covered in a section, followed by a section for how to perform a look
up in the multi-block filter. The wording also needs to be cleaned up
so that it is always clear whether the filter that is referenced is a
block or the multi-block filter.

The spec should give more detail on how to choose the number of blocks
and on false positive rates. The sentence with “11.54 bits for each
distinct value inserted into the filter” is vague: is this the
multi-block filter? Why is a 1% false-positive rate “recommended”?

I think it is okay to use 0.5% as each block’s false-positive rate,
but then this should state how to achieve an overall false-positive
rate as a function of the number of distinct values.

jbapple · 2019-08-05T01:13:05Z

@rdblue @chenjunjiedada your review is requested.

chenjunjiedada · 2019-08-05T02:22:13Z

BloomFilter.md

-  for (int i = 0; i < 8; ++i) {
-    mask[i] = key * SALT[i];
+
+The section describes split block Bloom filters, which is the first


Do we need a header for the section here?

I don't think so, since this is the only BF algorithm in the spec right now.

chenjunjiedada · 2019-08-05T02:27:55Z

BloomFilter.md


-In the real scenario, the size of the Bloom filter and the false positive rate may vary from


Do we need to keep the recommendation here? The recommendation was asked in previous code review.

I do not think we do. I think that was probably overkill in the first place. While it's possible to explain what "optimal" means for some definition of "optimal", that adds unnecessary complexity.

chenjunjiedada

LGTM

gszadovszky · 2019-08-05T12:49:27Z

BloomFilter.md

+conversion of an unsigned 64-bit integer to an unsigned 32-bit integer
+containing only the most significant 32 bits, and C's cast operator
+"`(unsigned int32)`" is used to denote the conversion of an unsigned
+64-bit integer to an unsigned 32-bit integer containing only the lest


Suggested change

64-bit integer to an unsigned 32-bit integer containing only the lest

64-bit integer to an unsigned 32-bit integer containing only the least

rdblue · 2019-08-05T16:54:43Z

BloomFilter.md

+was never inserted into the SBBF. These are called "false
+positives". There is not a simple closed-form calculation of this
+probability, but empirically, an SBBF that uses 6 bits of space (or
+6/256 blocks) per unique item inserted will have a false positive


Why is this 6? Each value is set in a block and each block requires setting 8 bits, one in each 32-bit word. So this isn't possible in this spec, right?

I guess this could be the average across all blocks, not the actual number of bits set. That's what the next calculation seems to imply. Although we are setting 8 bits per key, the average is around 6 bits per value if you insert 43691 values.

Exactly. While eight bits are set each time an item is inserted, that doesn't mean that the fraction bits_of_space / number_of_distinct_items_inserted is eight. I attempted to convey this by adding the modifier "of space".

I think it would be good to have a note about these calculations. I had a lot of trouble understanding what was intended by the 6/256. I recommend showing calculations that relate the total number of bits (blocks * 256) to bits per distinct value and distinct values. That's easier to understand.

After a face-to-face discussion, Ryan explained a bit more how this was confusing. I've changed the wording as a result of that conversation to note that bits-per-insert is a derived value.

rdblue · 2019-08-05T17:05:58Z

BloomFilter.md

-c = -8 / log(1 - pow(p, 1.0 / 8))
-```
+For example, consider an SBBF that contains 1024 blocks and has been
+loaded with (256/6) * 1024 = 43691 unique keys. Running `filter_check`


I think it would be more clear to add units to this example and group the 1024 blocks * 256 bits/block to get the total number of bits first, then divide by bits per key.

It sounds like the way to size is to estimate the number of expected values per row group, say 10,000 is:

Choose the desired false-positive rate, like 1% and multiply by values to get total required bits: 10.5 bits * 10,000 values = 105,000 bits.

Divide total bits by 256 to get the required number of blocks: 105,000 bits / 256 bits per block = 410 blocks.

Find the first power of 2 higher than the number of blocks: 2^8 = 256 < 410, 2^9 = 512 > 410, so use 512 blocks.

Is this correct?

Yes, correct.

rdblue · 2019-08-05T17:09:59Z

BloomFilter.md

+on 100 items that were never `filter_insert`ed will yield about 10
+`true` responses in expectation.
+
+Using more space will reduce the false positive probability:


In the calculations for previous proposals, we found that over-filling a bloom filter by 2x caused the false-positive rate to be 10x the desired rate. In other words, over-filling is expensive. But, under-filling is also expensive because it takes much more space.

It would be helpful to have a rough calculation here to get the false-positive probability from a number of blocks and an estimate of the number of unique keys, although above it seems to imply that there isn't one. At least stating that it is better to under-fill than over-fill would be good?

Yes, there isn't one. I'd hesitate to say it is "better" to overfill than underfill. They're just different tradeoffs.

rdblue · 2019-08-05T17:13:50Z

BloomFilter.md

+Now that a block is defined, we can describe Parquet's split block
+Bloom filters. A split block Bloom filter (henceforth "SBBF") is
+composed of `z` blocks, where `z` is a power of two greater than or
+equal to one and less than 2 to the 27th power. When an SBBF is


Why does this need to be a power of 2? Looks like this might be so that we can use a mask instead of a mod operation to get the block.

Does this affect bloom filter false-positive rate in some way?

The problem is that under- or over-filling has a large effect on either size or false-positive rate. Requiring the number of blocks to be a power of 2 seems to complicate a how to choose a size.

Yes, it's to use a mask. It doesn't influence the false positive probability in any way.

I agree that it makes sense to remove this restriction. I'll send that in a follow-on patch once this one is checked in, so that each patch does one thing: this one clarifies, that one will change the spec.

rdblue · 2019-08-05T17:15:19Z

BloomFilter.md

+significant 32 bits.
+
+```
+void filter_insert(SBBF filter, unsigned int64 x) {


Thanks for the clear and thorough write-up of how to implement this. I think that it really helps make the spec clear.

Do you think it may be worth noting where these can be sped up using masks and integer operations?

I think shorter is better here. I'm not adverse to this coming in a follow-on patch, maybe even after the release, since such a note would be advisory or commentary, not normative.

The specific questions answered from https://lists.apache.org/thread.html/82d2e50f8c1007720564c5dc64aeae7947e949f3954a83436dc36760@%3Cdev.parquet.apache.org%3E are How is the bloom filter block selected from the 32 most-significant bits from of the hash function? These details must be in the spec and not in papers linked from the spec. How is the number of blocks determined? From the overall filter size? I think that the exact procedure for a lookup in each block should be covered in a section, followed by a section for how to perform a look up in the multi-block filter. The wording also needs to be cleaned up so that it is always clear whether the filter that is referenced is a block or the multi-block filter. The spec should give more detail on how to choose the number of blocks and on false positive rates. The sentence with “11.54 bits for each distinct value inserted into the filter” is vague: is this the multi-block filter? Why is a 1% false-positive rate “recommended”? I think it is okay to use 0.5% as each block’s false-positive rate, but then this should state how to achieve an overall false-positive rate as a function of the number of distinct values.

jbapple mentioned this pull request Aug 5, 2019

PARQUET-1630: Update Bloom filter format #146

Merged

jbapple force-pushed the blue-review branch from 16a51ef to f0a7034 Compare August 5, 2019 01:11

jbapple changed the title ~~Answer some mailing list questions about Bloom filters~~ PARQUET-1630: Clarify the Bloom filter algorithm Aug 5, 2019

chenjunjiedada suggested changes Aug 5, 2019

View reviewed changes

chenjunjiedada approved these changes Aug 5, 2019

View reviewed changes

gszadovszky reviewed Aug 5, 2019

View reviewed changes

rdblue reviewed Aug 5, 2019

View reviewed changes

jbapple force-pushed the blue-review branch from f0a7034 to e500303 Compare August 6, 2019 04:32

jbapple force-pushed the blue-review branch from e500303 to 7c852b9 Compare August 7, 2019 02:26

rdblue merged commit fa53342 into apache:master Aug 9, 2019

asfimport mentioned this pull request Jun 23, 2024

Resolve Bloom filter spec concerns #376

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-1630: Clarify the Bloom filter algorithm #147

PARQUET-1630: Clarify the Bloom filter algorithm #147

jbapple commented Aug 5, 2019 •

edited

Loading

jbapple commented Aug 5, 2019

chenjunjiedada Aug 5, 2019

jbapple Aug 5, 2019

chenjunjiedada Aug 5, 2019

jbapple Aug 5, 2019

chenjunjiedada left a comment

gszadovszky Aug 5, 2019

rdblue Aug 5, 2019

rdblue Aug 5, 2019

jbapple Aug 6, 2019

rdblue Aug 6, 2019

jbapple Aug 7, 2019

rdblue Aug 5, 2019

jbapple Aug 6, 2019

rdblue Aug 5, 2019

jbapple Aug 6, 2019

rdblue Aug 5, 2019

jbapple Aug 6, 2019

rdblue Aug 5, 2019

jbapple Aug 6, 2019


		In the real scenario, the size of the Bloom filter and the false positive rate may vary from

	64-bit integer to an unsigned 32-bit integer containing only the lest
	64-bit integer to an unsigned 32-bit integer containing only the least

PARQUET-1630: Clarify the Bloom filter algorithm #147

PARQUET-1630: Clarify the Bloom filter algorithm #147

Conversation

jbapple commented Aug 5, 2019 • edited Loading

jbapple commented Aug 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenjunjiedada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbapple commented Aug 5, 2019 •

edited

Loading