Reduce bloom filter size by using the optimal count for hash functions. #11900

jfboeuf · 2022-11-06T10:16:09Z

BloomFilteringPostingsFormat currently relies on a bloom filter with one hash function (k=1). For a target false positive probability of 10%, 1 is never the optimal value for k. Using the best value for k would either:

achieve a much better fpp with the same bitset size, or
achieve the same fpp with a reduced size (half of what is currently used.

From tests:

targeting a better false positive probability doesn't bring significant enough better performance for the increased size;
Targeting a smaller size by degrading the false positive probability comes with a significant performance hit.

As consequence, a target false positive probability of about 10% seems to be a good trade-off. I slightly raised this value (to 0.1023f) so the size of newly allocated bloom filters is always half the size of what they used to be. The effective false positive probability varies from significantly better in most cases to slightly worse in rare cases. This graph compares both size and effective false positive probability of the current and proposed implementations. Overall performance remains comparable (slightly but not significantly better); the reduced size and the improved false positive probability compensate for the cost of having additional hashes. You can find in branch bloomPerfBench the class BloomBench I used to check for performance.

In addition, the implementation of the bitset is based on a long array, so picking up a size lower than 64 bits is pointless.

API change:

HashFunction.hash(BytesRef) returns a long: more accuracy with a 64bits hash useful to derivate additional hashes from the original one.

The proposed implementation remains compatible with existing/persisted bloom filters.

jpountz

Nice, thanks for improving this. I left some minor comments.

jpountz · 2022-11-08T17:53:23Z

lucene/codecs/src/java/org/apache/lucene/codecs/bloom/FuzzySet.java

-  public static final int VERSION_CURRENT = 2;
+  public static final int VERSION_MURMUR2 = 2;
+  private static final int VERSION_MULTI_HASH = 3;
+  public static final int VERSION_CURRENT = VERSION_MULTI_HASH;


You can drop versions 2 and 3: non-default codecs do not have to remain backward compatible.

Awesome, breaking backward compatibility will allow lots of cleaning up!

jpountz · 2022-11-08T17:58:47Z

lucene/codecs/src/java/org/apache/lucene/codecs/bloom/MurmurHash64.java

+    // body
+    for (int i = 0; i < nblocks; i++) {
+
+      long k = getLittleEndianLong(data, offset);


You can use BitUtil.VH_LE_LONG instead.

Thanks, I will do it.

jpountz · 2023-01-31T17:40:01Z

@jfboeuf I took a stab at removing the versioning logic to simplify the change, I plan on merging it soon if this works for you.

…s. (#11900)

jfboeuf · 2023-02-23T15:04:42Z

Thank you very much for taking care of this and sorry for the late reply: I had been far from home for long vacations for the past months.

Reduce bloom filter size by using the optimal count for hash functions.

597ada1

jpountz reviewed Nov 8, 2022

View reviewed changes

jpountz added 4 commits January 31, 2023 18:15

Merge branch 'main' into jfboeuf_bloom

fe203e3

Simplifications.

c1b9bd3

Remove more versioning logic.

b96b2d1

More simplifications.

545ccdc

jpountz added 4 commits February 1, 2023 14:06

Fix docs

1a4d561

Merge branch 'main' into jfboeuf_bloom

fbb3c88

tidy

fc471cb

CHANGES

1a51d0f

jpountz merged commit 5acca82 into apache:main Feb 1, 2023

jpountz pushed a commit that referenced this pull request Feb 1, 2023

Reduce bloom filter size by using the optimal count for hash function…

d1e4a3a

…s. (#11900)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce bloom filter size by using the optimal count for hash functions. #11900

Reduce bloom filter size by using the optimal count for hash functions. #11900

jfboeuf commented Nov 6, 2022

jpountz left a comment

jpountz Nov 8, 2022

jfboeuf Nov 10, 2022

jpountz Nov 8, 2022

jfboeuf Nov 10, 2022

jpountz commented Jan 31, 2023

jfboeuf commented Feb 23, 2023

Reduce bloom filter size by using the optimal count for hash functions. #11900

Reduce bloom filter size by using the optimal count for hash functions. #11900

Conversation

jfboeuf commented Nov 6, 2022

jpountz left a comment

Choose a reason for hiding this comment

jpountz Nov 8, 2022

Choose a reason for hiding this comment

jfboeuf Nov 10, 2022

Choose a reason for hiding this comment

jpountz Nov 8, 2022

Choose a reason for hiding this comment

jfboeuf Nov 10, 2022

Choose a reason for hiding this comment

jpountz commented Jan 31, 2023

jfboeuf commented Feb 23, 2023