Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce bloom filter size by using the optimal count for hash functions. #11900

Merged
merged 9 commits into from
Feb 1, 2023

Conversation

jfboeuf
Copy link
Contributor

@jfboeuf jfboeuf commented Nov 6, 2022

BloomFilteringPostingsFormat currently relies on a bloom filter with one hash function (k=1). For a target false positive probability of 10%, 1 is never the optimal value for k. Using the best value for k would either:

  • achieve a much better fpp with the same bitset size, or
  • achieve the same fpp with a reduced size (half of what is currently used.

From tests:

  • targeting a better false positive probability doesn't bring significant enough better performance for the increased size;
  • Targeting a smaller size by degrading the false positive probability comes with a significant performance hit.

As consequence, a target false positive probability of about 10% seems to be a good trade-off. I slightly raised this value (to 0.1023f) so the size of newly allocated bloom filters is always half the size of what they used to be. The effective false positive probability varies from significantly better in most cases to slightly worse in rare cases. This graph compares both size and effective false positive probability of the current and proposed implementations. Overall performance remains comparable (slightly but not significantly better); the reduced size and the improved false positive probability compensate for the cost of having additional hashes. You can find in branch bloomPerfBench the class BloomBench I used to check for performance.

In addition, the implementation of the bitset is based on a long array, so picking up a size lower than 64 bits is pointless.

API change:

  • HashFunction.hash(BytesRef) returns a long: more accuracy with a 64bits hash useful to derivate additional hashes from the original one.

The proposed implementation remains compatible with existing/persisted bloom filters.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for improving this. I left some minor comments.

public static final int VERSION_CURRENT = 2;
public static final int VERSION_MURMUR2 = 2;
private static final int VERSION_MULTI_HASH = 3;
public static final int VERSION_CURRENT = VERSION_MULTI_HASH;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can drop versions 2 and 3: non-default codecs do not have to remain backward compatible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, breaking backward compatibility will allow lots of cleaning up!

// body
for (int i = 0; i < nblocks; i++) {

long k = getLittleEndianLong(data, offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use BitUtil.VH_LE_LONG instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will do it.

@jpountz
Copy link
Contributor

jpountz commented Jan 31, 2023

@jfboeuf I took a stab at removing the versioning logic to simplify the change, I plan on merging it soon if this works for you.

@jpountz jpountz merged commit 5acca82 into apache:main Feb 1, 2023
@jfboeuf
Copy link
Contributor Author

jfboeuf commented Feb 23, 2023

Thank you very much for taking care of this and sorry for the late reply: I had been far from home for long vacations for the past months.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants