-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce bloom filter size by using the optimal count for hash functions. #11900
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks for improving this. I left some minor comments.
public static final int VERSION_CURRENT = 2; | ||
public static final int VERSION_MURMUR2 = 2; | ||
private static final int VERSION_MULTI_HASH = 3; | ||
public static final int VERSION_CURRENT = VERSION_MULTI_HASH; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can drop versions 2 and 3: non-default codecs do not have to remain backward compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, breaking backward compatibility will allow lots of cleaning up!
// body | ||
for (int i = 0; i < nblocks; i++) { | ||
|
||
long k = getLittleEndianLong(data, offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use BitUtil.VH_LE_LONG
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will do it.
@jfboeuf I took a stab at removing the versioning logic to simplify the change, I plan on merging it soon if this works for you. |
Thank you very much for taking care of this and sorry for the late reply: I had been far from home for long vacations for the past months. |
BloomFilteringPostingsFormat currently relies on a bloom filter with one hash function (k=1). For a target false positive probability of 10%, 1 is never the optimal value for k. Using the best value for k would either:
From tests:
As consequence, a target false positive probability of about 10% seems to be a good trade-off. I slightly raised this value (to 0.1023f) so the size of newly allocated bloom filters is always half the size of what they used to be. The effective false positive probability varies from significantly better in most cases to slightly worse in rare cases. This graph compares both size and effective false positive probability of the current and proposed implementations. Overall performance remains comparable (slightly but not significantly better); the reduced size and the improved false positive probability compensate for the cost of having additional hashes. You can find in branch bloomPerfBench the class BloomBench I used to check for performance.
In addition, the implementation of the bitset is based on a long array, so picking up a size lower than 64 bits is pointless.
API change:
The proposed implementation remains compatible with existing/persisted bloom filters.