-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add single value cache to BloomFilter hash calculation #73103
Conversation
Although this isn't horribly expensive, I noticed that we typically call BloomFilter.ProbablyContains many times with the same input string. This string is then hashed a number of times (about 13 times if the string is present, less if not). Instead, we can cache what these hashes are as almost all created bloom filters will calculate the same hashes. Testing yielded around a 99% hit rate. Should have a *slight* positive effect for find references and other scenarios using the bloom filters.
|
||
if (cachedHash == null | ||
|| cachedHash._isCaseSensitive != filter._isCaseSensitive | ||
|| cachedHash._hashes.Length < filter._hashFunctionCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a <
and not a !=
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because a longer hash of arrays is fine. The caller may not use all the values in the array we hand back, but that's not a big deal.
/// <summary> | ||
/// Provides mechanism to efficiently obtain bloom filter hash for a value. Backed by a single element cache. | ||
/// </summary> | ||
internal class BloomFilterHash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sealed
|
||
for (var i = 0; i < filter._hashFunctionCount; i++) | ||
hashBuilder.Add(filter.GetBitArrayIndex(value, i)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems wrong (though it is early). The result of GetBitArrayIndex depends on internal state of the filter it doesn't seem like it could be used across filters.
} | ||
|
||
[Fact] | ||
public void TestCacheAfterCalls() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need tests with vastly different Bloom filters, demonstrating we get expected probability results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a general request, or is this something you think needs testing in light of the code now using a cache?
I think there's a subtle, but very problematic bug. If I'm remembering how this works properly. |
…mod length to get the BitArrayIndex
323b692
to
f47186e
Compare
f47186e
to
ede3bb2
Compare
public bool ProbablyContains(string value) | ||
{ | ||
var hashes = BloomFilterHash.GetOrCreateHashArray(value, _isCaseSensitive, _hashFunctionCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. this is subtle, and needs docs. mention explicitly that hashes may contain more hash values than the _hashFunctionCount passed in. But that's ok as the first _hashFunctionCount
hashes are guaranteed to be the same due to how ComputeHash works.
{ | ||
var cachedHash = s_cachedHash; | ||
|
||
// Not an equivalency check on the hashFunctionCount as a longer array is ok. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd prefer a longer explanation of why it isok. specifically that when getting the hashes that you always get hte same prefix of hashes regardless of how many hash function counts you ask for. In other words, the hash[0] is always hte same across all the arrays, as long as you have the same value and case sensitivity, and same for hash[1], and s on.
/// we put those values into a simple cache and see if it can be used before calculating. | ||
/// Local testing has put the hit rate of this at around 99%. | ||
/// </summary> | ||
public static ImmutableArray<int> GetOrCreateHashArray(string value, bool isCaseSensitive, int hashFunctionCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should mention you can get a larger array, but should only use up to the first hashFunctionCount entries in it.
1975bb8
to
7117760
Compare
/azp run |
Azure Pipelines successfully started running 3 pipeline(s). |
Although this isn't horribly expensive, I noticed that we typically call BloomFilter.ProbablyContains many times with the same input string. This string is then hashed a number of times (about 13 times if the string is present, less if not). Instead, we can cache what these hashes are as almost all created bloom filters will calculate the same hashes.
Testing yielded around a 99% hit rate. Should have a slight positive effect for find references and other scenarios using the bloom filters.