Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add single value cache to BloomFilter hash calculation #73103

Merged
merged 7 commits into from
Apr 20, 2024

Conversation

ToddGrun
Copy link
Contributor

Although this isn't horribly expensive, I noticed that we typically call BloomFilter.ProbablyContains many times with the same input string. This string is then hashed a number of times (about 13 times if the string is present, less if not). Instead, we can cache what these hashes are as almost all created bloom filters will calculate the same hashes.

Testing yielded around a 99% hit rate. Should have a slight positive effect for find references and other scenarios using the bloom filters.

Although this isn't horribly expensive, I noticed that we typically call BloomFilter.ProbablyContains many times with the same input string. This string is then hashed a number of times (about 13 times if the string is present, less if not). Instead, we can cache what these hashes are as almost all created bloom filters will calculate the same hashes.

Testing yielded around a 99% hit rate. Should have a *slight* positive effect for find references and other scenarios using the bloom filters.
@ToddGrun ToddGrun requested a review from a team as a code owner April 19, 2024 14:12
@dotnet-issue-labeler dotnet-issue-labeler bot added Area-IDE untriaged Issues and PRs which have not yet been triaged by a lead labels Apr 19, 2024

if (cachedHash == null
|| cachedHash._isCaseSensitive != filter._isCaseSensitive
|| cachedHash._hashes.Length < filter._hashFunctionCount
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a < and not a !=?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because a longer hash of arrays is fine. The caller may not use all the values in the array we hand back, but that's not a big deal.

/// <summary>
/// Provides mechanism to efficiently obtain bloom filter hash for a value. Backed by a single element cache.
/// </summary>
internal class BloomFilterHash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sealed


for (var i = 0; i < filter._hashFunctionCount; i++)
hashBuilder.Add(filter.GetBitArrayIndex(value, i));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong (though it is early). The result of GetBitArrayIndex depends on internal state of the filter it doesn't seem like it could be used across filters.

}

[Fact]
public void TestCacheAfterCalls()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need tests with vastly different Bloom filters, demonstrating we get expected probability results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a general request, or is this something you think needs testing in light of the code now using a cache?

@CyrusNajmabadi
Copy link
Member

I think there's a subtle, but very problematic bug. If I'm remembering how this works properly.

public bool ProbablyContains(string value)
{
var hashes = BloomFilterHash.GetOrCreateHashArray(value, _isCaseSensitive, _hashFunctionCount);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. this is subtle, and needs docs. mention explicitly that hashes may contain more hash values than the _hashFunctionCount passed in. But that's ok as the first _hashFunctionCount hashes are guaranteed to be the same due to how ComputeHash works.

{
var cachedHash = s_cachedHash;

// Not an equivalency check on the hashFunctionCount as a longer array is ok.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd prefer a longer explanation of why it isok. specifically that when getting the hashes that you always get hte same prefix of hashes regardless of how many hash function counts you ask for. In other words, the hash[0] is always hte same across all the arrays, as long as you have the same value and case sensitivity, and same for hash[1], and s on.

/// we put those values into a simple cache and see if it can be used before calculating.
/// Local testing has put the hit rate of this at around 99%.
/// </summary>
public static ImmutableArray<int> GetOrCreateHashArray(string value, bool isCaseSensitive, int hashFunctionCount)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should mention you can get a larger array, but should only use up to the first hashFunctionCount entries in it.

@ToddGrun
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@ToddGrun ToddGrun merged commit 5134c93 into dotnet:main Apr 20, 2024
25 checks passed
@dotnet-policy-service dotnet-policy-service bot added this to the Next milestone Apr 20, 2024
@ToddGrun ToddGrun deleted the BloomFilterHashCache branch April 21, 2024 15:28
@dibarbet dibarbet modified the milestones: Next, 17.11 P1 Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-IDE untriaged Issues and PRs which have not yet been triaged by a lead
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants