Add single value cache to BloomFilter hash calculation #73103

ToddGrun · 2024-04-19T14:12:50Z

Although this isn't horribly expensive, I noticed that we typically call BloomFilter.ProbablyContains many times with the same input string. This string is then hashed a number of times (about 13 times if the string is present, less if not). Instead, we can cache what these hashes are as almost all created bloom filters will calculate the same hashes.

Testing yielded around a 99% hit rate. Should have a slight positive effect for find references and other scenarios using the bloom filters.

Although this isn't horribly expensive, I noticed that we typically call BloomFilter.ProbablyContains many times with the same input string. This string is then hashed a number of times (about 13 times if the string is present, less if not). Instead, we can cache what these hashes are as almost all created bloom filters will calculate the same hashes. Testing yielded around a 99% hit rate. Should have a *slight* positive effect for find references and other scenarios using the bloom filters.

CyrusNajmabadi · 2024-04-19T15:33:53Z

src/Workspaces/Core/Portable/Shared/Utilities/BloomFilter.cs

+
+            if (cachedHash == null
+                || cachedHash._isCaseSensitive != filter._isCaseSensitive
+                || cachedHash._hashes.Length < filter._hashFunctionCount


Why is this a < and not a !=?

Because a longer hash of arrays is fine. The caller may not use all the values in the array we hand back, but that's not a big deal.

CyrusNajmabadi · 2024-04-19T15:36:49Z

src/Workspaces/Core/Portable/Shared/Utilities/BloomFilter.cs

+    /// <summary>
+    /// Provides mechanism to efficiently obtain bloom filter hash for a value. Backed by a single element cache.
+    /// </summary>
+    internal class BloomFilterHash


CyrusNajmabadi · 2024-04-19T15:39:56Z

src/Workspaces/Core/Portable/Shared/Utilities/BloomFilter.cs

+
+            for (var i = 0; i < filter._hashFunctionCount; i++)
+                hashBuilder.Add(filter.GetBitArrayIndex(value, i));
+


This seems wrong (though it is early). The result of GetBitArrayIndex depends on internal state of the filter it doesn't seem like it could be used across filters.

src/Workspaces/Core/Portable/Shared/Utilities/BloomFilter.cs

CyrusNajmabadi · 2024-04-19T15:43:22Z

src/EditorFeatures/Test/Utilities/BloomFilterTests.cs

+        }
+
+        [Fact]
+        public void TestCacheAfterCalls()


We need tests with vastly different Bloom filters, demonstrating we get expected probability results.

Is this a general request, or is this something you think needs testing in light of the code now using a cache?

CyrusNajmabadi · 2024-04-19T15:43:50Z

I think there's a subtle, but very problematic bug. If I'm remembering how this works properly.

…mod length to get the BitArrayIndex

CyrusNajmabadi · 2024-04-19T18:03:12Z

src/Workspaces/Core/Portable/Shared/Utilities/BloomFilter.cs

    public bool ProbablyContains(string value)
    {
+        var hashes = BloomFilterHash.GetOrCreateHashArray(value, _isCaseSensitive, _hashFunctionCount);


ok. this is subtle, and needs docs. mention explicitly that hashes may contain more hash values than the _hashFunctionCount passed in. But that's ok as the first _hashFunctionCount hashes are guaranteed to be the same due to how ComputeHash works.

CyrusNajmabadi · 2024-04-19T18:04:22Z

src/Workspaces/Core/Portable/Shared/Utilities/BloomFilter.cs

+        {
+            var cachedHash = s_cachedHash;
+
+            // Not an equivalency check on the hashFunctionCount as a longer array is ok.


i'd prefer a longer explanation of why it isok. specifically that when getting the hashes that you always get hte same prefix of hashes regardless of how many hash function counts you ask for. In other words, the hash[0] is always hte same across all the arrays, as long as you have the same value and case sensitivity, and same for hash[1], and s on.

CyrusNajmabadi · 2024-04-19T18:04:42Z

src/Workspaces/Core/Portable/Shared/Utilities/BloomFilter.cs

+        /// we put those values into a simple cache and see if it can be used before calculating.
+        /// Local testing has put the hit rate of this at around 99%.
+        /// </summary>
+        public static ImmutableArray<int> GetOrCreateHashArray(string value, bool isCaseSensitive, int hashFunctionCount)


this should mention you can get a larger array, but should only use up to the first hashFunctionCount entries in it.

ToddGrun · 2024-04-20T02:52:04Z

/azp run

azure-pipelines · 2024-04-20T02:52:22Z

Azure Pipelines successfully started running 3 pipeline(s).

ToddGrun requested a review from a team as a code owner April 19, 2024 14:12

dotnet-issue-labeler bot added Area-IDE untriaged Issues and PRs which have not yet been triaged by a lead labels Apr 19, 2024

CyrusNajmabadi reviewed Apr 19, 2024

View reviewed changes

CyrusNajmabadi requested changes Apr 19, 2024

View reviewed changes

ToddGrun added 2 commits April 19, 2024 09:47

Fix subtle bug Cyrus pointed out in PR where we still need to do the …

347b400

…mod length to get the BitArrayIndex

be consistent in conventions across both ProbablyContains calls

dac1cf2

CyrusNajmabadi force-pushed the BloomFilterHashCache branch from 323b692 to f47186e Compare April 19, 2024 17:50

Add test

ede3bb2

CyrusNajmabadi force-pushed the BloomFilterHashCache branch from f47186e to ede3bb2 Compare April 19, 2024 17:51

CyrusNajmabadi reviewed Apr 19, 2024

View reviewed changes

Test

7117760

CyrusNajmabadi force-pushed the BloomFilterHashCache branch from 1975bb8 to 7117760 Compare April 19, 2024 18:17

Update comments

c492002

Merge branch 'dotnet:main' into BloomFilterHashCache

8ef6a7d

CyrusNajmabadi approved these changes Apr 20, 2024

View reviewed changes

ToddGrun merged commit 5134c93 into dotnet:main Apr 20, 2024
25 checks passed

dotnet-policy-service bot added this to the Next milestone Apr 20, 2024

ToddGrun deleted the BloomFilterHashCache branch April 21, 2024 15:28

This was referenced Apr 23, 2024

[Automated] PRs inserted in VS build main-34822.156 #73187

Closed

[Automated] PRs inserted in VS build feature.debugger.main-34823.102 #73204

Closed

dibarbet modified the milestones: Next, 17.11 P1 Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add single value cache to BloomFilter hash calculation #73103

Add single value cache to BloomFilter hash calculation #73103

ToddGrun commented Apr 19, 2024

CyrusNajmabadi Apr 19, 2024

ToddGrun Apr 19, 2024

CyrusNajmabadi Apr 19, 2024

CyrusNajmabadi Apr 19, 2024

CyrusNajmabadi Apr 19, 2024

ToddGrun Apr 19, 2024

CyrusNajmabadi commented Apr 19, 2024

CyrusNajmabadi Apr 19, 2024

CyrusNajmabadi Apr 19, 2024

CyrusNajmabadi Apr 19, 2024

ToddGrun commented Apr 20, 2024

azure-pipelines bot commented Apr 20, 2024


		for (var i = 0; i < filter._hashFunctionCount; i++)
		hashBuilder.Add(filter.GetBitArrayIndex(value, i));

Add single value cache to BloomFilter hash calculation #73103

Add single value cache to BloomFilter hash calculation #73103

Conversation

ToddGrun commented Apr 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CyrusNajmabadi commented Apr 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ToddGrun commented Apr 20, 2024

azure-pipelines bot commented Apr 20, 2024