Lots of tweaks to FrozenDictionary/Set analysis and hashing #86293

stephentoub · 2023-05-16T03:51:27Z

Contributes to #77891

Some minor internal refactorings, e.g. changed the internal KeyAnalyzer.Analyze from being void returning with an out results to instead just return those results, and did some internal renaming.
We were previously computing the min/max string length in KeyAnalyzer.Analyze twice, once in UseSubstring and then again after that in Analyze. We were also computing it in the length buckets analyzer. Changed the code to do the checks just once at the beginning and then pass that information around to everywhere that needs it.
Once that information is passed in, AnalysisResults can then be made readonly, rather than being mutated to store the min/max after it was constructed.
If the min length is 0, there's no point calling TryUseSubstring as it'll never find one, so skip the call to it entirely if there's an empty string in the mix.
In TryUseSubstring, for a given substring length first check all the left justifications and then check all the right justifications, rather than intermingling them. Left justifications are a bit cheaper to look up, plus we can avoid creating any objects related to checking right justification if we end up finding a left justification for a given substring length first.
We can also avoid doing any right justification checks if all of the input strings are of the same length, as at that point there's nothing right justification could find that left justification couldn't.
When constructing the HashSets for evaluating uniqueness, when targeting more recent .NET versions we can presize the HashSets to avoid the expansion as we add all the items.
Importantly, set a limit on the maximum length of substring we'll consider. This significantly curtails the worst-case analysis performance for very large inputs that don't yield any unique substrings. While it's certainly possible to construct cases where this will then fail to find a substring when it otherwise could, it's much more rare, and the longer we're dealing with the less we're saving on the hashing costs, which is the only thing this is avoiding.
For hashing, create dedicated branchless hash checks for each of lengths 0 to 4.
When we're doing the ASCII check to see if we can use an ASCII-optimized comparer, if OrdinalIgnoreCase was used, we can also check to see if the substring contains any ASCII letters; if the only the thing the substrings contain are ASCII non-letters, then we can switch to being case-sensitive, as no casing will impact the comparisons.
The ASCII check downlevel was erroneously including 0x7f as being non-ASCII. Fixed it to be 0x80 instead of 0x7f.
Changed GetHashCodeOrdinalIgnoreCase to delegate to GetHashCodeOrdinal after doing its non-ASCII casing work.
On .NET 6+, we can use GetValueRefOrAddDefault to avoid some dictionary lookups.
For the length-bucketing implementation, we can do a quick up-front check to rule out applicability of many inputs where we know just based on the number of input strings and the min/max lengths whether some bucket will be forced to be too big.

- Some minor internal refactorings, e.g. changed the internal KeyAnalyzer.Analyze from being void returning with an out results to instead just return those results, and did some internal renaming. - We were previously computing the min/max string length in KeyAnalyzer.Analyze twice, once in UseSubstring and then again after that in Analyze. We were also computing it in the length buckets analyzer. Changed the code to do the checks just once at the beginning and then pass that information around to everywhere that needs it. - Once that information is passed in, AnalysisResults can then be made readonly, rather than being mutated to store the min/max after it was constructed. - If the min length is 0, there's no point calling TryUseSubstring as it'll never find one, so skip the call to it entirely if there's an empty string in the mix. - In TryUseSubstring, for a given substring length first check all the left justifications and then check all the right justifications, rather than intermingling them. Left justifications are a bit cheaper to look up, plus we can avoid creating any objects related to checking right justification if we end up finding a left justification for a given substring length first. - We can also avoid doing any right justification checks if all of the input strings are of the same length, as at that point there's nothing right justification could find that left justification couldn't. - When constructing the HashSets for evaluating uniqueness, when targeting more recent .NET versions we can presize the HashSets to avoid the expansion as we add all the items. - Importantly, set a limit on the maximum length of substring we'll consider. This significantly curtails the worst-case analysis performance for very large inputs that don't yield any unique substrings. While it's certainly possible to construct cases where this will then fail to find a substring when it otherwise could, it's much more rare, and the longer we're dealing with the less we're saving on the hashing costs, which is the only thing this is avoiding. - For hashing, create dedicated branchless hash checks for each of lengths 0 to 4. - When we're doing the ASCII check to see if we can use an ASCII-optimized comparer, if OrdinalIgnoreCase was used, we can also check to see if the substring contains any ASCII letters; if the only the thing the substrings contain are ASCII non-letters, then we can switch to being case-sensitive, as no casing will impact the comparisons. - The ASCII check downlevel was erroneously including 0x7f as being non-ASCII. Fixed it to be 0x80 instead of 0x7f. - Changed GetHashCodeOrdinalIgnoreCase to delegate to GetHashCodeOrdinal after doing its non-ASCII casing work. - On .NET 6+, we can use GetValueRefOrAddDefault to avoid some dictionary lookups. - For the length-bucketing implementation, we can do a quick up-front check to rule out applicability of many inputs where we know just based on the number of input strings and the min/max lengths whether some bucket will be forced to be too big.

ghost · 2023-05-16T03:51:41Z

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

Issue Details

Contributes to #77891

Some minor internal refactorings, e.g. changed the internal KeyAnalyzer.Analyze from being void returning with an out results to instead just return those results, and did some internal renaming.
We were previously computing the min/max string length in KeyAnalyzer.Analyze twice, once in UseSubstring and then again after that in Analyze. We were also computing it in the length buckets analyzer. Changed the code to do the checks just once at the beginning and then pass that information around to everywhere that needs it.
Once that information is passed in, AnalysisResults can then be made readonly, rather than being mutated to store the min/max after it was constructed.
If the min length is 0, there's no point calling TryUseSubstring as it'll never find one, so skip the call to it entirely if there's an empty string in the mix.
In TryUseSubstring, for a given substring length first check all the left justifications and then check all the right justifications, rather than intermingling them. Left justifications are a bit cheaper to look up, plus we can avoid creating any objects related to checking right justification if we end up finding a left justification for a given substring length first.
We can also avoid doing any right justification checks if all of the input strings are of the same length, as at that point there's nothing right justification could find that left justification couldn't.
When constructing the HashSets for evaluating uniqueness, when targeting more recent .NET versions we can presize the HashSets to avoid the expansion as we add all the items.
Importantly, set a limit on the maximum length of substring we'll consider. This significantly curtails the worst-case analysis performance for very large inputs that don't yield any unique substrings. While it's certainly possible to construct cases where this will then fail to find a substring when it otherwise could, it's much more rare, and the longer we're dealing with the less we're saving on the hashing costs, which is the only thing this is avoiding.
For hashing, create dedicated branchless hash checks for each of lengths 0 to 4.
When we're doing the ASCII check to see if we can use an ASCII-optimized comparer, if OrdinalIgnoreCase was used, we can also check to see if the substring contains any ASCII letters; if the only the thing the substrings contain are ASCII non-letters, then we can switch to being case-sensitive, as no casing will impact the comparisons.
The ASCII check downlevel was erroneously including 0x7f as being non-ASCII. Fixed it to be 0x80 instead of 0x7f.
Changed GetHashCodeOrdinalIgnoreCase to delegate to GetHashCodeOrdinal after doing its non-ASCII casing work.
On .NET 6+, we can use GetValueRefOrAddDefault to avoid some dictionary lookups.
For the length-bucketing implementation, we can do a quick up-front check to rule out applicability of many inputs where we know just based on the number of input strings and the min/max lengths whether some bucket will be forced to be too big.

Author:	stephentoub
Assignees:	stephentoub
Labels:	`area-System.Collections`
Milestone:	-

stephentoub · 2023-05-16T18:05:22Z

Importantly, set a limit on the maximum length of substring we'll consider.

Here's a terrible-don't-do-this-at-home example of the "worst case" behavior this avoids. Essentially today the ordinal optimization is significantly influenced by the length of the strings involved, having an O(N^2) component in the length N. By limiting the max substring length to a small constant considered (right now in the PR, 8), that O(N^2) component drops to O(N). You can see this with a bad example like:

private KeyValuePair<string, string>[] _pairs =
    Enumerable
    .Range(0, 1000)
    .Select(i => new StringBuilder().Append('a', i).Append('b').Append('a', 1000 - i - 1).ToString())
    .Select(s => KeyValuePair.Create(s, s))
    .ToArray();

[Benchmark]
public FrozenDictionary<string, string> BadCtorCase() => FrozenDictionary.ToFrozenDictionary(_pairs, optimizeForReading: true);

which is populating the dictionary with 1000 strings each of length 1000 in which it won't find a good substring to use.

Method	Toolchain	Mean	Ratio	Allocated	Alloc Ratio
BadCtorCase	\main\corerun.exe	102,736.7 ms	1.000	351.57 KB	1.00
BadCtorCase	\pr\corerun.exe	150.5 ms	0.001	196.82 KB	0.56

src/libraries/System.Collections.Immutable/src/System/Collections/Frozen/FrozenDictionary.cs

stephentoub requested review from eiriktsarpalis and geeknoid May 16, 2023 03:51

dotnet-issue-labeler bot added the area-System.Collections label May 16, 2023

ghost assigned stephentoub May 16, 2023

IDisposable reviewed May 19, 2023

View reviewed changes

src/libraries/System.Collections.Immutable/src/System/Collections/Frozen/FrozenDictionary.cs Show resolved Hide resolved

eiriktsarpalis approved these changes May 22, 2023

View reviewed changes

stephentoub merged commit cb35f65 into dotnet:main May 22, 2023
105 checks passed

stephentoub deleted the frozentweaks branch May 22, 2023 14:34

dotnet locked as resolved and limited conversation to collaborators Jun 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lots of tweaks to FrozenDictionary/Set analysis and hashing #86293

Lots of tweaks to FrozenDictionary/Set analysis and hashing #86293

stephentoub commented May 16, 2023

ghost commented May 16, 2023

stephentoub commented May 16, 2023

Lots of tweaks to FrozenDictionary/Set analysis and hashing #86293

Lots of tweaks to FrozenDictionary/Set analysis and hashing #86293

Conversation

stephentoub commented May 16, 2023

ghost commented May 16, 2023

stephentoub commented May 16, 2023