Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lots of tweaks to FrozenDictionary/Set analysis and hashing #86293

Merged
merged 1 commit into from May 22, 2023

Conversation

stephentoub
Copy link
Member

Contributes to #77891

  • Some minor internal refactorings, e.g. changed the internal KeyAnalyzer.Analyze from being void returning with an out results to instead just return those results, and did some internal renaming.
  • We were previously computing the min/max string length in KeyAnalyzer.Analyze twice, once in UseSubstring and then again after that in Analyze. We were also computing it in the length buckets analyzer. Changed the code to do the checks just once at the beginning and then pass that information around to everywhere that needs it.
  • Once that information is passed in, AnalysisResults can then be made readonly, rather than being mutated to store the min/max after it was constructed.
  • If the min length is 0, there's no point calling TryUseSubstring as it'll never find one, so skip the call to it entirely if there's an empty string in the mix.
  • In TryUseSubstring, for a given substring length first check all the left justifications and then check all the right justifications, rather than intermingling them. Left justifications are a bit cheaper to look up, plus we can avoid creating any objects related to checking right justification if we end up finding a left justification for a given substring length first.
  • We can also avoid doing any right justification checks if all of the input strings are of the same length, as at that point there's nothing right justification could find that left justification couldn't.
  • When constructing the HashSets for evaluating uniqueness, when targeting more recent .NET versions we can presize the HashSets to avoid the expansion as we add all the items.
  • Importantly, set a limit on the maximum length of substring we'll consider. This significantly curtails the worst-case analysis performance for very large inputs that don't yield any unique substrings. While it's certainly possible to construct cases where this will then fail to find a substring when it otherwise could, it's much more rare, and the longer we're dealing with the less we're saving on the hashing costs, which is the only thing this is avoiding.
  • For hashing, create dedicated branchless hash checks for each of lengths 0 to 4.
  • When we're doing the ASCII check to see if we can use an ASCII-optimized comparer, if OrdinalIgnoreCase was used, we can also check to see if the substring contains any ASCII letters; if the only the thing the substrings contain are ASCII non-letters, then we can switch to being case-sensitive, as no casing will impact the comparisons.
  • The ASCII check downlevel was erroneously including 0x7f as being non-ASCII. Fixed it to be 0x80 instead of 0x7f.
  • Changed GetHashCodeOrdinalIgnoreCase to delegate to GetHashCodeOrdinal after doing its non-ASCII casing work.
  • On .NET 6+, we can use GetValueRefOrAddDefault to avoid some dictionary lookups.
  • For the length-bucketing implementation, we can do a quick up-front check to rule out applicability of many inputs where we know just based on the number of input strings and the min/max lengths whether some bucket will be forced to be too big.

- Some minor internal refactorings, e.g. changed the internal KeyAnalyzer.Analyze from being void returning with an out results to instead just return those results, and did some internal renaming.
- We were previously computing the min/max string length in KeyAnalyzer.Analyze twice, once in UseSubstring and then again after that in Analyze.  We were also computing it in the length buckets analyzer.  Changed the code to do the checks just once at the beginning and then pass that information around to everywhere that needs it.
- Once that information is passed in, AnalysisResults can then be made readonly, rather than being mutated to store the min/max after it was constructed.
- If the min length is 0, there's no point calling TryUseSubstring as it'll never find one, so skip the call to it entirely if there's an empty string in the mix.
- In TryUseSubstring, for a given substring length first check all the left justifications and then check all the right justifications, rather than intermingling them.  Left justifications are a bit cheaper to look up, plus we can avoid creating any objects related to checking right justification if we end up finding a left justification for a given substring length first.
- We can also avoid doing any right justification checks if all of the input strings are of the same length, as at that point there's nothing right justification could find that left justification couldn't.
- When constructing the HashSets for evaluating uniqueness, when targeting more recent .NET versions we can presize the HashSets to avoid the expansion as we add all the items.
- Importantly, set a limit on the maximum length of substring we'll consider.  This significantly curtails the worst-case analysis performance for very large inputs that don't yield any unique substrings.  While it's certainly possible to construct cases where this will then fail to find a substring when it otherwise could, it's much more rare, and the longer we're dealing with the less we're saving on the hashing costs, which is the only thing this is avoiding.
- For hashing, create dedicated branchless hash checks for each of lengths 0 to 4.
- When we're doing the ASCII check to see if we can use an ASCII-optimized comparer, if OrdinalIgnoreCase was used, we can also check to see if the substring contains any ASCII letters; if the only the thing the substrings contain are ASCII non-letters, then we can switch to being case-sensitive, as no casing will impact the comparisons.
- The ASCII check downlevel was erroneously including 0x7f as being non-ASCII. Fixed it to be 0x80 instead of 0x7f.
- Changed GetHashCodeOrdinalIgnoreCase to delegate to GetHashCodeOrdinal after doing its non-ASCII casing work.
- On .NET 6+, we can use GetValueRefOrAddDefault to avoid some dictionary lookups.
- For the length-bucketing implementation, we can do a quick up-front check to rule out applicability of many inputs where we know just based on the number of input strings and the min/max lengths whether some bucket will be forced to be too big.
@ghost
Copy link

ghost commented May 16, 2023

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

Issue Details

Contributes to #77891

  • Some minor internal refactorings, e.g. changed the internal KeyAnalyzer.Analyze from being void returning with an out results to instead just return those results, and did some internal renaming.
  • We were previously computing the min/max string length in KeyAnalyzer.Analyze twice, once in UseSubstring and then again after that in Analyze. We were also computing it in the length buckets analyzer. Changed the code to do the checks just once at the beginning and then pass that information around to everywhere that needs it.
  • Once that information is passed in, AnalysisResults can then be made readonly, rather than being mutated to store the min/max after it was constructed.
  • If the min length is 0, there's no point calling TryUseSubstring as it'll never find one, so skip the call to it entirely if there's an empty string in the mix.
  • In TryUseSubstring, for a given substring length first check all the left justifications and then check all the right justifications, rather than intermingling them. Left justifications are a bit cheaper to look up, plus we can avoid creating any objects related to checking right justification if we end up finding a left justification for a given substring length first.
  • We can also avoid doing any right justification checks if all of the input strings are of the same length, as at that point there's nothing right justification could find that left justification couldn't.
  • When constructing the HashSets for evaluating uniqueness, when targeting more recent .NET versions we can presize the HashSets to avoid the expansion as we add all the items.
  • Importantly, set a limit on the maximum length of substring we'll consider. This significantly curtails the worst-case analysis performance for very large inputs that don't yield any unique substrings. While it's certainly possible to construct cases where this will then fail to find a substring when it otherwise could, it's much more rare, and the longer we're dealing with the less we're saving on the hashing costs, which is the only thing this is avoiding.
  • For hashing, create dedicated branchless hash checks for each of lengths 0 to 4.
  • When we're doing the ASCII check to see if we can use an ASCII-optimized comparer, if OrdinalIgnoreCase was used, we can also check to see if the substring contains any ASCII letters; if the only the thing the substrings contain are ASCII non-letters, then we can switch to being case-sensitive, as no casing will impact the comparisons.
  • The ASCII check downlevel was erroneously including 0x7f as being non-ASCII. Fixed it to be 0x80 instead of 0x7f.
  • Changed GetHashCodeOrdinalIgnoreCase to delegate to GetHashCodeOrdinal after doing its non-ASCII casing work.
  • On .NET 6+, we can use GetValueRefOrAddDefault to avoid some dictionary lookups.
  • For the length-bucketing implementation, we can do a quick up-front check to rule out applicability of many inputs where we know just based on the number of input strings and the min/max lengths whether some bucket will be forced to be too big.
Author: stephentoub
Assignees: stephentoub
Labels:

area-System.Collections

Milestone: -

@stephentoub
Copy link
Member Author

Importantly, set a limit on the maximum length of substring we'll consider.

Here's a terrible-don't-do-this-at-home example of the "worst case" behavior this avoids. Essentially today the ordinal optimization is significantly influenced by the length of the strings involved, having an O(N^2) component in the length N. By limiting the max substring length to a small constant considered (right now in the PR, 8), that O(N^2) component drops to O(N). You can see this with a bad example like:

private KeyValuePair<string, string>[] _pairs =
    Enumerable
    .Range(0, 1000)
    .Select(i => new StringBuilder().Append('a', i).Append('b').Append('a', 1000 - i - 1).ToString())
    .Select(s => KeyValuePair.Create(s, s))
    .ToArray();

[Benchmark]
public FrozenDictionary<string, string> BadCtorCase() => FrozenDictionary.ToFrozenDictionary(_pairs, optimizeForReading: true);

which is populating the dictionary with 1000 strings each of length 1000 in which it won't find a good substring to use.

Method Toolchain Mean Ratio Allocated Alloc Ratio
BadCtorCase \main\corerun.exe 102,736.7 ms 1.000 351.57 KB 1.00
BadCtorCase \pr\corerun.exe 150.5 ms 0.001 196.82 KB 0.56

@stephentoub stephentoub merged commit cb35f65 into dotnet:main May 22, 2023
105 checks passed
@stephentoub stephentoub deleted the frozentweaks branch May 22, 2023 14:34
@dotnet dotnet locked as resolved and limited conversation to collaborators Jun 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants