Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frozen collection follow-ups #77891

Open
4 of 6 tasks
Tracked by #77391
stephentoub opened this issue Nov 4, 2022 · 53 comments
Open
4 of 6 tasks
Tracked by #77391

Frozen collection follow-ups #77891

stephentoub opened this issue Nov 4, 2022 · 53 comments

Comments

@stephentoub
Copy link
Member

stephentoub commented Nov 4, 2022

#77799 adds new FrozenDictionary and FrozenSet types. These are in decent shape but not yet ready to ship. Several follow-ups are required:

  • Frozen collection construction performance #87964
    The initial use cases for these types was for long-lived processes, where overall it's a better use of resources to spend more time at construction time to optimize the data structures and algorithms used for later data access. However, the current construction time is one or more orders of magnitude more than for the standard dictionary/set types. We need to a) invest in driving down those construction costs, and b) consider making the default performance of ToFrozenDictionary/Set as close that of Dictionary/Set as possible via an additional optional parameter to these methods, e.g.
public static FrozenSet<T> ToFrozenSet<T>(this IEnumerable<T> source, IEqualityComparer<T>? comparer = null,
+    double constructionOptimization = default // pick a better name
);

When 0.0, this would effectively default to creating a FrozenSet/Dictionary that just wrapped a HashSet/Dictionary, providing the same FrozenSet/Dictionary semantics but primarily just as a wrapper for a HashSet/Dictionary, such that the only additional costs were cloning of the data (assuming the Items/Keys/Values properties remain as they are today, they could be lazily-created on first need). On a scale from 0.0 to 1.0, the larger the double value, the more optimization would be performed at construction time, enabling the caller to choose how much time is spent at construction vs how fast subsequent data access can be made, with 1.0 meaning construction would take the longest in exchange for fastest data access. Having the default be fast construction would make this the least surprising for casual use and would avoid falling into pits of failure in the average case; consumers willing to spend more time at construction because they know their processes will be long-lived can then opt-in to longer startup times.

  • Frozen Collection Follow-up: Items/Keys/Values properties #86719 blocking-release
    The current design has Items/Keys/Values properties that return ImmutableArray<>. This is to enable access to the data as a ReadOnlySpan<>, indexibility, via refs, and with very fast enumeration. This does, however, place a requirement on implementations that they either store the keys/values/items separately and in contiguous memory or that implementations maintain a second copy of the data that does so. If we'd be willing to give up on the ReadOnlySpan<> and indexing capabilities, these types should instead be changed to some FrozenEnumerable<>-like type that would allow different implementations to vary how the data was stored.
  • Small collections
    There's currently an optimized implementation for empty collections, but not for other small sizes, e.g. <= 8 elements. We should investigate adding simple implementations that don't do any hashing or other complicated storage for small sizes. This includes looking at various code bases / sources of information to determine the most common sizes to be optimized for, e.g. is it worth having a dedicated implementation for collections of just 1 or 2 elements.
  • Other data structures and algorithms
    Would a perfect hashing implementation yield any benefits? Would tries be useful in some situations for strings? Should we consider doing reflection emit code gen in some circumstances?
  • Trimming friendliness
    The factory nature here means that every implementation needs to be kept by a trimmer regardless of whether it's actually used (though a trimmer should, even if not today, be able to get rid of implementations unique to Ts where it can see those Ts are never used). We thus need to be thoughtful about how many custom implementations there are, and potentially have one or more feature switches to exclude one or more that add only incremental value, potentially opting them out by default.
  • Source generation
    If/when Roslyn source generators support some kind of mechanism that would allow a source generator to look at constant data in the C# file and augment ToFrozenSet/Dictionary call sites in some manner, it would be beneficial if the optimizations could be performed at compile-time, feeding something into an optional parameter to ToFrozenSet/Dictionary to enable them to be fast, with all the work supplied at build time. This would likely also require more work on the API to make the existing types extensible, which is something we've actively avoided today.
@ghost
Copy link

ghost commented Nov 4, 2022

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

Issue Details

#77799 adds new FrozenDictionary and FrozenSet types. These are in decent shape but not yet ready to ship. Several follow-ups are required:

  • Construction performance. The initial use cases for these types was for long-lived processes, where overall it's a better use of resources to spend more time at construction time to optimize the data structures and algorithms used for later data access. However, the current construction time is one or more orders of magnitude more than for the standard dictionary/set types. We need to a) invest in driving down those construction costs, and b) consider making the default performance of ToFrozenDictionary/Set as close that of Dictionary/Set as possible via an additional optional parameter to these methods, e.g.
public static FrozenSet<T> ToFrozenSet<T>(this IEnumerable<T> source, IEqualityComparer<T>? comparer = null,
+    double constructionOptimization = default // pick a better name
);

When 0.0, this would effectively default to creating a FrozenSet/Dictionary that just wrapped a HashSet/Dictionary, providing the same FrozenSet/Dictionary semantics but primarily just as a wrapper for a HashSet/Dictionary, such that the only additional costs were cloning of the data (assuming the Items/Keys/Values properties remain as they are today, they could be lazily-created on first need). On a scale from 0.0 to 1.0, the larger the double value, the more optimization would be performed at construction time, enabling the caller to choose how much time is spent at construction vs how fast subsequent data access can be made, with 1.0 meaning construction would take the longest in exchange for fastest data access. Having the default be fast construction would make this the least surprising for casual use and would avoid falling into pits of failure in the average case; consumers willing to spend more time at construction because they know their processes will be long-lived can then opt-in to longer startup times.

  • Items/Keys/Values properties. The current design has Items/Keys/Values properties that return ImmutableArray<>. This is to enable access to the data as a ReadOnlySpan<>, indexibility, via refs, and with very fast enumeration. This does, however, place a requirement on implementations that they either store the keys/values/items separately and in contiguous memory or that implementations maintain a second copy of the data that does so. If we'd be willing to give up on the ReadOnlySpan<> and indexing capabilities, these types should instead be changed to some FrozenEnumerable<>-like type that would allow different implementations to vary how the data was stored.
  • Small collections. There's currently an optimized implementation for empty collections, but not for other small sizes, e.g. <= 8 elements. We should investigate adding simple implementations that don't do any hashing or other complicated storage for small sizes. This includes looking at various code bases / sources of information to determine the most common sizes to be optimized for, e.g. is it worth having a dedicated implementation for collections of just 1 or 2 elements.
  • Other data structures and algorithms. Would a perfect hashing implementation yield any benefits? Would tries be useful in some situations for strings? Should we consider doing reflection emit code gen in some circumstances?
  • Source generation. If/when Roslyn source generators support some kind of mechanism that would allow a source generator to look at constant data in the C# file, it would be beneficial if the optimizations could be performed at compile-time, feeding something into an optional parameter to ToFrozenSet/Dictionary to enable them to be fast, with all the work supplied at build time. This would likely also require more work on the API to make the existing types extensible, which is something we've actively avoided today.
Author: stephentoub
Assignees: -
Labels:

area-System.Collections

Milestone: 8.0.0

@stephentoub
Copy link
Member Author

cc: @geeknoid

@geeknoid
Copy link
Member

geeknoid commented Nov 4, 2022

Perfect hashing ends up being not a great choice. Perfect hashing usually involves a more complex hash function, and you'll end up with lower perf than what we have here. Their main benefit would potentially be in smaller data by not having giant sparse lookup tables.

Another area worth looking into is to further optimize string-based key (the overwhelming favorite kind of key). There are a number of comparers in the current implementation to try and fine-tune how comparison works, but more are possible. For example, comparers that know about 2 character sequences explicitly. The motivation here is to reduce the cost of hashing incoming strings to a minimum.

Perhaps a broader problem that just for the frozen collections, but there is very poor interop between spans and collections. In particular, for string-based dictionaries, it would be terrific to be able to do a lookup using a ReadOnlySpan as a key in addition to a string. This would avoid needing to materialize strings just to do a dictionary lookup. A perfect use for this is a string interning table for incoming REST/gRPC calls. You could go sraight from the incoming buffer, and do a lookup in the dictionary to get the final string object.

@stephentoub
Copy link
Member Author

it would be terrific to be able to do a lookup using a ReadOnlySpan as a key in addition to a string

Yes, this has long been desired, and we currently lack a great answer. This is tracked by #27229.

@geeknoid
Copy link
Member

geeknoid commented Nov 4, 2022

Frozen collection state serialization is another interesting possibility.

The idea is to make it possible to serialize the state of a frozen collection to a portable format. A source generator can then generate such serialized blobs to be sucked in by a frozen collection at startup, providing cheap initiatilziation for well-knwon data.

Additionally, said serialized data could actually be passed around in config blobs. So that even state that is config-drivem can have the benefit of low startup time. You serialize the state offline, and send it through the config system. This could in fact provide opportunities for even more aggressive optimizations since startup time would not be an issue.

@stephentoub
Copy link
Member Author

Frozen collection state serialization is another interesting possibility.

I intended the "Source generation" part of what I wrote above to cover this. But we'd also need to be really careful here. It'd be fairly easy to accidentally create a vulnerability.

@iSazonov
Copy link
Contributor

iSazonov commented Nov 4, 2022

Frozen collection state serialization is another interesting possibility.

I intended the "Source generation" part of what I wrote above to cover this. But we'd also need to be really careful here. It'd be fairly easy to accidentally create a vulnerability.

Does this exclude using perfect hash functions at all?

@stephentoub
Copy link
Member Author

Frozen collection state serialization is another interesting possibility.

I intended the "Source generation" part of what I wrote above to cover this. But we'd also need to be really careful here. It'd be fairly easy to accidentally create a vulnerability.

Does this exclude using perfect hash functions at all?

That statement was unrelated to perfect hashing.

@iSazonov
Copy link
Contributor

iSazonov commented Nov 4, 2022

I mean GetHash using seeding today but perfect hash no. So question can we use it or no.

@iSazonov
Copy link
Contributor

iSazonov commented Nov 4, 2022

double constructionOptimization = default

It could be an enum like CompressionLevel. It is more user friendly in public API.
Or it could be a dynamic to follow tiered compilation conception - fast startup with follow optimization in background.

@stephentoub
Copy link
Member Author

perfect hash no. So question can we use it or no.

Technically could we? Sure. Should we? That would depend on a variety of factors, not the least of which would be demonstrating it was a perf win. And Martin already commented on his experience with their perf in #77891 (comment).

@iSazonov
Copy link
Contributor

iSazonov commented Nov 4, 2022

That would depend on a variety of factors, not the least of which would be demonstrating it was a perf win. And Martin already commented on his experience with their perf in #77891 (comment).

I couldn't agree more. A simple example, if we have 256 strings all with different length - perfect hash function can be just the length of the string (and byte lookup table). Is it possible to create something faster?

Complex perfect functions are rather aimed at creating very compact data buffers. If we are not going to store them in the .Net Runtime, we don't need to strive to make them extremely small and hence we can use simpler and faster functions.
We could learn the gperf algorithms, if the license allows it. And create an SG in the first step if that is needed of course.

@stephentoub
Copy link
Member Author

A simple example, if we have 256 strings all with different length

The implementation already does that.

@omariom
Copy link
Contributor

omariom commented Nov 4, 2022

Would be very helpful if the dictionary had such indexer

public ref readonly KeyValuePair<TKey, TValue> this[int entryIndex]

// for the set
public ref readonly T this[int entryIndex] 

It would allow to store references to entries as integers and later quickly retrieve them.
If returning KeyValuePair imposes too much on the implementation, then may be separate GetValue and GetKey?

The idea is to be able to get indexes of the entries and use them to retrieve keys and values.

@stephentoub
Copy link
Member Author

Would be very helpful if the dictionary had such indexer

What's the scenario for that such that the indexer returning a ref to the value is insufficient?

@omariom
Copy link
Contributor

omariom commented Nov 4, 2022

You mean this indexer?
public ref readonly TValue this[TKey key]

I have a couple of scenarios.

  1. Reference data that can be looked up. For example, a list of countries or a list of currencies.
    The referencing data structure would keep entries' indexes, instead of the keys ("UK" or "GBP").
    It would also be more efficient in terms of memory and more GC friendly.

    I implemented it for myself by just copying the existing HashSet and Dictionary and adding the required members.

  2. Another scenario (imaginable in this case) is a network protocol, that at the start of a session sends the initial snapshot of
    ref data, and later refers to the entries by their indexes instead of the values in the hashset or the keys of the dictionary.

@geeknoid
Copy link
Member

geeknoid commented Nov 5, 2022

@omariom Does this make sense in the context of a frozen dictionary? Seems like you're using a dictionary to build up unique name/value pairs, and then when you're done you ditch the name part and access your values by index. Do you have scenarios that routinely require having both key-based and index-based access to the same values on an on-going basis?

@geeknoid
Copy link
Member

geeknoid commented Nov 5, 2022

@iSazonov For those cases where the input is known statically, a source generator that has a large suite of tricks to generate custom code seems like a great idea. This is not so much focused on the need for perfect hashing, it could be anything that makes the code fast.

In the scenarios where I've used frozen collections so far, the set of items in the collections have always come from config or some other dynamic source, so I haven't had the need/desire for a build-time generator. But for parsers which have a well-known set of tokens, there could be value in a highly optimizing source generator. It all depends on the actual gains you could achieve relative to what you already get with the hard-coded frozen collection implementations.

@geeknoid
Copy link
Member

geeknoid commented Nov 5, 2022

@stephentoub I have not experimented with purely length-based "hashing", which I think is a fascinating idea.

If there is sufficient variance in the length of the keys in the frozen collection, it should be possible to just use the string length as the hash code, completely eliminating the cost of hashing. It'd be trivial to write a new comparer implementation that just uses string length. I bet that could be a considerable win in many cases.

@stephentoub
Copy link
Member Author

stephentoub commented Nov 5, 2022

I have not experimented with purely length-based "hashing", which I think is a fascinating idea.

I spent some time on it, which is where https://github.com/dotnet/runtime/blob/main/src/libraries/System.Collections.Immutable/src/System/Collections/Frozen/LengthBucketsFrozenDictionary.cs came from. There's of course more opportunity to experiment.

@geeknoid
Copy link
Member

geeknoid commented Nov 5, 2022

I have not experimented with purely length-based "hashing", which I think is a fascinating idea.

I spent some time on it, which is where https://github.com/dotnet/runtime/blob/main/src/libraries/System.Collections.Immutable/src/System/Collections/Frozen/LengthBucketsFrozenDictionary.cs came from. There's of course more opportunity to experiment.

Ah, you're one step ahead. I'm surprised this required a new dictionary type, rather than just a string comparer implementation.

@stephentoub
Copy link
Member Author

stephentoub commented Nov 5, 2022

It didn't require it, but the comparer itself added overhead. It might still be useful to have the comparer with the frozen hashtable as a fallback if it doesn't meet the criteria for this impl.

@geeknoid
Copy link
Member

Here's another idea:

If an integer set is created that contains a constrained range of values, it would be more efficient to just have an array of bits to hold the set's state.

The idea is to leverage this for sets of enums. FrozenSet can then be implemented using simple array lookups rather than hashing.

I just saw an enum-based set in some production code...

@AlexRadch
Copy link
Contributor

I suggested providing optimized read-only collections for SortedSet and SortedDictionary. I made a separate topic for discussing the API and other issues related to the implementation of these collections, so as not to interfere with all discussions on different topics in this one topic.

@Timovzl
Copy link

Timovzl commented Dec 21, 2022

The Adaptive Radix Tree could be suitable for (ordinal) string keys, especially to implement a sorted set.

A trade-off is retaining the actual string keys vs. keeping only their compacted bytes in the tree.

@Timovzl
Copy link

Timovzl commented Dec 21, 2022

The Adaptive Radix Tree

Come to think of it, the adaptive part is unneccessary for frozen collections. Any radix tree would do.

@theodorzoulias
Copy link
Contributor

theodorzoulias commented Dec 23, 2022

So to support IReadonlyList<> interface and this[int] indexer it should be separated classes.

@AlexRadch isn't this problematic for a FrozenDictionary<TKey, TValue> with the Tkey being of type int? I would expect compile-time errors because of the ambiguity between

public TValue this[TKey key]

and

public KeyValuePair<TKey, TValue> this[int index]

Maybe the derived classes could implement the IReadOnlyList<KeyValuePair<TKey, TValue>> interface, so casting from FrozenDictionary<> to IReadOnlyList<> would succeed, and the this[int] indexer would be available through the interface. But what is the scenario that the this[int] indexer would be useful? Isn't ordered enumeration enough?

@AlexRadch
Copy link
Contributor

@AlexRadch isn't this problematic for a FrozenDictionary<TKey, TValue> with the TKey being of type int? I would expect compile-time errors because of the ambiguity between
Maybe the derived classes could implement the IReadOnlyList<KeyValuePair<TKey, TValue>> interface, so casting from FrozenDictionary<> to IReadOnlyList<> would succeed, and the this[int] indexer would be available through the interface. But what is the scenario that the this[int] indexer would be useful? Isn't ordered enumeration enough?

Yes, you are right, so I think the IList<> interface will be enough. I think Frozen OrderedDictionary should have public ref readonly TValue GetValueRefByIndex(int index) method to get the reference to a value by index.

@theodorzoulias
Copy link
Contributor

@AlexRadch personally I've never used an OrderedDictionary. I don't think that it's a useful data structure. I guess it can be useful in case you want to paginate a dictionary, and fetch efficiently the items 9,901-10,000. That's the only usage scenario that I can think of.

@danmoseley
Copy link
Member

danmoseley commented Feb 13, 2023

Is the scenario of frozen keys, mutable values supported by the new collections? Eg., in this scenario (on Kestrel hot path) a dictionary lookup is basically hand generated (these days maybe Roslyn would generate similar code from a switch statement) and the value may change. https://github.com/dotnet/aspnetcore/blob/41627fe50a7a71e395f2fdfccce9f8cbff10a49a/src/Shared/HttpSys/RequestProcessing/RequestHeaders.Generated.cs#L968

Edit: never mind, it uses fields for the values. But I'm interested in the answer anyway.

@stephentoub
Copy link
Member Author

Is the scenario of frozen keys, mutable values supported by the new collections?

You can't mutate what value the dictionary/set stores for a key. If that value is a reference type, though, or if it contains a reference type, you can of course mutate the referenced instance. In other words, the immutability here is shallow.

@danmoseley
Copy link
Member

Yeah in this case it's a struct. OK thanks.

@Timovzl
Copy link

Timovzl commented Feb 14, 2023

@danmoseley I have definitely seen someone use GetValueRefOrNullRef() and mutate a struct value in a frozen dictionary somewhere…

@stephentoub
Copy link
Member Author

I have definitely seen someone use GetValueRefOrNullRef() and mutate a struct value in a frozen dictionary somewhere…

That returns a ref readonly. Using a ref readonly to erroneously mutate the collection requires unsafe code. With unsafe code you can violate almost any kind of constraint, including immutability.

@Timovzl
Copy link

Timovzl commented Feb 14, 2023

Using a ref readonly to erroneously mutate the collection requires unsafe code.

Ahhh, so that was the trick. Indeed, they used a method from the Unsafe class to do so, I now recall. So that was definitely not by design.

@mjsabby
Copy link
Contributor

mjsabby commented Feb 16, 2023

There is absolutely no requirement for Perfect Hashing to be complex. Take a look at https://gist.github.com/mjsabby/d7baf64d524176ee35f69ffa2c278179 and it out performs current .NET 7 dictionary.

Implements the Hash Displace algorithm.

I would implement Frozen Sets and Dictionaries with perfect hashing.

        [MethodImpl(MethodImplOptions.AggressiveOptimization)]
        public static int Lookup(ReadOnlySpan<byte> key, ReadOnlySpan<int> values, ReadOnlySpan<int> seeds)
        {
            ulong hash;

            unsafe
            {
                fixed (byte* ptr = &MemoryMarshal.GetReference(key))
                {
                    hash = Hash64(ptr, key.Length);
                }
            }

            var size = (ulong)values.Length;
            var index = hash & (size - 1);
            var seed = seeds[(int)index];

            if (seed < 0)
            {
                return values[0 - seed - 1];
            }

            var x = (ulong)seed + hash;
            x ^= x >> 12;
            x ^= x << 25;
            x ^= x >> 27;

            index = (x * 0x2545F4914F6CDD1D) & (size - 1);
            return values[(int)index];
        }

@YoniFeng
Copy link

YoniFeng commented Mar 12, 2023

Perfect hashing ends up being not a great choice. Perfect hashing usually involves a more complex hash function, and you'll end up with lower perf than what we have here. Their main benefit would potentially be in smaller data by not having giant sparse lookup tables.

@geeknoid
I haven't taken the time to look at the Frozen collection's code (would love TL;DR/gist, but I'll have to take a good look regardless), but a PHF(perfect hash function) definitely has value.
I'm surprised you seem to dismiss it for poor key lookup performance (did I infer wrong?), I'd think the only reason to avoid a PHF during runtime is the construction time can be very slow.

For those cases where the input is known statically, a source generator that has a large suite of tricks to generate custom code seems like a great idea. This is not so much focused on the need for perfect hashing, it could be anything that makes the code fast.

A source generator for static sets could be fantastic. It can make top-of-the-line lookup perf easily accessible.
It would essentially be like an integrated gperf. An example (similar to gperf in concept, not implementation) for an integrated gperf/"source generator" is rust-phf.
Its downsides are that it doesn't allow customizing the underlying hash function and comparer, is minimal, and Rust's default hasher is unnecessarily slow because it provides cryptographic guarantees that aren't needed for static sets.

Being able to customize the underlying hash function and key comparer opens the door for super-efficient implementations that make use of the static set's properties and/or assumptions.

About minimal: I've yet to do thorough benchmarking of the PHF area (planning to), but my hunch is having a minimal PHF results in worse performance.
It optimizes for space efficiency (binary size / table stored in data segment) over time efficiency (lookup speed), because a non-keyword hitting an empty slot can be rejected faster (in the case of strings - faster or no string comparison).
Granted, with a fast hash function (or even something ad-hoc) the distribution probably isn't great - increasing the table size probably tends to 0 for lookup performance improvement.

Is the scenario of frozen keys, mutable values supported by the new collections? Eg., in this scenario (on Kestrel hot path) a dictionary lookup is basically hand generated (these days maybe Roslyn would generate similar code from a switch statement) and the value may change. https://github.com/dotnet/aspnetcore/blob/41627fe50a7a71e395f2fdfccce9f8cbff10a49a/src/Shared/HttpSys/RequestProcessing/RequestHeaders.Generated.cs#L968

Edit: never mind, it uses fields for the values. But I'm interested in the answer anyway.

@danmoseley
Can you give a small elaboration on HTTP request keyword/verb matching is used in Kestrel? Why does the value need to be mutated? (Like above - I need to find the time to dig in to the code myself..).
I was looking for this the other day, because I wanted to see how Kestrel/ASP.NET match HTTP keywords and/or common request headers.

Edit: Looked at the code. Like Stephen said for the Frozen collections, a source-generated PHF-based solution would also allow for shallow immutability. If the value can be an index in an array - that's probably best.

This the sort of use case I'd expect could see great benefit from a source-generated PHF set/dictionary.
Assessing whether a string is a reserved keyword (e.g. in compilers) is exactly the use-case where a compile-time PHF-based structure can shine.

Come to think of it - how does Roslyn do keyword matching?

@danmoseley
Copy link
Member

@YoniFeng I’m not familiar with that code. Of course any contributions or analysis you may have are welcome.

@rampaa
Copy link

rampaa commented Jun 28, 2023

Source generation
If/when Roslyn source generators support some kind of mechanism that would allow a source generator to look at constant data in the C# file and augment ToFrozenSet/Dictionary call sites in some manner, it would be beneficial if the optimizations could be performed at compile-time, feeding something into an optional parameter to ToFrozenSet/Dictionary to enable them to be fast, with all the work supplied at build time. This would likely also require more work on the API to make the existing types extensible, which is something we've actively avoided today.

Is this on the radar for .NET 8?

@stephentoub
Copy link
Member Author

Is this on the radar for .NET 8?

No. It requires exposing the details of the abstraction, and we want some bake time on that before we do. It's something we'll consider for .NET 9.

@eiriktsarpalis
Copy link
Member

eiriktsarpalis commented Jul 19, 2023

@stephentoub are there any work items tracked in this issue that we're planning to deliver in .NET 8? If not we should update the milestone.

@stephentoub
Copy link
Member Author

Everything for 8 is done

@stephentoub stephentoub modified the milestones: 8.0.0, Future Jul 19, 2023
@andrewjsaid
Copy link
Contributor

Another way to deliver on the premise of improving lookup time at a potential one-time cost of construction would be some kind of mechanism which tracks usage patterns. This could be provided either at construction time (e.g. an array of key lookups) or by enabling some kind of sampling to be run to track lookup keys.

The frozen collection would then have a method to re-optimize to speed up lookups based on the usage pattern.

Here are some metrics which would be great to know ahead of time for stringly keyed FrozenDictionaries in order to create more optimal and targeted lookups:

  1. Are most lookups for the same key or small subset of keys? - e.g. a FrozenDictionary where the key is country code and most of the time it's the same country code being looked up. Maybe there could be two dictionaries - a faster one with fewer items and a regular one with the entire set.
  2. Changing the number of buckets to ensure that frequently accessed keys are not in the same bucket.
  3. Rotating the order of keys within buckets to sort by most frequently accessed
  4. Not using max length checks if we know them to be ineffective with that particular data set.
  5. Are we ignoring case unnecessarily most of the time? If so maybe there could be two dictionaries here too.

I'm sure that some of these ideas aren't suitable or are slow to compute, but equally I'm sure there are others I haven't thought of which could be perfect optimization.

I guess the problem with this suggestion is that I'm not entirely sure how FrozenDictionaries are used in the wild, so maybe it's better to have this conversation once some common scenarios exist.

@andrewjsaid
Copy link
Contributor

Would tries be useful in some situations for strings? Should we consider doing reflection emit code gen in some circumstances?

After I saw that routing in ASP.NET Core uses reflection emitted Trie, I also spent some time attempting to implement an alternative to stringly-keyed ASCII FrozenDictionary in a similar way. After lots of attempts, I no longer think it's possible for Tree structures to be faster than Hash based data structures in general purpose use-cases. However the IL-Emit Trie did indeed come out faster than the FrozenDictionary in my small experiments.

https://github.com/andrewjsaid/trie

@rampaa
Copy link

rampaa commented Apr 13, 2024

It's something we'll consider for .NET 9.

Is it being considered for .NET 9?

@stephentoub
Copy link
Member Author

It's something we'll consider for .NET 9.

Is it being considered for .NET 9?

Unlikely at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests