Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: CollectionsMarshal method that enumerates dictionary by reference to avoid unnecessary hash calculation. #58333

Open
acaly opened this issue Aug 29, 2021 · 17 comments
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Collections tenet-performance Performance related issue
Milestone

Comments

@acaly
Copy link

acaly commented Aug 29, 2021

Background and motivation

CollectionsMarshal class has been introduced to expose the collection class's underlying data by reference in order to reduce unnecessary copy of large struct values when accessing only a small part of the value. For Dictionary<TKey, TValue>, there are methods that manipulate individual key-value pairs. However, these methods mainly requires accessing from know keys. When the user wants to do batch actions to many elements in the dictionary, enumerating keys and then doing a hash table look up would not be the most efficient way.

Consider a very common use case of Dictionary<,> class: the user wants to use it like an in-memory database table, to store large struct values, each containing the data of one row, and all values are indexed by a primary key, which is itself part of the row data but immutable. When reading and writing data from one row with known primary key, existing methods like GetValueRefOrNullRef is sufficient. But if the user wants to read all rows, doing some work to each of them (or to some of them that meets specific condition), it would be better if there is a way to enumerate the dictionary returning ref to each value. This issue mainly happens to Dictionary<,> because List<T> already exposes its Span<T>, which can be used to enumerate by a for (int i = 0; ... loop to get the references.

Another common use pattern of dictionary is to remove specific items while enumerating. For List<T>, this can be done by a sequential enumeration. Dictionary<,> class already allows removing items while enumerating, but to find the removed item, the dictionary will need additional hash table look-ups. This can be avoided if we provide an enumerator that itself supports removing.

In summary, using a new enumerator, we should able to achieve:

  • Get the ref TValue without additional hash calculation and look-ups.
  • Remove the enumerated item without additional hash calculation and look-ups.

To maximize performance, these two objectives do not necessarily share the same implementation. It's possibly better to address the separately, as mentioned in the discussion below. However, here I will still provide an alternatives that aims at solving both together, because the two objectives can actually be seen as avoiding unnecessary hash calculation and look-ups during enumeration.

API Proposal

Method 1: sequentially enumerate each entry, returning reference to the entry struct.

This will maximize performance, but will not allow removing while enumerating.

namespace System.Runtime.InteropServices
{
    public static class CollectionsMarshal
    {
        public static DictionaryReferenceEnumerable<TKey, TValue> EnumerateByReference<TKey, TValue>(Dictionary<TKey, TValue> dictionary) => new(dictionary);

        public readonly struct DictionaryReferenceEnumerable<TKey, TValue>
        {
            private readonly Dictionary<TKey, TValue> _dictionary;
            internal DictionaryReferenceEnumerable(Dictionary<TKey, TValue> dictionary) => _dictionary = dictionary;
            public Enumerator GetEnumerator() => new(_dictionary);

            public struct Enumerator
            {
                public Enumerator(Dictionary<TKey, TValue> dictionary) { ... }
                public bool MoveNext() { ... }

                public ref Entry Current
                {
                    get => ref Unsafe.As<Entry, Dictionary<TKey, TValue>.Entry>(ref _dictionary._entries[...]);
                }
            }

            public struct Entry
            {
                private readonly uint _hashCode;
                private readonly int _next;
                public readonly TKey Key;
                public TValue Value;
            }
        }
    }
}

Method 2: enumerate each bucket, returning a struct that holds the internal state of enumerating (the current bucket and the last entry index).

This will sacrifice some performance, but it allows a Remove method to be added.

namespace System.Runtime.InteropServices
{
    public static class CollectionsMarshal
    {
        public static DictionaryReferenceEnumerable<TKey, TValue> EnumerateByReference<TKey, TValue>(Dictionary<TKey, TValue> dictionary) => new(dictionary);

        public readonly struct DictionaryReferenceEnumerable<TKey, TValue>
        {
            private readonly Dictionary<TKey, TValue> _dictionary;
            internal DictionaryReferenceEnumerable(Dictionary<TKey, TValue> dictionary) { ... }
            public Enumerator GetEnumerator() => new(_dictionary);

            public struct Enumerator
            {
                private EnumeratorData _data;
                internal Enumerator(Dictionary<TKey, TValue> dictionary) => _data = new(dictionary);
                public bool MoveNext() { ... }

                public EnumeratorData Current
                {
                    get => _data;
                }
            }

            public struct EnumeratorData
            {
                //internal state of enumerator: bucket index, last node index, etc.

                internal EnumeratorData(Dictionary<TKey, TValue> dictionary) { ... }
                internal bool MoveNext() { ... }

                public TKey Key => ...;
                public ref TValue Value => ...;
                public void Remove() { ... }
            }
        }
    }
}

API Usage

Method 1:

struct Data
{
    public int SomeValue;
    //Other fields.
}
Dictionary<int, Data> dictionary = new();

//...

foreach (ref var entry : CollectionsMarshal.DictionaryReferenceEnumerable(dictionary))
{
    if (entry.Value.SomeValue != -1)
    {
        entry.Value.SomeValue += 1;
    }
}

Method 2:

foreach (var entry : CollectionsMarshal.DictionaryReferenceEnumerable(dictionary))
{
    if (entry.Value.SomeValue != -1)
    {
        entry.Value.SomeValue += 1;
    }
    else
    {
        entry.Remove(); //Method 2 allows removing.
    }
}

Possible implementations and performances

I have done some preliminary investigation on the implementation and the performance. The conclusion:

Enumerating:

  • Baseline: enumerate dictionary.Keys followed by CollectionsMarshal.GetValueRefOrNullRef.
  • Method 1 provides maximal speed (3x for int keys, ~10x for (int, int, int) keys, 6x for string keys).
  • Method 2 is much slower than Method 1 (Method 1 is 2-3x as fast as Method 2).

Deleting while enumerating:

  • Baseline: enumerate dictionary.Keys followed by CollectionsMarshal.GetValueRefOrNullRef. Use a List<TKey> to record keys to be removed, and after enumeration of the dictionary finishes, enumerate this list to remove them.
  • Method 1 does not support deleting.
  • Method 2 is usually faster than the baseline, except for int keys (1x for int keys, 3x for (int, int, int) keys, and 2x for string keys).
  • The benchamrk removes 50% of the items from the dictionary.

Since I am not an expert in this area, I think there should still be much space for optimization.

Benchmark results
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1165 (21H1/May2021Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.7.21379.14
  [Host]     : .NET 6.0.0 (6.0.21.37719), X64 RyuJIT  [AttachedDebugger]
  DefaultJob : .NET 6.0.0 (6.0.21.37719), X64 RyuJIT

int->int, 10 items

Method Mean Error StdDev
Enum_ByKey 112.26 ns 1.893 ns 1.678 ns
Enum_ByRefSequential 35.51 ns 0.250 ns 0.221 ns
Enum_ByRefBucket 83.95 ns 0.797 ns 0.707 ns
Remove_ByKey 52.89 ns 0.464 ns 0.434 ns
Remove_ByRefBucket 57.43 ns 1.166 ns 1.145 ns

int->int, 1000 items

Method Mean Error StdDev
Enum_ByKey 9.787 us 0.1115 us 0.1043 us
Enum_ByRefSequential 3.468 us 0.0381 us 0.0356 us
Enum_ByRefBucket 6.456 us 0.0440 us 0.0411 us
Remove_ByKey 4.910 us 0.0613 us 0.0543 us
Remove_ByRefBucket 5.261 us 0.0621 us 0.0551 us

int->big struct, 10 items

Method Mean Error StdDev
Enum_ByKey 119.59 ns 1.249 ns 1.043 ns
Enum_ByRefSequential 37.79 ns 0.254 ns 0.212 ns
Enum_ByRefBucket 85.65 ns 0.838 ns 0.743 ns
Remove_ByKey 58.28 ns 1.190 ns 1.323 ns
Remove_ByRefBucket 58.50 ns 1.026 ns 1.098 ns

int->big struct, 1000 items

Method Mean Error StdDev
Enum_ByKey 10.412 us 0.1571 us 0.1470 us
Enum_ByRefSequential 3.587 us 0.0370 us 0.0346 us
Enum_ByRefBucket 6.827 us 0.0470 us 0.0367 us
Remove_ByKey 5.386 us 0.0798 us 0.0747 us
Remove_ByRefBucket 5.646 us 0.1064 us 0.0888 us

(int, int, int)->int, 10 items

Method Mean Error StdDev
Enum_ByKey 339.03 ns 4.299 ns 4.021 ns
Enum_ByRefSequential 35.28 ns 0.431 ns 0.382 ns
Enum_ByRefBucket 84.11 ns 1.497 ns 1.400 ns
Remove_ByKey 191.94 ns 2.186 ns 1.938 ns
Remove_ByRefBucket 60.92 ns 0.480 ns 0.688 ns

(int, int, int)->int, 1000 items

Method Mean Error StdDev
Enum_ByKey 38.774 us 0.3048 us 0.2380 us
Enum_ByRefSequential 3.369 us 0.0655 us 0.0613 us
Enum_ByRefBucket 9.389 us 0.0894 us 0.0792 us
Remove_ByKey 16.413 us 0.1372 us 0.1216 us
Remove_ByRefBucket 5.799 us 0.0806 us 0.0754 us

(int, int, int)->big struct, 10 items

Method Mean Error StdDev
Enum_ByKey 395.39 ns 3.212 ns 3.004 ns
Enum_ByRefSequential 34.97 ns 0.622 ns 0.581 ns
Enum_ByRefBucket 93.08 ns 1.364 ns 1.276 ns
Remove_ByKey 186.41 ns 2.713 ns 2.538 ns
Remove_ByRefBucket 61.45 ns 0.555 ns 0.492 ns

(int, int, int)->big struct, 1000 items

Method Mean Error StdDev
Enum_ByKey 35.810 us 0.4707 us 0.4403 us
Enum_ByRefSequential 3.352 us 0.0360 us 0.0337 us
Enum_ByRefBucket 12.960 us 0.1137 us 0.1063 us
Remove_ByKey 16.516 us 0.1926 us 0.1802 us
Remove_ByRefBucket 6.410 us 0.0557 us 0.0521 us

string->int, 10 items

Method Mean Error StdDev
Enum_ByKey 245.04 ns 4.859 ns 5.596 ns
Enum_ByRefSequential 33.07 ns 0.697 ns 1.145 ns
Enum_ByRefBucket 70.59 ns 0.833 ns 0.779 ns
Remove_ByKey 108.80 ns 2.205 ns 2.451 ns
Remove_ByRefBucket 58.70 ns 0.812 ns 0.759 ns

string->int, 1000 items

Method Mean Error StdDev
Enum_ByKey 22.106 us 0.1307 us 0.1223 us
Enum_ByRefSequential 3.564 us 0.0194 us 0.0162 us
Enum_ByRefBucket 7.655 us 0.0743 us 0.0621 us
Remove_ByKey 11.218 us 0.1740 us 0.1627 us
Remove_ByRefBucket 5.861 us 0.0703 us 0.0623 us

string->big struct, 10 items

Method Mean Error StdDev
Enum_ByKey 239.33 ns 3.121 ns 2.767 ns
Enum_ByRefSequential 36.20 ns 0.258 ns 0.242 ns
Enum_ByRefBucket 72.04 ns 0.826 ns 0.773 ns
Remove_ByKey 108.35 ns 1.004 ns 0.939 ns
Remove_ByRefBucket 60.00 ns 0.820 ns 0.767 ns

string->big struct, 1000 items

Method Mean Error StdDev
Enum_ByKey 23.894 us 0.4075 us 0.3613 us
Enum_ByRefSequential 3.586 us 0.0638 us 0.0597 us
Enum_ByRefBucket 8.318 us 0.1107 us 0.1036 us
Remove_ByKey 11.708 us 0.2322 us 0.2385 us
Remove_ByRefBucket 6.301 us 0.1231 us 0.1317 us
Implementation of Method 1 used in the test
public readonly struct DictionaryReferenceEnumerable<TKey, TValue> where TKey : notnull
{
    private readonly Dictionary<TKey, TValue> _dictionary;

    public DictionaryReferenceEnumerable(Dictionary<TKey, TValue> dictionary)
    {
        _dictionary = dictionary;
    }

    public Enumerator GetEnumerator() => new(_dictionary);

    public ref struct Enumerator
    {
        private readonly Dictionary_<TKey, TValue> _dictionary;
        private int _index;
        private int _currentIndex;

        public Enumerator(Dictionary<TKey, TValue> dictionary)
        {
            _dictionary = Unsafe.As<Dictionary_<TKey, TValue>>(dictionary);
            _index = 0;
            _currentIndex = default;
        }

        public bool MoveNext()
        {
            while ((uint)_index < (uint)_dictionary._count)
            {
                ref var entry = ref _dictionary._entries![_index++];

                if (entry.next >= -1)
                {
                    _currentIndex = _index - 1;
                    return true;
                }
            }

            _index = _dictionary._count + 1;
            _currentIndex = default;
            return false;
        }

        public ref Entry Current
        {
            [MethodImpl(MethodImplOptions.AggressiveInlining)]
            get => ref Unsafe.As<Dictionary_<TKey, TValue>.Entry, Entry>(ref _dictionary._entries![_currentIndex]);
        }
    }

    public struct Entry
    {
        private readonly uint hashCode;
        private readonly int next;
        public readonly TKey Key;
        public TValue Value;
    }
}
Implementation of Method 2 used in the test
public readonly struct DictionaryReferenceEnumerable<TKey, TValue> where TKey : notnull
{
    private readonly Dictionary<TKey, TValue> _dictionary;

    public DictionaryReferenceEnumerable(Dictionary<TKey, TValue> dictionary)
    {
        _dictionary = dictionary;
    }

    public Enumerator GetEnumerator() => new(_dictionary);

    public struct EnumeratorData
    {
        internal readonly Dictionary_<TKey, TValue> _dictionary;
        internal int _bucketIndex;
        internal int _lastIndex;
        internal int _currentIndex;

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        internal EnumeratorData(Dictionary<TKey, TValue> dictionary)
        {
            _dictionary = Unsafe.As<Dictionary_<TKey, TValue>>(dictionary);
            _bucketIndex = -1;
            _lastIndex = -1;
            _currentIndex = -1;
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        internal bool MoveNext()
        {
            _lastIndex = _currentIndex;
            if (_currentIndex < 0 || (_currentIndex = _dictionary._entries![_currentIndex].next) < 0)
            {
                int bucketHead = 0;
                while (bucketHead == 0 && ++_bucketIndex < _dictionary._buckets!.Length)
                {
                    bucketHead = _dictionary._buckets![_bucketIndex];
                }
                if (bucketHead == 0)
                {
                    return false;
                }
                _currentIndex = bucketHead - 1;
                _lastIndex = -1;
            }
            return true;
        }

        public TKey Key
        {
            [MethodImpl(MethodImplOptions.AggressiveInlining)]
            get => _dictionary._entries![_currentIndex].Key;
        }

        public ref TValue Value
        {
            [MethodImpl(MethodImplOptions.AggressiveInlining)]
            get => ref _dictionary._entries![_currentIndex].Value;
        }

        public void Remove()
        {
            int i = _currentIndex;
            var d = _dictionary;
            ref var entry = ref d._entries![i];
            var next = entry.next;
            _currentIndex = next;

            if (_lastIndex < 0)
            {
                d._buckets![_bucketIndex] = 1 + next;
            }
            else
            {
                d._entries![_lastIndex].next = next;
            }

            entry.next = (-3) - d._freeList;
            if (RuntimeHelpers.IsReferenceOrContainsReferences<TKey>())
            {
                Unsafe.AsRef(in entry.Key) = default!;
            }
            if (RuntimeHelpers.IsReferenceOrContainsReferences<TValue>())
            {
                entry.Value = default!;
            }

            d._freeList = i;
            d._freeCount++;
        }
    }

    public struct Enumerator
    {
        private EnumeratorData _data;

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public Enumerator(Dictionary<TKey, TValue> dictionary)
        {
            _data = new(dictionary);
        }

        public bool MoveNext()
        {
            return _data.MoveNext();
        }

        public EnumeratorData Current => _data;
    }
}

Alternative designs

Linq-like extension method:

As mentioned below, the same effect can be achieved by providing a RemoveWhere method that passes TKey and ref TValue to a delegate:

public bool RemoveDictionaryEntryPredicate<TKey, TValue>(TKey key, ref TValue value);
public static void RemoveWhere<TKey, TValue>(this Dictionary<TKey, TValue> dictionary, RemoveDictionaryEntryPredicate predicate);

such that the delegate will be able to access the value by reference and specify whether each item should be removed.

Advantages: 1) seems more natrual to C#; 2) the dictionary has chance to do rehashing after finishes enumerating.

Disadvantages: 1) the runtime is currently unable to fully inline delegate calls; 2) if the delegate need to access other variables, creating the closure requires additional allocation at each call.

Risks

Some design considerations:

  • Same safety concerns shared by other CollectionsMarshal methods, but this is by design. User should make sure the use is valid, e.g., by guaranteeing no modification is made to the collection during enumerating.
  • The returned struct enumerable can only be enumerated directly, and cannot be converted to the IEnumerable<T> interface, because the T here is a by-ref type. This should be fine since usually converting to interfaces causes boxing and GC allocation, which is against high-performance requirements.
@acaly acaly added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Aug 29, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Collections untriaged New issue has not been triaged by the area owner labels Aug 29, 2021
@ghost
Copy link

ghost commented Aug 29, 2021

Tagging subscribers to this area: @eiriktsarpalis
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

CollectionsMarshal class has been introduced to expose the collection class's underlying data by reference in order to reduce unnecessary copy of large struct values when accessing only a small part of the value. For Dictionary<TKey, TValue>, there are methods that manipulate individual key-value pairs. However, these methods mainly requires accessing from know keys. When the user wants to do batch actions to many elements in the dictionary, enumerating keys and then doing a hash table look up would not be the most efficient way.

Consider a very common use case of Dictionary<,> class: the user wants to use it like an in-memory database table, to store large struct values, each containing the data of one row, and all values are indexed by a primary key, which is itself part of the row data but immutable. When reading and writing data from one row with known primary key, existing methods like GetValueRefOrNullRef is sufficient. But if the user wants to read all rows, doing some work to each of them (or to some of them that meets specific condition), it would be better if there is a way to enumerate the dictionary returning ref to each value.

This issue mainly happens to Dictionary<,> because List<T> already exposes its Span<T>, which can be used to enumerate by a for (int i = 0; ... loop to get the references.

API Proposal

namespace System.Runtime.InteropServices
{
    public static class CollectionsMarshal
    {
        public static DictionaryValueRefEnumerable<TKey, TValue> EnumerateValuesByRef<TKey, TValue>(Dictionary<TKey, TValue> dictionary) => new(dictionary);

        public struct DictionaryValueRefEnumerable<TKey, TValue>
        {
            internal DictionaryValueRefEnumerable(Dictionary<TKey, TValue> dictionary) { ... }

            DictionaryValueRefEnumerable<TKey, TValue> GetEnumerable() => this;

            public bool MoveNext() { ... }

            public ref TValue Current
            {
                get { ... }
            }
        }
    }
}

API Usage

struct Data
{
    public int SomeValue;
    //Other fields.
}
Dictionary<int, Data> dictionary = new();

//...

foreach (ref var data : CollectionsMarshal.EnumerateValuesByRef(dictionary))
{
    if (data.SomeValue != -1)
    {
        data.SomeValue += 1;
    }
}

Risks

Some design considerations:

  • Same safety concerns shared by other CollectionsMarshal methods, but this is by design. User should make sure the use is valid, e.g., by guaranteeing no modification is made to the collection during enumerating.
  • The returned struct enumerable can only be enumerated directly, and cannot be converted to the IEnumerable<T> interface, because the T here is a by-ref type. This should be fine since usually converting to interfaces causes boxing and GC allocation, which is against high-performance requirements.
  • Only values can be enumerated, and the key for each item cannot be easily obtained. An alternative way is to return reference to a new struct type of
public struct ImmutableKeyKeyValuePair<TKey, TValue>
{
    public TKey Key { get; }
    public TValue Value;
}

by converting from the corresponding part of the Dictionary<TKey, TValue>.Entry struct with Unsafe.As.

Author: acaly
Assignees: -
Labels:

api-suggestion, area-System.Collections, untriaged

Milestone: -

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Aug 29, 2021

This would probably benefit from ref fields being a first-class citizen of the runtime & language (see dotnet/csharplang#1147). It's certainly possible to have an implementation without this, but it's likely to have suboptimal performance.

public readonly ref struct KeyValuePairByRef<TKey, TValue>
{
    private readonly ref Entry _entry;

    public ref readonly TKey Key => ref _entry.Key;
    public ref TValue Value => ref _entry.Value;
}

public struct TheEnumerator
{
    private readonly Entry[] _entries;
    private int _index;

    public bool MoveNext() { /* checks mimicking current enumerator checks */ }
    public KeyValuePairByRef<TKey, TValue> Current => new (ref _entries[_index]); // + whatever other checks need take place
}

@acaly
Copy link
Author

acaly commented Aug 29, 2021

I have another proposal on the use of TValue ref that is not quite related to the proposal but I think might also be good to mention here. Do you think we can remove a key-value pair from the dictionary given the reference to TValue?

I looked at the current implementation, it seems that's completely possible. The index can be calculated from Unsafe.ByteOffset. This might even be possible while enumerating. As long as the dictionary isn't rehashing, the references should remain valid.

EDIT

Not quite possible, at least directly given the ref. It would need to change the single linked list to double linked list, which will make huge regression.

But it would really be helpful if we could remove while enumerating. I think if the enumerator records more states, it will be possible. For example, it can record 2 pointers, one for scanning for linked list heads (through _buckets), the other for enumerating the current linked list. I will think of it and update the proposal.

@acaly
Copy link
Author

acaly commented Aug 29, 2021

This would probably benefit from ref fields being a first-class citizen of the runtime & language (see dotnet/csharplang#1147). It's certainly possible to have an implementation without this, but it's likely to have suboptimal performance.

public readonly ref struct KeyValuePairByRef<TKey, TValue>
{
    private readonly ref Entry _entry;

    public ref readonly TKey Key => ref _entry.Key;
    public ref TValue Value => ref _entry.Value;
}

public struct TheEnumerator
{
    private readonly Entry[] _entries;
    private int _index;

    public bool MoveNext() { /* checks mimicking current enumerator checks */ }
    public KeyValuePairByRef<TKey, TValue> Current => new (ref _entries[_index]); // + whatever other checks need take place
}

Since it's inside the corelib, maybe ByReference<T> is enough?

@GrabYourPitchforks
Copy link
Member

Since it's inside the corelib, maybe ByReference<T> is enough?

Perhaps, but we've historically been hesitant to use it for scenarios beyond simply Span<T>. It's partly that we don't want to risk propagating a type throughout the runtime when that type has really only been stress tested in one specific scenario, but also that we don't want to make it a habit of relying on runtime tricks for something that should be a first-class feature available to all. (Solving the latter problem would also necessarily imply solving the former, since shipping such a feature would involve properly testing it.)

@acaly
Copy link
Author

acaly commented Aug 30, 2021

I have update the proposal with some benchmark results. Both methods have some improvements in enumerating (the ratio strongly depends on how fast the hash can be calculated). The second method, when used with removing, is also better than the "standard workaround" of creating another list recording keys, because it's not only faster, but also avoids allocation of the list.

@acaly acaly changed the title [API Proposal]: CollectionsMarshal method enumerating Dictionary<TKey, TValue>'s Values by reference. [API Proposal]: CollectionsMarshal method that enumerates dictionary by reference and allows removing. Aug 30, 2021
@GrabYourPitchforks
Copy link
Member

The issue was changed from "allow iterating by references" to "allow removal while iterating", which significantly changes the scope & direction of this work.

I think you'll have more success proposing an API which tries to split the two concerns. For example, you could consider a RemoveWhere method which takes a predicate, which doesn't require any CollectionsMarshal or ref field goop. A predicate-based approach would also have a natural "end" state (the RemoveWhere method returning), which could allow the dictionary to rebucket its internal entries.

@stephentoub
Copy link
Member

stephentoub commented Aug 30, 2021

Dictionary<> was updated circa .NET Core 2.1 such that you can remove while iterating; removal no longer invalidates enumerators.
dotnet/coreclr#18854

@acaly
Copy link
Author

acaly commented Aug 30, 2021

Dictionary<> was updated circa .NET Core 2.1 such that you can remove while iterating; removal does not invalidate enumerators.

Yes, but it will need another hash table look up, as I mentioned in the background section.

@stephentoub
Copy link
Member

stephentoub commented Aug 30, 2021

as I mentioned in the background section

I'm saying, the background section states "or Dictionary<,>, modifying the content will invalidates the enumerator, and users are forced to temporarily record keys to remove elsewhere, and remove them after enumeration finishes" which is not correct. This is valid code:

var d = new Dictionary<int, int> { { 1, 1 }, { 2, 2 }, { 3, 3 } };
foreach (var pair in d)
{
    d.Remove(pair.Key);
}

@acaly
Copy link
Author

acaly commented Aug 30, 2021

The issue was changed from "allow iterating by references" to "allow removal while iterating", which significantly changes the scope & direction of this work.

I think you'll have more success proposing an API which tries to split the two concerns. For example, you could consider a RemoveWhere method which takes a predicate, which doesn't require any CollectionsMarshal or ref field goop. A predicate-based approach would also have a natural "end" state (the RemoveWhere method returning), which could allow the dictionary to rebucket its internal entries.

I understand. The reason I think it's probably better to mention it here is because if one method can cover both, it will be easier for user to decide which one to use. If there are two proposals, each add one enumeration method, the API will look much more complcated.

I think this issue might be an initial attempt that illustrates the problem. If you think it's better having an enumerator first, without allowing the removal, I (or anyone else) can open a separate issue to track it. In any case, I still think both proposals are about avoiding repeated hash calculation and look-up while enumerating, and can be seen as one unified problem and solved together.

@acaly
Copy link
Author

acaly commented Aug 30, 2021

as I mentioned in the background section

I'm saying, the background section states "or Dictionary<,>, modifying the content will invalidates the enumerator, and users are forced to temporarily record keys to remove elsewhere, and remove them after enumeration finishes" which is not correct. This is valid code:

var d = new Dictionary<int, int> { { 1, 1 }, { 2, 2 }, { 3, 3 } };
foreach (var pair in d)
{
    d.Remove(pair.Key);
}

You are right. I will update the wording and probably redo some benchmarks.

@acaly acaly changed the title [API Proposal]: CollectionsMarshal method that enumerates dictionary by reference and allows removing. [API Proposal]: CollectionsMarshal method that enumerates dictionary by reference to avoid unnecessary hash calculation. Aug 30, 2021
@eiriktsarpalis eiriktsarpalis added api-needs-work API needs work before it is approved, it is NOT ready for implementation tenet-performance Performance related issue and removed untriaged New issue has not been triaged by the area owner labels Sep 14, 2021
@eiriktsarpalis eiriktsarpalis added this to the Future milestone Sep 14, 2021
@krwq
Copy link
Member

krwq commented Sep 27, 2021

@acaly can you please update the issue to add example scenarios which this could solve? I personally never needed such functionality but YMMV

@acaly
Copy link
Author

acaly commented Sep 30, 2021

@krwq Thanks for the reply. I wrote a very brief description on the scenario in the proposal.

Consider a very common use case of Dictionary<,> class: the user wants to use it like an in-memory database table, to store large struct values, each containing the data of one row, and all values are indexed by a primary key, which is itself part of the row data but immutable. When reading and writing data from one row with known primary key, existing methods like GetValueRefOrNullRef is sufficient. But if the user wants to read all rows, doing some work to each of them (or to some of them that meets specific condition), it would be better if there is a way to enumerate the dictionary returning ref to each value.

I encountered this problem when I was implementing a physics engine in C#. I stored the active interaction data between entities in a dictionary, with (entity1, entity2, interactionType) as key and the data (as a value type) as value. This brings some benefits: 1) the collision detection module can use the dictionary to check whether there is already an active interaction between specific entities; 2) using dictionaries avoids unnecessary copy when adding/removing interactions. An alternative approach would be to use reference type instead, but it requires manually pooling instances to make an allocation-free implementation, and it may also sacrifice locality when enumerating interactions.

@eiriktsarpalis eiriktsarpalis removed the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Nov 16, 2021
@lakecherry
Copy link

I have a similar high performance scenario where the dictionary values (int) all have to be divided by 2 (some kind of decay operation).

How about changing dictionarys' ValueCollection enumerator from public TValue Current => _currentValue!; to public ref TValue Current => ref _currentValue!;

Existing code would still be valid, and ref assignments would give a change option.

@KalleOlaviNiemitalo
Copy link

Existing code would still be valid, and ref assignments would give a change option.

Existing source code yes -- but not existing binaries.

@lakecherry
Copy link

That's true ... but thanks for the quick response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Collections tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

7 participants