Add System.Collections.Concurrent.ConcurrentHashSet<T> #39919

LeaFrock · 2020-07-25T06:38:21Z

Background and Motivation

We already have ConcurrentDictionary which takes great advantages at concurrent environment comparing to Dictionary. However, HashSet doesn't have a Concurrent version.
Sometimes(maybe very often in fact), we hope to use ConcurrentHashSet easily which its elements are never duplicated, with an O(1) operation for contains or remove methods at concurrent environment. It's quite different from ConcurrentBag and provides another choice for suitable usage.
You can also see details at issue #16443 . Almost 5 years passes, nobody gives an API proposal, so I'm here to finish it.

Proposed API

    public partial class ConcurrentHashSet<T> : System.Collections.Generic.ICollection<T>, System.Collections.Generic.IEnumerable<T>, System.Collections.Generic.IReadOnlyCollection<T>, System.Collections.ICollection,System.Collections.IEnumerable where T : notnull
    {
        public ConcurrentHashSet() { }
        public ConcurrentHashSet(System.Collections.Generic.IEnumerable<T> collection) { }
        public ConcurrentHashSet(System.Collections.Generic.IEnumerable<T> collection, System.Collections.Generic.IEqualityComparer<T>? comparer) { }
        public ConcurrentHashSet(System.Collections.Generic.IEqualityComparer<T>? comparer) { }
        public ConcurrentHashSet(int concurrencyLevel, System.Collections.Generic.IEnumerable<T> collection, System.Collections.Generic.IEqualityComparer<T>? comparer) { }
        public ConcurrentHashSet(int concurrencyLevel, int capacity) { }
        public ConcurrentHashSet(int concurrencyLevel, int capacity, System.Collections.Generic.IEqualityComparer<T>? comparer) { }

        public int Count { get { throw null; } }
        public bool IsEmpty { get { throw null; } }
        bool System.Collections.Generic.ICollection<T>.IsReadOnly { get { throw null; } }
        bool System.Collections.ICollection.IsSynchronized { get { throw null; } }
        object System.Collections.ICollection.SyncRoot { get { throw null; } }

        public void Clear() { }
        public bool Contains(T item) { throw null; }
        public System.Collections.Generic.IEnumerator<T> GetEnumerator() { throw null; }
        public T GetOrAdd(T item) { throw null; }
        void System.Collections.Generic.ICollection<T>.Add(T item) { }
        bool System.Collections.Generic.ICollection<T>.Contains(T item) { throw null; }
        void System.Collections.Generic.ICollection<T>.CopyTo(T[] array, int index) { }
        bool System.Collections.Generic.ICollection<T>.Remove(T item) { throw null; }
        void System.Collections.ICollection.CopyTo(System.Array array, int index) { }
        System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() { throw null; }
        public T[] ToArray() { throw null; }
        public bool TryAdd(T item) { throw null; }
        public bool TryRemove(T item) { throw null; }

        public System.Collections.Generic.HashSet<T> ToHashSet() { } //optional
    }

Usage Examples

The main goal of ConcurrentHashSet is to improve performance and reduce cost, taking replace of HashSet-With-Lock pattern for concurrent environment. If we have it, the code can be written like below,

public class Sample()
{
     // private readonly object _syncRoot = new object();

     private readonly ConcurrentHashSet<Guid> _concurrentGuidSet = new ConcurrentHashSet<Guid>();
     // private readonly HashSet<Guid> _guidSet = new Hash<Guid>();

     public async Task Generate()
     {
          for(int i = 0; i < 100; i++)
          {
               Task.Run( () =>
               {
                    // The goal is to improve the performance, and ease the way of programming here.
                    _concurrentGuidSet.TryAdd(Guid.NewGuid());
                    //lock(_syncRoot)
                    //{
                    //     _guidSet.Add(Guid.NewGuid());
                    //}
               })
          }
     }
}

The difference could be very similiar to the one between Dictionary and ConcurrentDictionary.

Alternative Designs

Many developers tell that they use ConcurrentDictionary<T, byte> as a replace when meeting the requirement. They don't want the Value actually which is meaningless and totally a waste of memory. However, someones suggest that the first version of ConcurrentHashSet can be implemented by underlying ConcurrentDictionary<T, byte>, in order to add the APIs as soon as possible. Individually, I prefer to an independent implementation which is necessary to be discussed between excellent engineers of Microsoft & our community.
Do we need it to inherit ISet<T>/IReadOnlySet<T>? Though it's the concurrent version of HashSet, I'm not sure whether each API of HashSet needs a copy. In my opinion, not all APIs make sense while running in a concurrent environment. Of course, the discussion is required.
Do we need this API public HashSet<T> ToHashSet() to get a synchronous instance for easy usage?

Risks

It might not own concurrent version of the whole APIs of HashSet. Because it's a new collection, no break changes will occur. It just needs a pretty design :).

The text was updated successfully, but these errors were encountered:

LeaFrock · 2020-07-25T07:25:00Z

By the way, this issue is not duplicated of #15326 , though we have almost the same requirement.

ghost · 2020-07-25T09:36:08Z

Tagging subscribers to this area: @eiriktsarpalis
See info in area-owners.md if you want to be subscribed.

danmoseley · 2020-07-26T22:34:09Z

Thanks for doing this. Could you please update the proposal to be complete? No implementation is needed, but everything that would be public, eg.,

all constructors
all members fulfiling the interface contracts - whether explicit or implicit. Eg., there's no getter for the generic enumerator, etc.
to your question in Get core-setup building in the consolidated repo. #2, does it need ISet and IReadOnlySet-- it sounds like your use case is for storage and retrieval, not set operations like intersect and superset. Better to mimimally satisfy demonstrated needs, the interfaces can always be added later without breaking code, if use case emerges.
CopyTo probably should be hidden in explicit interface implementation, as for ConcurrentDictionary
Remove should probably be TryRemove
do we need XXAdd overloads with a factory delegate, as for ConcurrentDictionary?
do we need AddOrUpdate?
do we need TryUpdate?
does it need IsEmpty? this is syntactic sugar that with present JIT may be less efficient than checking Count == 0. (Fix CA1836 rule violations excluding Memory/Span #39890 (comment))

LeaFrock · 2020-07-27T02:30:36Z

@danmosemsft Updated, thanks!

all constructors

Just take reference of ConcurrentDictionary<TKey, TValue>.

all members fulfiling the interface contracts - whether explicit or implicit. Eg., there's no getter for the generic enumerator, etc.

CopyTo probably should be hidden in explicit interface implementation, as for ConcurrentDictionary

Sorry, I'm confused whether the whole members of interface contract should be listed above? I deleted overloads of CopyTo and added getter for the generic enumerator.

to your question in Get core-setup building in the consolidated repo. #2, does it need ISet and IReadOnlySet-- it sounds like your use case is for storage and retrieval, not set operations like intersect and superset. Better to mimimally satisfy demonstrated needs, the interfaces can always be added later without breaking code, if use case emerges.

Agreed. I moved the related content to 'Alternative Designs'.

Remove should probably be TryRemove

Agreed & done.

do we need XXAdd overloads with a factory delegate, as for ConcurrentDictionary?

do we need AddOrUpdate?

do we need TryUpdate?

I think NO, because it's a simpler collection than ConcurrentDictionary. For ConcurrentDictionary, we usually use AddOrUpdate/TryUpdate or factory delegate overloads to update Value only. ConcurrentHashSet owns the Key part only, so I don't see any point in them. Any comments welcomed.

does it need IsEmpty? this is syntactic sugar that with present JIT may be less efficient than checking Count == 0.

I think we can add it. The following is the reasons from the perspective of my view,

ConcurrentDictionary owns it
it really simplify codes
the performance gap is acceptable and I believe JIT performance here will be optimized sooner or later

danmoseley · 2020-07-27T05:47:51Z

Thanks

Sorry, I'm confused whether the whole members of interface contract should be listed above?

We're looking for the same format as in this file
https://github.com/dotnet/runtime/blob/master/src/libraries/System.Collections.Concurrent/ref/System.Collections.Concurrent.cs#L75

LeaFrock · 2020-07-27T06:09:09Z

@danmosemsft Done.

danmoseley · 2020-07-29T15:35:14Z

Ok thanks. The owners of this area will take a look. It might be a little while since we are focused on shipping 5.0 right now.

eiriktsarpalis · 2020-10-30T15:04:39Z

Many developers tell that they use ConcurrentDictionary<T, byte> as a replace when meeting the requirement. They don't want the Value actually which is meaningless and totally a waste of memory.

This is a compromise that has more or less worked fine, at least for my own requirements in the past. It would be interesting if we could quantify any performance improvements that come out of a dedicated ConcurrentHashSet implementation.

bennidhamma · 2020-12-13T05:59:17Z

Even if you just used a ConcurrentDictionary<T, byte> initially, under the covers, people could start using a cleaner interface.

eiriktsarpalis · 2020-12-14T11:08:02Z

Even if you just used a ConcurrentDictionary<T, byte> initially, under the covers, people could start using a cleaner interface.

Adding new classes always comes at a cost, even if they're just facades.

LeaFrock · 2020-12-15T01:46:08Z

It would be interesting if we could quantify any performance improvements that come out of a dedicated ConcurrentHashSet implementation.

Well, so the situation becomes that we're all waiting for some kind of implementation by someone now, in order to compare it with the current solution... that doesn't sound good =.=

danmoseley · 2020-12-15T02:39:21Z

It will be more space efficient, even if it's no faster, which seems potentially worthwhile to me.

I noticed this (didn't look at it): https://github.com/i3arnon/ConcurrentHashSet by @i3arnon

eiriktsarpalis · 2020-12-15T10:15:43Z

In practice though we'd be looking at maybe shaving one byte per Node instance which I'm not sure would even register in heap allocations. That being said, there might be opportunities for other optimizations as well, happy to be surprised :-)

danmoseley · 2020-12-15T18:01:55Z

That is true.

stephentoub · 2020-12-16T11:03:11Z

In practice though we'd be looking at maybe shaving one byte per Node instance

On 64-bit it generally won't even save that.

i3arnon · 2020-12-16T11:25:51Z

On 64-bit it won't even save that.

It depends on the type being used, doesn't it?
I did some naive tests with benchmark-net compared with ConcurrentHashSet and I can see some difference in allocations when Tkey is int, but not with reference types (on 64-bit) or any other sized types (long, short, etc.)
I assume that's because of how Node's members are being aligned. It consists of a reference to the next node (8 bytes), 2 int values (the key & hash-code, 8 bytes together) and the single byte value which requires another 8 byte block to keep alignment.

stephentoub · 2020-12-16T11:39:12Z

It depends on the type being used, doesn't it?

Yes, but for almost anything other than key==int, it won't increase the size.

The node stores the key, the value, a next node reference, and an int hash code, so whether the value increases the size of the object depends on whether the value can fit in the padding for the hashcode or whether it's consumed by the key. If the value is byte, then if the key is any reference type, for example, the byte value will just be in what is otherwise padding space from the hash code. The original example motivating this issue was for Guid as the T. In that case, the byte value would similarly just fit into the padding for the int, which is what I was referring to (but wasn't clear).

eiriktsarpalis · 2021-10-29T14:06:57Z

Closing since the benefits over using the ConcurrentDictionary<T, byte> workaround or a third-party package don't seem to outweigh the cost of authoring a new collection type. We might revisit in the future if more evidence comes up.

LeaFrock added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Jul 25, 2020

LeaFrock mentioned this issue Jul 25, 2020

What about adding a ConcurrentSet<T>? #16443

Closed

scalablecory added the area-System.Collections label Jul 25, 2020

scalablecory added this to the Future milestone Jul 25, 2020

danmoseley added the untriaged New issue has not been triaged by the area owner label Jul 29, 2020

layomia removed the untriaged New issue has not been triaged by the area owner label Aug 28, 2020

eiriktsarpalis closed this as completed Oct 29, 2021

ghost locked as resolved and limited conversation to collaborators Nov 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add System.Collections.Concurrent.ConcurrentHashSet<T> #39919

Add System.Collections.Concurrent.ConcurrentHashSet<T> #39919

LeaFrock commented Jul 25, 2020 •

edited

LeaFrock commented Jul 25, 2020 •

edited

ghost commented Jul 25, 2020

danmoseley commented Jul 26, 2020 •

edited

LeaFrock commented Jul 27, 2020

danmoseley commented Jul 27, 2020

LeaFrock commented Jul 27, 2020 •

edited

danmoseley commented Jul 29, 2020

eiriktsarpalis commented Oct 30, 2020

bennidhamma commented Dec 13, 2020

eiriktsarpalis commented Dec 14, 2020

LeaFrock commented Dec 15, 2020

danmoseley commented Dec 15, 2020

eiriktsarpalis commented Dec 15, 2020

danmoseley commented Dec 15, 2020

stephentoub commented Dec 16, 2020 •

edited

i3arnon commented Dec 16, 2020 •

edited

stephentoub commented Dec 16, 2020 •

edited

eiriktsarpalis commented Oct 29, 2021

Add System.Collections.Concurrent.ConcurrentHashSet<T> #39919

Add System.Collections.Concurrent.ConcurrentHashSet<T> #39919

Comments

LeaFrock commented Jul 25, 2020 • edited

Background and Motivation

Proposed API

Usage Examples

Alternative Designs

Risks

LeaFrock commented Jul 25, 2020 • edited

ghost commented Jul 25, 2020

danmoseley commented Jul 26, 2020 • edited

LeaFrock commented Jul 27, 2020

danmoseley commented Jul 27, 2020

LeaFrock commented Jul 27, 2020 • edited

danmoseley commented Jul 29, 2020

eiriktsarpalis commented Oct 30, 2020

bennidhamma commented Dec 13, 2020

eiriktsarpalis commented Dec 14, 2020

LeaFrock commented Dec 15, 2020

danmoseley commented Dec 15, 2020

eiriktsarpalis commented Dec 15, 2020

danmoseley commented Dec 15, 2020

stephentoub commented Dec 16, 2020 • edited

i3arnon commented Dec 16, 2020 • edited

stephentoub commented Dec 16, 2020 • edited

eiriktsarpalis commented Oct 29, 2021

LeaFrock commented Jul 25, 2020 •

edited

LeaFrock commented Jul 25, 2020 •

edited

danmoseley commented Jul 26, 2020 •

edited

LeaFrock commented Jul 27, 2020 •

edited

stephentoub commented Dec 16, 2020 •

edited

i3arnon commented Dec 16, 2020 •

edited

stephentoub commented Dec 16, 2020 •

edited