Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add System.Collections.Concurrent.ConcurrentHashSet<T> #39919

Closed
LeaFrock opened this issue Jul 25, 2020 · 18 comments
Closed

Add System.Collections.Concurrent.ConcurrentHashSet<T> #39919

LeaFrock opened this issue Jul 25, 2020 · 18 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Collections
Milestone

Comments

@LeaFrock
Copy link
Contributor

LeaFrock commented Jul 25, 2020

Background and Motivation

We already have ConcurrentDictionary which takes great advantages at concurrent environment comparing to Dictionary. However, HashSet doesn't have a Concurrent version.
Sometimes(maybe very often in fact), we hope to use ConcurrentHashSet easily which its elements are never duplicated, with an O(1) operation for contains or remove methods at concurrent environment. It's quite different from ConcurrentBag and provides another choice for suitable usage.
You can also see details at issue #16443 . Almost 5 years passes, nobody gives an API proposal, so I'm here to finish it.

Proposed API

    public partial class ConcurrentHashSet<T> : System.Collections.Generic.ICollection<T>, System.Collections.Generic.IEnumerable<T>, System.Collections.Generic.IReadOnlyCollection<T>, System.Collections.ICollection,System.Collections.IEnumerable where T : notnull
    {
        public ConcurrentHashSet() { }
        public ConcurrentHashSet(System.Collections.Generic.IEnumerable<T> collection) { }
        public ConcurrentHashSet(System.Collections.Generic.IEnumerable<T> collection, System.Collections.Generic.IEqualityComparer<T>? comparer) { }
        public ConcurrentHashSet(System.Collections.Generic.IEqualityComparer<T>? comparer) { }
        public ConcurrentHashSet(int concurrencyLevel, System.Collections.Generic.IEnumerable<T> collection, System.Collections.Generic.IEqualityComparer<T>? comparer) { }
        public ConcurrentHashSet(int concurrencyLevel, int capacity) { }
        public ConcurrentHashSet(int concurrencyLevel, int capacity, System.Collections.Generic.IEqualityComparer<T>? comparer) { }

        public int Count { get { throw null; } }
        public bool IsEmpty { get { throw null; } }
        bool System.Collections.Generic.ICollection<T>.IsReadOnly { get { throw null; } }
        bool System.Collections.ICollection.IsSynchronized { get { throw null; } }
        object System.Collections.ICollection.SyncRoot { get { throw null; } }

        public void Clear() { }
        public bool Contains(T item) { throw null; }
        public System.Collections.Generic.IEnumerator<T> GetEnumerator() { throw null; }
        public T GetOrAdd(T item) { throw null; }
        void System.Collections.Generic.ICollection<T>.Add(T item) { }
        bool System.Collections.Generic.ICollection<T>.Contains(T item) { throw null; }
        void System.Collections.Generic.ICollection<T>.CopyTo(T[] array, int index) { }
        bool System.Collections.Generic.ICollection<T>.Remove(T item) { throw null; }
        void System.Collections.ICollection.CopyTo(System.Array array, int index) { }
        System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() { throw null; }
        public T[] ToArray() { throw null; }
        public bool TryAdd(T item) { throw null; }
        public bool TryRemove(T item) { throw null; }

        public System.Collections.Generic.HashSet<T> ToHashSet() { } //optional
    }

Usage Examples

The main goal of ConcurrentHashSet is to improve performance and reduce cost, taking replace of HashSet-With-Lock pattern for concurrent environment. If we have it, the code can be written like below,

public class Sample()
{
     // private readonly object _syncRoot = new object();

     private readonly ConcurrentHashSet<Guid> _concurrentGuidSet = new ConcurrentHashSet<Guid>();
     // private readonly HashSet<Guid> _guidSet = new Hash<Guid>();

     public async Task Generate()
     {
          for(int i = 0; i < 100; i++)
          {
               Task.Run( () =>
               {
                    // The goal is to improve the performance, and ease the way of programming here.
                    _concurrentGuidSet.TryAdd(Guid.NewGuid());
                    //lock(_syncRoot)
                    //{
                    //     _guidSet.Add(Guid.NewGuid());
                    //}
               })
          }
     }
}

The difference could be very similiar to the one between Dictionary and ConcurrentDictionary.

Alternative Designs

  1. Many developers tell that they use ConcurrentDictionary<T, byte> as a replace when meeting the requirement. They don't want the Value actually which is meaningless and totally a waste of memory. However, someones suggest that the first version of ConcurrentHashSet can be implemented by underlying ConcurrentDictionary<T, byte>, in order to add the APIs as soon as possible. Individually, I prefer to an independent implementation which is necessary to be discussed between excellent engineers of Microsoft & our community.

  2. Do we need it to inherit ISet<T>/IReadOnlySet<T>? Though it's the concurrent version of HashSet, I'm not sure whether each API of HashSet needs a copy. In my opinion, not all APIs make sense while running in a concurrent environment. Of course, the discussion is required.

  3. Do we need this API public HashSet<T> ToHashSet() to get a synchronous instance for easy usage?

Risks

It might not own concurrent version of the whole APIs of HashSet. Because it's a new collection, no break changes will occur. It just needs a pretty design :).

@LeaFrock LeaFrock added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Jul 25, 2020
@LeaFrock
Copy link
Contributor Author

LeaFrock commented Jul 25, 2020

By the way, this issue is not duplicated of #15326 , though we have almost the same requirement.

@ghost
Copy link

ghost commented Jul 25, 2020

Tagging subscribers to this area: @eiriktsarpalis
See info in area-owners.md if you want to be subscribed.

@scalablecory scalablecory added this to the Future milestone Jul 25, 2020
@danmoseley
Copy link
Member

danmoseley commented Jul 26, 2020

Thanks for doing this. Could you please update the proposal to be complete? No implementation is needed, but everything that would be public, eg.,

  • all constructors
  • all members fulfiling the interface contracts - whether explicit or implicit. Eg., there's no getter for the generic enumerator, etc.
  • to your question in Get core-setup building in the consolidated repo. #2, does it need ISet and IReadOnlySet-- it sounds like your use case is for storage and retrieval, not set operations like intersect and superset. Better to mimimally satisfy demonstrated needs, the interfaces can always be added later without breaking code, if use case emerges.
  • CopyTo probably should be hidden in explicit interface implementation, as for ConcurrentDictionary
  • Remove should probably be TryRemove
  • do we need XXAdd overloads with a factory delegate, as for ConcurrentDictionary?
  • do we need AddOrUpdate?
  • do we need TryUpdate?
  • does it need IsEmpty? this is syntactic sugar that with present JIT may be less efficient than checking Count == 0. (Fix CA1836 rule violations excluding Memory/Span #39890 (comment))

@LeaFrock
Copy link
Contributor Author

@danmosemsft Updated, thanks!

  • all constructors

Just take reference of ConcurrentDictionary<TKey, TValue>.

  • all members fulfiling the interface contracts - whether explicit or implicit. Eg., there's no getter for the generic enumerator, etc.
  • CopyTo probably should be hidden in explicit interface implementation, as for ConcurrentDictionary

Sorry, I'm confused whether the whole members of interface contract should be listed above? I deleted overloads of CopyTo and added getter for the generic enumerator.

  • to your question in Get core-setup building in the consolidated repo. #2, does it need ISet and IReadOnlySet-- it sounds like your use case is for storage and retrieval, not set operations like intersect and superset. Better to mimimally satisfy demonstrated needs, the interfaces can always be added later without breaking code, if use case emerges.

Agreed. I moved the related content to 'Alternative Designs'.

  • Remove should probably be TryRemove

Agreed & done.

  • do we need XXAdd overloads with a factory delegate, as for ConcurrentDictionary?
  • do we need AddOrUpdate?
  • do we need TryUpdate?

I think NO, because it's a simpler collection than ConcurrentDictionary. For ConcurrentDictionary, we usually use AddOrUpdate/TryUpdate or factory delegate overloads to update Value only. ConcurrentHashSet owns the Key part only, so I don't see any point in them. Any comments welcomed.

  • does it need IsEmpty? this is syntactic sugar that with present JIT may be less efficient than checking Count == 0.

I think we can add it. The following is the reasons from the perspective of my view,

  • ConcurrentDictionary owns it
  • it really simplify codes
  • the performance gap is acceptable and I believe JIT performance here will be optimized sooner or later

@danmoseley
Copy link
Member

Thanks

Sorry, I'm confused whether the whole members of interface contract should be listed above?

We're looking for the same format as in this file
https://github.com/dotnet/runtime/blob/master/src/libraries/System.Collections.Concurrent/ref/System.Collections.Concurrent.cs#L75

@LeaFrock
Copy link
Contributor Author

LeaFrock commented Jul 27, 2020

@danmosemsft Done.

@danmoseley
Copy link
Member

Ok thanks. The owners of this area will take a look. It might be a little while since we are focused on shipping 5.0 right now.

@danmoseley danmoseley added the untriaged New issue has not been triaged by the area owner label Jul 29, 2020
@layomia layomia removed the untriaged New issue has not been triaged by the area owner label Aug 28, 2020
@eiriktsarpalis
Copy link
Member

Many developers tell that they use ConcurrentDictionary<T, byte> as a replace when meeting the requirement. They don't want the Value actually which is meaningless and totally a waste of memory.

This is a compromise that has more or less worked fine, at least for my own requirements in the past. It would be interesting if we could quantify any performance improvements that come out of a dedicated ConcurrentHashSet implementation.

@bennidhamma
Copy link

Even if you just used a ConcurrentDictionary<T, byte> initially, under the covers, people could start using a cleaner interface.

@eiriktsarpalis
Copy link
Member

Even if you just used a ConcurrentDictionary<T, byte> initially, under the covers, people could start using a cleaner interface.

Adding new classes always comes at a cost, even if they're just facades.

@LeaFrock
Copy link
Contributor Author

It would be interesting if we could quantify any performance improvements that come out of a dedicated ConcurrentHashSet implementation.

Well, so the situation becomes that we're all waiting for some kind of implementation by someone now, in order to compare it with the current solution... that doesn't sound good =.=

@danmoseley
Copy link
Member

It will be more space efficient, even if it's no faster, which seems potentially worthwhile to me.

I noticed this (didn't look at it): https://github.com/i3arnon/ConcurrentHashSet by @i3arnon

@eiriktsarpalis
Copy link
Member

In practice though we'd be looking at maybe shaving one byte per Node instance which I'm not sure would even register in heap allocations. That being said, there might be opportunities for other optimizations as well, happy to be surprised :-)

@danmoseley
Copy link
Member

That is true.

@stephentoub
Copy link
Member

stephentoub commented Dec 16, 2020

In practice though we'd be looking at maybe shaving one byte per Node instance

On 64-bit it generally won't even save that.

@i3arnon
Copy link
Contributor

i3arnon commented Dec 16, 2020

On 64-bit it won't even save that.

It depends on the type being used, doesn't it?
I did some naive tests with benchmark-net compared with ConcurrentHashSet and I can see some difference in allocations when Tkey is int, but not with reference types (on 64-bit) or any other sized types (long, short, etc.)
I assume that's because of how Node's members are being aligned. It consists of a reference to the next node (8 bytes), 2 int values (the key & hash-code, 8 bytes together) and the single byte value which requires another 8 byte block to keep alignment.

@stephentoub
Copy link
Member

stephentoub commented Dec 16, 2020

It depends on the type being used, doesn't it?

Yes, but for almost anything other than key==int, it won't increase the size.

The node stores the key, the value, a next node reference, and an int hash code, so whether the value increases the size of the object depends on whether the value can fit in the padding for the hashcode or whether it's consumed by the key. If the value is byte, then if the key is any reference type, for example, the byte value will just be in what is otherwise padding space from the hash code. The original example motivating this issue was for Guid as the T. In that case, the byte value would similarly just fit into the padding for the int, which is what I was referring to (but wasn't clear).

@eiriktsarpalis
Copy link
Member

Closing since the benefits over using the ConcurrentDictionary<T, byte> workaround or a third-party package don't seem to outweigh the cost of authoring a new collection type. We might revisit in the future if more evidence comes up.

@ghost ghost locked as resolved and limited conversation to collaborators Nov 28, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Collections
Projects
None yet
Development

No branches or pull requests

8 participants