Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the default hash code for structs #4414

Open
GSPP opened this issue Aug 4, 2015 · 12 comments
Open

Improve the default hash code for structs #4414

GSPP opened this issue Aug 4, 2015 · 12 comments
Labels
area-System.Runtime enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue
Milestone

Comments

@GSPP
Copy link

GSPP commented Aug 4, 2015

Apparently, the default GetHashCode implementation for structs is really bad: http://stackoverflow.com/a/5927853/ (see FastGetValueTypeHashCodeHelper). This answer lists a lot of problems. For example, if the struct contains a reference type field the hash code is based on one field only. This can cause excessive hash collisions.

I'd like to propose that the default hash code and equality be changed to a more useful form. The hash code and equality comparison should always be based on calling GetHashCode and Equals on all fields. The hash code could be based in the common pattern accumulator * 37 + nextHashCode. Resharper uses that for its "generate GetHashCode" feature and it's quite reasonable.

Ideally, the code to do this could be added to each struct type at runtime so that the JIT can generate super fast code for this on demand. In many cases this would obviate the need for custom overrides of these two functions.

In order to mitigate compatibility risk this should be tied to some form of compatibility switch. The BinaryCompatibility class infrastructure seems to be made to address exactly this problem.

@omariom
Copy link
Contributor

omariom commented Aug 14, 2015

To use a struct efficiently as key in hash based collections you already have to override GetHashCode and implement IEquatable. Otherwise you will get a lot of boxing.

@omariom
Copy link
Contributor

omariom commented Aug 14, 2015

Here is an example:

const int Iters = 10 * 1000000;
var dict = new Dictionary<Key, int>(Iters); // prealocate all the required storage

for (int  i = 0; i < Iters; i++)
{
    dict.Add(new Key(i), 0);
}

...

public struct Key
{
    int i;
    public Key(int i)
    {
        this.i = i;
    }
}

It takes 775 ms and allocates 24 bytes (on average) for each Add call.

If we implement GetHashCode:

public struct Key
{
    int i;
    public Key(int i)
    {
        this.i = i;
    }
    public override int GetHashCode()
    {
        return i;
    }
}

it takes 200 ms with no allocations.

Let's make the generated hash code not so perfect. It will cause Equals to be called.

    public override int GetHashCode()
    {
        return i % (1 << 20);
    }

It will cost us 2.5 seconds and 205 bytes per call.

Finally implement IEquatable:

public struct Key: IEquatable<Key>
{
    int i;

    public Key(int i)
    {
       this.i = i;
    }

    public override int GetHashCode()
    {
        return i % (1 << 20);
    }

    public bool Equals(Key other)
    {
        return i == other.i;
    }
 }

Now even though the GetHashCode implementation causes hash collisions
time drops to 775 ms with zero bytes allocated.

@GSPP
Copy link
Author

GSPP commented Aug 14, 2015

@omariom that is a good point. This seems like a JIT limitation. Maybe we can fix the JIT in this regard as well. The JIT certainly doesn't have to box a struct in order to call default equals and GetHashCode methods. The dictionaries code is being specialized for this exact struct type. This means that when jitting the exact type is known.

@Eyas
Copy link

Eyas commented Aug 14, 2015

@omariom Is implementing IEquatable relevant here per se, or is it just that overriding Equals() alone would prevent boxing? I don't have access to test that right now.

@omariom
Copy link
Contributor

omariom commented Aug 14, 2015

Overriding Object.Equals will reduce boxing but to get rid of it completely IEquatable is necessary.

@mikedn
Copy link
Contributor

mikedn commented Aug 16, 2015

Maybe we can fix the JIT in this regard as well. The JIT certainly doesn't have to box a struct in order to call default equals and GetHashCode methods.

The default Equals takes an object so boxing isn't easy to avoid in this case. Equals either needs to be inlined but that's unlikely given the current size of the ValueType.Equals(object) or escape analysis needs to be performed on Equals and in turn that may allow the boxed object to be allocated on the stack. Not exactly trivial and unlikely to succeed due to the use of reflection.

What might work is to treat ValueType.Equals as some sort of JIT intrinsic and generate specialized code for each value type. That will avoid the need for reflection and might as well avoid the need to box the Equals argument if that specialized code uses ref T instead of object.

That said, I think implementing IEquatable<T> is the proper way to deal with this as it clearly communicates the intent. I think it's a bit unfortunate that methods such as Object.Equals(object) and especially Object.GetHashCode() exist. Not every type needs to be used as hashtable key, quite the contrary.

@GSPP
Copy link
Author

GSPP commented Aug 16, 2015

Maybe all that's needed in the JIT is:

if (receiverType.IsValueType && methodName == "Equals")
 CallIntrinsicFunction(receiverExpr.GetPointer());

It would be even nicer if the C# and/or the CLR had the necessary meta-programming facilities to push all equality concerns (including IEquatable) of a type into an attribute or "template". That is likely not on the table, though.

@DemiMarie
Copy link

@mikedn That is the answer: generate specialized code for each and every value type.

@mikedn
Copy link
Contributor

mikedn commented Aug 11, 2016

@DemiMarie I'm not sure what your comment adds to this thread. Of course that specialized code is faster but we don't know when that code will be generated, how it will be called etc.

@vancem
Copy link
Contributor

vancem commented Aug 11, 2016

This is an interesting problem, but perhaps the most important issue associated with it is how much effort(cost) should be devoted to it. which is unclear.

We could do absolutely nothing (after it always 'works' and even works OK many times, and arguably users who care about perf can always fix it (by overriding the methods needed). Also the need for this is moderately rare (you need a user defined type that is a key in a dictionary).

The other extreme is quite a bit of JIT magic to make it work well (but probably not optimally) in all cases. From what I can tell this is moderately expensive (when you synthesize methods out of 'nothing' (No IL), things tend to be at least moderately expensive.

My guess as to the right code-benefit, is to see if you can simply fix RegularGetValueTypeHashCode to not ever produce horrendous hashs. This should be very easy (modifying one method), and avoid the worst problem (because users that care about perf should be implementing the methods because that will always give the best perf).

Ideally we have some idea of the scenarios that are performing poorly today. (Examples of REAL WORLD examples where the hash is very poor, along with some idea of how much better it would be if the hash was fixed). If people have this information it should be added to the discussion.

If anyone wants to go do that work in RegularGetValueTypeHashCode (it is straightforward enough), I would be supportive of getting it checked in. (low cost/risk, reasonable value).

If anyone wishes to pursue more aggressive tactics, you should talk about them here before proceeding. We have to be sure of the cost/benefit before you spend time on it.

@morganbr
Copy link
Contributor

Now that the HashCode type has been merged, we should consider making it available via ValueType.GetHashCode. It should have much higher quality hashes.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 30, 2020
@msftgits msftgits added this to the Future milestone Jan 30, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 26, 2020
@sjb-sjb
Copy link

sjb-sjb commented Mar 7, 2020

I agree this is an area that needs fixing. Defining Equals and GetHashCode over and over again is a pain, particularly if you want a decent hash function.
@DemiMarie's comment was valuable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Runtime enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests