New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the default hash code for structs #1341

Open
GSPP opened this Issue Aug 4, 2015 · 11 comments

Comments

Projects
None yet
@GSPP

GSPP commented Aug 4, 2015

Apparently, the default GetHashCode implementation for structs is really bad: http://stackoverflow.com/a/5927853/ (see FastGetValueTypeHashCodeHelper). This answer lists a lot of problems. For example, if the struct contains a reference type field the hash code is based on one field only. This can cause excessive hash collisions.

I'd like to propose that the default hash code and equality be changed to a more useful form. The hash code and equality comparison should always be based on calling GetHashCode and Equals on all fields. The hash code could be based in the common pattern accumulator * 37 + nextHashCode. Resharper uses that for its "generate GetHashCode" feature and it's quite reasonable.

Ideally, the code to do this could be added to each struct type at runtime so that the JIT can generate super fast code for this on demand. In many cases this would obviate the need for custom overrides of these two functions.

In order to mitigate compatibility risk this should be tied to some form of compatibility switch. The BinaryCompatibility class infrastructure seems to be made to address exactly this problem.

@omariom

This comment has been minimized.

Show comment
Hide comment
@omariom

omariom Aug 14, 2015

Contributor

To use a struct efficiently as key in hash based collections you already have to override GetHashCode and implement IEquatable. Otherwise you will get a lot of boxing.

Contributor

omariom commented Aug 14, 2015

To use a struct efficiently as key in hash based collections you already have to override GetHashCode and implement IEquatable. Otherwise you will get a lot of boxing.

@omariom

This comment has been minimized.

Show comment
Hide comment
@omariom

omariom Aug 14, 2015

Contributor

Here is an example:

const int Iters = 10 * 1000000;
var dict = new Dictionary<Key, int>(Iters); // prealocate all the required storage

for (int  i = 0; i < Iters; i++)
{
    dict.Add(new Key(i), 0);
}

...

public struct Key
{
    int i;
    public Key(int i)
    {
        this.i = i;
    }
}

It takes 775 ms and allocates 24 bytes (on average) for each Add call.

If we implement GetHashCode:

public struct Key
{
    int i;
    public Key(int i)
    {
        this.i = i;
    }
    public override int GetHashCode()
    {
        return i;
    }
}

it takes 200 ms with no allocations.

Let's make the generated hash code not so perfect. It will cause Equals to be called.

    public override int GetHashCode()
    {
        return i % (1 << 20);
    }

It will cost us 2.5 seconds and 205 bytes per call.

Finally implement IEquatable:

public struct Key: IEquatable<Key>
{
    int i;

    public Key(int i)
    {
       this.i = i;
    }

    public override int GetHashCode()
    {
        return i % (1 << 20);
    }

    public bool Equals(Key other)
    {
        return i == other.i;
    }
 }

Now even though the GetHashCode implementation causes hash collisions
time drops to 775 ms with zero bytes allocated.

Contributor

omariom commented Aug 14, 2015

Here is an example:

const int Iters = 10 * 1000000;
var dict = new Dictionary<Key, int>(Iters); // prealocate all the required storage

for (int  i = 0; i < Iters; i++)
{
    dict.Add(new Key(i), 0);
}

...

public struct Key
{
    int i;
    public Key(int i)
    {
        this.i = i;
    }
}

It takes 775 ms and allocates 24 bytes (on average) for each Add call.

If we implement GetHashCode:

public struct Key
{
    int i;
    public Key(int i)
    {
        this.i = i;
    }
    public override int GetHashCode()
    {
        return i;
    }
}

it takes 200 ms with no allocations.

Let's make the generated hash code not so perfect. It will cause Equals to be called.

    public override int GetHashCode()
    {
        return i % (1 << 20);
    }

It will cost us 2.5 seconds and 205 bytes per call.

Finally implement IEquatable:

public struct Key: IEquatable<Key>
{
    int i;

    public Key(int i)
    {
       this.i = i;
    }

    public override int GetHashCode()
    {
        return i % (1 << 20);
    }

    public bool Equals(Key other)
    {
        return i == other.i;
    }
 }

Now even though the GetHashCode implementation causes hash collisions
time drops to 775 ms with zero bytes allocated.

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Aug 14, 2015

@omariom that is a good point. This seems like a JIT limitation. Maybe we can fix the JIT in this regard as well. The JIT certainly doesn't have to box a struct in order to call default equals and GetHashCode methods. The dictionaries code is being specialized for this exact struct type. This means that when jitting the exact type is known.

GSPP commented Aug 14, 2015

@omariom that is a good point. This seems like a JIT limitation. Maybe we can fix the JIT in this regard as well. The JIT certainly doesn't have to box a struct in order to call default equals and GetHashCode methods. The dictionaries code is being specialized for this exact struct type. This means that when jitting the exact type is known.

@Eyas

This comment has been minimized.

Show comment
Hide comment
@Eyas

Eyas Aug 14, 2015

@omariom Is implementing IEquatable relevant here per se, or is it just that overriding Equals() alone would prevent boxing? I don't have access to test that right now.

Eyas commented Aug 14, 2015

@omariom Is implementing IEquatable relevant here per se, or is it just that overriding Equals() alone would prevent boxing? I don't have access to test that right now.

@omariom

This comment has been minimized.

Show comment
Hide comment
@omariom

omariom Aug 14, 2015

Contributor

Overriding Object.Equals will reduce boxing but to get rid of it completely IEquatable is necessary.

Contributor

omariom commented Aug 14, 2015

Overriding Object.Equals will reduce boxing but to get rid of it completely IEquatable is necessary.

@mikedn

This comment has been minimized.

Show comment
Hide comment
@mikedn

mikedn Aug 16, 2015

Contributor

Maybe we can fix the JIT in this regard as well. The JIT certainly doesn't have to box a struct in order to call default equals and GetHashCode methods.

The default Equals takes an object so boxing isn't easy to avoid in this case. Equals either needs to be inlined but that's unlikely given the current size of the ValueType.Equals(object) or escape analysis needs to be performed on Equals and in turn that may allow the boxed object to be allocated on the stack. Not exactly trivial and unlikely to succeed due to the use of reflection.

What might work is to treat ValueType.Equals as some sort of JIT intrinsic and generate specialized code for each value type. That will avoid the need for reflection and might as well avoid the need to box the Equals argument if that specialized code uses ref T instead of object.

That said, I think implementing IEquatable<T> is the proper way to deal with this as it clearly communicates the intent. I think it's a bit unfortunate that methods such as Object.Equals(object) and especially Object.GetHashCode() exist. Not every type needs to be used as hashtable key, quite the contrary.

Contributor

mikedn commented Aug 16, 2015

Maybe we can fix the JIT in this regard as well. The JIT certainly doesn't have to box a struct in order to call default equals and GetHashCode methods.

The default Equals takes an object so boxing isn't easy to avoid in this case. Equals either needs to be inlined but that's unlikely given the current size of the ValueType.Equals(object) or escape analysis needs to be performed on Equals and in turn that may allow the boxed object to be allocated on the stack. Not exactly trivial and unlikely to succeed due to the use of reflection.

What might work is to treat ValueType.Equals as some sort of JIT intrinsic and generate specialized code for each value type. That will avoid the need for reflection and might as well avoid the need to box the Equals argument if that specialized code uses ref T instead of object.

That said, I think implementing IEquatable<T> is the proper way to deal with this as it clearly communicates the intent. I think it's a bit unfortunate that methods such as Object.Equals(object) and especially Object.GetHashCode() exist. Not every type needs to be used as hashtable key, quite the contrary.

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Aug 16, 2015

Maybe all that's needed in the JIT is:

if (receiverType.IsValueType && methodName == "Equals")
 CallIntrinsicFunction(receiverExpr.GetPointer());

It would be even nicer if the C# and/or the CLR had the necessary meta-programming facilities to push all equality concerns (including IEquatable) of a type into an attribute or "template". That is likely not on the table, though.

GSPP commented Aug 16, 2015

Maybe all that's needed in the JIT is:

if (receiverType.IsValueType && methodName == "Equals")
 CallIntrinsicFunction(receiverExpr.GetPointer());

It would be even nicer if the C# and/or the CLR had the necessary meta-programming facilities to push all equality concerns (including IEquatable) of a type into an attribute or "template". That is likely not on the table, though.

@DemiMarie

This comment has been minimized.

Show comment
Hide comment
@DemiMarie

DemiMarie Aug 11, 2016

@mikedn That is the answer: generate specialized code for each and every value type.

DemiMarie commented Aug 11, 2016

@mikedn That is the answer: generate specialized code for each and every value type.

@mikedn

This comment has been minimized.

Show comment
Hide comment
@mikedn

mikedn Aug 11, 2016

Contributor

@DemiMarie I'm not sure what your comment adds to this thread. Of course that specialized code is faster but we don't know when that code will be generated, how it will be called etc.

Contributor

mikedn commented Aug 11, 2016

@DemiMarie I'm not sure what your comment adds to this thread. Of course that specialized code is faster but we don't know when that code will be generated, how it will be called etc.

@vancem

This comment has been minimized.

Show comment
Hide comment
@vancem

vancem Aug 11, 2016

Member

This is an interesting problem, but perhaps the most important issue associated with it is how much effort(cost) should be devoted to it. which is unclear.

We could do absolutely nothing (after it always 'works' and even works OK many times, and arguably users who care about perf can always fix it (by overriding the methods needed). Also the need for this is moderately rare (you need a user defined type that is a key in a dictionary).

The other extreme is quite a bit of JIT magic to make it work well (but probably not optimally) in all cases. From what I can tell this is moderately expensive (when you synthesize methods out of 'nothing' (No IL), things tend to be at least moderately expensive.

My guess as to the right code-benefit, is to see if you can simply fix RegularGetValueTypeHashCode to not ever produce horrendous hashs. This should be very easy (modifying one method), and avoid the worst problem (because users that care about perf should be implementing the methods because that will always give the best perf).

Ideally we have some idea of the scenarios that are performing poorly today. (Examples of REAL WORLD examples where the hash is very poor, along with some idea of how much better it would be if the hash was fixed). If people have this information it should be added to the discussion.

If anyone wants to go do that work in RegularGetValueTypeHashCode (it is straightforward enough), I would be supportive of getting it checked in. (low cost/risk, reasonable value).

If anyone wishes to pursue more aggressive tactics, you should talk about them here before proceeding. We have to be sure of the cost/benefit before you spend time on it.

Member

vancem commented Aug 11, 2016

This is an interesting problem, but perhaps the most important issue associated with it is how much effort(cost) should be devoted to it. which is unclear.

We could do absolutely nothing (after it always 'works' and even works OK many times, and arguably users who care about perf can always fix it (by overriding the methods needed). Also the need for this is moderately rare (you need a user defined type that is a key in a dictionary).

The other extreme is quite a bit of JIT magic to make it work well (but probably not optimally) in all cases. From what I can tell this is moderately expensive (when you synthesize methods out of 'nothing' (No IL), things tend to be at least moderately expensive.

My guess as to the right code-benefit, is to see if you can simply fix RegularGetValueTypeHashCode to not ever produce horrendous hashs. This should be very easy (modifying one method), and avoid the worst problem (because users that care about perf should be implementing the methods because that will always give the best perf).

Ideally we have some idea of the scenarios that are performing poorly today. (Examples of REAL WORLD examples where the hash is very poor, along with some idea of how much better it would be if the hash was fixed). If people have this information it should be added to the discussion.

If anyone wants to go do that work in RegularGetValueTypeHashCode (it is straightforward enough), I would be supportive of getting it checked in. (low cost/risk, reasonable value).

If anyone wishes to pursue more aggressive tactics, you should talk about them here before proceeding. We have to be sure of the cost/benefit before you spend time on it.

@morganbr

This comment has been minimized.

Show comment
Hide comment
@morganbr

morganbr Feb 24, 2018

Contributor

Now that the HashCode type has been merged, we should consider making it available via ValueType.GetHashCode. It should have much higher quality hashes.

Contributor

morganbr commented Feb 24, 2018

Now that the HashCode type has been merged, we should consider making it available via ValueType.GetHashCode. It should have much higher quality hashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment