Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueTuple with more than 8 fields calculates its hash code incorrectly #13087

Open
dioptryk opened this issue Jul 16, 2019 · 10 comments
Open

ValueTuple with more than 8 fields calculates its hash code incorrectly #13087

dioptryk opened this issue Jul 16, 2019 · 10 comments

Comments

@dioptryk
Copy link

Consider the following code:

Console.WriteLine((  1, 2, 3, 4, 5, 6, 7, 8).GetHashCode());
Console.WriteLine((777, 2, 3, 4, 5, 6, 7, 8).GetHashCode()); // different result
		
Console.WriteLine((  1, 2, 3, 4, 5, 6, 7, 8, 9, 10).GetHashCode());
Console.WriteLine((777, 2, 3, 4, 5, 6, 7, 8, 9, 10).GetHashCode()); // same result

The ValueTuple ignores first N-8 fields when calculating hash code. It even says so in a comment in ValueTuple.cs. The offending code is:

if (size >= 8) { return rest.GetHashCode(); }

To be honest, this seems like a bug to me, since this behaviour is described nowhere but the source. We were using tuple syntax to calculate hash codes of several class members and just noticed, that even with some fields changed, the hashes were still equal.. I understand this will have performance implications, but still I find the GetHashCode() unusable in its current state for "big" tuples. This is wrong for both .NET Framework and Core.

@danmoseley
Copy link
Member

This has come up before I think. It is a tradeoff between cost of calculating hash code and cost of collisions for structs with sufficient size.

@dioptryk
Copy link
Author

Understandable, but still..

"If it doesn’t work, it doesn’t matter how fast it doesn’t work."
— Mich Ravera

;-)
It should be (at the very least) clarified in the docs, IMO. Who knows how many people are unknowingly affected by this. Besides that, decision about being the subject to such an inconsistent optimization should be left to the programmer using the API, I think.

@stephentoub
Copy link
Member

"If it doesn’t work, it doesn’t matter how fast it doesn’t work"

What about it "doesn't work"? There is nothing that requires a GetHashCode to factor in all data from a type (and in fact frequently implementations don't). The only guarantee it need make is that two values that should be considered equal return the same hash code (in that light a GetHashCode that always returns 0 is correct, but obviously will likely lead to subsequent performance deficiencies). This is about a perf trade-off, not correctness.

@dioptryk
Copy link
Author

@stephentoub Sorry, but I can't agree with that. It's like having print() method, which prints only 8 arguments (for performance reasons), or a String.Join() which joins only 8 elements of the collection.. The examples may be a bit ;-) exaggerated, but you get the idea. It's about consistency in the API. Besides that, you don't know anything about Our Business Case. Perhaps we have a hundred tuples like that (which is nothing), so the performance issue does not exist, yet the hash collision problem remains.. Do you really think that changing the code, because the class has suddenly more than 8 fields, is reasonable?

@stephentoub
Copy link
Member

stephentoub commented Jul 18, 2019

It's like having print() method, which prints only 8 arguments (for performance reasons), or a String.Join() which joins only 8 elements of the collection

It is in no way the same as those.

Besides that, you don't know anything about Our Business Case.

You're right, I don't. If your business case demands such explicit control over exactly how a GetHashCode works, you will need to implement it yourself on your own type, and not use the built-in types for this.

@benaadams
Copy link
Member

Might want to use System.HashCode rather than ValueTuple to generate your hash code

@danmoseley
Copy link
Member

@dioptryk assuming you accept it's a performance issue, not a correctness issue, consider that hashing all the fields might hurt performance more than the reduced collisions helps. Clearly, there has to be a cut-over, and it's apparently 8 in this case. If it hashed 1000 fields, it would be worse in every respect.

@benaadams
Copy link
Member

benaadams commented Jul 18, 2019

If it hashed 1000 fields, it would be worse in every respect.

Using a 1000 field struct/ValueTuple as a key in a dictionary will have be terrible performance to begin with; hashcode performance would likely to be the bottleneck?

@GSPP
Copy link

GSPP commented Aug 31, 2019

Yes, this is a correct GetHashCode implementation, but...

I had practical cases in which leaving out parts of the data for hashing purposes lead to catastrophic hash table behavior. In my opinion, leaving out fields entirely is a trap that is not appropriate given the usual reliability and quality standards of the .NET class libraries.

Bad hash code functions usually fail more gracefully because at least they still distribute the data over many buckets. This function amounts to int GetHashCode() => 0; for all but the last 8 arguments. That can get catastrophic rather easily. And catastrophic means O(N^2) instead of O(N).

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 26, 2020
@joperezr joperezr removed the untriaged New issue has not been triaged by the area owner label Jul 2, 2020
@EnCey
Copy link

EnCey commented Jul 29, 2021

I am currently evaluating whether to use large, generated ValueTuples to determine whether 2 sets of values are equal and thus also stumbled upon this.

In my opinion, the ValueTuple documentation should state that not all items of the tuple are factored into the hash code calculation. That would save us developers some time, since apparently there are some that are interested in such functionality.

However, I don't understand the criticism here. No GetHashCode implementation is obliged to generate unique results; a List<T> never computed a hash code based on all its items, and plenty non-collection types only use a fraction of their fields for theirs. The most obvious case are types that contain an "ID" and return a hash code derived solely from that, ignoring all other fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants