Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

THRIFT-4857: Made Java TField class hash code consistent with equals. #1792

Merged
merged 2 commits into from May 13, 2019
Merged

THRIFT-4857: Made Java TField class hash code consistent with equals. #1792

merged 2 commits into from May 13, 2019

Conversation

garretwilson
Copy link
Contributor

Client: java

This change fixes a bug in which the TField class was broken with respect to the Object API requirements for hashCode() and equals() being consistent, as explained in THRIFT-4857.

As this class had no unit tests, and I'm not yet ready to install the entire build infrastructure on my Windows machine, I did not add any unit tests.

I have my own opinions about approaches to generating hash codes, as well comments about the approach to checking for class equivalence, but those issues are separate from the central hash code bug. As this is my first contribution, I therefore made my change as surgical as possible.

@jeking3
Copy link
Contributor

jeking3 commented May 3, 2019

It would be ideal to have a test that ensures this does not break in the future.
Otherwise PR looks good. If the builds fail it's probably CI and not your code. I have some work to do to stabilize CI.

@garretwilson
Copy link
Contributor Author

It would be ideal to have a test that ensures this does not break in the future.

I completely agree! Unfortunately I'm way too busy with other work to add a test right now. As there were no tests for this class anyway, it figured it would be best to at least fix the bug for now.

I only mention this because I wasn't clear from your comment whether you were waiting on me for that before accepting the pull request. Cheers!

Copy link
Contributor

@jeking3 jeking3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerned about the integrity of the hash after removing the name.

@@ -47,7 +47,6 @@ public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + id;
result = prime * result + ((name == null) ? 0 : name.hashCode());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hashCode still looks pretty weak. I'm not convinced when looking at the logic that there couldn't be a combination of id and type that equal another combination of id and type.

If id and type are truly the only discriminating values then I would suggest something like (0x02TTDDDD) where T is the type (byte) and D is the ID (word), and 0x02 is the version of the hash (unlikely to ever change, but why not version it while you're at it...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure of what issue you're trying to address:

  • Are you unsure of what should constitute "equals" for TField?
  • Are you complaining that the hash code would not be consistent with "equals"?
  • Are you afraid that the hash code algorithm produces collisions?
  • Are you afraid that the hash code algorithm is not uniformly distributed?

These issue are distinct.

This hashCode still looks pretty weak. I'm not convinced when looking at the logic that there couldn't be a combination of id and type that equal another combination of id and type.

You seem to be saying that you're worried that the hash code algorithm produces collisions. But the contract for hashCode() allows collisions! The hashCode() contract does not require that each unique object produce a unique hash code (because for many objects that would be impossible). It only requires (among other things) that two objects that are considered "equal" return the same hash code.

We prefer of course that the hash code be uniformly distributed. And if we can find a way to avoid collisions altogether, so much the better. But hashCode() would still work if every TField instance returned the arbitrary number 123. It would be inefficient, but it would work. I was pointing out that it was broken because it did not consistently return the same hash code for two objects that are equal.

If id and type are truly the only discriminating values then I would suggest something like (0x02TTDDDD) where T is the type (byte) and D is the ID (word), …

If you are saying that you have an idea to avoid collisions altogether, great, but that's a different issue and in my opinion should be discussed in a separate ticket. You're trying to make the algorithm more efficient. I explicitly avoided messing with the hashing algorithm, because the ticket I filed was that the class was broken. I've fixed the brokenness. You're wanting to make it more efficient by making it more uniformly distributed and/or even avoiding collisions altogether, which is noble goal but a separate issue.

… and 0x02 is the version of the hash (unlikely to ever change, but why not version it while you're at it...)

Versioning the hash? Whatever for? This is not and should not ever be something that goes across the wire. It is not (as per its contract) reproducible across JVM invocations. It is only to be used locally within a single JVM for things like hash tables. It could use a different "version" on each release and no one would care, as long as it was consistent with equals (and met the other parts of its contract). (Perhaps you are thinking of "consistent hashing" used in distributed databases? That's a different animal.)

In summary, the class was broken. I un-broke it. You want to make something about the class better. Great, there's a thousand ways to make it better—and I even mentioned a couple in the ticket—but in my opinion the highest priority should be to un-break it as soon as possible and then make it better as a separate effort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced when looking at the logic that there couldn't be a combination of id and type that equal another combination of id and type.

Here's another thing to consider: is your "new and improved" hash code algorithm uniformly distributed? You may say, "of course it is, because I can prove that it produces a unique hash code for each non-equal TField." But whether there are hash collision is a completely different thing than whether the hash codes are uniformly distributed.

Think about it this way: there are approximately 2^64 possible hash codes. Do you think a hash table uses 2^64 buckets? Most likely not. Instead they group the hash codes into buckets, using for example the modulus operator. (I'm just summarizing Data Structures 101 here.) Therefore:

  1. You're probably going to get bucket collisions anyway, even if you try to prevent hash collisions, because there are fewer buckets than hash codes.

  2. If your new and improved hash algorithm is not uniformly distributed, you're going to group all the items in the e.g. lower buckets while all the e.g. upper buckets are empty. (And glancing at your proposal, I'd say that exactly what would happen!) So by attempting to avoid hash collisions (when the contract doesn't even preclude hash collisions), you came up with an algorithm that prevents hash collisions but makes bucket collisions even more likely, defeating the purpose.

People write entire research papers over hashing algorithms. The one used in TField is not new; it's probably (I'd have to check) something from Bloch or Knuth. So many people have made the mistake of trying to create a home-grown hash algorithm and made things worse. That's why I tell my students just to use the built-in hash functions in Java (e.g. Arrays.hash() or Objects.hash()) and not try to get fancy. If I made any change to the hashing function, that's what I would do: I'd take out the manual algorithm and delegate to an existing method that has stood the test of time.

But again, all this algorithm discussion is about attempting to "solve" an efficiency problem for which there is no empirical evidence that it even exists. The far, far larger problem is that as it is, the contract is broken, i.e. it doesn't even work at all! That's what I fixed.

@jeking3
Copy link
Contributor

jeking3 commented May 13, 2019

I agree with your fix, I was just suggesting there could be more collisions now that there are only two numeric inputs which would lead to incorrect behavior when using the value as a key in a container. I was not suggesting your changes were bad; sorry if that was the impression. I will rebase your PR after I add a unit test and get it building again, so no further action is needed on your side.

@garretwilson
Copy link
Contributor Author

I was not suggesting your changes were bad; sorry if that was the impression.

I understood that, no worries. I was just saying that "fixing something completely broken" and "twiddling with a conjectured inefficiency that we don't even know exists" are two separate things and should be done in two different tickets, in my opinion.

Have a great day!

@jeking3
Copy link
Contributor

jeking3 commented May 13, 2019

Yeah, I tend to combine things for efficiency when I am in a general area of code. I get way more done that way. Thanks.

@jeking3 jeking3 merged commit b261f3c into apache:master May 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants