-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25122][SQL] Deduplication of supports equals code #22110
Conversation
@@ -115,6 +115,8 @@ protected[sql] abstract class AtomicType extends DataType { | |||
private[sql] type InternalType | |||
private[sql] val tag: TypeTag[InternalType] | |||
private[sql] val ordering: Ordering[InternalType] | |||
|
|||
private[spark] override def supportsEquals: Boolean = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be a property of the data type. It's specific to the OpenHashSet
. How about we add this method to object OpenHashSet
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all of the expressions utilize OpenHashSet
or OpenHashMap
. What about TypeUtils
that contains methods like getInterpretedOrdering
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Test build #94795 has finished for PR 22110 at commit
|
Test build #94800 has finished for PR 22110 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM apart from one comment
@@ -73,4 +73,14 @@ object TypeUtils { | |||
} | |||
x.length - y.length | |||
} | |||
|
|||
/** | |||
* Returns true if elements of the data type could be used as items of a hash set or as keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is not very coherent with the method name. I think we can rephrase to something like:
Returns true if the equal method of the elements of the data type is implemented properly.
This also means that they can be safely used in collections relying on the equals method,
as sets or maps.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to any changes :) But if you want to explicitly mention the equals
method, I would also mention hashCode
generally needed for usage in "hash" collections. But then this not 100% true for Spark's specialized OpenHashSets
and OpenHashMaps
since they calculate hash by themselves. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a common pattern for every well formed class to have equals and hashCode defined in a coherent way. I think what we are doing here is saying: this class has a meaningful equals method, so we can rely on it or this class has a meaningless equals method, like the default one comparing the pointers, so we cannot rely on it. I am open too to any suggestion, I'd only like to have a description which is coherent with the method name, otherwise I feel that either one or the other has to be changed in order to properly reflect what the method does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about changing the name of the method to typeCanBeHashed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like this proposal...
Test build #94824 has finished for PR 22110 at commit
|
* Returns true if elements of the data type could be used as items of a hash set or as keys | ||
* of a hash map. | ||
*/ | ||
def typeCanBeHashed(dataType: DataType): Boolean = dataType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey, this is a weird name, byte[]
can also be hashed. I'd rather call it typeWithProperEquals
, and document it as @mgaido91 proposed. I don't think we need to consider hashCode
here, it's a rule in java world that equals and hashCode should be defined in a coherent way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change it :)
Just one question to hashCode
. If case classes
are used, equals
and hashCode
are generated by compiler. But if we define equals
manually, shouldn't also hold a.equals(b) == true
=> a.hashCode == b.hashCode
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should, and I believe Spark has enforced it with style checker: when you override equals
, you must override hashCode
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks!
Test build #94847 has finished for PR 22110 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* This also means that they can be safely used in collections relying on the equals method, | ||
* as sets or maps. | ||
*/ | ||
def typeWithProperEquals(dataType: DataType): Boolean = dataType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO typeSupportsEquals
sounded simpler. I'd not have changed. But I am fine with this too if others agree on it.
thanks, merging to master! |
What changes were proposed in this pull request?
The method
*supportEquals
determining whether elements of a data type could be used as items in a hash set or as keys in a hash map is duplicated across multiple collection and higher-order functions.This PR suggests to deduplicate the method.
How was this patch tested?
Run tests in: