[SPARK-25122][SQL] Deduplication of supports equals code #22110

mn-mikke · 2018-08-15T08:23:23Z

What changes were proposed in this pull request?

The method *supportEquals determining whether elements of a data type could be used as items in a hash set or as keys in a hash map is duplicated across multiple collection and higher-order functions.

This PR suggests to deduplicate the method.

How was this patch tested?

Run tests in:

DataFrameFunctionsSuite
CollectionExpressionsSuite
HigherOrderExpressionsSuite

mn-mikke · 2018-08-15T08:23:44Z

cc @ueshin @cloud-fan

cloud-fan · 2018-08-15T08:52:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/AbstractDataType.scala

@@ -115,6 +115,8 @@ protected[sql] abstract class AtomicType extends DataType {
  private[sql] type InternalType
  private[sql] val tag: TypeTag[InternalType]
  private[sql] val ordering: Ordering[InternalType]
+
+  private[spark] override def supportsEquals: Boolean = true


I don't think this should be a property of the data type. It's specific to the OpenHashSet. How about we add this method to object OpenHashSet?

Not all of the expressions utilize OpenHashSet or OpenHashMap. What about TypeUtils that contains methods like getInterpretedOrdering?

SparkQA · 2018-08-15T12:08:32Z

Test build #94795 has finished for PR 22110 at commit dd292e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-15T17:19:49Z

Test build #94800 has finished for PR 22110 at commit 9ed65cf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91

LGTM apart from one comment

mgaido91 · 2018-08-15T17:26:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala

@@ -73,4 +73,14 @@ object TypeUtils {
    }
    x.length - y.length
  }
+
+  /**
+   * Returns true if elements of the data type could be used as items of a hash set or as keys


I think this comment is not very coherent with the method name. I think we can rephrase to something like:

Returns true if the equal method of the elements of the data type is implemented properly. This also means that they can be safely used in collections relying on the equals method, as sets or maps.

What do you think?

I'm open to any changes :) But if you want to explicitly mention the equals method, I would also mention hashCode generally needed for usage in "hash" collections. But then this not 100% true for Spark's specialized OpenHashSets and OpenHashMaps since they calculate hash by themselves. WDYT?

I think it is a common pattern for every well formed class to have equals and hashCode defined in a coherent way. I think what we are doing here is saying: this class has a meaningful equals method, so we can rely on it or this class has a meaningless equals method, like the default one comparing the pointers, so we cannot rely on it. I am open too to any suggestion, I'd only like to have a description which is coherent with the method name, otherwise I feel that either one or the other has to be changed in order to properly reflect what the method does.

What about changing the name of the method to typeCanBeHashed?

I don't really like this proposal...

SparkQA · 2018-08-16T00:15:55Z

Test build #94824 has finished for PR 22110 at commit d0b0552.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-16T01:28:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala

+   * Returns true if elements of the data type could be used as items of a hash set or as keys
+   * of a hash map.
+   */
+  def typeCanBeHashed(dataType: DataType): Boolean = dataType match {


hey, this is a weird name, byte[] can also be hashed. I'd rather call it typeWithProperEquals, and document it as @mgaido91 proposed. I don't think we need to consider hashCode here, it's a rule in java world that equals and hashCode should be defined in a coherent way.

I will change it :)

Just one question to hashCode. If case classes are used, equals and hashCode are generated by compiler. But if we define equals manually, shouldn't also hold a.equals(b) == true => a.hashCode == b.hashCode?

we should, and I believe Spark has enforced it with style checker: when you override equals, you must override hashCode too.

Ok, thanks!

SparkQA · 2018-08-16T13:22:15Z

Test build #94847 has finished for PR 22110 at commit 7d395d2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91

LGTM

mgaido91 · 2018-08-16T13:24:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala

+   * This also means that they can be safely used in collections relying on the equals method,
+   * as sets or maps.
+   */
+  def typeWithProperEquals(dataType: DataType): Boolean = dataType match {


IMHO typeSupportsEquals sounded simpler. I'd not have changed. But I am fine with this too if others agree on it.

cloud-fan · 2018-08-17T03:53:03Z

thanks, merging to master!

[SPARK-25122][SQL] Deduplication of supports equals code

dd292e8

cloud-fan reviewed Aug 15, 2018

View reviewed changes

[SPARK-25122][SQL] Moving *supportEquals to TypeUtils

9ed65cf

mgaido91 reviewed Aug 15, 2018

View reviewed changes

[SPARK-25122][SQL] Renaming the method to "typeCanBeHashed"

d0b0552

cloud-fan reviewed Aug 16, 2018

View reviewed changes

[SPARK-25122][SQL] Addressing review comments

7d395d2

mgaido91 reviewed Aug 16, 2018

View reviewed changes

HyukjinKwon approved these changes Aug 17, 2018

View reviewed changes

asfgit closed this in 8af61fb Aug 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25122][SQL] Deduplication of supports equals code #22110

[SPARK-25122][SQL] Deduplication of supports equals code #22110

mn-mikke commented Aug 15, 2018

mn-mikke commented Aug 15, 2018

cloud-fan Aug 15, 2018

mn-mikke Aug 15, 2018

cloud-fan Aug 15, 2018

mgaido91 Aug 15, 2018

kiszk Aug 15, 2018

SparkQA commented Aug 15, 2018

SparkQA commented Aug 15, 2018

mgaido91 left a comment

mgaido91 Aug 15, 2018

mn-mikke Aug 15, 2018

mgaido91 Aug 15, 2018

mn-mikke Aug 15, 2018

mgaido91 Aug 16, 2018

SparkQA commented Aug 16, 2018

cloud-fan Aug 16, 2018

mn-mikke Aug 16, 2018

cloud-fan Aug 16, 2018

mn-mikke Aug 16, 2018

SparkQA commented Aug 16, 2018

mgaido91 left a comment

mgaido91 Aug 16, 2018

cloud-fan commented Aug 17, 2018

[SPARK-25122][SQL] Deduplication of supports equals code #22110

[SPARK-25122][SQL] Deduplication of supports equals code #22110

Conversation

mn-mikke commented Aug 15, 2018

What changes were proposed in this pull request?

How was this patch tested?

mn-mikke commented Aug 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 15, 2018

SparkQA commented Aug 15, 2018

mgaido91 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 16, 2018

mgaido91 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Aug 17, 2018