Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25122][SQL] Deduplication of supports equals code #22110

Closed
wants to merge 4 commits into from

Conversation

mn-mikke
Copy link
Contributor

What changes were proposed in this pull request?

The method *supportEquals determining whether elements of a data type could be used as items in a hash set or as keys in a hash map is duplicated across multiple collection and higher-order functions.

This PR suggests to deduplicate the method.

How was this patch tested?

Run tests in:

  • DataFrameFunctionsSuite
  • CollectionExpressionsSuite
  • HigherOrderExpressionsSuite

@mn-mikke
Copy link
Contributor Author

cc @ueshin @cloud-fan

@@ -115,6 +115,8 @@ protected[sql] abstract class AtomicType extends DataType {
private[sql] type InternalType
private[sql] val tag: TypeTag[InternalType]
private[sql] val ordering: Ordering[InternalType]

private[spark] override def supportsEquals: Boolean = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be a property of the data type. It's specific to the OpenHashSet. How about we add this method to object OpenHashSet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all of the expressions utilize OpenHashSet or OpenHashMap. What about TypeUtils that contains methods like getInterpretedOrdering?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@SparkQA
Copy link

SparkQA commented Aug 15, 2018

Test build #94795 has finished for PR 22110 at commit dd292e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 15, 2018

Test build #94800 has finished for PR 22110 at commit 9ed65cf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart from one comment

@@ -73,4 +73,14 @@ object TypeUtils {
}
x.length - y.length
}

/**
* Returns true if elements of the data type could be used as items of a hash set or as keys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is not very coherent with the method name. I think we can rephrase to something like:

Returns true if the equal method of the elements of the data type is implemented properly.
This also means that they can be safely used in collections relying on the equals method,
as sets or maps.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to any changes :) But if you want to explicitly mention the equals method, I would also mention hashCode generally needed for usage in "hash" collections. But then this not 100% true for Spark's specialized OpenHashSets and OpenHashMaps since they calculate hash by themselves. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a common pattern for every well formed class to have equals and hashCode defined in a coherent way. I think what we are doing here is saying: this class has a meaningful equals method, so we can rely on it or this class has a meaningless equals method, like the default one comparing the pointers, so we cannot rely on it. I am open too to any suggestion, I'd only like to have a description which is coherent with the method name, otherwise I feel that either one or the other has to be changed in order to properly reflect what the method does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about changing the name of the method to typeCanBeHashed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this proposal...

@SparkQA
Copy link

SparkQA commented Aug 16, 2018

Test build #94824 has finished for PR 22110 at commit d0b0552.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Returns true if elements of the data type could be used as items of a hash set or as keys
* of a hash map.
*/
def typeCanBeHashed(dataType: DataType): Boolean = dataType match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey, this is a weird name, byte[] can also be hashed. I'd rather call it typeWithProperEquals, and document it as @mgaido91 proposed. I don't think we need to consider hashCode here, it's a rule in java world that equals and hashCode should be defined in a coherent way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change it :)

Just one question to hashCode. If case classes are used, equals and hashCode are generated by compiler. But if we define equals manually, shouldn't also hold a.equals(b) == true => a.hashCode == b.hashCode?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should, and I believe Spark has enforced it with style checker: when you override equals, you must override hashCode too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks!

@SparkQA
Copy link

SparkQA commented Aug 16, 2018

Test build #94847 has finished for PR 22110 at commit 7d395d2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* This also means that they can be safely used in collections relying on the equals method,
* as sets or maps.
*/
def typeWithProperEquals(dataType: DataType): Boolean = dataType match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO typeSupportsEquals sounded simpler. I'd not have changed. But I am fine with this too if others agree on it.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 8af61fb Aug 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants