Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-40315][SQL] Add equals() and hashCode() to ArrayBasedMapData #37771

Closed
wants to merge 3 commits into from

Conversation

c27kwan
Copy link
Contributor

@c27kwan c27kwan commented Sep 2, 2022

What changes were proposed in this pull request?

There is no explicit hashCode() function override for the ArrayBasedMapData LogicalPlan. As a result, there is a non-deterministic error where the hashCode() computed for ArrayBasedMapData can be different for two equal objects (objects with equal keys and values).

In this PR, we override the hashCode function so that it works exactly as we expect. We also have an explicit equals() function for consistency with how Literals check for equality of ArrayBasedMapData.

Why are the changes needed?

This is a bug fix for a non-deterministic error. It is also more consistent with the rest of Spark if we implement the hashCode and equals methods instead of relying on defaults.

Does this PR introduce any user-facing change?

No

How was this patch tested?

A simple unit test was added.

@github-actions github-actions bot added the SQL label Sep 2, 2022
@c27kwan c27kwan changed the title [SPARK-40315] Add equals() and hashCode() to ArrayBasedMapData [SPARK-40315][SQL] Add equals() and hashCode() to ArrayBasedMapData Sep 2, 2022
Comment on lines +57 to +62
override def hashCode(): Int = {
val seed = MurmurHash3.productSeed
val keyHash = scala.util.hashing.MurmurHash3.mix(seed, keyArray.hashCode())
val valueHash = scala.util.hashing.MurmurHash3.mix(keyHash, valueArray.hashCode())
scala.util.hashing.MurmurHash3.finalizeHash(valueHash, 2)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cloud-fan , can you take a look? Do you think we need to also add an explicit hashCode() override to ArrayData (used by keyArray and valueArray)?

Comment on lines 41 to 53
override def equals(obj: Any): Boolean = {
if (obj == null && this == null) {
return true
}

if (obj == null || !obj.isInstanceOf[ArrayBasedMapData]) {
return false
}

val other = obj.asInstanceOf[ArrayBasedMapData]

keyArray.equals(other.keyArray) && valueArray.equals(other.valueArray)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this part is also causing ComplexDataSuite to fail the "inequality test for MapData" test case, where we use the triple inequality check for object equality. That should be calling equalsTo? This PR's implementation of equals follows the expectation of Literal. Can you comment on the expected behaviour for equals(), @cloud-fan ? Is it supposed to be object equality or element equality?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally equals should do element equality.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how we do it in other code, but there is Arrays.deepEquals for element-wise equality

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ArrayBasedMapData equals check in Literals does a == check between the keyArray and the valueArray: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala#L377-L390

To be consistent, I'll change it to use ==

@@ -35,6 +37,29 @@ class ArrayBasedMapData(val keyArray: ArrayData, val valueArray: ArrayData) exte
override def toString: String = {
s"keys: $keyArray, values: $valueArray"
}

override def equals(obj: Any): Boolean = {
if (obj == null && this == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can this be null?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think IDEA can generate equals and hashCode implementations pretty well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this can't be null. I'll remove this check.

The IDEA defaults in this case are not super helpful because they just defer to the parent class super.equals(obj).

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@c27kwan
Copy link
Contributor Author

c27kwan commented Sep 6, 2022

I've been going through old PRs and see that the equals and hashCode methods were actually purposely removed from this class: #13847

We've also chosen to honour this removal by adding the checks in other places like Literal: #30246

I also saw a newer PR (not merged) attempting to add these methods back in: #32552

Is this going to break an invariant if we now provide an equals and hashCode function only for ArrayBasedMapData? I'm going to make another variant of this PR and fix the hashCode only for Literal of ArrayBasedMapData, since that's the only place that's inconsistent right now between the equals and the hashCode.

@c27kwan c27kwan marked this pull request as draft September 6, 2022 09:52
@c27kwan
Copy link
Contributor Author

c27kwan commented Sep 6, 2022

I made a new PR that I think works better because it has a more narrow area of impact: #3780

I'm going to close this PR because the proposed solution in this PR isn't as good.

@c27kwan c27kwan closed this Sep 6, 2022
@cloud-fan
Copy link
Contributor

There are 2 things:

  1. SQL semantic: map type is not orderable and can't be used as join/group keys
  2. MapData needs to implement equals/hashCode as we need to look up plans (e.g. in CacheManager) and plans may contain map literals.

In long term, we need to support equality check of map values, so that we can use map type as join/group keys. In short term, we need to correctly implement QueryPlan.semanticEquals which requires correct hashCode/equals of map literals.

I'm +1 to this idea: I'm going to make another variant of this PR and fix the hashCode only for Literal of ArrayBasedMapData

@cloud-fan
Copy link
Contributor

@c27kwan did you put the wrong link of your new PR?

@c27kwan
Copy link
Contributor Author

c27kwan commented Sep 6, 2022

Oops, here's the correct PR link: #37807

@c27kwan c27kwan deleted the SPARK-40315 branch February 12, 2024 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants