[SPARK-45599][CORE][3.5] Use object equality in OpenHashSet#45296
Closed
nchammas wants to merge 2 commits intoapache:branch-3.5from
Closed
[SPARK-45599][CORE][3.5] Use object equality in OpenHashSet#45296nchammas wants to merge 2 commits intoapache:branch-3.5from
nchammas wants to merge 2 commits intoapache:branch-3.5from
Conversation
### What changes were proposed in this pull request? Change `OpenHashSet` to use object equality instead of cooperative equality when looking up keys. ### Why are the changes needed? This brings the behavior of `OpenHashSet` more in line with the semantics of `java.util.HashSet`, and fixes its behavior when comparing values for which `equals` and `==` return different results, like 0.0/-0.0 and NaN/NaN. For example, in certain cases where both 0.0 and -0.0 are provided as keys to the set, lookups of one or the other key may return the [wrong position][wrong] in the set. This leads to the bug described in SPARK-45599 and summarized in [this comment][1]. [wrong]: https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R277-R283 [1]: https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954 ### Does this PR introduce _any_ user-facing change? Yes, it resolves the bug described in SPARK-45599. `OpenHashSet` is used widely under the hood, including in: - `OpenHashMap`, which itself backs: - `TypedImperativeAggregate` - aggregate functions like `percentile` and `mode` - many algorithms in ML and MLlib - `SQLOpenHashSet`, which backs array functions like `array_union` and `array_distinct` However, the user-facing changes should be limited to the kind of edge case described in SPARK-45599. ### How was this patch tested? New and existing unit tests. Of the new tests added in this PR, some simply validate that we have not changed existing SQL semantics, while others confirm that we have fixed the specific bug reported in SPARK-45599 along with any related incorrect behavior. New tests failing on `master` but passing on this branch: - [Handling 0.0 and -0.0 in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R273) - [Handling NaN in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R302) - [Handling 0.0 and -0.0 in `OpenHashMap`](https://github.com/apache/spark/pull/45036/files#diff-09400ec633b1f1322c5f7b39dc4e87073b0b0435b60b9cff93388053be5083b6R253) - [Handling 0.0 and -0.0 when computing percentile](https://github.com/apache/spark/pull/45036/files#diff-bd3d5c79ede5675f4bf10d2efb313db893d57443d6d6d67b1f8766e6ce741271R1092) New tests passing both on `master` and this branch: - [Handling 0.0, -0.0, and NaN in `array_union`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R793) - [Handling 0.0, -0.0, and NaN in `array_distinct`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R801) - [Handling 0.0, -0.0, and NaN in `GROUP BY`](https://github.com/apache/spark/pull/45036/files#diff-496edb8b03201f078c3772ca81f7c7f80002acc11dff00b1d06d288b87855264R1107) - [Normalizing -0 and -0.0](https://github.com/apache/spark/pull/45036/files#diff-4bdd04d06a2d88049dd5c8a67715c5566dd68a1c4ebffc689dc74b6b2e0b3b04R782) ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45036 from nchammas/SPARK-45599-plus-and-minus-zero. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
dongjoon-hyun
approved these changes
Feb 28, 2024
dongjoon-hyun
pushed a commit
that referenced
this pull request
Feb 28, 2024
### What changes were proposed in this pull request? This is a backport of fcc5dbc / #45036 with a tweak so that it works on Scala 2.12. ### Why are the changes needed? This is a correctness bug fix. The original fix against `master` suppresses a warning category that doesn't exist on certain versions of Scala 2.13 and 2.12, and the exact versions are [not documented anywhere][1]. To be safe, this backport simply suppresses all warnings instead of just `other-non-cooperative-equals`. It would be interesting to see if `-Wconf:nowarn` complains, since the warning about non-cooperative equals itself is also not thrown on all versions of Scala, but I don't think that's a priority. [1]: scala/scala#8120 (comment) ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? CI + added tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45296 from nchammas/SPARK-45599-OpenHashSet. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Member
|
Merged to branch-3.5. Thank you, @nchammas . |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This is a backport of fcc5dbc / #45036 with a tweak so that it works on Scala 2.12.
Why are the changes needed?
This is a correctness bug fix.
The original fix against
mastersuppresses a warning category that doesn't exist on certain versions of Scala 2.13 and 2.12, and the exact versions are not documented anywhere.To be safe, this backport simply suppresses all warnings instead of just
other-non-cooperative-equals. It would be interesting to see if-Wconf:nowarncomplains, since the warning about non-cooperative equals itself is also not thrown on all versions of Scala, but I don't think that's a priority.Does this PR introduce any user-facing change?
Yes.
How was this patch tested?
CI + added tests.
Was this patch authored or co-authored using generative AI tooling?
No.