Skip to content

[SPARK-45599][CORE][3.5] Use object equality in OpenHashSet#45296

Closed
nchammas wants to merge 2 commits intoapache:branch-3.5from
nchammas:SPARK-45599-OpenHashSet
Closed

[SPARK-45599][CORE][3.5] Use object equality in OpenHashSet#45296
nchammas wants to merge 2 commits intoapache:branch-3.5from
nchammas:SPARK-45599-OpenHashSet

Conversation

@nchammas
Copy link
Contributor

@nchammas nchammas commented Feb 27, 2024

What changes were proposed in this pull request?

This is a backport of fcc5dbc / #45036 with a tweak so that it works on Scala 2.12.

Why are the changes needed?

This is a correctness bug fix.

The original fix against master suppresses a warning category that doesn't exist on certain versions of Scala 2.13 and 2.12, and the exact versions are not documented anywhere.

To be safe, this backport simply suppresses all warnings instead of just other-non-cooperative-equals. It would be interesting to see if -Wconf:nowarn complains, since the warning about non-cooperative equals itself is also not thrown on all versions of Scala, but I don't think that's a priority.

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

CI + added tests.

Was this patch authored or co-authored using generative AI tooling?

No.

### What changes were proposed in this pull request?

Change `OpenHashSet` to use object equality instead of cooperative equality when looking up keys.

### Why are the changes needed?

This brings the behavior of `OpenHashSet` more in line with the semantics of `java.util.HashSet`, and fixes its behavior when comparing values for which `equals` and `==` return different results, like 0.0/-0.0 and NaN/NaN.

For example, in certain cases where both 0.0 and -0.0 are provided as keys to the set, lookups of one or the other key may return the [wrong position][wrong] in the set. This leads to the bug described in SPARK-45599 and summarized in [this comment][1].

[wrong]: https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R277-R283
[1]: https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954

### Does this PR introduce _any_ user-facing change?

Yes, it resolves the bug described in SPARK-45599.

`OpenHashSet` is used widely under the hood, including in:
- `OpenHashMap`, which itself backs:
    - `TypedImperativeAggregate`
    - aggregate functions like `percentile` and `mode`
    - many algorithms in ML and MLlib
- `SQLOpenHashSet`, which backs array functions like `array_union` and `array_distinct`

However, the user-facing changes should be limited to the kind of edge case described in SPARK-45599.

### How was this patch tested?

New and existing unit tests. Of the new tests added in this PR, some simply validate that we have not changed existing SQL semantics, while others confirm that we have fixed the specific bug reported in SPARK-45599 along with any related incorrect behavior.

New tests failing on `master` but passing on this branch:
- [Handling 0.0 and -0.0 in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R273)
- [Handling NaN in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R302)
- [Handling 0.0 and -0.0 in `OpenHashMap`](https://github.com/apache/spark/pull/45036/files#diff-09400ec633b1f1322c5f7b39dc4e87073b0b0435b60b9cff93388053be5083b6R253)
- [Handling 0.0 and -0.0 when computing percentile](https://github.com/apache/spark/pull/45036/files#diff-bd3d5c79ede5675f4bf10d2efb313db893d57443d6d6d67b1f8766e6ce741271R1092)

New tests passing both on `master` and this branch:
- [Handling 0.0, -0.0, and NaN in `array_union`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R793)
- [Handling 0.0, -0.0, and NaN in `array_distinct`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R801)
- [Handling 0.0, -0.0, and NaN in `GROUP BY`](https://github.com/apache/spark/pull/45036/files#diff-496edb8b03201f078c3772ca81f7c7f80002acc11dff00b1d06d288b87855264R1107)
- [Normalizing -0 and -0.0](https://github.com/apache/spark/pull/45036/files#diff-4bdd04d06a2d88049dd5c8a67715c5566dd68a1c4ebffc689dc74b6b2e0b3b04R782)

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45036 from nchammas/SPARK-45599-plus-and-minus-zero.

Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-45599][CORE] Use object equality in OpenHashSet [SPARK-45599][CORE][3.5] Use object equality in OpenHashSet Feb 27, 2024
dongjoon-hyun pushed a commit that referenced this pull request Feb 28, 2024
### What changes were proposed in this pull request?

This is a backport of fcc5dbc / #45036 with a tweak so that it works on Scala 2.12.

### Why are the changes needed?

This is a correctness bug fix.

The original fix against `master` suppresses a warning category that doesn't exist on certain versions of Scala 2.13 and 2.12, and the exact versions are [not documented anywhere][1].

To be safe, this backport simply suppresses all warnings instead of just `other-non-cooperative-equals`. It would be interesting to see if `-Wconf:nowarn` complains, since the warning about non-cooperative equals itself is also not thrown on all versions of Scala, but I don't think that's a priority.

[1]: scala/scala#8120 (comment)

### Does this PR introduce _any_ user-facing change?

Yes.

### How was this patch tested?

CI + added tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45296 from nchammas/SPARK-45599-OpenHashSet.

Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Member

Merged to branch-3.5. Thank you, @nchammas .

@nchammas nchammas deleted the SPARK-45599-OpenHashSet branch February 28, 2024 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants