-
Notifications
You must be signed in to change notification settings - Fork 29.2k
[SPARK-45599][CORE] Use object equality in OpenHashSet #45036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
905e174
0d5c2ce
2bfc605
0290455
21cff3b
16d9aef
0c47d4c
760cb49
7ac5e3e
5bbde83
3764705
3b5118d
6b4f091
9fe698b
60f698a
15a091c
a76629c
3a0b194
f0764c4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -249,4 +249,34 @@ class OpenHashMapSuite extends SparkFunSuite with Matchers { | |
| map(null) = null | ||
| assert(map.get(null) === Some(null)) | ||
| } | ||
|
|
||
| test("SPARK-45599: 0.0 and -0.0 should count distinctly; NaNs should count together") { | ||
| // Exactly these elements provided in roughly this order trigger a condition where lookups of | ||
| // 0.0 and -0.0 in the bitset happen to collide, causing their counts to be merged incorrectly | ||
| // and inconsistently if `==` is used to check for key equality. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shall we mention the NaN behavior as well? All NaN values are all the same.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tweaked the test name. Is that what you had in mind? This comment explains why we need exactly the following elements to trigger the 0.0/-0.0 miscount. It doesn't always happen (which is part of what kept this bug hidden for so long). |
||
| val spark45599Repro = Seq( | ||
| Double.NaN, | ||
| 2.0, | ||
| 168.0, | ||
| Double.NaN, | ||
| Double.NaN, | ||
| -0.0, | ||
| 153.0, | ||
| 0.0 | ||
| ) | ||
|
|
||
| val map1 = new OpenHashMap[Double, Int]() | ||
| spark45599Repro.foreach(map1.changeValue(_, 1, {_ + 1})) | ||
| assert(map1(0.0) == 1) | ||
| assert(map1(-0.0) == 1) | ||
| assert(map1(Double.NaN) == 3) | ||
|
|
||
| val map2 = new OpenHashMap[Double, Int]() | ||
| // Simply changing the order in which the elements are added to the map should not change the | ||
| // counts for 0.0 and -0.0. | ||
| spark45599Repro.reverse.foreach(map2.changeValue(_, 1, {_ + 1})) | ||
| assert(map2(0.0) == 1) | ||
| assert(map2(-0.0) == 1) | ||
|
nchammas marked this conversation as resolved.
|
||
| assert(map2(Double.NaN) == 3) | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -269,4 +269,43 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers { | |||||
| assert(pos1 == pos2) | ||||||
| } | ||||||
| } | ||||||
|
|
||||||
| test("SPARK-45599: 0.0 and -0.0 are equal but not the same") { | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a bit tricky and it's better if we can find a reference system that defines this semantic. In Spark,
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, probably make sense if we just fix this particular issue as
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
scala> import java.util.HashSet
import java.util.HashSet
scala> val h = new HashSet[Double]()
val h: java.util.HashSet[Double] = []
scala> h.add(0.0)
val res0: Boolean = true
scala> h.add(-0.0)
val res1: Boolean = true
scala> h.size()
val res2: Int = 2The doc for HashSet.add states:
In other words, So this PR brings
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This PR does not change this behavior. I noticed, however, that we do not have any tests currently to check that -0.0 is normalized and grouped as you describe, so I went ahead and added such a test in 2bfc605. Does this address your concern? Or are you suggesting that we should normalize -0.0 to 0.0 across the board?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Consider another interesting case where scala> val h = new HashSet[Double]()
val h: java.util.HashSet[Double] = []
scala> h.add(Double.NaN)
val res9: Boolean = true
scala> h.add(Double.NaN)
val res10: Boolean = false
scala> h.contains(Double.NaN)
val res11: Boolean = true
scala> h.size()
val res12: Int = 1On val set = new OpenHashSet[Double]()
set.add(Double.NaN)
set.add(Double.NaN)
set.size // returns 2
set.contains(Double.NaN) // returns falseThis could possibly lead to a bug like the one reported in SPARK-45599 but in reverse, where a new NaN row is added rather than dropped. I will see if I can construct such a scenario as a demonstration. But regardless, I think this behavior is incorrect by itself.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note also that the docstring for spark/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala Lines 31 to 32 in 0d5c2ce
If that's true, then we should perhaps add property based tests to ensure alignment between the two implementations, but I'll leave that as a potential future improvement.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Spark's
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Whether or not This comment explains the root cause. Basically, it is a mistake to combine hash code-based lookups with cooperative equality, at least in the way we are doing it in But I understand what you are saying. Fixing bugs in
I've updated the PR description with a summary of what uses As a side note, I believe that if we accept the change proposed here, we should be able to eliminate
I've updated the PR description with a diff of what tests pass or fail on |
||||||
| // Therefore, 0.0 and -0.0 should get separate entries in the hash set. | ||||||
| // | ||||||
| // Exactly these elements provided in roughly this order will trigger the following scenario: | ||||||
| // When probing the bitset in `getPos(-0.0)`, the loop will happen upon the entry for 0.0. | ||||||
| // In the old logic pre-SPARK-45599, the loop will find that the bit is set and, because | ||||||
| // -0.0 == 0.0, it will think that's the position of -0.0. But in reality this is the position | ||||||
| // of 0.0. So -0.0 and 0.0 will be stored at different positions, but `getPos()` will return | ||||||
| // the same position for them. This can cause users of OpenHashSet, like OpenHashMap, to | ||||||
| // return the wrong value for a key based on whether or not this bitset lookup collision | ||||||
| // happens. | ||||||
|
nchammas marked this conversation as resolved.
|
||||||
| val spark45599Repro = Seq( | ||||||
| Double.NaN, | ||||||
| 2.0, | ||||||
| 168.0, | ||||||
| Double.NaN, | ||||||
| Double.NaN, | ||||||
| -0.0, | ||||||
| 153.0, | ||||||
| 0.0 | ||||||
| ) | ||||||
| val set = new OpenHashSet[Double]() | ||||||
| spark45599Repro.foreach(set.add) | ||||||
| assert(set.size == 6) | ||||||
| val zeroPos = set.getPos(0.0) | ||||||
| val negZeroPos = set.getPos(-0.0) | ||||||
| assert(zeroPos != negZeroPos) | ||||||
| } | ||||||
|
|
||||||
| test("SPARK-45599: NaN and NaN are the same but not equal") { | ||||||
| // Any mathematical comparison to NaN will return false, but when we place it in | ||||||
| // a hash set we want the lookup to work like a "normal" value. | ||||||
| val set = new OpenHashSet[Double]() | ||||||
| set.add(Double.NaN) | ||||||
| set.add(Double.NaN) | ||||||
| assert(set.contains(Double.NaN)) | ||||||
| assert(set.size == 1) | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we actually waste space in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes sir. On // ...OpenHashSet$mcD$sp@21b327e6 did not contain NaN
assert(set.contains(Double.NaN))
// ...OpenHashSet$mcD$sp@1f09db1e had size 2 instead of expected size 1
assert(set.size == 1)Every NaN will get its own entry in |
||||||
| } | ||||||
| } | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was told that in scala
==is the same asequals, buteqis a different operator. I need to refresh my knowledge now :)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the differences are subtle:
There is a long discussion on the Scala forums from 2017 about this difference and some of the problems it causes:
Can we get rid of cooperative equality?