-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32764][SQL] -0.0 should be equal to 0.0 #29647
Conversation
However that's not true in other contexts, like sorting things, where you do want a total ordering on doubles. I think we had to confront that a while ago in dealing with Scala 2.13 or something. If this doesn't affect sorts then I think that's fine (and it doesn't seem to) |
@srowen I just verified on PostgreSQL/Oracle/Mysql and the results of |
|
||
package org.apache.spark.sql.catalyst.util | ||
|
||
object OrderingUtil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: How about rename this as SQLOrderingUtil
and rename compareDoublesSQL
as compareDoubles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good.
@gengliangwang wait, so |
@srowen it's false.
|
Test build #128291 has finished for PR 29647 at commit
|
Thank you, @cloud-fan ! |
select double('NaN') = double('NaN'), double('NaN') >= double('NaN'), double('NaN') <= double('NaN'); | ||
select double('NaN') > double("+Infinity"), double("+Infinity") < double('NaN'), double('NaN') > double('NaN'); | ||
select 0.0 = -0.0, 0.0 >= -0.0, 0.0 <= -0.0; | ||
select 0.0 > -0.0, 0.0 < -0.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't add a new test coverage because this has the same result with Spark 3.0.0.
spark-sql> select version();
3.0.0 3fdfce3120f307147244e5eaf46d61419a723d50
spark-sql> select double('NaN') = double('NaN'), double('NaN') >= double('NaN'), double('NaN') <= double('NaN');
true true true
spark-sql> select double('NaN') > double("+Infinity"), double("+Infinity") < double('NaN'), double('NaN') > double('NaN');
true true false
spark-sql> select 0.0 = -0.0, 0.0 >= -0.0, 0.0 <= -0.0;
true true true
spark-sql> select 0.0 > -0.0, 0.0 < -0.0;
false false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove these if we cannot reproduce the regression in SQL layer. Otherwise, this may be misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
literals are OK. I'll update the tests to use columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm removing them. I can't reproduce the bug with temp view as there are optimizations around literals. I'll add an end-to-end test in DataFrameSuite
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
shouldMatchDefaultOrder(0f, 1f) | ||
shouldMatchDefaultOrder(-1f, 1f) | ||
shouldMatchDefaultOrder(Float.MinValue, Float.MaxValue) | ||
assert(OrderingUtil.compareDoublesSQL(Float.NaN, Float.NaN) === 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NaN
is not a unique value. Since there are multiple NaN values, could you add a test coverage for different NaN values too?
scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array
res0: Array[Byte] = Array(127, -64, 0, 0)
scala> val x = java.lang.Float.intBitsToFloat(-6966608)
x: Float = NaN
scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array
res1: Array[Byte] = Array(-1, -107, -78, -80)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @cloud-fan . I added two comments.
- Adding a new test coverage to compare different NaN values.
- Reconsidering
operators.sql
because it doesn't fail on3.0.0
, too.
/** | ||
* A special version of float comparison that follows SQL semantic: | ||
* 1. NaN == NaN | ||
* 2. NaN is greater than any non-NaN double |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NaN is greater than any non-NaN double -> NaN is greater than any non-NaN float
Test build #128359 has finished for PR 29647 at commit
|
assert(JDouble.doubleToRawLongBits(Double.NaN) != JDouble.doubleToRawLongBits(specialNaN)) | ||
|
||
assert(SQLOrderingUtil.compareDoubles(Double.NaN, Double.NaN) === 0) | ||
assert(SQLOrderingUtil.compareDoubles(Double.NaN, specialNaN) === 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @cloud-fan and all!
Merged to master/3.0.
This is a Spark 3.0 regression introduced by #26761. We missed a corner case that `java.lang.Double.compare` treats 0.0 and -0.0 as different, which breaks SQL semantic. This PR adds back the `OrderingUtil`, to provide custom compare methods that take care of 0.0 vs -0.0 Fix a correctness bug. Yes, now `SELECT 0.0 > -0.0` returns false correctly as Spark 2.x. new tests Closes #29647 from cloud-fan/float. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4144b6d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
This is a Spark 3.0 regression introduced by apache#26761. We missed a corner case that `java.lang.Double.compare` treats 0.0 and -0.0 as different, which breaks SQL semantic. This PR adds back the `OrderingUtil`, to provide custom compare methods that take care of 0.0 vs -0.0 Fix a correctness bug. Yes, now `SELECT 0.0 > -0.0` returns false correctly as Spark 2.x. new tests Closes apache#29647 from cloud-fan/float. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4144b6d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
This is a Spark 3.0 regression introduced by #26761. We missed a corner case that
java.lang.Double.compare
treats 0.0 and -0.0 as different, which breaks SQL semantic.This PR adds back the
OrderingUtil
, to provide custom compare methods that take care of 0.0 vs -0.0Why are the changes needed?
Fix a correctness bug.
Does this PR introduce any user-facing change?
Yes, now
SELECT 0.0 > -0.0
returns false correctly as Spark 2.x.How was this patch tested?
new tests