Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32764][SQL] -0.0 should be equal to 0.0 #29647

Closed
wants to merge 1 commit into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is a Spark 3.0 regression introduced by #26761. We missed a corner case that java.lang.Double.compare treats 0.0 and -0.0 as different, which breaks SQL semantic.

This PR adds back the OrderingUtil, to provide custom compare methods that take care of 0.0 vs -0.0

Why are the changes needed?

Fix a correctness bug.

Does this PR introduce any user-facing change?

Yes, now SELECT 0.0 > -0.0 returns false correctly as Spark 2.x.

How was this patch tested?

new tests

@cloud-fan
Copy link
Contributor Author

cc @srowen @maropu @viirya

@srowen
Copy link
Member

srowen commented Sep 4, 2020

0.0 > -0.0 should be false in SQL? then I agree. That's the answer in the JVM languages too.

However that's not true in other contexts, like sorting things, where you do want a total ordering on doubles. I think we had to confront that a while ago in dealing with Scala 2.13 or something. If this doesn't affect sorts then I think that's fine (and it doesn't seem to)

@gengliangwang
Copy link
Member

@srowen I just verified on PostgreSQL/Oracle/Mysql and the results of 0.0 = -0.0 are all true. It's a very corner case but still a correctness bug.


package org.apache.spark.sql.catalyst.util

object OrderingUtil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: How about rename this as SQLOrderingUtil and rename compareDoublesSQL as compareDoubles

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good.

@srowen
Copy link
Member

srowen commented Sep 4, 2020

@gengliangwang wait, so 0.0 > -0.0 is true in those systems?

@gengliangwang
Copy link
Member

@srowen it's false.
I tried with

create table foo(i int);
insert into foo values(1);
select * from foo where 0.0>-0.0;

in http://sqlfiddle.com/

@SparkQA
Copy link

SparkQA commented Sep 4, 2020

Test build #128291 has finished for PR 29647 at commit 77b1e44.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Thank you, @cloud-fan !

select double('NaN') = double('NaN'), double('NaN') >= double('NaN'), double('NaN') <= double('NaN');
select double('NaN') > double("+Infinity"), double("+Infinity") < double('NaN'), double('NaN') > double('NaN');
select 0.0 = -0.0, 0.0 >= -0.0, 0.0 <= -0.0;
select 0.0 > -0.0, 0.0 < -0.0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't add a new test coverage because this has the same result with Spark 3.0.0.

spark-sql> select version();
3.0.0 3fdfce3120f307147244e5eaf46d61419a723d50

spark-sql> select double('NaN') = double('NaN'), double('NaN') >= double('NaN'), double('NaN') <= double('NaN');
true	true	true

spark-sql> select double('NaN') > double("+Infinity"), double("+Infinity") < double('NaN'), double('NaN') > double('NaN');
true	true	false

spark-sql> select 0.0 = -0.0, 0.0 >= -0.0, 0.0 <= -0.0;
true	true	true

spark-sql> select 0.0 > -0.0, 0.0 < -0.0;
false	false

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove these if we cannot reproduce the regression in SQL layer. Otherwise, this may be misleading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

literals are OK. I'll update the tests to use columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm removing them. I can't reproduce the bug with temp view as there are optimizations around literals. I'll add an end-to-end test in DataFrameSuite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

shouldMatchDefaultOrder(0f, 1f)
shouldMatchDefaultOrder(-1f, 1f)
shouldMatchDefaultOrder(Float.MinValue, Float.MaxValue)
assert(OrderingUtil.compareDoublesSQL(Float.NaN, Float.NaN) === 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaN is not a unique value. Since there are multiple NaN values, could you add a test coverage for different NaN values too?

scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array
res0: Array[Byte] = Array(127, -64, 0, 0)

scala> val x = java.lang.Float.intBitsToFloat(-6966608)
x: Float = NaN

scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array
res1: Array[Byte] = Array(-1, -107, -78, -80)

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @cloud-fan . I added two comments.

  1. Adding a new test coverage to compare different NaN values.
  2. Reconsidering operators.sql because it doesn't fail on 3.0.0, too.

/**
* A special version of float comparison that follows SQL semantic:
* 1. NaN == NaN
* 2. NaN is greater than any non-NaN double
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaN is greater than any non-NaN double -> NaN is greater than any non-NaN float

@SparkQA
Copy link

SparkQA commented Sep 7, 2020

Test build #128359 has finished for PR 29647 at commit 869856c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

assert(JDouble.doubleToRawLongBits(Double.NaN) != JDouble.doubleToRawLongBits(specialNaN))

assert(SQLOrderingUtil.compareDoubles(Double.NaN, Double.NaN) === 0)
assert(SQLOrderingUtil.compareDoubles(Double.NaN, specialNaN) === 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @cloud-fan and all!
Merged to master/3.0.

dongjoon-hyun pushed a commit that referenced this pull request Sep 8, 2020
This is a Spark 3.0 regression introduced by #26761. We missed a corner case that `java.lang.Double.compare` treats 0.0 and -0.0 as different, which breaks SQL semantic.

This PR adds back the `OrderingUtil`, to provide custom compare methods that take care of 0.0 vs -0.0

Fix a correctness bug.

Yes, now `SELECT  0.0 > -0.0` returns false correctly as Spark 2.x.

new tests

Closes #29647 from cloud-fan/float.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 4144b6d)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
This is a Spark 3.0 regression introduced by apache#26761. We missed a corner case that `java.lang.Double.compare` treats 0.0 and -0.0 as different, which breaks SQL semantic.

This PR adds back the `OrderingUtil`, to provide custom compare methods that take care of 0.0 vs -0.0

Fix a correctness bug.

Yes, now `SELECT  0.0 > -0.0` returns false correctly as Spark 2.x.

new tests

Closes apache#29647 from cloud-fan/float.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 4144b6d)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants