[SPARK-32764][SQL] -0.0 should be equal to 0.0 #29647

cloud-fan · 2020-09-04T09:04:43Z

What changes were proposed in this pull request?

This is a Spark 3.0 regression introduced by #26761. We missed a corner case that java.lang.Double.compare treats 0.0 and -0.0 as different, which breaks SQL semantic.

This PR adds back the OrderingUtil, to provide custom compare methods that take care of 0.0 vs -0.0

Why are the changes needed?

Fix a correctness bug.

Does this PR introduce any user-facing change?

Yes, now SELECT 0.0 > -0.0 returns false correctly as Spark 2.x.

How was this patch tested?

new tests

cloud-fan · 2020-09-04T09:06:27Z

cc @srowen @maropu @viirya

srowen · 2020-09-04T12:36:57Z

0.0 > -0.0 should be false in SQL? then I agree. That's the answer in the JVM languages too.

However that's not true in other contexts, like sorting things, where you do want a total ordering on doubles. I think we had to confront that a while ago in dealing with Scala 2.13 or something. If this doesn't affect sorts then I think that's fine (and it doesn't seem to)

gengliangwang · 2020-09-04T14:12:23Z

@srowen I just verified on PostgreSQL/Oracle/Mysql and the results of 0.0 = -0.0 are all true. It's a very corner case but still a correctness bug.

gengliangwang · 2020-09-04T14:13:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/OrderingUtil.scala

+
+package org.apache.spark.sql.catalyst.util
+
+object OrderingUtil {


Nit: How about rename this as SQLOrderingUtil and rename compareDoublesSQL as compareDoubles

sounds good.

srowen · 2020-09-04T14:58:31Z

@gengliangwang wait, so 0.0 > -0.0 is true in those systems?

gengliangwang · 2020-09-04T15:10:01Z

@srowen it's false.
I tried with

create table foo(i int);
insert into foo values(1);
select * from foo where 0.0>-0.0;

in http://sqlfiddle.com/

SparkQA · 2020-09-04T15:17:31Z

Test build #128291 has finished for PR 29647 at commit 77b1e44.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-09-04T16:20:58Z

Thank you, @cloud-fan !

dongjoon-hyun · 2020-09-04T16:45:25Z

sql/core/src/test/resources/sql-tests/inputs/operators.sql

+select double('NaN') = double('NaN'), double('NaN') >= double('NaN'), double('NaN') <= double('NaN');
+select double('NaN') > double("+Infinity"), double("+Infinity") < double('NaN'), double('NaN') > double('NaN');
+select 0.0 = -0.0, 0.0 >= -0.0, 0.0 <= -0.0;
+select 0.0 > -0.0, 0.0 < -0.0;


This doesn't add a new test coverage because this has the same result with Spark 3.0.0.

spark-sql> select version(); 3.0.0 3fdfce3120f307147244e5eaf46d61419a723d50 spark-sql> select double('NaN') = double('NaN'), double('NaN') >= double('NaN'), double('NaN') <= double('NaN'); true true true spark-sql> select double('NaN') > double("+Infinity"), double("+Infinity") < double('NaN'), double('NaN') > double('NaN'); true true false spark-sql> select 0.0 = -0.0, 0.0 >= -0.0, 0.0 <= -0.0; true true true spark-sql> select 0.0 > -0.0, 0.0 < -0.0; false false

We can remove these if we cannot reproduce the regression in SQL layer. Otherwise, this may be misleading.

literals are OK. I'll update the tests to use columns.

I'm removing them. I can't reproduce the bug with temp view as there are optimizations around literals. I'll add an end-to-end test in DataFrameSuite.

dongjoon-hyun · 2020-09-04T16:50:56Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/OrderingUtilSuite.scala

+    shouldMatchDefaultOrder(0f, 1f)
+    shouldMatchDefaultOrder(-1f, 1f)
+    shouldMatchDefaultOrder(Float.MinValue, Float.MaxValue)
+    assert(OrderingUtil.compareDoublesSQL(Float.NaN, Float.NaN) === 0)


NaN is not a unique value. Since there are multiple NaN values, could you add a test coverage for different NaN values too?

scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array res0: Array[Byte] = Array(127, -64, 0, 0) scala> val x = java.lang.Float.intBitsToFloat(-6966608) x: Float = NaN scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array res1: Array[Byte] = Array(-1, -107, -78, -80)

dongjoon-hyun

Thank you, @cloud-fan . I added two comments.

Adding a new test coverage to compare different NaN values.
Reconsidering operators.sql because it doesn't fail on 3.0.0, too.

viirya · 2020-09-04T19:56:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/OrderingUtil.scala

+  /**
+   * A special version of float comparison that follows SQL semantic:
+   *  1. NaN == NaN
+   *  2. NaN is greater than any non-NaN double


NaN is greater than any non-NaN double -> NaN is greater than any non-NaN float

SparkQA · 2020-09-07T20:02:57Z

Test build #128359 has finished for PR 29647 at commit 869856c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-09-08T03:41:54Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/SQLOrderingUtilSuite.scala

+    assert(JDouble.doubleToRawLongBits(Double.NaN) != JDouble.doubleToRawLongBits(specialNaN))
+
+    assert(SQLOrderingUtil.compareDoubles(Double.NaN, Double.NaN) === 0)
+    assert(SQLOrderingUtil.compareDoubles(Double.NaN, specialNaN) === 0)


dongjoon-hyun

+1, LGTM. Thank you, @cloud-fan and all!
Merged to master/3.0.

This is a Spark 3.0 regression introduced by #26761. We missed a corner case that `java.lang.Double.compare` treats 0.0 and -0.0 as different, which breaks SQL semantic. This PR adds back the `OrderingUtil`, to provide custom compare methods that take care of 0.0 vs -0.0 Fix a correctness bug. Yes, now `SELECT 0.0 > -0.0` returns false correctly as Spark 2.x. new tests Closes #29647 from cloud-fan/float. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4144b6d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

This is a Spark 3.0 regression introduced by apache#26761. We missed a corner case that `java.lang.Double.compare` treats 0.0 and -0.0 as different, which breaks SQL semantic. This PR adds back the `OrderingUtil`, to provide custom compare methods that take care of 0.0 vs -0.0 Fix a correctness bug. Yes, now `SELECT 0.0 > -0.0` returns false correctly as Spark 2.x. new tests Closes apache#29647 from cloud-fan/float. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4144b6d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

probot-autolabeler bot added the SQL label Sep 4, 2020

gengliangwang reviewed Sep 4, 2020

View reviewed changes

dongjoon-hyun reviewed Sep 4, 2020

View reviewed changes

dongjoon-hyun requested changes Sep 4, 2020

View reviewed changes

viirya reviewed Sep 4, 2020

View reviewed changes

-0.0 should be equal to 0.0

869856c

cloud-fan force-pushed the float branch from 77b1e44 to 869856c Compare September 7, 2020 15:26

dongjoon-hyun reviewed Sep 8, 2020

View reviewed changes

dongjoon-hyun approved these changes Sep 8, 2020

View reviewed changes

dongjoon-hyun closed this in 4144b6d Sep 8, 2020

tanelk mentioned this pull request Sep 11, 2020

[SPARK-32688][SQL][TEST] Add special values to LiteralGenerator for float and double #29515

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32764][SQL] -0.0 should be equal to 0.0 #29647

[SPARK-32764][SQL] -0.0 should be equal to 0.0 #29647

cloud-fan commented Sep 4, 2020

cloud-fan commented Sep 4, 2020

srowen commented Sep 4, 2020 •

edited

Loading

gengliangwang commented Sep 4, 2020

gengliangwang Sep 4, 2020

dongjoon-hyun Sep 4, 2020

viirya Sep 4, 2020

srowen commented Sep 4, 2020

gengliangwang commented Sep 4, 2020

SparkQA commented Sep 4, 2020

dongjoon-hyun commented Sep 4, 2020

dongjoon-hyun Sep 4, 2020

dongjoon-hyun Sep 4, 2020

cloud-fan Sep 7, 2020

cloud-fan Sep 7, 2020

dongjoon-hyun Sep 8, 2020

dongjoon-hyun Sep 4, 2020

dongjoon-hyun left a comment

viirya Sep 4, 2020

SparkQA commented Sep 7, 2020

dongjoon-hyun Sep 8, 2020

dongjoon-hyun left a comment


		package org.apache.spark.sql.catalyst.util

		object OrderingUtil {

[SPARK-32764][SQL] -0.0 should be equal to 0.0 #29647

[SPARK-32764][SQL] -0.0 should be equal to 0.0 #29647

Conversation

cloud-fan commented Sep 4, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan commented Sep 4, 2020

srowen commented Sep 4, 2020 • edited Loading

gengliangwang commented Sep 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen commented Sep 4, 2020

gengliangwang commented Sep 4, 2020

SparkQA commented Sep 4, 2020

dongjoon-hyun commented Sep 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 7, 2020

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

srowen commented Sep 4, 2020 •

edited

Loading