[SPARK-14051][SQL] Implement `Double.NaN==Float.NaN` for consistency. #11868

dongjoon-hyun · 2016-03-21T23:29:19Z

What changes were proposed in this pull request?

Since SPARK-9079 and SPARK-9145, NaN = NaN returns true and works well. The only exception case is direct comparison between Row(Float.NaN) and Row(Double.NaN). The following is the example: the last two expressions had better be true and List([NaN]) for consistency.

scala> Seq((1d,1f),(Double.NaN,Float.NaN)).toDF("a","b").registerTempTable("tmp")
scala> sql("select a,b,a=b from tmp").collect()
res1: Array[org.apache.spark.sql.Row] = Array([1.0,1.0,true], [NaN,NaN,true])
scala> val row_a = sql("select a from tmp").collect()
row_a: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN])
scala> val row_b = sql("select b from tmp").collect()
row_b: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN])
scala> row_a(0) == row_b(0)
res2: Boolean = true
scala> List(row_a(0),row_b(0)).distinct
res3: List[org.apache.spark.sql.Row] = List([1.0])
scala> row_a(1) == row_b(1)
res4: Boolean = false
scala> List(row_a(1),row_b(1)).distinct
res5: List[org.apache.spark.sql.Row] = List([NaN], [NaN])

Please note that the following background truths as of today (before this PR).

Double.NaN != Double.NaN (Scala/Java/IEEE Standard)
Float.NaN != Float.NaN (Scala/Java/IEEE Standard)
Double.NaN != Float.NaN (Scala/Java/IEEE Standard)
Row(Double.NaN) == Row(Double.NaN)
Row(Float.NaN) == Row(Float.NaN)
Row(Double.NaN) != Row(Float.NaN) <== The problem of this PR

How was this patch tested?

Pass the Jenkins tests including new testcases.

SparkQA · 2016-03-22T01:18:40Z

Test build #53726 has finished for PR 11868 at commit 6ff7ade.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-03-22T07:29:36Z

What do we do for hash code?

dongjoon-hyun · 2016-03-22T15:47:41Z

Oh, thank you for pointing out that. I missed that part. Let me check that again. I guess we can change Row[NaN].hashCode together in this PR.

dongjoon-hyun · 2016-03-22T18:41:36Z

Hi, @rxin . Thank you again!

I made a big mistake in this PR and now I fixed it due to your advice. Now, the followings are true.

scala> row_a(1) == row_b(1)
res4: Boolean = true
scala> List(row_a(1),row_b(1)).distinct
res5: List[org.apache.spark.sql.Row] = List([NaN])

Also, I applied this to BaseGenericInternalRow, too.

rxin · 2016-03-22T18:43:32Z

I spent a bit of time on this -- I'm actually not sure we want to change this anymore, because Scala itsefl doesn't do this and users can always screw up if they do field comparison themselves.

scala> Float.NaN == Double.NaN
res1: Boolean = false

dongjoon-hyun · 2016-03-22T18:48:47Z

It's because Scala uses the standard way of Java and IEEE floating point. I also know that NaN is always false with even other NaN in Java/Scala.
However, it's about Row. I mean SQL. In SQL and Database world, NaN == NaN.

dongjoon-hyun · 2016-03-22T18:51:36Z

For example, Oracle orders NaN greatest with respect to all other values, and evaluates NaN equal to NaN.

rxin · 2016-03-22T18:52:10Z

Yea but if they do

row1.getFloat(1) == row2.getDouble(2), it'd ...

dongjoon-hyun · 2016-03-22T18:54:12Z

IBM DB2 also says "From an SQL perspective, infinity = infinity, NaN = NaN, and sNaN = sNaN."

dongjoon-hyun · 2016-03-22T19:40:51Z

Oh, I see what is the point here now. @rxin , may I explain a little bit more?

Mathematically, NaN equality is defined false. The followings are all false in Scala (and Java).

row_a(1).getDouble(0) == row_a(1).getDouble(0)
row_b(1).getFloat(0) == row_b(1).getFloat(0)
row_a(1).getDouble(0) == row_b(1).getFloat(0)
Double.NaN == Double.NaN
Float.NaN == Float.NaN
Double.NaN == Float.NaN

However, Spark Row on master branch already returns true for the followings.

Row(Double.NaN) == Row(Double.NaN)
Row(Float.NaN) == Row(Float.NaN)

As you guess easily, the followings are still false.

Row(Float.NaN).getFloat(0) == Row(Float.NaN).getFloat(0)
Row(Double.NaN).getDouble(0) == Row(Double.NaN).getDouble(0)

dongjoon-hyun · 2016-03-22T19:55:39Z

I think this PR makes Spark users feel less confused by completing the missing part of NaN in Spark SQL(Row).

SparkQA · 2016-03-22T20:30:35Z

Test build #53799 has finished for PR 11868 at commit 5e0a19c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-24T21:17:36Z

Test build #54055 has finished for PR 11868 at commit a0e4ebe.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-25T02:31:30Z

Test build #54124 has finished for PR 11868 at commit bafb5ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-04-11T00:53:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala

            }
          case f1: Float if java.lang.Float.isNaN(f1) =>
-            if (!o2.isInstanceOf[Float] || ! java.lang.Float.isNaN(o2.asInstanceOf[Float])) {
+            if (!(o2.isInstanceOf[Float] && java.lang.Float.isNaN(o2.asInstanceOf[Float]) ||


Hm, in general NaN never equals NaN. There might be some reason to treat it differently here. On the one hand I tend to agree with this change anyway, on the grounds that it implements something like automatic promotion in Scala/Java. But clearly we're already not implementing the language's semantics, and trying to achieve something more like bitwise-equal semantics. In that case this wouldn't quite be right. Ask the author of the original change why it was made?

Thank you for attention, @srowen . I agree with you and @rxin in the current Spark status view point.

Today, when I'm reading Spark SQL, DataFrames and Datasets Guide: NaN Semantics, I just suddenly want to update this PR.

NaN Semantics
There is specially handling for not-a-number (NaN) when dealing with float or double types that does not exactly match standard floating point semantics. Specifically:

NaN = NaN returns true.

I'm still digging to find some useful cases for this PR outside SQL layers.

SparkQA · 2016-04-11T01:51:33Z

Test build #55494 has finished for PR 11868 at commit 1609112.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-04-19T17:07:14Z

Hi, @JoshRosen .
Could you take a look at this PR about Row(Double.NaN) == Row(Float.NaN) for a second when you have some time today? I'm going to close this PR and JIRA (as WontFix) since it's stale and Spark works fine without this. However, before closing this, I really appreciate if you give some opinion about what the original intention was.

According to the initial commit in RowSuite.scala, Row class is designed(tested) to return true for 'NaN' equality test since 'NaN' is defined a normal value. So, I thought Double.NaN also needs to be equal to Float.NaN and made this PR. It's not interesting feature, just an attempt to complete Row-level NaN operations.

test("float NaN == NaN") {
    val r1 = Row(Float.NaN)
    val r2 = Row(Float.NaN)
    assert(r1 === r2)
}

test("double NaN == NaN") {
    val r1 = Row(Double.NaN)
    val r2 = Row(Double.NaN)
    assert(r1 === r2)
}

rxin · 2016-04-19T21:50:46Z

Yea unfortunately the behavior is already pretty weird, and I'm not sure adding this would actually make it less weird, so I'm in favor of just not doing anything here.

dongjoon-hyun · 2016-04-19T22:02:10Z

Sure. As I mentioned today, I'm going to close this PR since Spark don't want this.
I'm wondering what the intention of those testcases was.
Don't worry, @rxin . I know that any -1 reject PRs.
I think this PR and JIRA at least prevents the same misunderstanding for the future like me.

dongjoon-hyun · 2016-04-19T22:15:51Z

It's just for a record.
In case there is no comment from @JoshRosen , I'll close this PR at tonight.
Silence means many things and seems enough for this PR.

dongjoon-hyun · 2016-04-20T04:09:49Z

Thank you for your kind reviews during last month, @rxin and @srowen .
I'm happily closing this PR. It's my bad to take so much time to close this PR.

dongjoon-hyun changed the title ~~[SPARK-14051] Implement Double.NaN==Float.NaN in row.equals for consistency.~~ [SPARK-14051][SQL] Implement Double.NaN==Float.NaN in row.equals for consistency. Mar 21, 2016

dongjoon-hyun changed the title ~~[SPARK-14051][SQL] Implement Double.NaN==Float.NaN in row.equals for consistency.~~ [SPARK-14051][SQL] Implement Double.NaN==Float.NaN for consistency. Mar 22, 2016

[SPARK-14051] Implement Double.NaN==Float.NaN for consistency.

1609112

srowen reviewed Apr 11, 2016
View reviewed changes

dongjoon-hyun closed this Apr 20, 2016

dongjoon-hyun deleted the SPARK-14051 branch July 20, 2016 07:33

[SPARK-14051][SQL] Implement Double.NaN==Float.NaN for consistency. #11868

[SPARK-14051][SQL] Implement Double.NaN==Float.NaN for consistency. #11868

Conversation

dongjoon-hyun commented Mar 21, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 22, 2016

Uh oh!

rxin commented Mar 22, 2016

Uh oh!

dongjoon-hyun commented Mar 22, 2016

Uh oh!

dongjoon-hyun commented Mar 22, 2016

Uh oh!

rxin commented Mar 22, 2016

Uh oh!

dongjoon-hyun commented Mar 22, 2016

Uh oh!

dongjoon-hyun commented Mar 22, 2016

Uh oh!

rxin commented Mar 22, 2016

Uh oh!

dongjoon-hyun commented Mar 22, 2016

Uh oh!

dongjoon-hyun commented Mar 22, 2016

Uh oh!

dongjoon-hyun commented Mar 22, 2016

Uh oh!

SparkQA commented Mar 22, 2016

Uh oh!

SparkQA commented Mar 24, 2016

Uh oh!

SparkQA commented Mar 25, 2016

Uh oh!

srowen Apr 11, 2016

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Apr 11, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 11, 2016

Uh oh!

dongjoon-hyun commented Apr 19, 2016

Uh oh!

rxin commented Apr 19, 2016

Uh oh!

dongjoon-hyun commented Apr 19, 2016

Uh oh!

dongjoon-hyun commented Apr 19, 2016

Uh oh!

dongjoon-hyun commented Apr 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-14051][SQL] Implement `Double.NaN==Float.NaN` for consistency. #11868

[SPARK-14051][SQL] Implement `Double.NaN==Float.NaN` for consistency. #11868