Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

What changes were proposed in this pull request?

Since SPARK-9079 and SPARK-9145, NaN = NaN returns true and works well. The only exception case is direct comparison between Row(Float.NaN) and Row(Double.NaN). The following is the example: the last two expressions had better be true and List([NaN]) for consistency.

scala> Seq((1d,1f),(Double.NaN,Float.NaN)).toDF("a","b").registerTempTable("tmp")
scala> sql("select a,b,a=b from tmp").collect()
res1: Array[org.apache.spark.sql.Row] = Array([1.0,1.0,true], [NaN,NaN,true])
scala> val row_a = sql("select a from tmp").collect()
row_a: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN])
scala> val row_b = sql("select b from tmp").collect()
row_b: Array[org.apache.spark.sql.Row] = Array([1.0], [NaN])
scala> row_a(0) == row_b(0)
res2: Boolean = true
scala> List(row_a(0),row_b(0)).distinct
res3: List[org.apache.spark.sql.Row] = List([1.0])
scala> row_a(1) == row_b(1)
res4: Boolean = false
scala> List(row_a(1),row_b(1)).distinct
res5: List[org.apache.spark.sql.Row] = List([NaN], [NaN])

Please note that the following background truths as of today (before this PR).

  • Double.NaN != Double.NaN (Scala/Java/IEEE Standard)
  • Float.NaN != Float.NaN (Scala/Java/IEEE Standard)
  • Double.NaN != Float.NaN (Scala/Java/IEEE Standard)
  • Row(Double.NaN) == Row(Double.NaN)
  • Row(Float.NaN) == Row(Float.NaN)
  • Row(Double.NaN) != Row(Float.NaN) <== The problem of this PR

How was this patch tested?

Pass the Jenkins tests including new testcases.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-14051] Implement Double.NaN==Float.NaN in row.equals for consistency. [SPARK-14051][SQL] Implement Double.NaN==Float.NaN in row.equals for consistency. Mar 21, 2016
@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53726 has finished for PR 11868 at commit 6ff7ade.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Mar 22, 2016

What do we do for hash code?

@dongjoon-hyun
Copy link
Member Author

Oh, thank you for pointing out that. I missed that part. Let me check that again. I guess we can change Row[NaN].hashCode together in this PR.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-14051][SQL] Implement Double.NaN==Float.NaN in row.equals for consistency. [SPARK-14051][SQL] Implement Double.NaN==Float.NaN for consistency. Mar 22, 2016
@dongjoon-hyun
Copy link
Member Author

Hi, @rxin . Thank you again!

I made a big mistake in this PR and now I fixed it due to your advice. Now, the followings are true.

scala> row_a(1) == row_b(1)
res4: Boolean = true
scala> List(row_a(1),row_b(1)).distinct
res5: List[org.apache.spark.sql.Row] = List([NaN])

Also, I applied this to BaseGenericInternalRow, too.

@rxin
Copy link
Contributor

rxin commented Mar 22, 2016

I spent a bit of time on this -- I'm actually not sure we want to change this anymore, because Scala itsefl doesn't do this and users can always screw up if they do field comparison themselves.

scala> Float.NaN == Double.NaN
res1: Boolean = false

@dongjoon-hyun
Copy link
Member Author

It's because Scala uses the standard way of Java and IEEE floating point. I also know that NaN is always false with even other NaN in Java/Scala.
However, it's about Row. I mean SQL. In SQL and Database world, NaN == NaN.

@dongjoon-hyun
Copy link
Member Author

For example, Oracle orders NaN greatest with respect to all other values, and evaluates NaN equal to NaN.

@rxin
Copy link
Contributor

rxin commented Mar 22, 2016

Yea but if they do

row1.getFloat(1) == row2.getDouble(2), it'd ...

@dongjoon-hyun
Copy link
Member Author

IBM DB2 also says "From an SQL perspective, infinity = infinity, NaN = NaN, and sNaN = sNaN."

@dongjoon-hyun
Copy link
Member Author

Oh, I see what is the point here now. @rxin , may I explain a little bit more?

Mathematically, NaN equality is defined false. The followings are all false in Scala (and Java).

  • row_a(1).getDouble(0) == row_a(1).getDouble(0)
  • row_b(1).getFloat(0) == row_b(1).getFloat(0)
  • row_a(1).getDouble(0) == row_b(1).getFloat(0)
  • Double.NaN == Double.NaN
  • Float.NaN == Float.NaN
  • Double.NaN == Float.NaN

However, Spark Row on master branch already returns true for the followings.

  • Row(Double.NaN) == Row(Double.NaN)
  • Row(Float.NaN) == Row(Float.NaN)

As you guess easily, the followings are still false.

  • Row(Float.NaN).getFloat(0) == Row(Float.NaN).getFloat(0)
  • Row(Double.NaN).getDouble(0) == Row(Double.NaN).getDouble(0)

@dongjoon-hyun
Copy link
Member Author

I think this PR makes Spark users feel less confused by completing the missing part of NaN in Spark SQL(Row).

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53799 has finished for PR 11868 at commit 5e0a19c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 24, 2016

Test build #54055 has finished for PR 11868 at commit a0e4ebe.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #54124 has finished for PR 11868 at commit bafb5ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
case f1: Float if java.lang.Float.isNaN(f1) =>
if (!o2.isInstanceOf[Float] || ! java.lang.Float.isNaN(o2.asInstanceOf[Float])) {
if (!(o2.isInstanceOf[Float] && java.lang.Float.isNaN(o2.asInstanceOf[Float]) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, in general NaN never equals NaN. There might be some reason to treat it differently here. On the one hand I tend to agree with this change anyway, on the grounds that it implements something like automatic promotion in Scala/Java. But clearly we're already not implementing the language's semantics, and trying to achieve something more like bitwise-equal semantics. In that case this wouldn't quite be right. Ask the author of the original change why it was made?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for attention, @srowen . I agree with you and @rxin in the current Spark status view point.

Today, when I'm reading Spark SQL, DataFrames and Datasets Guide: NaN Semantics, I just suddenly want to update this PR.

NaN Semantics
There is specially handling for not-a-number (NaN) when dealing with float or double types that does not exactly match standard floating point semantics. Specifically:

  • NaN = NaN returns true.

I'm still digging to find some useful cases for this PR outside SQL layers.

@SparkQA
Copy link

SparkQA commented Apr 11, 2016

Test build #55494 has finished for PR 11868 at commit 1609112.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @JoshRosen .
Could you take a look at this PR about Row(Double.NaN) == Row(Float.NaN) for a second when you have some time today? I'm going to close this PR and JIRA (as WontFix) since it's stale and Spark works fine without this. However, before closing this, I really appreciate if you give some opinion about what the original intention was.

According to the initial commit in RowSuite.scala, Row class is designed(tested) to return true for 'NaN' equality test since 'NaN' is defined a normal value. So, I thought Double.NaN also needs to be equal to Float.NaN and made this PR. It's not interesting feature, just an attempt to complete Row-level NaN operations.

test("float NaN == NaN") {
    val r1 = Row(Float.NaN)
    val r2 = Row(Float.NaN)
    assert(r1 === r2)
}

test("double NaN == NaN") {
    val r1 = Row(Double.NaN)
    val r2 = Row(Double.NaN)
    assert(r1 === r2)
}

@rxin
Copy link
Contributor

rxin commented Apr 19, 2016

Yea unfortunately the behavior is already pretty weird, and I'm not sure adding this would actually make it less weird, so I'm in favor of just not doing anything here.

@dongjoon-hyun
Copy link
Member Author

Sure. As I mentioned today, I'm going to close this PR since Spark don't want this.
I'm wondering what the intention of those testcases was.
Don't worry, @rxin . I know that any -1 reject PRs.
I think this PR and JIRA at least prevents the same misunderstanding for the future like me.

@dongjoon-hyun
Copy link
Member Author

It's just for a record.
In case there is no comment from @JoshRosen , I'll close this PR at tonight.
Silence means many things and seems enough for this PR.

@dongjoon-hyun
Copy link
Member Author

Thank you for your kind reviews during last month, @rxin and @srowen .
I'm happily closing this PR. It's my bad to take so much time to close this PR.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-14051 branch July 20, 2016 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants