Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26021][SQL][followup] add test for special floating point values #23141

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions docs/sql-migration-guide-upgrade.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,16 @@ displayTitle: Spark SQL Upgrading Guide

- Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}` with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts it to `Row(null)`.

- In Spark version 2.4 and earlier, the `from_json` function produces `null`s for JSON strings and JSON datasource skips the same independetly of its mode if there is no valid root JSON token in its input (` ` for example). Since Spark 3.0, such input is treated as a bad record and handled according to specified mode. For example, in the `PERMISSIVE` mode the ` ` input is converted to `Row(null, null)` if specified schema is `key STRING, value INT`.
- In Spark version 2.4 and earlier, the `from_json` function produces `null`s for JSON strings and JSON datasource skips the same independetly of its mode if there is no valid root JSON token in its input (` ` for example). Since Spark 3.0, such input is treated as a bad record and handled according to specified mode. For example, in the `PERMISSIVE` mode the ` ` input is converted to `Row(null, null)` if specified schema is `key STRING, value INT`.

- The `ADD JAR` command previously returned a result set with the single value 0. It now returns an empty result set.

- In Spark version 2.4 and earlier, users can create map values with map type key via built-in function like `CreateMap`, `MapFromArrays`, etc. Since Spark 3.0, it's not allowed to create map values with map type key with these built-in functions. Users can still read map values with map type key from data source or Java/Scala collections, though they are not very useful.

- In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a grouped dataset with key attribute wrongly named as "value", if the key is non-struct type, e.g. int, string, array, etc. This is counterintuitive and makes the schema of aggregation queries weird. For example, the schema of `ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the grouping attribute to "key". The old behaviour is preserved under a newly added configuration `spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue` with a default value of `false`.

- In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but users can still distinguish them via `Dataset.show`, `Dataset.collect` etc. Since Spark 3.0, float/double -0.0 is replaced by 0.0 internally, and users can't distinguish them any more.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked presto and postgres, the behaviors are same. Hive distinguishes -0.0 and 0.0, but it has the group by bug.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What version of hive did you test?
It was fixed in https://issues.apache.org/jira/browse/HIVE-11174

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran few simple queries on Hive 2.1.

Simple comparison seems ok:

hive> select 1 where 0.0=-0.0;
OK
1
Time taken: 0.047 seconds, Fetched: 1 row(s)
hive> select 1 where -0.0<0.0;
OK
Time taken: 0.053 seconds

But group by behavior seems not correct:

hive> select * from test;
OK
0.0
-0.0
0.0
Time taken: 0.11 seconds, Fetched: 3 row(s)
hive> select * from test;
OK
0.0
-0.0
0.0
Time taken: 0.11 seconds, Fetched: 3 row(s)
hive> select a, count(*) from test group by a;
-0.0	3
Time taken: 1.308 seconds, Fetched: 1 row(s)


## Upgrading From Spark SQL 2.3 to 2.4

- In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -393,4 +393,33 @@ class DatasetPrimitiveSuite extends QueryTest with SharedSQLContext {
val ds = spark.createDataset(data)
checkDataset(ds, data: _*)
}

test("special floating point values") {
import org.scalatest.exceptions.TestFailedException

// Spark treats -0.0 as 0.0
intercept[TestFailedException] {
checkDataset(Seq(-0.0d).toDS(), -0.0d)
}
intercept[TestFailedException] {
checkDataset(Seq(-0.0f).toDS(), -0.0f)
}
intercept[TestFailedException] {
checkDataset(Seq(Tuple1(-0.0)).toDS(), Tuple1(-0.0))
}

val floats = Seq[Float](-0.0f, 0.0f, Float.NaN).toDS()
checkDataset(floats, 0.0f, 0.0f, Float.NaN)

val doubles = Seq[Double](-0.0d, 0.0d, Double.NaN).toDS()
checkDataset(doubles, 0.0, 0.0, Double.NaN)

checkDataset(Seq(Tuple1(Float.NaN)).toDS(), Tuple1(Float.NaN))
checkDataset(Seq(Tuple1(-0.0f)).toDS(), Tuple1(0.0f))
checkDataset(Seq(Tuple1(Double.NaN)).toDS(), Tuple1(Double.NaN))
checkDataset(Seq(Tuple1(-0.0)).toDS(), Tuple1(0.0))

val complex = Map(Array(Seq(Tuple1(Double.NaN))) -> Map(Tuple2(Float.NaN, null)))
checkDataset(Seq(complex).toDS(), complex)
}
}
7 changes: 7 additions & 0 deletions sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,13 @@ abstract class QueryTest extends PlanTest {
a.length == b.length && a.zip(b).forall { case (l, r) => compare(l, r)}
case (a: Iterable[_], b: Iterable[_]) =>
a.size == b.size && a.zip(b).forall { case (l, r) => compare(l, r)}
case (a: Product, b: Product) =>
compare(a.productIterator.toSeq, b.productIterator.toSeq)
// 0.0 == -0.0, turn float/double to binary before comparison, to distinguish 0.0 and -0.0.
case (a: Double, b: Double) =>
java.lang.Double.doubleToRawLongBits(a) == java.lang.Double.doubleToRawLongBits(b)
case (a: Float, b: Float) =>
java.lang.Float.floatToRawIntBits(a) == java.lang.Float.floatToRawIntBits(b)
case (a, b) => a == b
}

Expand Down