Skip to content

Commit

Permalink
[SPARK-26021][SQL][FOLLOWUP] add test for special floating point values
Browse files Browse the repository at this point in the history
## What changes were proposed in this pull request?

a followup of #23043 . Add a test to show the minor behavior change introduced by #23043 , and add migration guide.

## How was this patch tested?

a new test

Closes #23141 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
  • Loading branch information
cloud-fan committed Nov 28, 2018
1 parent 8c68718 commit 09a91d9
Show file tree
Hide file tree
Showing 5 changed files with 48 additions and 12 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -165,10 +165,14 @@ public void writeMinusZeroIsReplacedWithZero() {
byte[] floatBytes = new byte[Float.BYTES];
Platform.putDouble(doubleBytes, Platform.BYTE_ARRAY_OFFSET, -0.0d);
Platform.putFloat(floatBytes, Platform.BYTE_ARRAY_OFFSET, -0.0f);
double doubleFromPlatform = Platform.getDouble(doubleBytes, Platform.BYTE_ARRAY_OFFSET);
float floatFromPlatform = Platform.getFloat(floatBytes, Platform.BYTE_ARRAY_OFFSET);

Assert.assertEquals(Double.doubleToLongBits(0.0d), Double.doubleToLongBits(doubleFromPlatform));
Assert.assertEquals(Float.floatToIntBits(0.0f), Float.floatToIntBits(floatFromPlatform));
byte[] doubleBytes2 = new byte[Double.BYTES];
byte[] floatBytes2 = new byte[Float.BYTES];
Platform.putDouble(doubleBytes, Platform.BYTE_ARRAY_OFFSET, 0.0d);
Platform.putFloat(floatBytes, Platform.BYTE_ARRAY_OFFSET, 0.0f);

// Make sure the bytes we write from 0.0 and -0.0 are same.
Assert.assertArrayEquals(doubleBytes, doubleBytes2);
Assert.assertArrayEquals(floatBytes, floatBytes2);
}
}
6 changes: 4 additions & 2 deletions docs/sql-migration-guide-upgrade.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,16 @@ displayTitle: Spark SQL Upgrading Guide

- Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}` with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts it to `Row(null)`.

- In Spark version 2.4 and earlier, the `from_json` function produces `null`s for JSON strings and JSON datasource skips the same independetly of its mode if there is no valid root JSON token in its input (` ` for example). Since Spark 3.0, such input is treated as a bad record and handled according to specified mode. For example, in the `PERMISSIVE` mode the ` ` input is converted to `Row(null, null)` if specified schema is `key STRING, value INT`.
- In Spark version 2.4 and earlier, the `from_json` function produces `null`s for JSON strings and JSON datasource skips the same independetly of its mode if there is no valid root JSON token in its input (` ` for example). Since Spark 3.0, such input is treated as a bad record and handled according to specified mode. For example, in the `PERMISSIVE` mode the ` ` input is converted to `Row(null, null)` if specified schema is `key STRING, value INT`.

- The `ADD JAR` command previously returned a result set with the single value 0. It now returns an empty result set.

- In Spark version 2.4 and earlier, users can create map values with map type key via built-in function like `CreateMap`, `MapFromArrays`, etc. Since Spark 3.0, it's not allowed to create map values with map type key with these built-in functions. Users can still read map values with map type key from data source or Java/Scala collections, though they are not very useful.

- In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a grouped dataset with key attribute wrongly named as "value", if the key is non-struct type, e.g. int, string, array, etc. This is counterintuitive and makes the schema of aggregation queries weird. For example, the schema of `ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the grouping attribute to "key". The old behaviour is preserved under a newly added configuration `spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue` with a default value of `false`.

- In Spark version 2.4 and earlier, float/double -0.0 is semantically equal to 0.0, but users can still distinguish them via `Dataset.show`, `Dataset.collect` etc. Since Spark 3.0, float/double -0.0 is replaced by 0.0 internally, and users can't distinguish them any more.

## Upgrading From Spark SQL 2.3 to 2.4

- In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -334,17 +334,11 @@ public void setLong(int ordinal, long value) {
}

public void setFloat(int ordinal, float value) {
if (Float.isNaN(value)) {
value = Float.NaN;
}
assertIndexIsValid(ordinal);
Platform.putFloat(baseObject, getElementOffset(ordinal, 4), value);
}

public void setDouble(int ordinal, double value) {
if (Double.isNaN(value)) {
value = Double.NaN;
}
assertIndexIsValid(ordinal);
Platform.putDouble(baseObject, getElementOffset(ordinal, 8), value);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -393,4 +393,33 @@ class DatasetPrimitiveSuite extends QueryTest with SharedSQLContext {
val ds = spark.createDataset(data)
checkDataset(ds, data: _*)
}

test("special floating point values") {
import org.scalatest.exceptions.TestFailedException

// Spark treats -0.0 as 0.0
intercept[TestFailedException] {
checkDataset(Seq(-0.0d).toDS(), -0.0d)
}
intercept[TestFailedException] {
checkDataset(Seq(-0.0f).toDS(), -0.0f)
}
intercept[TestFailedException] {
checkDataset(Seq(Tuple1(-0.0)).toDS(), Tuple1(-0.0))
}

val floats = Seq[Float](-0.0f, 0.0f, Float.NaN).toDS()
checkDataset(floats, 0.0f, 0.0f, Float.NaN)

val doubles = Seq[Double](-0.0d, 0.0d, Double.NaN).toDS()
checkDataset(doubles, 0.0, 0.0, Double.NaN)

checkDataset(Seq(Tuple1(Float.NaN)).toDS(), Tuple1(Float.NaN))
checkDataset(Seq(Tuple1(-0.0f)).toDS(), Tuple1(0.0f))
checkDataset(Seq(Tuple1(Double.NaN)).toDS(), Tuple1(Double.NaN))
checkDataset(Seq(Tuple1(-0.0)).toDS(), Tuple1(0.0))

val complex = Map(Array(Seq(Tuple1(Double.NaN))) -> Map(Tuple2(Float.NaN, null)))
checkDataset(Seq(complex).toDS(), complex)
}
}
7 changes: 7 additions & 0 deletions sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,13 @@ abstract class QueryTest extends PlanTest {
a.length == b.length && a.zip(b).forall { case (l, r) => compare(l, r)}
case (a: Iterable[_], b: Iterable[_]) =>
a.size == b.size && a.zip(b).forall { case (l, r) => compare(l, r)}
case (a: Product, b: Product) =>
compare(a.productIterator.toSeq, b.productIterator.toSeq)
// 0.0 == -0.0, turn float/double to binary before comparison, to distinguish 0.0 and -0.0.
case (a: Double, b: Double) =>
java.lang.Double.doubleToRawLongBits(a) == java.lang.Double.doubleToRawLongBits(b)
case (a: Float, b: Float) =>
java.lang.Float.floatToRawIntBits(a) == java.lang.Float.floatToRawIntBits(b)
case (a, b) => a == b
}

Expand Down

0 comments on commit 09a91d9

Please sign in to comment.