[SPARK-28503][SQL] Return null result on cast an out-of-range value to a integral type #25300

gengliangwang · 2019-07-30T13:48:24Z

What changes were proposed in this pull request?

Currently, when we convert an out-of-range value to a numeric type, the value is unexpected

scala> spark.sql("select cast(1234567890 as short)").show()
+----------------------------+
|CAST(1234567890 AS SMALLINT)|
+----------------------------+
|                         722|
+----------------------------+

The result is actually 1234567890.toShort (1234567890 & 0xffff).
The issue exists in all the integral types: Byte/Short/Int/Long. In the current implementation of Cast, if the value is too big to fit in an integral type, only the low-order bits are returned.
For Float/Double type, the value is converted as PositiveInfinity or NegativeInfinity on overflow. We can keep the current behavior.

This PR is to convert out-of-range integral type castings to null results.

How was this patch tested?

Unit test

…meric type

SparkQA · 2019-07-30T15:31:31Z

Test build #108397 has finished for PR 25300 at commit 83190e5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-30T15:39:35Z

Test build #108399 has finished for PR 25300 at commit f140561.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

maropu · 2019-07-31T03:36:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

-      b => x.numeric.asInstanceOf[Numeric[Any]].toInt(b)
+      buildCast[Long](_, t => {
+        val longValue = timestampToLong(t)
+        if (longValue == longValue.toInt) {


no chance to has the same value accidentally?

Do you mean: is it possible thatlongValue == longValue.toInt when longValue can't fit into Int?
The longValue is long type, so it is impossible.

oh, right. I misunderstood it. Thanks!

maropu · 2019-07-31T03:55:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

+    case x: FloatType =>
+      buildCast[Float](_, f =>
+        if (f <= Int.MaxValue && f >= Int.MinValue) {
+          f.toInt


This question might be out-of-scope in this pr though... we don't need round-up here?

postgres=# select CAST(3.9 AS INT); int4 ------ 4 (1 row)

Actually it is tricty to compare float and int if the float value is around Int.MaxValue or Int.MinValue.

scala> BigDecimal((Int.MaxValue + 1L).toString).toFloat <= Int.MaxValue res1: Boolean = true

This is because float is also 32 bits long, and it uses 8 bit to represent the exponent field. While int is 32 bits long. So the comparison won't be accurate.We don't have to worry about the rounding for Float -> Int.

Ur, I see. That behaivour looks interesting...
But, what I'm just worried about is that the query below has different output;

// this pr scala> sql("SELECT CAST(float(3.5) AS INT)").show +-------------------------------+ |CAST(CAST(3.5 AS FLOAT) AS INT)| +-------------------------------+ | 3| +-------------------------------+ // postgresql postgres=# SELECT CAST(float4 '3.5' AS INT); int4 ------ 4 (1 row) // mysql (no float literal in mysql?) mysql> SELECT CAST(3.5 AS SIGNED INT); +-------------------------+ | CAST(3.5 AS SIGNED INT) | +-------------------------+ | 4 | +-------------------------+ 1 row in set (0.00 sec)

You mean that, since that comparison is inaccurate, this output difference is ok?

Oh I thought you were talking about the corner case.
As per the SQL standard in section 9.2

it is implementation- defined whether the approximation is obtained by rounding or by truncation

Spark always uses truncation. So I think we can simply follow the previous behavior on this. It is out of the scope of this PR.

yea, ok. thx for your check!

maropu · 2019-07-31T03:59:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

@@ -180,13 +180,14 @@ object Cast {

    case (FloatType | DoubleType, TimestampType) => true
    case (TimestampType, DateType) => false
+    case (TimestampType, _: IntegralType) if to != LongType => true


Is this related to this pr?

Yes, converting TimeStamp to Byte/Short/Int type can be null

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

maropu · 2019-07-31T04:57:57Z

I've left the same comment in #25239 (comment) though, Cast itself needs to support this kind of upcast behaivours? As another option, we could check value ranges by IF expressions: master...maropu:SPARK-28503

SparkQA · 2019-07-31T05:05:09Z

Test build #108439 has finished for PR 25300 at commit b504f2e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-31T05:10:02Z

Test build #108440 has finished for PR 25300 at commit e62551a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-31T07:05:02Z

Test build #108446 has finished for PR 25300 at commit a19535d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-07-31T07:40:37Z

I've left the same comment in #25239 (comment) though, Cast itself needs to support this kind of upcast behaivours? As another option, we could check value ranges by IF expressions: master...maropu:SPARK-28503

@maropu I check the code in master...maropu:SPARK-28503 . It seems clean and simple. Not sure if there is difference in performance. The fixes of this PR is only about casting to byte/short/int/long type, so I think it is also fine with the straightforward solution in this PR.

SparkQA · 2019-07-31T09:59:32Z

Test build #108450 has finished for PR 25300 at commit e087b1a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-07-31T12:30:39Z

retest this please.

SparkQA · 2019-07-31T15:10:31Z

Test build #108462 has finished for PR 25300 at commit e087b1a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-01T16:57:56Z

Test build #108523 has finished for PR 25300 at commit 9a262c0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-01T21:26:17Z

Test build #108528 has finished for PR 25300 at commit a429869.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-08-02T02:12:41Z

...ompatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala

@@ -1086,7 +1090,6 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter {
    "udf_sum",
    "udf_tan",
    "udf_tinyint",
-    "udf_to_byte",


Can we change this behaviour? cc: @gatorsmile @dongjoon-hyun @wangyum
If we do so, we need to update the guide, too:
https://github.com/apache/spark/blob/master/docs/sql-migration-guide-hive-compatibility.md
https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md

This is within expectation. There are two cases in udf_to_byte that failed after this PR:

SELECT CAST(-129 AS TINYINT) FROM src tablesample (1 rows); SELECT CAST(CAST(-1025 AS BIGINT) AS TINYINT) FROM src tablesample (1 rows);

The results will be null for Spark, while the result of HIve is not null.
I will change the migration guide if this looks ok to you.

maropu · 2019-08-02T02:20:42Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala

+      checkEvaluation(cast(Literal(value * MICROS_PER_SECOND, TimestampType), IntegerType), value)
+      checkEvaluation(cast(Literal(value * 1.0, DoubleType), IntegerType), value)
+    }
+    checkEvaluation(cast(2147483647.4f, IntegerType), 2147483647)


In this test, you tried to check this case? https://github.com/apache/spark/pull/25300/files#r309043580
If so, how about describing what's tested in comments?
Also, can you add boundary tests for null, e.g., checkEvaluation(cast(214748364?.?f, IntegerType), null)?

maropu · 2019-08-02T02:21:55Z

...st/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupingAnalyticsSuite.scala

@@ -285,7 +285,7 @@ class ResolveGroupingAnalyticsSuite extends AnalysisTest {
      GroupingSets(Seq(Seq(), Seq(unresolved_a), Seq(unresolved_a, unresolved_b)),
        Seq(unresolved_a, unresolved_b), r1, Seq(unresolved_a, unresolved_b)))
    val expected = Project(Seq(a, b), Sort(
-      Seq(SortOrder('aggOrder.byte.withNullability(false), Ascending)), true,
+      Seq(SortOrder('aggOrder.byte.withNullability(true), Ascending)), true,


Why you changed this? You hit some test failures?

In line 38

lazy val grouping_a = Cast(ShiftRight(gid, 1) & 1, ByteType, Option(TimeZone.getDefault().getID))

The grouping expr is casting a integer to byte, so the nullability is true.

maropu · 2019-08-02T02:24:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

+    case ShortType =>
+      b => b.asInstanceOf[Short].toLong
+    case IntegerType =>
+      b => b.asInstanceOf[Int].toLong


We need the three entries above? case x: NumericType you removed in this pr is enough for those cases?

Not sure if there is performance difference here. The motivation is that there are only there types here and by doing this might slightly improve the performance. @rednaxelafx @kiszk

Ur, I see. Good idea to ask the JVM guys, hahaha

In summary, I think that this slightly improve improve performance.

My guess (I have not seen generated code by HotSpot) is that this cast (e.g. b.asInstanceOf[Int]) removes table lookup of invocation of toLong. As a result, we expect the code in toLong is inlined to caller. On the other hand, the original code (asInstanceOf[Numeric[Any]].toLong(b)) has a table lookup for invoking toLong. Thus, it is not expected to apply inlining.

gengliangwang · 2019-08-02T07:20:15Z

cc @cloud-fan

SparkQA · 2019-08-02T10:59:48Z

Test build #108553 has finished for PR 25300 at commit 63a6e62.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2019-08-09T17:41:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

+      )
+    case FloatType =>
+      buildCast[Float](_, f =>
+        if (f <= Int.MaxValue && f >= Int.MinValue) {


Why don't we use the similar condition to others, like if (f < Int.MaxValue + 1L && f > Int.MinValue - 1L) {?

It is quite tricky here. I did some corner tests and decide to do it in this way.

scala> 2147483647.5f < Int.MaxValue + 1L res2: Boolean = false scala> 2147483647.5f <= Int.MaxValue res3: Boolean = true

kiszk · 2019-08-09T18:00:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

@@ -1150,11 +1338,45 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String
    case DateType =>
      (c, evPrim, evNull) => code"$evNull = true;"
    case TimestampType =>
-      (c, evPrim, evNull) => code"$evPrim = (byte) ${timestampToIntegerCode(c)};"
+      val longValue = ctx.freshName("longValue")


Among castToByteCode, castToShortCode, and castToIntCode, can we do refactoring by introducing a helper function that has arguments type as String, MaxValue, MinValue? As a result, we can reduce a similar code.

kiszk · 2019-08-09T18:14:16Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala

+      checkEvaluation(cast(Literal(value * MICROS_PER_SECOND, TimestampType), IntegerType), value)
+      checkEvaluation(cast(Literal(value * 1.0, DoubleType), IntegerType), value)
+    }
+    checkEvaluation(cast(2147483647.4f, IntegerType), 2147483647)


Is it better to use ????.9 for fraction value to show cutoff truncation semantics?

gengliangwang · 2019-08-15T08:16:53Z

After consideration, I decide to close this one and open #25461. The current behavior is actually compatible with Hive. The changes in this PR might break existing queries, while there is no similar behavior in other DBMS.
If users care about the overflow, they can use enable the configuration proposed in #25461.

gengliangwang added 2 commits July 30, 2019 21:32

SPARK-28503: Return null result on cast an out-of-range value to a nu…

83190e5

…meric type

revise

f140561

dongjoon-hyun changed the title ~~[SPARK-28503] Return null result on cast an out-of-range value to a integral type~~ [SPARK-28503][SQL] Return null result on cast an out-of-range value to a integral type Jul 30, 2019

dongjoon-hyun added the SQL label Jul 30, 2019

blacklist test

b504f2e

wangyum reviewed Jul 31, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala Outdated Show resolved Hide resolved

revise

e62551a

maropu reviewed Jul 31, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala Outdated Show resolved Hide resolved

address comments

a19535d

fix case

e087b1a

gengliangwang mentioned this pull request Jul 31, 2019

[SPARK-26218][SQL] Overflow on arithmetic operations returns incorrect result #21599

Closed

fix corner test cases

9a262c0

fix

a429869

maropu reviewed Aug 2, 2019

View reviewed changes

update doc

63a6e62

kiszk reviewed Aug 9, 2019

View reviewed changes

gengliangwang closed this Aug 15, 2019

[SPARK-28503][SQL] Return null result on cast an out-of-range value to a integral type #25300

[SPARK-28503][SQL] Return null result on cast an out-of-range value to a integral type #25300

Conversation

gengliangwang commented Jul 30, 2019

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jul 30, 2019

SparkQA commented Jul 30, 2019

maropu Jul 31, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu Jul 31, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang Jul 31, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang Jul 31, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Jul 31, 2019

SparkQA commented Jul 31, 2019

SparkQA commented Jul 31, 2019

SparkQA commented Jul 31, 2019

gengliangwang commented Jul 31, 2019

SparkQA commented Jul 31, 2019

gengliangwang commented Jul 31, 2019

SparkQA commented Jul 31, 2019

SparkQA commented Aug 1, 2019

SparkQA commented Aug 1, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu Aug 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang Aug 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiszk Aug 2, 2019 • edited Loading

Choose a reason for hiding this comment

gengliangwang commented Aug 2, 2019

SparkQA commented Aug 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang commented Aug 15, 2019

maropu Jul 31, 2019 •

edited

Loading

maropu Jul 31, 2019 •

edited

Loading

gengliangwang Jul 31, 2019 •

edited

Loading

gengliangwang Jul 31, 2019 •

edited

Loading

maropu Aug 2, 2019 •

edited

Loading

gengliangwang Aug 2, 2019 •

edited

Loading

kiszk Aug 2, 2019 •

edited

Loading