Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28503][SQL] Return null result on cast an out-of-range value to a integral type #25300

Closed
wants to merge 9 commits into from

Conversation

gengliangwang
Copy link
Member

What changes were proposed in this pull request?

Currently, when we convert an out-of-range value to a numeric type, the value is unexpected

scala> spark.sql("select cast(1234567890 as short)").show()
+----------------------------+
|CAST(1234567890 AS SMALLINT)|
+----------------------------+
|                         722|
+----------------------------+

The result is actually 1234567890.toShort (1234567890 & 0xffff).
The issue exists in all the integral types: Byte/Short/Int/Long. In the current implementation of Cast, if the value is too big to fit in an integral type, only the low-order bits are returned.
For Float/Double type, the value is converted as PositiveInfinity or NegativeInfinity on overflow. We can keep the current behavior.

This PR is to convert out-of-range integral type castings to null results.

How was this patch tested?

Unit test

@SparkQA
Copy link

SparkQA commented Jul 30, 2019

Test build #108397 has finished for PR 25300 at commit 83190e5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 30, 2019

Test build #108399 has finished for PR 25300 at commit f140561.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28503] Return null result on cast an out-of-range value to a integral type [SPARK-28503][SQL] Return null result on cast an out-of-range value to a integral type Jul 30, 2019
b => x.numeric.asInstanceOf[Numeric[Any]].toInt(b)
buildCast[Long](_, t => {
val longValue = timestampToLong(t)
if (longValue == longValue.toInt) {
Copy link
Member

@maropu maropu Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no chance to has the same value accidentally?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean: is it possible thatlongValue == longValue.toInt when longValue can't fit into Int?
The longValue is long type, so it is impossible.

Copy link
Member

@maropu maropu Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, right. I misunderstood it. Thanks!

case x: FloatType =>
buildCast[Float](_, f =>
if (f <= Int.MaxValue && f >= Int.MinValue) {
f.toInt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This question might be out-of-scope in this pr though... we don't need round-up here?

postgres=# select CAST(3.9 AS INT);
 int4 
------
    4
(1 row)

Copy link
Member Author

@gengliangwang gengliangwang Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it is tricty to compare float and int if the float value is around Int.MaxValue or Int.MinValue.

scala> BigDecimal((Int.MaxValue + 1L).toString).toFloat <= Int.MaxValue
res1: Boolean = true

This is because float is also 32 bits long, and it uses 8 bit to represent the exponent field. While int is 32 bits long. So the comparison won't be accurate.We don't have to worry about the rounding for Float -> Int.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, I see. That behaivour looks interesting...
But, what I'm just worried about is that the query below has different output;

// this pr
scala> sql("SELECT CAST(float(3.5) AS INT)").show
+-------------------------------+
|CAST(CAST(3.5 AS FLOAT) AS INT)|
+-------------------------------+
|                              3|
+-------------------------------+

// postgresql
postgres=# SELECT CAST(float4 '3.5' AS INT);
 int4 
------
    4
(1 row)

// mysql (no float literal in mysql?)
mysql> SELECT CAST(3.5 AS SIGNED INT);
+-------------------------+
| CAST(3.5 AS SIGNED INT) |
+-------------------------+
|                       4 |
+-------------------------+
1 row in set (0.00 sec)

You mean that, since that comparison is inaccurate, this output difference is ok?

Copy link
Member Author

@gengliangwang gengliangwang Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I thought you were talking about the corner case.
As per the SQL standard in section 9.2

it is implementation- defined whether the approximation is obtained by rounding or by truncation

Spark always uses truncation. So I think we can simply follow the previous behavior on this. It is out of the scope of this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, ok. thx for your check!

@@ -180,13 +180,14 @@ object Cast {

case (FloatType | DoubleType, TimestampType) => true
case (TimestampType, DateType) => false
case (TimestampType, _: IntegralType) if to != LongType => true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to this pr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, converting TimeStamp to Byte/Short/Int type can be null

@maropu
Copy link
Member

maropu commented Jul 31, 2019

I've left the same comment in #25239 (comment) though, Cast itself needs to support this kind of upcast behaivours? As another option, we could check value ranges by IF expressions: master...maropu:SPARK-28503

@SparkQA
Copy link

SparkQA commented Jul 31, 2019

Test build #108439 has finished for PR 25300 at commit b504f2e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 31, 2019

Test build #108440 has finished for PR 25300 at commit e62551a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 31, 2019

Test build #108446 has finished for PR 25300 at commit a19535d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

I've left the same comment in #25239 (comment) though, Cast itself needs to support this kind of upcast behaivours? As another option, we could check value ranges by IF expressions: master...maropu:SPARK-28503

@maropu I check the code in master...maropu:SPARK-28503 . It seems clean and simple. Not sure if there is difference in performance. The fixes of this PR is only about casting to byte/short/int/long type, so I think it is also fine with the straightforward solution in this PR.

@SparkQA
Copy link

SparkQA commented Jul 31, 2019

Test build #108450 has finished for PR 25300 at commit e087b1a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Jul 31, 2019

Test build #108462 has finished for PR 25300 at commit e087b1a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 1, 2019

Test build #108523 has finished for PR 25300 at commit 9a262c0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 1, 2019

Test build #108528 has finished for PR 25300 at commit a429869.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1086,7 +1090,6 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter {
"udf_sum",
"udf_tan",
"udf_tinyint",
"udf_to_byte",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is within expectation. There are two cases in udf_to_byte that failed after this PR:

SELECT CAST(-129 AS TINYINT) FROM src tablesample (1 rows);
SELECT CAST(CAST(-1025 AS BIGINT) AS TINYINT) FROM src tablesample (1 rows);

The results will be null for Spark, while the result of HIve is not null.
I will change the migration guide if this looks ok to you.

checkEvaluation(cast(Literal(value * MICROS_PER_SECOND, TimestampType), IntegerType), value)
checkEvaluation(cast(Literal(value * 1.0, DoubleType), IntegerType), value)
}
checkEvaluation(cast(2147483647.4f, IntegerType), 2147483647)
Copy link
Member

@maropu maropu Aug 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test, you tried to check this case? https://github.com/apache/spark/pull/25300/files#r309043580
If so, how about describing what's tested in comments?
Also, can you add boundary tests for null, e.g., checkEvaluation(cast(214748364?.?f, IntegerType), null)?

@@ -285,7 +285,7 @@ class ResolveGroupingAnalyticsSuite extends AnalysisTest {
GroupingSets(Seq(Seq(), Seq(unresolved_a), Seq(unresolved_a, unresolved_b)),
Seq(unresolved_a, unresolved_b), r1, Seq(unresolved_a, unresolved_b)))
val expected = Project(Seq(a, b), Sort(
Seq(SortOrder('aggOrder.byte.withNullability(false), Ascending)), true,
Seq(SortOrder('aggOrder.byte.withNullability(true), Ascending)), true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you changed this? You hit some test failures?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line 38

lazy val grouping_a = Cast(ShiftRight(gid, 1) & 1, ByteType, Option(TimeZone.getDefault().getID))

The grouping expr is casting a integer to byte, so the nullability is true.

case ShortType =>
b => b.asInstanceOf[Short].toLong
case IntegerType =>
b => b.asInstanceOf[Int].toLong
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the three entries above? case x: NumericType you removed in this pr is enough for those cases?

Copy link
Member Author

@gengliangwang gengliangwang Aug 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there is performance difference here. The motivation is that there are only there types here and by doing this might slightly improve the performance. @rednaxelafx @kiszk

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, I see. Good idea to ask the JVM guys, hahaha

Copy link
Member

@kiszk kiszk Aug 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In summary, I think that this slightly improve improve performance.

My guess (I have not seen generated code by HotSpot) is that this cast (e.g. b.asInstanceOf[Int]) removes table lookup of invocation of toLong. As a result, we expect the code in toLong is inlined to caller. On the other hand, the original code (asInstanceOf[Numeric[Any]].toLong(b)) has a table lookup for invoking toLong. Thus, it is not expected to apply inlining.

@gengliangwang
Copy link
Member Author

cc @cloud-fan

@SparkQA
Copy link

SparkQA commented Aug 2, 2019

Test build #108553 has finished for PR 25300 at commit 63a6e62.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

)
case FloatType =>
buildCast[Float](_, f =>
if (f <= Int.MaxValue && f >= Int.MinValue) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use the similar condition to others, like if (f < Int.MaxValue + 1L && f > Int.MinValue - 1L) {?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is quite tricky here. I did some corner tests and decide to do it in this way.

scala> 2147483647.5f < Int.MaxValue + 1L
res2: Boolean = false

scala> 2147483647.5f <= Int.MaxValue 
res3: Boolean = true

@@ -1150,11 +1338,45 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String
case DateType =>
(c, evPrim, evNull) => code"$evNull = true;"
case TimestampType =>
(c, evPrim, evNull) => code"$evPrim = (byte) ${timestampToIntegerCode(c)};"
val longValue = ctx.freshName("longValue")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Among castToByteCode, castToShortCode, and castToIntCode, can we do refactoring by introducing a helper function that has arguments type as String, MaxValue, MinValue? As a result, we can reduce a similar code.

checkEvaluation(cast(Literal(value * MICROS_PER_SECOND, TimestampType), IntegerType), value)
checkEvaluation(cast(Literal(value * 1.0, DoubleType), IntegerType), value)
}
checkEvaluation(cast(2147483647.4f, IntegerType), 2147483647)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to use ????.9 for fraction value to show cutoff truncation semantics?

@gengliangwang
Copy link
Member Author

After consideration, I decide to close this one and open #25461. The current behavior is actually compatible with Hive. The changes in this PR might break existing queries, while there is no similar behavior in other DBMS.
If users care about the overflow, they can use enable the configuration proposed in #25461.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants