-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28503][SQL] Return null result on cast an out-of-range value to a integral type #25300
Conversation
Test build #108397 has finished for PR 25300 at commit
|
Test build #108399 has finished for PR 25300 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Outdated
Show resolved
Hide resolved
b => x.numeric.asInstanceOf[Numeric[Any]].toInt(b) | ||
buildCast[Long](_, t => { | ||
val longValue = timestampToLong(t) | ||
if (longValue == longValue.toInt) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no chance to has the same value accidentally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean: is it possible thatlongValue == longValue.toInt
when longValue can't fit into Int
?
The longValue
is long type, so it is impossible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, right. I misunderstood it. Thanks!
case x: FloatType => | ||
buildCast[Float](_, f => | ||
if (f <= Int.MaxValue && f >= Int.MinValue) { | ||
f.toInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This question might be out-of-scope in this pr though... we don't need round-up here?
postgres=# select CAST(3.9 AS INT);
int4
------
4
(1 row)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it is tricty to compare float and int if the float value is around Int.MaxValue or Int.MinValue.
scala> BigDecimal((Int.MaxValue + 1L).toString).toFloat <= Int.MaxValue
res1: Boolean = true
This is because float
is also 32 bits long, and it uses 8 bit to represent the exponent field. While int
is 32 bits long. So the comparison won't be accurate.We don't have to worry about the rounding for Float -> Int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, I see. That behaivour looks interesting...
But, what I'm just worried about is that the query below has different output;
// this pr
scala> sql("SELECT CAST(float(3.5) AS INT)").show
+-------------------------------+
|CAST(CAST(3.5 AS FLOAT) AS INT)|
+-------------------------------+
| 3|
+-------------------------------+
// postgresql
postgres=# SELECT CAST(float4 '3.5' AS INT);
int4
------
4
(1 row)
// mysql (no float literal in mysql?)
mysql> SELECT CAST(3.5 AS SIGNED INT);
+-------------------------+
| CAST(3.5 AS SIGNED INT) |
+-------------------------+
| 4 |
+-------------------------+
1 row in set (0.00 sec)
You mean that, since that comparison is inaccurate, this output difference is ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I thought you were talking about the corner case.
As per the SQL standard in section 9.2
it is implementation- defined whether the approximation is obtained by rounding or by truncation
Spark always uses truncation. So I think we can simply follow the previous behavior on this. It is out of the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, ok. thx for your check!
@@ -180,13 +180,14 @@ object Cast { | |||
|
|||
case (FloatType | DoubleType, TimestampType) => true | |||
case (TimestampType, DateType) => false | |||
case (TimestampType, _: IntegralType) if to != LongType => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this related to this pr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, converting TimeStamp to Byte/Short/Int type can be null
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Outdated
Show resolved
Hide resolved
I've left the same comment in #25239 (comment) though, |
Test build #108439 has finished for PR 25300 at commit
|
Test build #108440 has finished for PR 25300 at commit
|
Test build #108446 has finished for PR 25300 at commit
|
@maropu I check the code in master...maropu:SPARK-28503 . It seems clean and simple. Not sure if there is difference in performance. The fixes of this PR is only about casting to byte/short/int/long type, so I think it is also fine with the straightforward solution in this PR. |
Test build #108450 has finished for PR 25300 at commit
|
retest this please. |
Test build #108462 has finished for PR 25300 at commit
|
Test build #108523 has finished for PR 25300 at commit
|
Test build #108528 has finished for PR 25300 at commit
|
@@ -1086,7 +1090,6 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter { | |||
"udf_sum", | |||
"udf_tan", | |||
"udf_tinyint", | |||
"udf_to_byte", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this behaviour? cc: @gatorsmile @dongjoon-hyun @wangyum
If we do so, we need to update the guide, too:
https://github.com/apache/spark/blob/master/docs/sql-migration-guide-hive-compatibility.md
https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is within expectation. There are two cases in udf_to_byte
that failed after this PR:
SELECT CAST(-129 AS TINYINT) FROM src tablesample (1 rows);
SELECT CAST(CAST(-1025 AS BIGINT) AS TINYINT) FROM src tablesample (1 rows);
The results will be null for Spark, while the result of HIve is not null.
I will change the migration guide if this looks ok to you.
checkEvaluation(cast(Literal(value * MICROS_PER_SECOND, TimestampType), IntegerType), value) | ||
checkEvaluation(cast(Literal(value * 1.0, DoubleType), IntegerType), value) | ||
} | ||
checkEvaluation(cast(2147483647.4f, IntegerType), 2147483647) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this test, you tried to check this case? https://github.com/apache/spark/pull/25300/files#r309043580
If so, how about describing what's tested in comments?
Also, can you add boundary tests for null, e.g., checkEvaluation(cast(214748364?.?f, IntegerType), null)
?
@@ -285,7 +285,7 @@ class ResolveGroupingAnalyticsSuite extends AnalysisTest { | |||
GroupingSets(Seq(Seq(), Seq(unresolved_a), Seq(unresolved_a, unresolved_b)), | |||
Seq(unresolved_a, unresolved_b), r1, Seq(unresolved_a, unresolved_b))) | |||
val expected = Project(Seq(a, b), Sort( | |||
Seq(SortOrder('aggOrder.byte.withNullability(false), Ascending)), true, | |||
Seq(SortOrder('aggOrder.byte.withNullability(true), Ascending)), true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you changed this? You hit some test failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In line 38
lazy val grouping_a = Cast(ShiftRight(gid, 1) & 1, ByteType, Option(TimeZone.getDefault().getID))
The grouping expr is casting a integer to byte, so the nullability is true.
case ShortType => | ||
b => b.asInstanceOf[Short].toLong | ||
case IntegerType => | ||
b => b.asInstanceOf[Int].toLong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the three entries above? case x: NumericType
you removed in this pr is enough for those cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if there is performance difference here. The motivation is that there are only there types here and by doing this might slightly improve the performance. @rednaxelafx @kiszk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, I see. Good idea to ask the JVM guys, hahaha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In summary, I think that this slightly improve improve performance.
My guess (I have not seen generated code by HotSpot) is that this cast (e.g. b.asInstanceOf[Int]
) removes table lookup of invocation of toLong
. As a result, we expect the code in toLong
is inlined to caller. On the other hand, the original code (asInstanceOf[Numeric[Any]].toLong(b)
) has a table lookup for invoking toLong
. Thus, it is not expected to apply inlining.
cc @cloud-fan |
Test build #108553 has finished for PR 25300 at commit
|
) | ||
case FloatType => | ||
buildCast[Float](_, f => | ||
if (f <= Int.MaxValue && f >= Int.MinValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we use the similar condition to others, like if (f < Int.MaxValue + 1L && f > Int.MinValue - 1L) {
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is quite tricky here. I did some corner tests and decide to do it in this way.
scala> 2147483647.5f < Int.MaxValue + 1L
res2: Boolean = false
scala> 2147483647.5f <= Int.MaxValue
res3: Boolean = true
@@ -1150,11 +1338,45 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String | |||
case DateType => | |||
(c, evPrim, evNull) => code"$evNull = true;" | |||
case TimestampType => | |||
(c, evPrim, evNull) => code"$evPrim = (byte) ${timestampToIntegerCode(c)};" | |||
val longValue = ctx.freshName("longValue") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Among castToByteCode
, castToShortCode
, and castToIntCode
, can we do refactoring by introducing a helper function that has arguments type as String
, MaxValue
, MinValue
? As a result, we can reduce a similar code.
checkEvaluation(cast(Literal(value * MICROS_PER_SECOND, TimestampType), IntegerType), value) | ||
checkEvaluation(cast(Literal(value * 1.0, DoubleType), IntegerType), value) | ||
} | ||
checkEvaluation(cast(2147483647.4f, IntegerType), 2147483647) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to use ????.9
for fraction value to show cutoff truncation semantics?
After consideration, I decide to close this one and open #25461. The current behavior is actually compatible with Hive. The changes in this PR might break existing queries, while there is no similar behavior in other DBMS. |
What changes were proposed in this pull request?
Currently, when we convert an out-of-range value to a numeric type, the value is unexpected
The result is actually
1234567890.toShort
(1234567890 & 0xffff).The issue exists in all the integral types: Byte/Short/Int/Long. In the current implementation of
Cast
, if the value is too big to fit in an integral type, only the low-order bits are returned.For Float/Double type, the value is converted as
PositiveInfinity
orNegativeInfinity
on overflow. We can keep the current behavior.This PR is to convert out-of-range integral type castings to null results.
How was this patch tested?
Unit test