[SPARK-30471][SQL]Fix issue when comparing String and IntegerType #34862

beryllw · 2021-12-10T12:01:59Z

What changes were proposed in this pull request?

For comparing StringType and IntegerType, the value of string exceed Int.MaxValue, then should thrown exception.
Other wise, the result is null.
For example:
SQL "select cast('2147483648' as int);",which return is null.

Why are the changes needed?

User needs to know when an exception occurs.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added Unit test.

AmplabJenkins · 2021-12-10T12:03:38Z

Can one of the admins verify this patch?

beryllw · 2021-12-10T12:19:15Z

cc @wangyum @cloud-fan

wangyum · 2021-12-10T13:28:26Z

@Kwafoor Please enable spark.sql.ansi.enabled if you prefer thrown exception.

srowen · 2021-12-10T17:03:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

-      buildCast[UTF8String](_, s => if (s.toInt(result)) result.value else null)
+      buildCast[UTF8String](_, s => {
+        if (s.toInt(result)) {
+          result.value}


Brace is in the wrong place.
Do we have any reference for what the behavior should be in this case? I'm not sure if null or an exception is correct

This case is some users use HiveSQL in SparkSQL, when cast the value of string exceed Int.MaxValue to integer,SparkSQL return null.It will make big trouble, because user didn't get any exception, but the result is wrong.I think SparkSQL should thrown exception at least.

I'm still not clear what you are saying. What does Hive do, for example?

In sparkSQL

and

For in Hive

When user comparing String and IntegerType in sparkSQL with spark.sql.ansi.enabled=false, and the default value of spark.sql.ansi.enabled is false, user didn't get any remind but get a wrong result.
User runs a big and complex sql, it is very hard to find where the result is wrong,So they is hard to fix the wrong sql.

Actually I think may be there is some problem without any check to cast String to int when user comparing String and AtomicType.

And with spark.sql.ansi.enabled=true, SparkSQL disallowed compare String and DecimalType, user from hive to SparkSQL unwilling to change their code.

So I think It should remind User where your sql is not support in SparkSQL even without ansi mode, Because this case is exception, and user should to be reminded.

beryllw · 2021-12-12T07:14:07Z

@Kwafoor Please enable spark.sql.ansi.enabled if you prefer thrown exception.

spark.sql.ansi.enabled about this case is right, but user always didn't open ansi, because they want use hiveSQL in sparkSQL without SQL change.
I just want to remind user your sql have something wrong.

HyukjinKwon · 2021-12-13T00:20:40Z

spark.sql.ansi.enabled about this case is right, but user always didn't open ansi, because they want use hiveSQL in sparkSQL without SQL change.
I just want to remind user your sql have something wrong.

the problem here is that it will break existing users app when they migrate to higher versions of spark. e.g.) it worked before but now it fails. So it's better to cover it with the ANSI configuration which users can explicitly enable.

cc @cloud-fan, @gengliangwang and @MaxGekk FYI

HyukjinKwon · 2021-12-13T00:21:48Z

@Kwafoor, Before we go further, I think it should be best to double check how ANSI defines the behaviour on the overflow of strings <> other types.

HyukjinKwon · 2021-12-13T00:22:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

@@ -652,7 +652,13 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit
      buildCast[UTF8String](_, UTF8StringUtils.toIntExact)
    case StringType =>
      val result = new IntWrapper()
-      buildCast[UTF8String](_, s => if (s.toInt(result)) result.value else null)
+      buildCast[UTF8String](_, s => {


How about other types like short, etc?

I find this case in user change their HiveSQL to SparkSQL, user always filter column with comparing String and IntegerType.With ANSI this case is right, SparkSQL will thrown Exception, but user unwilling to change their code, so ANSI is often false.And in this case, it is actually return wrong result, I think SparkSQL should remind user where your sql is wrong.If user runs a big and complex query, without thrown exception it is hard to find where the sql is wrong.

gengliangwang · 2021-12-13T03:09:20Z

@Kwafoor the default behavior exists for many years. It is hard to change it now.
That's also why we introducing the ANSI SQL mode as an alternative option. If users want to detect integer overflow, they can enable the spark.sql.ansi.enabled.

beryllw · 2021-12-13T03:13:47Z

@Kwafoor, Before we go further, I think it should be best to double check how ANSI defines the behaviour on the overflow of strings <> other types.

@Kwafoor Please enable spark.sql.ansi.enabled if you prefer thrown exception.

With spark.sql.ansi.enabled enable it is right, but ANSI is too strict with cast, user often didn't enable.
Without thrown Exception, when overflow happened SparkSQL return wrong results(null), but user didn't get any remind, I think it is a problem.
Without thrown exception it is also too hard to find where the wrong sql is and fix it.

beryllw · 2021-12-13T03:18:37Z

@Kwafoor the default behavior exists for many years. It is hard to change it now. That's also why we introducing the ANSI SQL mode as an alternative option. If users want to detect integer overflow, they can enable the spark.sql.ansi.enabled.

I agree with you, but I think SparkSQL should remind user your sql is wrong rather return null.
User may not check their results, it will lead a hug trouble.
ANSI with default true is also not a good idea, it is too strict with cast.

srowen · 2021-12-13T03:57:25Z

If the value is expected to be large, is it not better to explicitly cast to a larger integer type?

gengliangwang · 2021-12-13T04:09:44Z

ANSI with default true is also not a good idea, it is too strict with cast.

I am actually trying to make the ANSI mode more lenient. Can you provide some scenarios that you think are too strict?

cloud-fan · 2021-12-13T04:29:51Z

when overflow happened SparkSQL return wrong results(null), but user didn't get any remind, I think it is a problem.

The problem has been there for many years, and many operations/functions are affected (math operations, cast, etc.). I don't think it's justified to change string and integer comparison only (and introduce inconsistency to the system). ANSI mode is the way to go. If ANSI mode is too strict, we can relax it a bit to make it fit the real-world use cases better.

beryllw · 2021-12-13T05:11:58Z

is it not better to explicitly cast to a larger integer type?

Explicitly cast string to a larger integer type is right, but user may implicit cast and he didn't find any exception( it is possible for a big query, and I have encountered).
If he doesn't check the results, a hug trouble will make.And even if he checked, it's so hard to find a such sql error(User may ignore it, for example hive for this cast type is right.)
So, I just want to remind user when ansi is not enabled and overflow happened, suggest them where your sql is not right.
Return null rather than Exception is too implicit, user always ignore it(If a big query with many filter, it is even more.).

beryllw · 2021-12-13T05:39:28Z

when overflow happened SparkSQL return wrong results(null), but user didn't get any remind, I think it is a problem.

The problem has been there for many years, and many operations/functions are affected (math operations, cast, etc.). I don't think it's justified to change string and integer comparison only (and introduce inconsistency to the system). ANSI mode is the way to go. If ANSI mode is too strict, we can relax it a bit to make it fit the real-world use cases better.

I agree with ansi, but in current place user didn't get any remind is a problem.
I haven't think this could introduce inconsistency to the system yet.
User didn't get any exception but get a null result, it can truly cause confusion, user may ignore the sql error rather than check and find some problem.

cloud-fan · 2021-12-13T05:47:25Z

I agree with ansi, but in current place user didn't get any remind is a problem.

We can turn on ANSI by default someday, which fixes all the problems. I don't agree to randomly pick string integer comparison and make it ANSI. How about string bigint/float/double/decimal comparisons? How about string +-*/ numbers? How about cast string to int/bigint/float/double/decimal/date/timestamp?

beryllw · 2021-12-13T06:02:36Z

ANSI with default true is also not a good idea, it is too strict with cast.

I am actually trying to make the ANSI mode more lenient. Can you provide some scenarios that you think are too strict?

I have test ANSI some simple case, and I will think it later.If I have some scenarios, I would like to share with you.

beryllw · 2021-12-13T06:20:00Z

I don't agree to randomly pick string integer comparison and make it ANSI.

ANSI is a solution to fix all this problems.But I still think SparkSQL should remind user where you wrong.User from other platform accept ANSI need time to change their sql but they don't want change.So a remind to tell where your sql wrong rather null result is need.
Or we have to suggest user better to use cast explicitly and change their code to accept ANSI.
But return null is truly unobvious，it's hard to find the error.

cloud-fan · 2021-12-13T06:50:48Z

But I still think SparkSQL should remind user where you wrong.

This doesn't answer the question of why you only fix string integer comparison here.

Or we have to suggest user better to use cast explicitly

This works for me. We should discourage relying on implicit cast.

beryllw · 2021-12-13T07:35:32Z

But I still think SparkSQL should remind user where you wrong.

This doesn't answer the question of why you only fix string integer comparison here.

I think string integer comparison is common and easier to encounter,I haven't meet the other case.I have just tried, and also encounter problem(long, short, etc).
And I agree with you it's better to not introduce inconsistency to the system.
About other like math(+-*/), it's right about string and integer(SparkSQL cast string to double type).

May be I make a confusion title.I found the issue in comparing String and IntegerType, and the reason is string2int cast overflow.
So I fix issue by thrown exception in cast code.

github-actions bot added the SQL label Dec 10, 2021

[SPARK-30471][SQL]Fix issue when comparing String and IntegerType

cd8a04a

beryllw force-pushed the master branch from 329ba0d to cd8a04a Compare December 10, 2021 12:07

srowen reviewed Dec 10, 2021

View reviewed changes

HyukjinKwon reviewed Dec 13, 2021

View reviewed changes

beryllw closed this Dec 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30471][SQL]Fix issue when comparing String and IntegerType #34862

[SPARK-30471][SQL]Fix issue when comparing String and IntegerType #34862

beryllw commented Dec 10, 2021

AmplabJenkins commented Dec 10, 2021

beryllw commented Dec 10, 2021

wangyum commented Dec 10, 2021

srowen Dec 10, 2021

beryllw Dec 12, 2021 •

edited

srowen Dec 12, 2021

beryllw Dec 13, 2021 •

edited

beryllw commented Dec 12, 2021 •

edited

HyukjinKwon commented Dec 13, 2021

HyukjinKwon commented Dec 13, 2021

HyukjinKwon Dec 13, 2021

beryllw Dec 13, 2021

gengliangwang commented Dec 13, 2021

beryllw commented Dec 13, 2021

beryllw commented Dec 13, 2021

srowen commented Dec 13, 2021

gengliangwang commented Dec 13, 2021

cloud-fan commented Dec 13, 2021 •

edited

beryllw commented Dec 13, 2021

beryllw commented Dec 13, 2021

cloud-fan commented Dec 13, 2021

beryllw commented Dec 13, 2021

beryllw commented Dec 13, 2021

cloud-fan commented Dec 13, 2021 •

edited

beryllw commented Dec 13, 2021

[SPARK-30471][SQL]Fix issue when comparing String and IntegerType #34862

[SPARK-30471][SQL]Fix issue when comparing String and IntegerType #34862

Conversation

beryllw commented Dec 10, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AmplabJenkins commented Dec 10, 2021

beryllw commented Dec 10, 2021

wangyum commented Dec 10, 2021

srowen Dec 10, 2021

Choose a reason for hiding this comment

beryllw Dec 12, 2021 • edited

Choose a reason for hiding this comment

srowen Dec 12, 2021

Choose a reason for hiding this comment

beryllw Dec 13, 2021 • edited

Choose a reason for hiding this comment

beryllw commented Dec 12, 2021 • edited

HyukjinKwon commented Dec 13, 2021

HyukjinKwon commented Dec 13, 2021

HyukjinKwon Dec 13, 2021

Choose a reason for hiding this comment

beryllw Dec 13, 2021

Choose a reason for hiding this comment

gengliangwang commented Dec 13, 2021

beryllw commented Dec 13, 2021

beryllw commented Dec 13, 2021

srowen commented Dec 13, 2021

gengliangwang commented Dec 13, 2021

cloud-fan commented Dec 13, 2021 • edited

beryllw commented Dec 13, 2021

beryllw commented Dec 13, 2021

cloud-fan commented Dec 13, 2021

beryllw commented Dec 13, 2021

beryllw commented Dec 13, 2021

cloud-fan commented Dec 13, 2021 • edited

beryllw commented Dec 13, 2021

beryllw Dec 12, 2021 •

edited

beryllw Dec 13, 2021 •

edited

beryllw commented Dec 12, 2021 •

edited

cloud-fan commented Dec 13, 2021 •

edited

cloud-fan commented Dec 13, 2021 •

edited