Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30471][SQL]Fix issue when comparing String and IntegerType #34862

Closed
wants to merge 1 commit into from

Conversation

beryllw
Copy link
Contributor

@beryllw beryllw commented Dec 10, 2021

What changes were proposed in this pull request?

For comparing StringType and IntegerType, the value of string exceed Int.MaxValue, then should thrown exception.
Other wise, the result is null.
For example:
SQL "select cast('2147483648' as int);",which return is null.

Why are the changes needed?

User needs to know when an exception occurs.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added Unit test.

@github-actions github-actions bot added the SQL label Dec 10, 2021
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@beryllw
Copy link
Contributor Author

beryllw commented Dec 10, 2021

cc @wangyum @cloud-fan

@wangyum
Copy link
Member

wangyum commented Dec 10, 2021

@Kwafoor Please enable spark.sql.ansi.enabled if you prefer thrown exception.

buildCast[UTF8String](_, s => if (s.toInt(result)) result.value else null)
buildCast[UTF8String](_, s => {
if (s.toInt(result)) {
result.value}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brace is in the wrong place.
Do we have any reference for what the behavior should be in this case? I'm not sure if null or an exception is correct

Copy link
Contributor Author

@beryllw beryllw Dec 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is some users use HiveSQL in SparkSQL, when cast the value of string exceed Int.MaxValue to integer,SparkSQL return null.It will make big trouble, because user didn't get any exception, but the result is wrong.I think SparkSQL should thrown exception at least.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not clear what you are saying. What does Hive do, for example?

Copy link
Contributor Author

@beryllw beryllw Dec 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In sparkSQL
image
and
image

For in Hive
image

When user comparing String and IntegerType in sparkSQL with spark.sql.ansi.enabled=false, and the default value of spark.sql.ansi.enabled is false, user didn't get any remind but get a wrong result.
User runs a big and complex sql, it is very hard to find where the result is wrong,So they is hard to fix the wrong sql.

Actually I think may be there is some problem without any check to cast String to int when user comparing String and AtomicType.
image

And with spark.sql.ansi.enabled=true, SparkSQL disallowed compare String and DecimalType, user from hive to SparkSQL unwilling to change their code.
image
image
image

So I think It should remind User where your sql is not support in SparkSQL even without ansi mode, Because this case is exception, and user should to be reminded.

@beryllw
Copy link
Contributor Author

beryllw commented Dec 12, 2021

@Kwafoor Please enable spark.sql.ansi.enabled if you prefer thrown exception.

spark.sql.ansi.enabled about this case is right, but user always didn't open ansi, because they want use hiveSQL in sparkSQL without SQL change.
I just want to remind user your sql have something wrong.

@HyukjinKwon
Copy link
Member

spark.sql.ansi.enabled about this case is right, but user always didn't open ansi, because they want use hiveSQL in sparkSQL without SQL change.
I just want to remind user your sql have something wrong.

the problem here is that it will break existing users app when they migrate to higher versions of spark. e.g.) it worked before but now it fails. So it's better to cover it with the ANSI configuration which users can explicitly enable.

cc @cloud-fan, @gengliangwang and @MaxGekk FYI

@HyukjinKwon
Copy link
Member

@Kwafoor, Before we go further, I think it should be best to double check how ANSI defines the behaviour on the overflow of strings <> other types.

@@ -652,7 +652,13 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit
buildCast[UTF8String](_, UTF8StringUtils.toIntExact)
case StringType =>
val result = new IntWrapper()
buildCast[UTF8String](_, s => if (s.toInt(result)) result.value else null)
buildCast[UTF8String](_, s => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about other types like short, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this case in user change their HiveSQL to SparkSQL, user always filter column with comparing String and IntegerType.With ANSI this case is right, SparkSQL will thrown Exception, but user unwilling to change their code, so ANSI is often false.And in this case, it is actually return wrong result, I think SparkSQL should remind user where your sql is wrong.If user runs a big and complex query, without thrown exception it is hard to find where the sql is wrong.

@gengliangwang
Copy link
Member

@Kwafoor the default behavior exists for many years. It is hard to change it now.
That's also why we introducing the ANSI SQL mode as an alternative option. If users want to detect integer overflow, they can enable the spark.sql.ansi.enabled.

@beryllw
Copy link
Contributor Author

beryllw commented Dec 13, 2021

@Kwafoor, Before we go further, I think it should be best to double check how ANSI defines the behaviour on the overflow of strings <> other types.

@Kwafoor Please enable spark.sql.ansi.enabled if you prefer thrown exception.

With spark.sql.ansi.enabled enable it is right, but ANSI is too strict with cast, user often didn't enable.
Without thrown Exception, when overflow happened SparkSQL return wrong results(null), but user didn't get any remind, I think it is a problem.
Without thrown exception it is also too hard to find where the wrong sql is and fix it.

@beryllw
Copy link
Contributor Author

beryllw commented Dec 13, 2021

@Kwafoor the default behavior exists for many years. It is hard to change it now. That's also why we introducing the ANSI SQL mode as an alternative option. If users want to detect integer overflow, they can enable the spark.sql.ansi.enabled.

I agree with you, but I think SparkSQL should remind user your sql is wrong rather return null.
User may not check their results, it will lead a hug trouble.
ANSI with default true is also not a good idea, it is too strict with cast.

@srowen
Copy link
Member

srowen commented Dec 13, 2021

If the value is expected to be large, is it not better to explicitly cast to a larger integer type?

@gengliangwang
Copy link
Member

ANSI with default true is also not a good idea, it is too strict with cast.

I am actually trying to make the ANSI mode more lenient. Can you provide some scenarios that you think are too strict?

@cloud-fan
Copy link
Contributor

cloud-fan commented Dec 13, 2021

when overflow happened SparkSQL return wrong results(null), but user didn't get any remind, I think it is a problem.

The problem has been there for many years, and many operations/functions are affected (math operations, cast, etc.). I don't think it's justified to change string and integer comparison only (and introduce inconsistency to the system). ANSI mode is the way to go. If ANSI mode is too strict, we can relax it a bit to make it fit the real-world use cases better.

@beryllw
Copy link
Contributor Author

beryllw commented Dec 13, 2021

is it not better to explicitly cast to a larger integer type?

Explicitly cast string to a larger integer type is right, but user may implicit cast and he didn't find any exception( it is possible for a big query, and I have encountered).
If he doesn't check the results, a hug trouble will make.And even if he checked, it's so hard to find a such sql error(User may ignore it, for example hive for this cast type is right.)
So, I just want to remind user when ansi is not enabled and overflow happened, suggest them where your sql is not right.
Return null rather than Exception is too implicit, user always ignore it(If a big query with many filter, it is even more.).

@beryllw
Copy link
Contributor Author

beryllw commented Dec 13, 2021

when overflow happened SparkSQL return wrong results(null), but user didn't get any remind, I think it is a problem.

The problem has been there for many years, and many operations/functions are affected (math operations, cast, etc.). I don't think it's justified to change string and integer comparison only (and introduce inconsistency to the system). ANSI mode is the way to go. If ANSI mode is too strict, we can relax it a bit to make it fit the real-world use cases better.

I agree with ansi, but in current place user didn't get any remind is a problem.
I haven't think this could introduce inconsistency to the system yet.
User didn't get any exception but get a null result, it can truly cause confusion, user may ignore the sql error rather than check and find some problem.

@cloud-fan
Copy link
Contributor

I agree with ansi, but in current place user didn't get any remind is a problem.

We can turn on ANSI by default someday, which fixes all the problems. I don't agree to randomly pick string integer comparison and make it ANSI. How about string bigint/float/double/decimal comparisons? How about string +-*/ numbers? How about cast string to int/bigint/float/double/decimal/date/timestamp?

@beryllw
Copy link
Contributor Author

beryllw commented Dec 13, 2021

ANSI with default true is also not a good idea, it is too strict with cast.

I am actually trying to make the ANSI mode more lenient. Can you provide some scenarios that you think are too strict?

I have test ANSI some simple case, and I will think it later.If I have some scenarios, I would like to share with you.

@beryllw
Copy link
Contributor Author

beryllw commented Dec 13, 2021

I don't agree to randomly pick string integer comparison and make it ANSI.

ANSI is a solution to fix all this problems.But I still think SparkSQL should remind user where you wrong.User from other platform accept ANSI need time to change their sql but they don't want change.So a remind to tell where your sql wrong rather null result is need.
Or we have to suggest user better to use cast explicitly and change their code to accept ANSI.
But return null is truly unobvious,it's hard to find the error.

@cloud-fan
Copy link
Contributor

cloud-fan commented Dec 13, 2021

But I still think SparkSQL should remind user where you wrong.

This doesn't answer the question of why you only fix string integer comparison here.

Or we have to suggest user better to use cast explicitly

This works for me. We should discourage relying on implicit cast.

@beryllw
Copy link
Contributor Author

beryllw commented Dec 13, 2021

But I still think SparkSQL should remind user where you wrong.

This doesn't answer the question of why you only fix string integer comparison here.

I think string integer comparison is common and easier to encounter,I haven't meet the other case.I have just tried, and also encounter problem(long, short, etc).
And I agree with you it's better to not introduce inconsistency to the system.
About other like math(+-*/), it's right about string and integer(SparkSQL cast string to double type).

May be I make a confusion title.I found the issue in comparing String and IntegerType, and the reason is string2int cast overflow.
So I fix issue by thrown exception in cast code.

@beryllw beryllw closed this Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants