-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32115][SQL] Fix SUBSTRING to handle integer overflows #28937
Conversation
cc @cloud-fan |
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
Outdated
Show resolved
Hide resolved
@@ -341,8 +341,17 @@ public UTF8String substringSQL(int pos, int length) { | |||
// to the -ith element before the end of the sequence. If a start index i is 0, it | |||
// refers to the first element. | |||
int len = numChars(); | |||
// `len + pos` does not overflow as `len >= 0`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if len = 10
and pos = Integer.MIN_VALUE
. I guess that start
would have an incorrect value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The negative pos
here refers to the -ith element before the end of the sequence
, so if pos = Integer.MIN_VALUE, then the start should be pos + len
. The final result of EMPTY_UTF8
will be returned by substring
when its param start and until are both negative. I also added a UT in 4dcfe81.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made misunderstaning. Thank you for clarification and adding a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing!
Test build #124593 has finished for PR 28937 at commit
|
Test build #124596 has finished for PR 28937 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @xuanyuanking , @maropu , @kiszk .
Merged to master/3.0/2.4.
### What changes were proposed in this pull request? Bug fix for overflow case in `UTF8String.substringSQL`. ### Why are the changes needed? SQL query `SELECT SUBSTRING("abc", -1207959552, -1207959552)` incorrectly returns` "abc"` against expected output of `""`. For query `SUBSTRING("abc", -100, -100)`, we'll get the right output of `""`. ### Does this PR introduce _any_ user-facing change? Yes, bug fix for the overflow case. ### How was this patch tested? New UT. Closes #28937 from xuanyuanking/SPARK-32115. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6484c14) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Bug fix for overflow case in `UTF8String.substringSQL`. SQL query `SELECT SUBSTRING("abc", -1207959552, -1207959552)` incorrectly returns` "abc"` against expected output of `""`. For query `SUBSTRING("abc", -100, -100)`, we'll get the right output of `""`. Yes, bug fix for the overflow case. New UT. Closes #28937 from xuanyuanking/SPARK-32115. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6484c14) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
cc @gatorsmile , too. |
cc @dbtsai since this is another correctness issue for all Spark releases. |
Thank you for reviewing! |
What changes were proposed in this pull request?
Bug fix for overflow case in
UTF8String.substringSQL
.Why are the changes needed?
SQL query
SELECT SUBSTRING("abc", -1207959552, -1207959552)
incorrectly returns"abc"
against expected output of""
. For querySUBSTRING("abc", -100, -100)
, we'll get the right output of""
.Does this PR introduce any user-facing change?
Yes, bug fix for the overflow case.
How was this patch tested?
New UT.