Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32115][SQL] Fix SUBSTRING to handle integer overflows #28937

Closed
wants to merge 2 commits into from

Conversation

xuanyuanking
Copy link
Member

What changes were proposed in this pull request?

Bug fix for overflow case in UTF8String.substringSQL.

Why are the changes needed?

SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly returns "abc" against expected output of "". For query SUBSTRING("abc", -100, -100), we'll get the right output of "".

Does this PR introduce any user-facing change?

Yes, bug fix for the overflow case.

How was this patch tested?

New UT.

@xuanyuanking
Copy link
Member Author

cc @cloud-fan

@@ -341,8 +341,17 @@ public UTF8String substringSQL(int pos, int length) {
// to the -ith element before the end of the sequence. If a start index i is 0, it
// refers to the first element.
int len = numChars();
// `len + pos` does not overflow as `len >= 0`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if len = 10 and pos = Integer.MIN_VALUE. I guess that start would have an incorrect value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The negative pos here refers to the -ith element before the end of the sequence, so if pos = Integer.MIN_VALUE, then the start should be pos + len. The final result of EMPTY_UTF8 will be returned by substring when its param start and until are both negative. I also added a UT in 4dcfe81.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made misunderstaning. Thank you for clarification and adding a test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing!

@SparkQA
Copy link

SparkQA commented Jun 28, 2020

Test build #124593 has finished for PR 28937 at commit 5f109a8.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 28, 2020

Test build #124596 has finished for PR 28937 at commit 4dcfe81.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @xuanyuanking , @maropu , @kiszk .
Merged to master/3.0/2.4.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-32115][SQL] Incorrect results for SUBSTRING when overflow [SPARK-32115][SQL] Fix SUBSTRING to handle integer overflows Jun 28, 2020
dongjoon-hyun pushed a commit that referenced this pull request Jun 28, 2020
### What changes were proposed in this pull request?
Bug fix for overflow case in `UTF8String.substringSQL`.

### Why are the changes needed?
SQL query `SELECT SUBSTRING("abc", -1207959552, -1207959552)` incorrectly returns` "abc"` against expected output of `""`. For query `SUBSTRING("abc", -100, -100)`, we'll get the right output of `""`.

### Does this PR introduce _any_ user-facing change?
Yes, bug fix for the overflow case.

### How was this patch tested?
New UT.

Closes #28937 from xuanyuanking/SPARK-32115.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 6484c14)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Jun 28, 2020
Bug fix for overflow case in `UTF8String.substringSQL`.

SQL query `SELECT SUBSTRING("abc", -1207959552, -1207959552)` incorrectly returns` "abc"` against expected output of `""`. For query `SUBSTRING("abc", -100, -100)`, we'll get the right output of `""`.

Yes, bug fix for the overflow case.

New UT.

Closes #28937 from xuanyuanking/SPARK-32115.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 6484c14)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member

cc @gatorsmile , too.

@dongjoon-hyun
Copy link
Member

cc @dbtsai since this is another correctness issue for all Spark releases.

@xuanyuanking xuanyuanking deleted the SPARK-32115 branch June 30, 2020 07:55
@xuanyuanking
Copy link
Member Author

Thank you for reviewing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants