New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-36081][SPARK-36066][SQL] Update the document about the behavior change of trimming characters for cast #33287
Conversation
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #140876 has finished for PR 33287 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
19dfc60
to
2330826
Compare
Test build #140878 has finished for PR 33287 at commit
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
Test build #140879 has finished for PR 33287 at commit
|
@@ -574,14 +574,14 @@ public UTF8String trim() { | |||
public UTF8String trimAll() { | |||
int s = 0; | |||
// skip all of the whitespaces (<=0x20) in the left side | |||
while (s < this.numBytes && Character.isWhitespace(getByte(s))) s++; | |||
while (s < this.numBytes && getByte(s) <= 0x20) s++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is either behavior more 'correct'? I'm not sure what this is trying to match. It's possible .isWhitespace
is a better behavior, in which case the docs should change. By default I'd not change behavior unless there's a reason it's not intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we comply with the comment for trimAll
, <= 0x20
one seems correct. Here is the implementation of String.trim
.
https://github.com/openjdk/jdk/blob/da75f3c4ad5bdf25167a3ed80e51f567ab3dbd01/src/java.base/share/classes/java/lang/StringLatin1.java#L531-L542
https://github.com/openjdk/jdk/blob/da75f3c4ad5bdf25167a3ed80e51f567ab3dbd01/src/java.base/share/classes/java/lang/StringUTF16.java#L847-L860
Also, I noticed that originally, characters <= 0x20 were trimmed but #29375 changed the behavior.
That change seems to break the compatibility.
sql-migration-guide.md
says like as follows.
In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint),
datetime types(date, timestamp and interval) and boolean type,
the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values,
for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`,
`cast('2019-10-10\t as date)` results the date value `2019-10-10`.
In Spark version 2.4 and below, when casting string to integrals and booleans,
it does not trim the whitespaces from both ends; the foregoing results is `null`,
while to datetimes, only the trailing spaces (= ASCII 32) are removed.
In fact, select cast('2019-10-10\b' as date);
returns 2019-10-10
in Spark 3.0.0
.
But after 3.0.1
, the query returns NULL
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at #29375 , it seems like the change was at least partly on purpose to catch 'whitespace' that isn't ASCII 32 or less. @WangGuangxin is this change of behavior necessary? Do we need to check for .isWhitespace
or <= 0x20
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at #29375 , it seems like the change was at least partly on purpose to catch 'whitespace' that isn't ASCII 32 or less
I think, the purpose of that change was to handle code points which is >= 0x80 (non-ASCII).
For example, あ
is 00 81 82
in hex in UTF-8
.
getByte
returns -127
for 0x81
so only checking <= 0x20
is not enough.
I think this is the problem #29375 originally aimed to resolve.
But it should have checked whether a byte
data is in the range of 0
and 0x20
to avoid breaking the compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @cloud-fan and @yaooqinn who were involved in #29375 and #26622.
Test build #140882 has finished for PR 33287 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Kubernetes integration test unable to build dist. exiting with code: 1 |
Test build #140883 has finished for PR 33287 at commit
|
I don't know if I have the full context to evaluate this, but it seems reasonable to me. |
I think this is a doc issue. The intention is to trim white spaces, but the condition (<= ASCII 32) was wrong and #29375 fixed it. Shall we fix the doc/comment instead? |
// skip all of the whitespaces (<=0x20) in the left side | ||
while (s < this.numBytes && Character.isWhitespace(getByte(s))) s++; | ||
while (s < this.numBytes && 0 <= (currentByte = getByte(s)) && currentByte <= 0x20) s++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this would be a good time to pull this logic out into a shared method? The if-condition is getting a bit more complicated and hard to read with the temporary variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion but this issue seems to be regarded as a doc issue. So, I've decided not to change the code.
f24e6a4
to
168f3c8
Compare
@cloud-fan I understand that trimming characters (<= ASCII 32) was not intended behavior. |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #140940 has finished for PR 33287 at commit
|
retest this please. |
Kubernetes integration test unable to build dist. exiting with code: 1 |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #140945 has finished for PR 33287 at commit
|
…r change of trimming characters for cast ### What changes were proposed in this pull request? This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`. The comment for `UTF8String.trimAll` says like as follows. ``` Trims whitespaces ({literal <=} ASCII 32) from both ends of this string. ``` Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows. ``` In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint), datetime types(date, timestamp and interval) and boolean type, the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values, for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`, `cast('2019-10-10\t as date)` results the date value `2019-10-10`. In Spark version 2.4 and below, when casting string to integrals and booleans, it does not trim the whitespaces from both ends; the foregoing results is `null`, while to datetimes, only the trailing spaces (= ASCII 32) are removed. ``` But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1. ### Why are the changes needed? To follow the previous change. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed the document built by the following command. ``` SKIP_API=1 bundle exec jekyll build ``` Closes #33287 from sarutak/fix-utf8string-trim-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 57a4f31) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…r change of trimming characters for cast ### What changes were proposed in this pull request? This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`. The comment for `UTF8String.trimAll` says like as follows. ``` Trims whitespaces ({literal <=} ASCII 32) from both ends of this string. ``` Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows. ``` In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint), datetime types(date, timestamp and interval) and boolean type, the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values, for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`, `cast('2019-10-10\t as date)` results the date value `2019-10-10`. In Spark version 2.4 and below, when casting string to integrals and booleans, it does not trim the whitespaces from both ends; the foregoing results is `null`, while to datetimes, only the trailing spaces (= ASCII 32) are removed. ``` But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1. ### Why are the changes needed? To follow the previous change. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed the document built by the following command. ``` SKIP_API=1 bundle exec jekyll build ``` Closes #33287 from sarutak/fix-utf8string-trim-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 57a4f31) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…r change of trimming characters for cast ### What changes were proposed in this pull request? This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`. The comment for `UTF8String.trimAll` says like as follows. ``` Trims whitespaces ({literal <=} ASCII 32) from both ends of this string. ``` Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows. ``` In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint), datetime types(date, timestamp and interval) and boolean type, the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values, for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`, `cast('2019-10-10\t as date)` results the date value `2019-10-10`. In Spark version 2.4 and below, when casting string to integrals and booleans, it does not trim the whitespaces from both ends; the foregoing results is `null`, while to datetimes, only the trailing spaces (= ASCII 32) are removed. ``` But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1. ### Why are the changes needed? To follow the previous change. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed the document built by the following command. ``` SKIP_API=1 bundle exec jekyll build ``` Closes #33287 from sarutak/fix-utf8string-trim-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 57a4f31) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
thanks, merging to master/3.2/3.1/3.0 |
…r change of trimming characters for cast ### What changes were proposed in this pull request? This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`. The comment for `UTF8String.trimAll` says like as follows. ``` Trims whitespaces ({literal <=} ASCII 32) from both ends of this string. ``` Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows. ``` In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint), datetime types(date, timestamp and interval) and boolean type, the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values, for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`, `cast('2019-10-10\t as date)` results the date value `2019-10-10`. In Spark version 2.4 and below, when casting string to integrals and booleans, it does not trim the whitespaces from both ends; the foregoing results is `null`, while to datetimes, only the trailing spaces (= ASCII 32) are removed. ``` But SPARK-32559 (apache#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1. ### Why are the changes needed? To follow the previous change. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed the document built by the following command. ``` SKIP_API=1 bundle exec jekyll build ``` Closes apache#33287 from sarutak/fix-utf8string-trim-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 57a4f31) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This PR modifies comment for
UTF8String.trimAll
andsql-migration-guide.mld
.The comment for
UTF8String.trimAll
says like as follows.Similarly,
sql-migration-guide.md
mentions about the behavior ofcast
like as follows.But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1.
Why are the changes needed?
To follow the previous change.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Confirmed the document built by the following command.