[SPARK-36081][SPARK-36066][SQL] Update the document about the behavior change of trimming characters for cast #33287

sarutak · 2021-07-10T06:06:27Z

What changes were proposed in this pull request?

This PR modifies comment for UTF8String.trimAll andsql-migration-guide.mld.
The comment for UTF8String.trimAll says like as follows.

Trims whitespaces ({@literal <=} ASCII 32) from both ends of this string.

Similarly, sql-migration-guide.md mentions about the behavior of cast like as follows.

In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint),
datetime types(date, timestamp and interval) and boolean type,
the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values,
for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`,
`cast('2019-10-10\t as date)` results the date value `2019-10-10`.
In Spark version 2.4 and below, when casting string to integrals and booleans,
it does not trim the whitespaces from both ends; the foregoing results is `null`,
while to datetimes, only the trailing spaces (= ASCII 32) are removed.

But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1.

Why are the changes needed?

To follow the previous change.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Confirmed the document built by the following command.

SKIP_API=1 bundle exec jekyll build

SparkQA · 2021-07-10T06:55:28Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45387/

SparkQA · 2021-07-10T07:28:00Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45387/

SparkQA · 2021-07-10T08:07:29Z

Test build #140876 has finished for PR 33287 at commit 5f62cde.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-10T08:58:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45389/

SparkQA · 2021-07-10T09:32:25Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45389/

SparkQA · 2021-07-10T10:01:04Z

Test build #140878 has finished for PR 33287 at commit 19dfc60.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-10T10:54:45Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45390/

SparkQA · 2021-07-10T12:08:51Z

Test build #140879 has finished for PR 33287 at commit 2330826.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2021-07-10T12:52:10Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

@@ -574,14 +574,14 @@ public UTF8String trim() {
  public UTF8String trimAll() {
    int s = 0;
    // skip all of the whitespaces (<=0x20) in the left side
-    while (s < this.numBytes && Character.isWhitespace(getByte(s))) s++;
+    while (s < this.numBytes && getByte(s) <= 0x20) s++;


Is either behavior more 'correct'? I'm not sure what this is trying to match. It's possible .isWhitespace is a better behavior, in which case the docs should change. By default I'd not change behavior unless there's a reason it's not intended.

If we comply with the comment for trimAll, <= 0x20 one seems correct. Here is the implementation of String.trim.
https://github.com/openjdk/jdk/blob/da75f3c4ad5bdf25167a3ed80e51f567ab3dbd01/src/java.base/share/classes/java/lang/StringLatin1.java#L531-L542
https://github.com/openjdk/jdk/blob/da75f3c4ad5bdf25167a3ed80e51f567ab3dbd01/src/java.base/share/classes/java/lang/StringUTF16.java#L847-L860

Also, I noticed that originally, characters <= 0x20 were trimmed but #29375 changed the behavior.
That change seems to break the compatibility.
sql-migration-guide.md says like as follows.

In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint), datetime types(date, timestamp and interval) and boolean type, the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values, for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`, `cast('2019-10-10\t as date)` results the date value `2019-10-10`. In Spark version 2.4 and below, when casting string to integrals and booleans, it does not trim the whitespaces from both ends; the foregoing results is `null`, while to datetimes, only the trailing spaces (= ASCII 32) are removed.

In fact, select cast('2019-10-10\b' as date); returns 2019-10-10 in Spark 3.0.0.
But after 3.0.1, the query returns NULL.

Looking at #29375 , it seems like the change was at least partly on purpose to catch 'whitespace' that isn't ASCII 32 or less. @WangGuangxin is this change of behavior necessary? Do we need to check for .isWhitespace or <= 0x20?

Looking at #29375 , it seems like the change was at least partly on purpose to catch 'whitespace' that isn't ASCII 32 or less

I think, the purpose of that change was to handle code points which is >= 0x80 (non-ASCII).
For example, あ is 00 81 82 in hex in UTF-8.
getByte returns -127 for 0x81 so only checking <= 0x20 is not enough.
I think this is the problem #29375 originally aimed to resolve.

But it should have checked whether a byte data is in the range of 0 and 0x20 to avoid breaking the compatibility.

cc: @cloud-fan and @yaooqinn who were involved in #29375 and #26622.

SparkQA · 2021-07-10T20:30:32Z

Test build #140882 has finished for PR 33287 at commit b017b34.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-10T20:58:11Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45393/

SparkQA · 2021-07-10T21:32:22Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45393/

SparkQA · 2021-07-10T21:54:50Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45394/

SparkQA · 2021-07-10T23:43:37Z

Test build #140883 has finished for PR 33287 at commit f24e6a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2021-07-12T13:06:21Z

I don't know if I have the full context to evaluate this, but it seems reasonable to me.

cloud-fan · 2021-07-12T14:40:21Z

\b means backspace, which is a control character that moves the cursor one character back in the console but doesn't delete it. I don't think we should trim it as whitespace.

I think this is a doc issue. The intention is to trim white spaces, but the condition (<= ASCII 32) was wrong and #29375 fixed it. Shall we fix the doc/comment instead?

xkrogen · 2021-07-12T15:47:07Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

    // skip all of the whitespaces (<=0x20) in the left side
-    while (s < this.numBytes && Character.isWhitespace(getByte(s))) s++;
+    while (s < this.numBytes && 0 <= (currentByte = getByte(s)) && currentByte <= 0x20) s++;


Seems like this would be a good time to pull this logic out into a shared method? The if-condition is getting a bit more complicated and hard to read with the temporary variable.

Thank you for your suggestion but this issue seems to be regarded as a doc issue. So, I've decided not to change the code.

sarutak · 2021-07-12T22:42:18Z

@cloud-fan I understand that trimming characters (<= ASCII 32) was not intended behavior.
I've update the migration guide and comment for UTF8String.trimAll.

SparkQA · 2021-07-12T23:17:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45452/

SparkQA · 2021-07-12T23:50:32Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45452/

SparkQA · 2021-07-13T00:28:40Z

Test build #140940 has finished for PR 33287 at commit 168f3c8.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sarutak · 2021-07-13T00:30:57Z

retest this please.

SparkQA · 2021-07-13T01:18:24Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45457/

SparkQA · 2021-07-13T01:21:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45458/

SparkQA · 2021-07-13T01:56:13Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45458/

SparkQA · 2021-07-13T03:10:18Z

Test build #140945 has finished for PR 33287 at commit 168f3c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…r change of trimming characters for cast ### What changes were proposed in this pull request? This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`. The comment for `UTF8String.trimAll` says like as follows. ``` Trims whitespaces ({literal <=} ASCII 32) from both ends of this string. ``` Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows. ``` In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint), datetime types(date, timestamp and interval) and boolean type, the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values, for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`, `cast('2019-10-10\t as date)` results the date value `2019-10-10`. In Spark version 2.4 and below, when casting string to integrals and booleans, it does not trim the whitespaces from both ends; the foregoing results is `null`, while to datetimes, only the trailing spaces (= ASCII 32) are removed. ``` But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1. ### Why are the changes needed? To follow the previous change. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed the document built by the following command. ``` SKIP_API=1 bundle exec jekyll build ``` Closes #33287 from sarutak/fix-utf8string-trim-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 57a4f31) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2021-07-13T12:31:03Z

thanks, merging to master/3.2/3.1/3.0

…r change of trimming characters for cast ### What changes were proposed in this pull request? This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`. The comment for `UTF8String.trimAll` says like as follows. ``` Trims whitespaces ({literal <=} ASCII 32) from both ends of this string. ``` Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows. ``` In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint), datetime types(date, timestamp and interval) and boolean type, the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values, for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`, `cast('2019-10-10\t as date)` results the date value `2019-10-10`. In Spark version 2.4 and below, when casting string to integrals and booleans, it does not trim the whitespaces from both ends; the foregoing results is `null`, while to datetimes, only the trailing spaces (= ASCII 32) are removed. ``` But SPARK-32559 (apache#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1. ### Why are the changes needed? To follow the previous change. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed the document built by the following command. ``` SKIP_API=1 bundle exec jekyll build ``` Closes apache#33287 from sarutak/fix-utf8string-trim-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 57a4f31) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Jul 10, 2021

sarutak force-pushed the fix-utf8string-trim-issue branch from 19dfc60 to 2330826 Compare July 10, 2021 09:37

srowen reviewed Jul 10, 2021

View reviewed changes

sarutak changed the title ~~[SPARK-36066][CORE] UTF8String.trimAll doesn't comply with its specification~~ [SPARK-36081][SPARK-36066][SQL] Fix the compatibility breaking issue related to cast and UTF8String Jul 10, 2021

xkrogen reviewed Jul 12, 2021

View reviewed changes

Fix docs.

168f3c8

sarutak force-pushed the fix-utf8string-trim-issue branch from f24e6a4 to 168f3c8 Compare July 12, 2021 22:29

github-actions bot added the DOCS label Jul 12, 2021

sarutak changed the title ~~[SPARK-36081][SPARK-36066][SQL] Fix the compatibility breaking issue related to cast and UTF8String~~ [SPARK-36081][SPARK-36066][SQL] Update the document about the behavior change of trimming characters for cast Jul 12, 2021

srowen approved these changes Jul 12, 2021

View reviewed changes

yaooqinn approved these changes Jul 13, 2021

View reviewed changes

cloud-fan approved these changes Jul 13, 2021

View reviewed changes

cloud-fan closed this in 57a4f31 Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-36081][SPARK-36066][SQL] Update the document about the behavior change of trimming characters for cast #33287

[SPARK-36081][SPARK-36066][SQL] Update the document about the behavior change of trimming characters for cast #33287

sarutak commented Jul 10, 2021 •

edited

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

srowen Jul 10, 2021

sarutak Jul 10, 2021 •

edited

srowen Jul 11, 2021

sarutak Jul 11, 2021 •

edited

sarutak Jul 11, 2021

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

srowen commented Jul 12, 2021

cloud-fan commented Jul 12, 2021

xkrogen Jul 12, 2021

sarutak Jul 12, 2021

sarutak commented Jul 12, 2021

SparkQA commented Jul 12, 2021

SparkQA commented Jul 12, 2021

SparkQA commented Jul 13, 2021

sarutak commented Jul 13, 2021

SparkQA commented Jul 13, 2021

SparkQA commented Jul 13, 2021

SparkQA commented Jul 13, 2021

SparkQA commented Jul 13, 2021

cloud-fan commented Jul 13, 2021

[SPARK-36081][SPARK-36066][SQL] Update the document about the behavior change of trimming characters for cast #33287

[SPARK-36081][SPARK-36066][SQL] Update the document about the behavior change of trimming characters for cast #33287

Conversation

sarutak commented Jul 10, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

srowen Jul 10, 2021

Choose a reason for hiding this comment

sarutak Jul 10, 2021 • edited

Choose a reason for hiding this comment

srowen Jul 11, 2021

Choose a reason for hiding this comment

sarutak Jul 11, 2021 • edited

Choose a reason for hiding this comment

sarutak Jul 11, 2021

Choose a reason for hiding this comment

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

SparkQA commented Jul 10, 2021

srowen commented Jul 12, 2021

cloud-fan commented Jul 12, 2021

xkrogen Jul 12, 2021

Choose a reason for hiding this comment

sarutak Jul 12, 2021

Choose a reason for hiding this comment

sarutak commented Jul 12, 2021

SparkQA commented Jul 12, 2021

SparkQA commented Jul 12, 2021

SparkQA commented Jul 13, 2021

sarutak commented Jul 13, 2021

SparkQA commented Jul 13, 2021

SparkQA commented Jul 13, 2021

SparkQA commented Jul 13, 2021

SparkQA commented Jul 13, 2021

cloud-fan commented Jul 13, 2021

sarutak commented Jul 10, 2021 •

edited

sarutak Jul 10, 2021 •

edited

sarutak Jul 11, 2021 •

edited