Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-36081][SPARK-36066][SQL] Update the document about the behavior change of trimming characters for cast #33287

Closed
wants to merge 1 commit into from

Conversation

sarutak
Copy link
Member

@sarutak sarutak commented Jul 10, 2021

What changes were proposed in this pull request?

This PR modifies comment for UTF8String.trimAll andsql-migration-guide.mld.
The comment for UTF8String.trimAll says like as follows.

Trims whitespaces ({@literal <=} ASCII 32) from both ends of this string.

Similarly, sql-migration-guide.md mentions about the behavior of cast like as follows.

In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint),
datetime types(date, timestamp and interval) and boolean type,
the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values,
for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`,
`cast('2019-10-10\t as date)` results the date value `2019-10-10`.
In Spark version 2.4 and below, when casting string to integrals and booleans,
it does not trim the whitespaces from both ends; the foregoing results is `null`,
while to datetimes, only the trailing spaces (= ASCII 32) are removed.

But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1.

Why are the changes needed?

To follow the previous change.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Confirmed the document built by the following command.

SKIP_API=1 bundle exec jekyll build

@github-actions github-actions bot added the SQL label Jul 10, 2021
@SparkQA
Copy link

SparkQA commented Jul 10, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45387/

@SparkQA
Copy link

SparkQA commented Jul 10, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45387/

@SparkQA
Copy link

SparkQA commented Jul 10, 2021

Test build #140876 has finished for PR 33287 at commit 5f62cde.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 10, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45389/

@SparkQA
Copy link

SparkQA commented Jul 10, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45389/

@SparkQA
Copy link

SparkQA commented Jul 10, 2021

Test build #140878 has finished for PR 33287 at commit 19dfc60.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 10, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45390/

@SparkQA
Copy link

SparkQA commented Jul 10, 2021

Test build #140879 has finished for PR 33287 at commit 2330826.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -574,14 +574,14 @@ public UTF8String trim() {
public UTF8String trimAll() {
int s = 0;
// skip all of the whitespaces (<=0x20) in the left side
while (s < this.numBytes && Character.isWhitespace(getByte(s))) s++;
while (s < this.numBytes && getByte(s) <= 0x20) s++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is either behavior more 'correct'? I'm not sure what this is trying to match. It's possible .isWhitespace is a better behavior, in which case the docs should change. By default I'd not change behavior unless there's a reason it's not intended.

Copy link
Member Author

@sarutak sarutak Jul 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we comply with the comment for trimAll, <= 0x20 one seems correct. Here is the implementation of String.trim.
https://github.com/openjdk/jdk/blob/da75f3c4ad5bdf25167a3ed80e51f567ab3dbd01/src/java.base/share/classes/java/lang/StringLatin1.java#L531-L542
https://github.com/openjdk/jdk/blob/da75f3c4ad5bdf25167a3ed80e51f567ab3dbd01/src/java.base/share/classes/java/lang/StringUTF16.java#L847-L860

Also, I noticed that originally, characters <= 0x20 were trimmed but #29375 changed the behavior.
That change seems to break the compatibility.
sql-migration-guide.md says like as follows.

In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint),
datetime types(date, timestamp and interval) and boolean type,
the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values,
for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`,
`cast('2019-10-10\t as date)` results the date value `2019-10-10`.
In Spark version 2.4 and below, when casting string to integrals and booleans,
it does not trim the whitespaces from both ends; the foregoing results is `null`,
while to datetimes, only the trailing spaces (= ASCII 32) are removed.

In fact, select cast('2019-10-10\b' as date); returns 2019-10-10 in Spark 3.0.0.
But after 3.0.1, the query returns NULL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at #29375 , it seems like the change was at least partly on purpose to catch 'whitespace' that isn't ASCII 32 or less. @WangGuangxin is this change of behavior necessary? Do we need to check for .isWhitespace or <= 0x20?

Copy link
Member Author

@sarutak sarutak Jul 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at #29375 , it seems like the change was at least partly on purpose to catch 'whitespace' that isn't ASCII 32 or less

I think, the purpose of that change was to handle code points which is >= 0x80 (non-ASCII).
For example, is 00 81 82 in hex in UTF-8.
getByte returns -127 for 0x81 so only checking <= 0x20 is not enough.
I think this is the problem #29375 originally aimed to resolve.

But it should have checked whether a byte data is in the range of 0 and 0x20 to avoid breaking the compatibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @cloud-fan and @yaooqinn who were involved in #29375 and #26622.

@sarutak sarutak changed the title [SPARK-36066][CORE] UTF8String.trimAll doesn't comply with its specification [SPARK-36081][SPARK-36066][SQL] Fix the compatibility breaking issue related to cast and UTF8String Jul 10, 2021
@SparkQA
Copy link

SparkQA commented Jul 10, 2021

Test build #140882 has finished for PR 33287 at commit b017b34.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 10, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45393/

@SparkQA
Copy link

SparkQA commented Jul 10, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45393/

@SparkQA
Copy link

SparkQA commented Jul 10, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45394/

@SparkQA
Copy link

SparkQA commented Jul 10, 2021

Test build #140883 has finished for PR 33287 at commit f24e6a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jul 12, 2021

I don't know if I have the full context to evaluate this, but it seems reasonable to me.

@cloud-fan
Copy link
Contributor

\b means backspace, which is a control character that moves the cursor one character back in the console but doesn't delete it. I don't think we should trim it as whitespace.

I think this is a doc issue. The intention is to trim white spaces, but the condition (<= ASCII 32) was wrong and #29375 fixed it. Shall we fix the doc/comment instead?

// skip all of the whitespaces (<=0x20) in the left side
while (s < this.numBytes && Character.isWhitespace(getByte(s))) s++;
while (s < this.numBytes && 0 <= (currentByte = getByte(s)) && currentByte <= 0x20) s++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this would be a good time to pull this logic out into a shared method? The if-condition is getting a bit more complicated and hard to read with the temporary variable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion but this issue seems to be regarded as a doc issue. So, I've decided not to change the code.

@github-actions github-actions bot added the DOCS label Jul 12, 2021
@sarutak sarutak changed the title [SPARK-36081][SPARK-36066][SQL] Fix the compatibility breaking issue related to cast and UTF8String [SPARK-36081][SPARK-36066][SQL] Update the document about the behavior change of trimming characters for cast Jul 12, 2021
@sarutak
Copy link
Member Author

sarutak commented Jul 12, 2021

@cloud-fan I understand that trimming characters (<= ASCII 32) was not intended behavior.
I've update the migration guide and comment for UTF8String.trimAll.

@SparkQA
Copy link

SparkQA commented Jul 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45452/

@SparkQA
Copy link

SparkQA commented Jul 12, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45452/

@SparkQA
Copy link

SparkQA commented Jul 13, 2021

Test build #140940 has finished for PR 33287 at commit 168f3c8.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sarutak
Copy link
Member Author

sarutak commented Jul 13, 2021

retest this please.

@SparkQA
Copy link

SparkQA commented Jul 13, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45457/

@SparkQA
Copy link

SparkQA commented Jul 13, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45458/

@SparkQA
Copy link

SparkQA commented Jul 13, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45458/

@SparkQA
Copy link

SparkQA commented Jul 13, 2021

Test build #140945 has finished for PR 33287 at commit 168f3c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this in 57a4f31 Jul 13, 2021
cloud-fan pushed a commit that referenced this pull request Jul 13, 2021
…r change of trimming characters for cast

### What changes were proposed in this pull request?

This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`.
The comment for `UTF8String.trimAll` says like as follows.
```
Trims whitespaces ({literal <=} ASCII 32) from both ends of this string.
```
Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows.
```
In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint),
datetime types(date, timestamp and interval) and boolean type,
the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values,
for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`,
`cast('2019-10-10\t as date)` results the date value `2019-10-10`.
In Spark version 2.4 and below, when casting string to integrals and booleans,
it does not trim the whitespaces from both ends; the foregoing results is `null`,
while to datetimes, only the trailing spaces (= ASCII 32) are removed.
```

But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1.

### Why are the changes needed?

To follow the previous change.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Confirmed the document built by the following command.
```
SKIP_API=1 bundle exec jekyll build
```

Closes #33287 from sarutak/fix-utf8string-trim-issue.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 57a4f31)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Jul 13, 2021
…r change of trimming characters for cast

### What changes were proposed in this pull request?

This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`.
The comment for `UTF8String.trimAll` says like as follows.
```
Trims whitespaces ({literal <=} ASCII 32) from both ends of this string.
```
Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows.
```
In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint),
datetime types(date, timestamp and interval) and boolean type,
the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values,
for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`,
`cast('2019-10-10\t as date)` results the date value `2019-10-10`.
In Spark version 2.4 and below, when casting string to integrals and booleans,
it does not trim the whitespaces from both ends; the foregoing results is `null`,
while to datetimes, only the trailing spaces (= ASCII 32) are removed.
```

But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1.

### Why are the changes needed?

To follow the previous change.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Confirmed the document built by the following command.
```
SKIP_API=1 bundle exec jekyll build
```

Closes #33287 from sarutak/fix-utf8string-trim-issue.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 57a4f31)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Jul 13, 2021
…r change of trimming characters for cast

### What changes were proposed in this pull request?

This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`.
The comment for `UTF8String.trimAll` says like as follows.
```
Trims whitespaces ({literal <=} ASCII 32) from both ends of this string.
```
Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows.
```
In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint),
datetime types(date, timestamp and interval) and boolean type,
the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values,
for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`,
`cast('2019-10-10\t as date)` results the date value `2019-10-10`.
In Spark version 2.4 and below, when casting string to integrals and booleans,
it does not trim the whitespaces from both ends; the foregoing results is `null`,
while to datetimes, only the trailing spaces (= ASCII 32) are removed.
```

But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1.

### Why are the changes needed?

To follow the previous change.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Confirmed the document built by the following command.
```
SKIP_API=1 bundle exec jekyll build
```

Closes #33287 from sarutak/fix-utf8string-trim-issue.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 57a4f31)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

thanks, merging to master/3.2/3.1/3.0

flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
…r change of trimming characters for cast

### What changes were proposed in this pull request?

This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`.
The comment for `UTF8String.trimAll` says like as follows.
```
Trims whitespaces ({literal <=} ASCII 32) from both ends of this string.
```
Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows.
```
In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint),
datetime types(date, timestamp and interval) and boolean type,
the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values,
for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`,
`cast('2019-10-10\t as date)` results the date value `2019-10-10`.
In Spark version 2.4 and below, when casting string to integrals and booleans,
it does not trim the whitespaces from both ends; the foregoing results is `null`,
while to datetimes, only the trailing spaces (= ASCII 32) are removed.
```

But SPARK-32559 (apache#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1.

### Why are the changes needed?

To follow the previous change.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Confirmed the document built by the following command.
```
SKIP_API=1 bundle exec jekyll build
```

Closes apache#33287 from sarutak/fix-utf8string-trim-issue.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 57a4f31)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants