[SPARK-35359][SQL] Insert data with char/varchar datatype will fail when data length exceed length limitation #32501

fhygh · 2021-05-11T09:16:56Z

What changes were proposed in this pull request?

This PR is used to fix this bug:

set spark.sql.legacy.charVarcharAsString=true;
create table chartb01(a char(3));
insert into chartb01 select 'aaaaa';

here we expect the data of table chartb01 is 'aaa', but it runs failed.

Why are the changes needed?

Improve backward compatibility

spark-sql>
         > create table tchar01(col char(2)) using parquet;
Time taken: 0.767 seconds
spark-sql>
         > insert into tchar01 select 'aaa';
ERROR | Executor task launch worker for task 0.0 in stage 0.0 (TID 0) | Aborting task | org.apache.spark.util.Utils.logError(Logging.scala:94)
java.lang.RuntimeException: Exceeds char/varchar type length limitation: 2
        at org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils.trimTrailingSpaces(CharVarcharCodegenUtils.java:31)
        at org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils.charTypeWriteSideCheck(CharVarcharCodegenUtils.java:44)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:279)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1500)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:288)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:212)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1466)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Does this PR introduce any user-facing change?

No (the legacy config is false by default).

How was this patch tested?

Added unit tests.

data length exceed length limitation

AmplabJenkins · 2021-05-11T09:18:21Z

Can one of the admins verify this patch?

sql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala

HyukjinKwon · 2021-05-11T11:35:51Z

cc @yaooqinn @cloud-fan FYI

cloud-fan · 2021-05-11T12:11:13Z

We should update TableOutputResolver.checkField to not add the string length check expression if the legacy config is true.

yaooqinn · 2021-05-11T12:38:17Z

We should update TableOutputResolver.checkField to not add the string length check expression if the legacy config is true.

+1

if legacy config is true

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/util/CharVarcharCodegenUtils.java

sql/catalyst/src/main/scala/org/apache/spark/sql/util/PartitioningUtils.scala

cloud-fan · 2021-05-17T16:13:38Z

thanks, merging to master/3.1!

…hen data length exceed length limitation ### What changes were proposed in this pull request? This PR is used to fix this bug: ``` set spark.sql.legacy.charVarcharAsString=true; create table chartb01(a char(3)); insert into chartb01 select 'aaaaa'; ``` here we expect the data of table chartb01 is 'aaa', but it runs failed. ### Why are the changes needed? Improve backward compatibility ``` spark-sql> > create table tchar01(col char(2)) using parquet; Time taken: 0.767 seconds spark-sql> > insert into tchar01 select 'aaa'; ERROR | Executor task launch worker for task 0.0 in stage 0.0 (TID 0) | Aborting task | org.apache.spark.util.Utils.logError(Logging.scala:94) java.lang.RuntimeException: Exceeds char/varchar type length limitation: 2 at org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils.trimTrailingSpaces(CharVarcharCodegenUtils.java:31) at org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils.charTypeWriteSideCheck(CharVarcharCodegenUtils.java:44) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:279) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1500) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:288) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:212) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1466) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` ### Does this PR introduce _any_ user-facing change? No (the legacy config is false by default). ### How was this patch tested? Added unit tests. Closes #32501 from fhygh/master. Authored-by: fhygh <283452027@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 3a3f8ca) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…hen data length exceed length limitation ### What changes were proposed in this pull request? This PR is used to fix this bug: ``` set spark.sql.legacy.charVarcharAsString=true; create table chartb01(a char(3)); insert into chartb01 select 'aaaaa'; ``` here we expect the data of table chartb01 is 'aaa', but it runs failed. ### Why are the changes needed? Improve backward compatibility ``` spark-sql> > create table tchar01(col char(2)) using parquet; Time taken: 0.767 seconds spark-sql> > insert into tchar01 select 'aaa'; ERROR | Executor task launch worker for task 0.0 in stage 0.0 (TID 0) | Aborting task | org.apache.spark.util.Utils.logError(Logging.scala:94) java.lang.RuntimeException: Exceeds char/varchar type length limitation: 2 at org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils.trimTrailingSpaces(CharVarcharCodegenUtils.java:31) at org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils.charTypeWriteSideCheck(CharVarcharCodegenUtils.java:44) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:279) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1500) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:288) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:212) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1466) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` ### Does this PR introduce _any_ user-facing change? No (the legacy config is false by default). ### How was this patch tested? Added unit tests. Closes apache#32501 from fhygh/master. Authored-by: fhygh <283452027@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 3a3f8ca) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This is a followup of #32501 . It moves a general char/varchar test from file source suite to the base char/varchar suite, so that it will be verified in all table formats, including v2. ### Why are the changes needed? improve test coverage ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #37152 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

[SPARK-35359][SQL]Insert data with char/varchar datatype will fail when

2168c3c

data length exceed length limitation

github-actions bot added the SQL label May 11, 2021

HyukjinKwon changed the title ~~[SPARK-35359][SQL]Insert data with char/varchar datatype will fail when data length exceed length limitation~~ [SPARK-35359][SQL] Insert data with char/varchar datatype will fail when data length exceed length limitation May 11, 2021

HyukjinKwon reviewed May 11, 2021

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala Outdated Show resolved Hide resolved

fhygh added 2 commits May 12, 2021 16:27

add JIRA prefix for ut.

cef576c

update TableOutputResolver.checkField to not add the string length check

ff66f5d

if legacy config is true

cloud-fan reviewed May 13, 2021

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/util/CharVarcharCodegenUtils.java Outdated Show resolved Hide resolved

fhygh added 2 commits May 14, 2021 09:15

fix partition table with char/varchar datatype partition column

aee92db

remove CharVarcharCodegenUtils check

4773c51

cloud-fan reviewed May 14, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/util/PartitioningUtils.scala Outdated Show resolved Hide resolved

update PartitioningUtils

eaba37d

yaooqinn approved these changes May 14, 2021

View reviewed changes

cloud-fan closed this in 3a3f8ca May 17, 2021

cloud-fan mentioned this pull request Jul 11, 2022

[SQL][MINOR] Move general char/varchar test to the base test suite #37152

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-35359][SQL] Insert data with char/varchar datatype will fail when data length exceed length limitation #32501

[SPARK-35359][SQL] Insert data with char/varchar datatype will fail when data length exceed length limitation #32501

Uh oh!

fhygh commented May 11, 2021 •

edited by HyukjinKwon

Loading

Uh oh!

AmplabJenkins commented May 11, 2021

Uh oh!

Uh oh!

HyukjinKwon commented May 11, 2021

Uh oh!

cloud-fan commented May 11, 2021

Uh oh!

yaooqinn commented May 11, 2021

Uh oh!

Uh oh!

Uh oh!

cloud-fan commented May 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-35359][SQL] Insert data with char/varchar datatype will fail when data length exceed length limitation #32501

[SPARK-35359][SQL] Insert data with char/varchar datatype will fail when data length exceed length limitation #32501

Uh oh!

Conversation

fhygh commented May 11, 2021 • edited by HyukjinKwon Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented May 11, 2021

Uh oh!

Uh oh!

HyukjinKwon commented May 11, 2021

Uh oh!

cloud-fan commented May 11, 2021

Uh oh!

yaooqinn commented May 11, 2021

Uh oh!

Uh oh!

Uh oh!

cloud-fan commented May 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fhygh commented May 11, 2021 •

edited by HyukjinKwon

Loading