[SPARK-48241][SQL][3.5] CSV parsing failure with char/varchar type columns #46565

liujiayi771 · 2024-05-14T01:47:26Z

What changes were proposed in this pull request?

CSV table containing char and varchar columns will result in the following error when selecting from the CSV table:

spark-sql (default)> show create table test_csv;
CREATE TABLE default.test_csv (
  id INT,
  name CHAR(10))
USING csv

java.lang.IllegalArgumentException: requirement failed: requiredSchema (struct<id:int,name:string>) should be the subset of dataSchema (struct<id:int,name:string>).
    at scala.Predef$.require(Predef.scala:281)
    at org.apache.spark.sql.catalyst.csv.UnivocityParser.<init>(UnivocityParser.scala:56)
    at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.$anonfun$buildReader$2(CSVFileFormat.scala:127)
    at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:155)
    at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:140)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)

Why are the changes needed?

For char and varchar types, Spark will convert them to StringType in CharVarcharUtils.replaceCharVarcharWithStringInSchema and record __CHAR_VARCHAR_TYPE_STRING in the metadata.

The reason for the above error is that the StringType columns in the dataSchema and requiredSchema of UnivocityParser are not consistent. The StringType in the dataSchema has metadata, while the metadata in the requiredSchema is empty. We need to retain the metadata when resolving schema.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add a new test case in CSVSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

yaooqinn · 2024-05-14T02:50:22Z

Can we port the PR description here too？

cloud-fan · 2024-05-14T05:06:32Z

thanks, merging to 3.5!

…lumns ### What changes were proposed in this pull request? CSV table containing char and varchar columns will result in the following error when selecting from the CSV table: ``` spark-sql (default)> show create table test_csv; CREATE TABLE default.test_csv ( id INT, name CHAR(10)) USING csv ``` ``` java.lang.IllegalArgumentException: requirement failed: requiredSchema (struct<id:int,name:string>) should be the subset of dataSchema (struct<id:int,name:string>). at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.catalyst.csv.UnivocityParser.<init>(UnivocityParser.scala:56) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.$anonfun$buildReader$2(CSVFileFormat.scala:127) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:155) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:140) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) ``` ### Why are the changes needed? For char and varchar types, Spark will convert them to `StringType` in `CharVarcharUtils.replaceCharVarcharWithStringInSchema` and record `__CHAR_VARCHAR_TYPE_STRING` in the metadata. The reason for the above error is that the `StringType` columns in the `dataSchema` and `requiredSchema` of `UnivocityParser` are not consistent. The `StringType` in the `dataSchema` has metadata, while the metadata in the `requiredSchema` is empty. We need to retain the metadata when resolving schema. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a new test case in `CSVSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46565 from liujiayi771/branch-3.5-SPARK-48241. Authored-by: joey.ljy <joey.ljy@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-48241][SQL] CSV parsing failure with char/varchar type columns

669bc27

github-actions bot added the SQL label May 14, 2024

liujiayi771 mentioned this pull request May 14, 2024

[SPARK-48241][SQL] CSV parsing failure with char/varchar type columns #46537

Closed

cloud-fan approved these changes May 14, 2024

View reviewed changes

yaooqinn approved these changes May 14, 2024

View reviewed changes

ulysses-you approved these changes May 14, 2024

View reviewed changes

cloud-fan closed this May 14, 2024

liujiayi771 deleted the branch-3.5-SPARK-48241 branch May 14, 2024 06:26

cloud-fan mentioned this pull request Jul 25, 2024

[SPARK-48308][CORE][3.5] Unify getting data schema without partition columns in FileSourceStrategy #47483

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48241][SQL][3.5] CSV parsing failure with char/varchar type columns #46565

[SPARK-48241][SQL][3.5] CSV parsing failure with char/varchar type columns #46565

liujiayi771 commented May 14, 2024 •

edited

Loading

yaooqinn commented May 14, 2024

cloud-fan commented May 14, 2024

[SPARK-48241][SQL][3.5] CSV parsing failure with char/varchar type columns #46565

[SPARK-48241][SQL][3.5] CSV parsing failure with char/varchar type columns #46565

Conversation

liujiayi771 commented May 14, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

yaooqinn commented May 14, 2024

cloud-fan commented May 14, 2024

liujiayi771 commented May 14, 2024 •

edited

Loading