Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48241][SQL][3.5] CSV parsing failure with char/varchar type columns #46565

Closed

Conversation

liujiayi771
Copy link
Contributor

@liujiayi771 liujiayi771 commented May 14, 2024

What changes were proposed in this pull request?

CSV table containing char and varchar columns will result in the following error when selecting from the CSV table:

spark-sql (default)> show create table test_csv;
CREATE TABLE default.test_csv (
  id INT,
  name CHAR(10))
USING csv
java.lang.IllegalArgumentException: requirement failed: requiredSchema (struct<id:int,name:string>) should be the subset of dataSchema (struct<id:int,name:string>).
    at scala.Predef$.require(Predef.scala:281)
    at org.apache.spark.sql.catalyst.csv.UnivocityParser.<init>(UnivocityParser.scala:56)
    at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.$anonfun$buildReader$2(CSVFileFormat.scala:127)
    at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:155)
    at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:140)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)

Why are the changes needed?

For char and varchar types, Spark will convert them to StringType in CharVarcharUtils.replaceCharVarcharWithStringInSchema and record __CHAR_VARCHAR_TYPE_STRING in the metadata.

The reason for the above error is that the StringType columns in the dataSchema and requiredSchema of UnivocityParser are not consistent. The StringType in the dataSchema has metadata, while the metadata in the requiredSchema is empty. We need to retain the metadata when resolving schema.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add a new test case in CSVSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

@yaooqinn
Copy link
Member

Can we port the PR description here too?

@cloud-fan
Copy link
Contributor

thanks, merging to 3.5!

cloud-fan pushed a commit that referenced this pull request May 14, 2024
…lumns

### What changes were proposed in this pull request?
CSV table containing char and varchar columns will result in the following error when selecting from the CSV table:
```
spark-sql (default)> show create table test_csv;
CREATE TABLE default.test_csv (
  id INT,
  name CHAR(10))
USING csv
```
```
java.lang.IllegalArgumentException: requirement failed: requiredSchema (struct<id:int,name:string>) should be the subset of dataSchema (struct<id:int,name:string>).
    at scala.Predef$.require(Predef.scala:281)
    at org.apache.spark.sql.catalyst.csv.UnivocityParser.<init>(UnivocityParser.scala:56)
    at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.$anonfun$buildReader$2(CSVFileFormat.scala:127)
    at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:155)
    at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:140)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
```

### Why are the changes needed?
For char and varchar types, Spark will convert them to `StringType` in `CharVarcharUtils.replaceCharVarcharWithStringInSchema` and record `__CHAR_VARCHAR_TYPE_STRING` in the metadata.

The reason for the above error is that the `StringType` columns in the `dataSchema` and `requiredSchema` of `UnivocityParser` are not consistent. The `StringType` in the `dataSchema` has metadata, while the metadata in the `requiredSchema` is empty. We need to retain the metadata when resolving schema.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add a new test case in `CSVSuite`.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46565 from liujiayi771/branch-3.5-SPARK-48241.

Authored-by: joey.ljy <joey.ljy@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan cloud-fan closed this May 14, 2024
@liujiayi771 liujiayi771 deleted the branch-3.5-SPARK-48241 branch May 14, 2024 06:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants