Skip to content

[SPARK-46862][SQL][FOLLOWUP] Fix column pruning without schema enforcing in V1 CSV datasource#44910

Closed
MaxGekk wants to merge 1 commit intoapache:masterfrom
MaxGekk:check-header-column-pruning
Closed

[SPARK-46862][SQL][FOLLOWUP] Fix column pruning without schema enforcing in V1 CSV datasource#44910
MaxGekk wants to merge 1 commit intoapache:masterfrom
MaxGekk:check-header-column-pruning

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jan 27, 2024

What changes were proposed in this pull request?

In the PR, I propose to invoke CSVOptons.isColumnPruningEnabled introduced by #44872 while matching of CSV header to a schema in the V1 CSV datasource.

Why are the changes needed?

To fix the failure when column pruning happens and a schema is not enforced:

scala> spark.read.
     | option("multiLine", true).
     | option("header", true).
     | option("escape", "\"").
     | option("enforceSchema", false).
     | csv("/Users/maximgekk/tmp/es-939111-data.csv").
     | count()
24/01/27 12:43:14 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.IllegalArgumentException: Number of column in CSV header is not equal to number of fields in the schema:
 Header length: 4, schema size: 0
CSV file: file:///Users/maximgekk/tmp/es-939111-data.csv

Does this PR introduce any user-facing change?

No.

How was this patch tested?

By running the affected test suites:

$ build/sbt "test:testOnly *CSVv1Suite"
$ build/sbt "test:testOnly *CSVv2Suite"
$ build/sbt "test:testOnly *CSVLegacyTimeParserSuite"
$ build/sbt "testOnly *.CsvFunctionsSuite"

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jan 27, 2024
@MaxGekk MaxGekk changed the title [WIP][SPARK-46862][SQL][FOLLOWUP] Fix column pruning without schema enforcing in V1 CSV datasource [SPARK-46862][SQL][FOLLOWUP] Fix column pruning without schema enforcing in V1 CSV datasource Jan 27, 2024
@MaxGekk MaxGekk marked this pull request as ready for review January 27, 2024 14:05
@MaxGekk MaxGekk requested a review from LuciferYang January 27, 2024 14:06
Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 27, 2024

Merging to master/3.5/3.4. Thank you, @LuciferYang and @HyukjinKwon for review.

@MaxGekk MaxGekk closed this in bc51c9f Jan 27, 2024
MaxGekk added a commit that referenced this pull request Jan 27, 2024
…ing in V1 CSV datasource

### What changes were proposed in this pull request?
In the PR, I propose to invoke `CSVOptons.isColumnPruningEnabled` introduced by #44872 while matching of CSV header to a schema in the V1 CSV datasource.

### Why are the changes needed?
To fix the failure when column pruning happens and a schema is not enforced:
```scala
scala> spark.read.
     | option("multiLine", true).
     | option("header", true).
     | option("escape", "\"").
     | option("enforceSchema", false).
     | csv("/Users/maximgekk/tmp/es-939111-data.csv").
     | count()
24/01/27 12:43:14 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.IllegalArgumentException: Number of column in CSV header is not equal to number of fields in the schema:
 Header length: 4, schema size: 0
CSV file: file:///Users/maximgekk/tmp/es-939111-data.csv
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "test:testOnly *CSVv1Suite"
$ build/sbt "test:testOnly *CSVv2Suite"
$ build/sbt "test:testOnly *CSVLegacyTimeParserSuite"
$ build/sbt "testOnly *.CsvFunctionsSuite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44910 from MaxGekk/check-header-column-pruning.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit bc51c9f)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
MaxGekk added a commit that referenced this pull request Jan 27, 2024
…ing in V1 CSV datasource

### What changes were proposed in this pull request?
In the PR, I propose to invoke `CSVOptons.isColumnPruningEnabled` introduced by #44872 while matching of CSV header to a schema in the V1 CSV datasource.

### Why are the changes needed?
To fix the failure when column pruning happens and a schema is not enforced:
```scala
scala> spark.read.
     | option("multiLine", true).
     | option("header", true).
     | option("escape", "\"").
     | option("enforceSchema", false).
     | csv("/Users/maximgekk/tmp/es-939111-data.csv").
     | count()
24/01/27 12:43:14 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.IllegalArgumentException: Number of column in CSV header is not equal to number of fields in the schema:
 Header length: 4, schema size: 0
CSV file: file:///Users/maximgekk/tmp/es-939111-data.csv
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "test:testOnly *CSVv1Suite"
$ build/sbt "test:testOnly *CSVv2Suite"
$ build/sbt "test:testOnly *CSVLegacyTimeParserSuite"
$ build/sbt "testOnly *.CsvFunctionsSuite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44910 from MaxGekk/check-header-column-pruning.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit bc51c9f)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, late LGTM.

szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Feb 7, 2024
…ing in V1 CSV datasource

### What changes were proposed in this pull request?
In the PR, I propose to invoke `CSVOptons.isColumnPruningEnabled` introduced by apache#44872 while matching of CSV header to a schema in the V1 CSV datasource.

### Why are the changes needed?
To fix the failure when column pruning happens and a schema is not enforced:
```scala
scala> spark.read.
     | option("multiLine", true).
     | option("header", true).
     | option("escape", "\"").
     | option("enforceSchema", false).
     | csv("/Users/maximgekk/tmp/es-939111-data.csv").
     | count()
24/01/27 12:43:14 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.IllegalArgumentException: Number of column in CSV header is not equal to number of fields in the schema:
 Header length: 4, schema size: 0
CSV file: file:///Users/maximgekk/tmp/es-939111-data.csv
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "test:testOnly *CSVv1Suite"
$ build/sbt "test:testOnly *CSVv2Suite"
$ build/sbt "test:testOnly *CSVLegacyTimeParserSuite"
$ build/sbt "testOnly *.CsvFunctionsSuite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#44910 from MaxGekk/check-header-column-pruning.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit bc51c9f)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants