[SPARK-46862][SQL][FOLLOWUP] Fix column pruning without schema enforcing in V1 CSV datasource by MaxGekk · Pull Request #44910 · apache/spark

MaxGekk · 2024-01-27T09:39:38Z

What changes were proposed in this pull request?

In the PR, I propose to invoke CSVOptons.isColumnPruningEnabled introduced by #44872 while matching of CSV header to a schema in the V1 CSV datasource.

Why are the changes needed?

To fix the failure when column pruning happens and a schema is not enforced:

scala> spark.read.
     | option("multiLine", true).
     | option("header", true).
     | option("escape", "\"").
     | option("enforceSchema", false).
     | csv("/Users/maximgekk/tmp/es-939111-data.csv").
     | count()
24/01/27 12:43:14 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.IllegalArgumentException: Number of column in CSV header is not equal to number of fields in the schema:
 Header length: 4, schema size: 0
CSV file: file:///Users/maximgekk/tmp/es-939111-data.csv

Does this PR introduce any user-facing change?

No.

How was this patch tested?

By running the affected test suites:

$ build/sbt "test:testOnly *CSVv1Suite"
$ build/sbt "test:testOnly *CSVv2Suite"
$ build/sbt "test:testOnly *CSVLegacyTimeParserSuite"
$ build/sbt "testOnly *.CsvFunctionsSuite"

Was this patch authored or co-authored using generative AI tooling?

No.

LuciferYang

+1, LGTM

MaxGekk · 2024-01-27T16:22:21Z

Merging to master/3.5/3.4. Thank you, @LuciferYang and @HyukjinKwon for review.

…ing in V1 CSV datasource ### What changes were proposed in this pull request? In the PR, I propose to invoke `CSVOptons.isColumnPruningEnabled` introduced by #44872 while matching of CSV header to a schema in the V1 CSV datasource. ### Why are the changes needed? To fix the failure when column pruning happens and a schema is not enforced: ```scala scala> spark.read. | option("multiLine", true). | option("header", true). | option("escape", "\""). | option("enforceSchema", false). | csv("/Users/maximgekk/tmp/es-939111-data.csv"). | count() 24/01/27 12:43:14 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.IllegalArgumentException: Number of column in CSV header is not equal to number of fields in the schema: Header length: 4, schema size: 0 CSV file: file:///Users/maximgekk/tmp/es-939111-data.csv ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *CSVv1Suite" $ build/sbt "test:testOnly *CSVv2Suite" $ build/sbt "test:testOnly *CSVLegacyTimeParserSuite" $ build/sbt "testOnly *.CsvFunctionsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44910 from MaxGekk/check-header-column-pruning. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit bc51c9f) Signed-off-by: Max Gekk <max.gekk@gmail.com>

dongjoon-hyun

+1, late LGTM.

…ing in V1 CSV datasource ### What changes were proposed in this pull request? In the PR, I propose to invoke `CSVOptons.isColumnPruningEnabled` introduced by apache#44872 while matching of CSV header to a schema in the V1 CSV datasource. ### Why are the changes needed? To fix the failure when column pruning happens and a schema is not enforced: ```scala scala> spark.read. | option("multiLine", true). | option("header", true). | option("escape", "\""). | option("enforceSchema", false). | csv("/Users/maximgekk/tmp/es-939111-data.csv"). | count() 24/01/27 12:43:14 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.IllegalArgumentException: Number of column in CSV header is not equal to number of fields in the schema: Header length: 4, schema size: 0 CSV file: file:///Users/maximgekk/tmp/es-939111-data.csv ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *CSVv1Suite" $ build/sbt "test:testOnly *CSVv2Suite" $ build/sbt "test:testOnly *CSVLegacyTimeParserSuite" $ build/sbt "testOnly *.CsvFunctionsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#44910 from MaxGekk/check-header-column-pruning. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit bc51c9f) Signed-off-by: Max Gekk <max.gekk@gmail.com>

Fix column pruning without schema enforcing in V1 CSV datasource

7436f4f

github-actions bot added the SQL label Jan 27, 2024

MaxGekk changed the title ~~[WIP][SPARK-46862][SQL][FOLLOWUP] Fix column pruning without schema enforcing in V1 CSV datasource~~ [SPARK-46862][SQL][FOLLOWUP] Fix column pruning without schema enforcing in V1 CSV datasource Jan 27, 2024

MaxGekk marked this pull request as ready for review January 27, 2024 14:05

MaxGekk requested a review from LuciferYang January 27, 2024 14:06

MaxGekk mentioned this pull request Jan 27, 2024

[SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode #44872

Closed

HyukjinKwon approved these changes Jan 27, 2024

View reviewed changes

LuciferYang approved these changes Jan 27, 2024

View reviewed changes

MaxGekk closed this in bc51c9f Jan 27, 2024

dongjoon-hyun reviewed Jan 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46862][SQL][FOLLOWUP] Fix column pruning without schema enforcing in V1 CSV datasource#44910

[SPARK-46862][SQL][FOLLOWUP] Fix column pruning without schema enforcing in V1 CSV datasource#44910
MaxGekk wants to merge 1 commit intoapache:masterfrom
MaxGekk:check-header-column-pruning

MaxGekk commented Jan 27, 2024 •

edited

Loading

Uh oh!

LuciferYang left a comment

Uh oh!

MaxGekk commented Jan 27, 2024

Uh oh!

dongjoon-hyun left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

MaxGekk commented Jan 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang left a comment

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Jan 27, 2024

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MaxGekk commented Jan 27, 2024 •

edited

Loading