-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-46890][SQL] Fix CSV parsing bug with existence default values and column pruning #44939
Conversation
fetch from master
@MaxGekk here is the fix! |
@dtenedor Thanks for the ping. I will review it today. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dtenedor Could you explain, please, how your test passed for CSV V2 datasource if you haven't fixed it?
I haven't found any changes at:
Lines 61 to 64 in 031df8f
val schema = if (options.isColumnPruningEnabled) actualReadDataSchema else actualDataSchema | |
val isStartOfFile = file.start == 0 | |
val headerChecker = new CSVHeaderChecker( | |
schema, options, source = s"CSV file: ${file.urlEncodedPath}", isStartOfFile) |
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Outdated
Show resolved
Hide resolved
respond to code review comments
@MaxGekk I looked into this. The new unit test To help answer your question: apparently the DSV2 scan sets the required columns to scan differently. For example, with the following test [1], I find the physical [1]
|
6ab90e6
to
40148d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dtenedor Could you write or modify your new test to check V2 DS implementation. I put a breakpoint in CSVPartitionReaderFactory
:
And your test didn't hit the breakpoint.
@MaxGekk Good question, reproducing it now, requesting to run this new Both the V1 and V2 cases pass. I am able to hit that breakpoint on the latter but not the former. Interestingly, I copied the unit test to a new PR and it fails for both CSV V1 and CSV V2. So this PR fixes it for both versions, but only V2 hits the breakpoint you suggested. |
@dtenedor Could you remove the first part of the test: test("SPARK-46890: CSV fails on a column with default and without enforcing schema") {
withTable("Products") {
spark.sql(
s"""
|CREATE TABLE IF NOT EXISTS Products (
| product_id INT,
| name STRING,
| price FLOAT default 0.0,
| quantity INT default 0
|)
|USING CSV
|OPTIONS (
| header 'true',
| inferSchema 'false',
| enforceSchema 'false',
| path "${testFile(productsFile)}"
|)
""".stripMargin)
checkAnswer(
sql("SELECT price FROM Products"),
Seq(
Row(0.50),
Row(0.25),
Row(0.75)))
}
} Set a breakpoint inside of the main constructor of The |
respond to code review comments
@MaxGekk I tried several different ways of testing this bug with DSV2 CSV scans, but was unable to use a schema with column defaults and hitting that breakpoint in the DSV2 At any rate, I updated |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MaxGekk for your thorough reviews!!
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except of minor comments.
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Outdated
Show resolved
Hide resolved
Thanks again @MaxGekk for your reviews! |
+1, LGTM. Merging to master. |
@dtenedor Should we backport it to |
What changes were proposed in this pull request?
This PR fixes a CSV parsing bug with existence default values and column pruning (https://issues.apache.org/jira/browse/SPARK-46890).
The bug fix includes disabling column pruning specifically when checking the CSV header schema against the required schema expected by Catalyst. This makes the expected schema match what the CSV parser provides, since later we also happen instruct the CSV parser to disable column pruning and instead read each entire row in order to correctly assign the default value(s) during execution.
Why are the changes needed?
Before this change, queries from a subset of the columns in a CSV table whose
CREATE TABLE
statement contained default values would return an internal exception. For example:The CSV file products.csv:
The query fails:
Does this PR introduce any user-facing change?
No.
How was this patch tested?
This PR adds test coverage.
Was this patch authored or co-authored using generative AI tooling?
No.