-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-49016][SQL] Restore the behavior that queries from raw CSV files are disallowed when only include corrupt record column and assign name to _LEGACY_ERROR_TEMP_1285
#47506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that this was a behaivour change IIRC so we couldn't change CSV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was fine because JSON one had this behaviour at the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Thanks for you explanation. Do you think we should change the documentation and remove the CSV and just keep the JSON to avoid misunderstandings?
CSV about queries are disallowed when the referenced columns only include the internal corrupt record column in sql-migration-guide.md
CSV about queries are disallowed when the referenced columns only include the internal corrupt record column in sql-migration-guide.mdCSV about queries are disallowed when the referenced columns only include the internal corrupt record column in sql-migration-guide.md
docs/sql-migration-guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure, mind double checking if this was mistakenly removed somewhere in the past commits, or just a mistake in the documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After doing some investigation on the related change history of CSVFileFormat, I found that there was indeed relevant PR(#19199) for CSV before, but it was removed in this PR(#35817 , it seems to be to solve the push-down problem related to filters, but I don’t know why the previous code related to requiredSchema was removed.).
And I also confirmed that, with the current code, if you only select columnNameOfCorruptRecord, the results are all null. The result is inappropriate.

So I think we'd better restore the previous detection and throw relevant exceptions code. (I don’t know if I missed some questions.) WDYT? @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk do you remember why we removed this below?
-
- if (requiredSchema.length == 1 &&
- requiredSchema.head.name == parsedOptions.columnNameOfCorruptRecord) {
- throw QueryCompilationErrors.queryFromRawFilesIncludeCorruptRecordColumnError()
- }
+ val columnPruning = sparkSession.sessionState.conf.csvColumnPruning
+ // Don't push any filter which refers to the "virtual" column which cannot present in the input.
+ // Such filters will be applied later on the upper layer.
+ val actualFilters =
+ filters.filterNot(_.references.contains(parsedOptions.columnNameOfCorruptRecord))There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gentle ping @MaxGekk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment I don't remember details. I think it makes sense to restore such behaviour with the error.
CSV about queries are disallowed when the referenced columns only include the internal corrupt record column in sql-migration-guide.mdThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wayneguow Since you are here, could you assign proper name for the error condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk Of course. And I plan to use UNSUPPORTED_FEATURE.QUERY_ONLY_INCLUDE_CORRUPT_RECORD_COLUMN as the error class name, do you think it's suitable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can omit _INCLUDE_, just UNSUPPORTED_FEATURE.QUERY_ONLY_CORRUPT_RECORD_COLUMN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I add a new commit.
_LEGACY_ERROR_TEMP_1285
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except of a comment.
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Updated it. |
|
+1, LGTM. Merging to master. |
| checkError( | ||
| exception = intercept[AnalysisException] { | ||
| spark.read.schema(schema).csv(testFile(valueMalformedFile)) | ||
| .select("_corrupt_record").collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what was the behavior before this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wayneguow the query itself makes sense, but the results "NULL" are wrong. Blocking this looks incorrect to me.
Could you revert it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile Sorry, I may not understand what you mean, you said that the results "NULL" are wrong, but it is the previous behavior before this PR, so why we need to revert this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The definition of "corrupted record" has been unclear for a while, as it depends on the column being read. This change itself is a breaking change as it introduced a new error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, from this point of view, this PR can be reverted. I respect your advices and you have more experience about this. If it's convenient for you, you can help to revert it. Thank you.

What changes were proposed in this pull request?
From SQL migration guide:https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-22-to-23

But the behavior related to CSV is inconsistent with the description in the document. After PR #35817 , the related code has been removed.
Why are the changes needed?
Maintain documentation and code consistency to avoid misunderstandings.
Does this PR introduce any user-facing change?
Yes, but correct the result and keep the same as docs.
How was this patch tested?
Pass GA and add a test case.
Was this patch authored or co-authored using generative AI tooling?
No.