[SPARK-49016][SQL] Restore the behavior that queries from raw CSV files are disallowed when only include corrupt record column and assign name to `_LEGACY_ERROR_TEMP_1285` #47506

wayneguow · 2024-07-26T16:50:07Z

What changes were proposed in this pull request?

From SQL migration guide：https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-22-to-23

But the behavior related to CSV is inconsistent with the description in the document. After PR #35817 , the related code has been removed.

Why are the changes needed?

Maintain documentation and code consistency to avoid misunderstandings.

Does this PR introduce any user-facing change?

Yes, but correct the result and keep the same as docs.

How was this patch tested?

Pass GA and add a test case.

Was this patch authored or co-authored using generative AI tooling?

No.

HyukjinKwon · 2024-07-27T19:58:49Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

The problem is that this was a behaivour change IIRC so we couldn't change CSV.

This was fine because JSON one had this behaviour at the first place.

@HyukjinKwon Thanks for you explanation. Do you think we should change the documentation and remove the CSV and just keep the JSON to avoid misunderstandings?

HyukjinKwon · 2024-07-29T14:08:06Z

docs/sql-migration-guide.md

Just to make sure, mind double checking if this was mistakenly removed somewhere in the past commits, or just a mistake in the documentation

After doing some investigation on the related change history of CSVFileFormat, I found that there was indeed relevant PR(#19199) for CSV before, but it was removed in this PR(#35817 , it seems to be to solve the push-down problem related to filters, but I don’t know why the previous code related to requiredSchema was removed.).

And I also confirmed that, with the current code, if you only select columnNameOfCorruptRecord, the results are all null. The result is inappropriate.

So I think we'd better restore the previous detection and throw relevant exceptions code. (I don’t know if I missed some questions.) WDYT? @HyukjinKwon

@MaxGekk do you remember why we removed this below?

- - if (requiredSchema.length == 1 && - requiredSchema.head.name == parsedOptions.columnNameOfCorruptRecord) { - throw QueryCompilationErrors.queryFromRawFilesIncludeCorruptRecordColumnError() - } + val columnPruning = sparkSession.sessionState.conf.csvColumnPruning + // Don't push any filter which refers to the "virtual" column which cannot present in the input. + // Such filters will be applied later on the upper layer. + val actualFilters = + filters.filterNot(_.references.contains(parsedOptions.columnNameOfCorruptRecord))

Gentle ping @MaxGekk

At the moment I don't remember details. I think it makes sense to restore such behaviour with the error.

MaxGekk · 2024-08-14T12:19:30Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

@wayneguow Since you are here, could you assign proper name for the error condition.

@MaxGekk Of course. And I plan to use UNSUPPORTED_FEATURE.QUERY_ONLY_INCLUDE_CORRUPT_RECORD_COLUMN as the error class name, do you think it's suitable?

I think you can omit _INCLUDE_, just UNSUPPORTED_FEATURE.QUERY_ONLY_CORRUPT_RECORD_COLUMN

Sounds good, I add a new commit.

common/utils/src/main/resources/error/error-conditions.json

MaxGekk

LGTM except of a comment.

common/utils/src/main/resources/error/error-conditions.json

Co-authored-by: Maxim Gekk <max.gekk@gmail.com>

wayneguow · 2024-08-19T01:52:43Z

LGTM except of a comment.

Updated it.

MaxGekk · 2024-08-19T05:21:14Z

+1, LGTM. Merging to master.
Thank you, @wayneguow and @HyukjinKwon for review.

cloud-fan · 2024-08-27T06:19:56Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+    checkError(
+      exception = intercept[AnalysisException] {
+        spark.read.schema(schema).csv(testFile(valueMalformedFile))
+          .select("_corrupt_record").collect()


what was the behavior before this PR?

The previous behavior looked like this:

@wayneguow the query itself makes sense, but the results "NULL" are wrong. Blocking this looks incorrect to me.

Could you revert it ?

@gatorsmile Sorry, I may not understand what you mean, you said that the results "NULL" are wrong, but it is the previous behavior before this PR, so why we need to revert this?

The definition of "corrupted record" has been unclear for a while, as it depends on the column being read. This change itself is a breaking change as it introduced a new error.

Hmmm, from this point of view, this PR can be reverted. I respect your advices and you have more experience about this. If it's convenient for you, you can help to revert it. Thank you.

github-actions bot added the SQL label Jul 26, 2024

wayneguow force-pushed the SPARK-49016 branch from d450e9d to 4f328d9 Compare July 26, 2024 16:52

HyukjinKwon reviewed Jul 27, 2024

View reviewed changes

wayneguow force-pushed the SPARK-49016 branch from 4f328d9 to 63cf0d4 Compare July 29, 2024 09:13

github-actions bot added DOCS and removed SQL labels Jul 29, 2024

HyukjinKwon reviewed Jul 29, 2024

View reviewed changes

wayneguow force-pushed the SPARK-49016 branch from 63cf0d4 to 202fa4e Compare August 2, 2024 05:01

github-actions bot added SQL and removed DOCS labels Aug 2, 2024

MaxGekk reviewed Aug 14, 2024

View reviewed changes

wayneguow commented Aug 14, 2024

View reviewed changes

common/utils/src/main/resources/error/error-conditions.json Outdated Show resolved Hide resolved

wayneguow added 2 commits August 14, 2024 21:40

recover

e070f20

assign

7161155

wayneguow force-pushed the SPARK-49016 branch from 3b22d61 to 7161155 Compare August 14, 2024 13:40

wayneguow requested a review from MaxGekk August 14, 2024 18:39

MaxGekk reviewed Aug 17, 2024

View reviewed changes

common/utils/src/main/resources/error/error-conditions.json Outdated Show resolved Hide resolved

Update common/utils/src/main/resources/error/error-conditions.json

811ed3a

Co-authored-by: Maxim Gekk <max.gekk@gmail.com>

wayneguow requested a review from MaxGekk August 19, 2024 01:26

MaxGekk approved these changes Aug 19, 2024

View reviewed changes

MaxGekk closed this in dd259b0 Aug 19, 2024

cloud-fan reviewed Aug 27, 2024

View reviewed changes

wayneguow deleted the SPARK-49016 branch February 11, 2025 04:25

[SPARK-49016][SQL] Restore the behavior that queries from raw CSV files are disallowed when only include corrupt record column and assign name to _LEGACY_ERROR_TEMP_1285 #47506

[SPARK-49016][SQL] Restore the behavior that queries from raw CSV files are disallowed when only include corrupt record column and assign name to _LEGACY_ERROR_TEMP_1285 #47506

Uh oh!

Conversation

wayneguow commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wayneguow commented Aug 19, 2024

Uh oh!

MaxGekk commented Aug 19, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-49016][SQL] Restore the behavior that queries from raw CSV files are disallowed when only include corrupt record column and assign name to `_LEGACY_ERROR_TEMP_1285` #47506

[SPARK-49016][SQL] Restore the behavior that queries from raw CSV files are disallowed when only include corrupt record column and assign name to `_LEGACY_ERROR_TEMP_1285` #47506

wayneguow commented Jul 26, 2024 •

edited

Loading