Skip to content

Conversation

@wayneguow
Copy link
Contributor

@wayneguow wayneguow commented Jul 26, 2024

What changes were proposed in this pull request?

From SQL migration guide:https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-22-to-23
image

But the behavior related to CSV is inconsistent with the description in the document. After PR #35817 , the related code has been removed.

Why are the changes needed?

Maintain documentation and code consistency to avoid misunderstandings.

Does this PR introduce any user-facing change?

Yes, but correct the result and keep the same as docs.

How was this patch tested?

Pass GA and add a test case.

Was this patch authored or co-authored using generative AI tooling?

No.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that this was a behaivour change IIRC so we couldn't change CSV.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was fine because JSON one had this behaviour at the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Thanks for you explanation. Do you think we should change the documentation and remove the CSV and just keep the JSON to avoid misunderstandings?

@github-actions github-actions bot added DOCS and removed SQL labels Jul 29, 2024
@wayneguow wayneguow changed the title [SPARK-49016][SQL] Queries from raw CSV files are disallowed when the referenced columns only include the internal corrupt record column [SPARK-49016][SQL] Remove CSV about queries are disallowed when the referenced columns only include the internal corrupt record column in sql-migration-guide.md Jul 29, 2024
@wayneguow wayneguow changed the title [SPARK-49016][SQL] Remove CSV about queries are disallowed when the referenced columns only include the internal corrupt record column in sql-migration-guide.md [SPARK-49016][SQL][DOCS] Remove CSV about queries are disallowed when the referenced columns only include the internal corrupt record column in sql-migration-guide.md Jul 29, 2024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure, mind double checking if this was mistakenly removed somewhere in the past commits, or just a mistake in the documentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After doing some investigation on the related change history of CSVFileFormat, I found that there was indeed relevant PR(#19199) for CSV before, but it was removed in this PR(#35817 , it seems to be to solve the push-down problem related to filters, but I don’t know why the previous code related to requiredSchema was removed.).

And I also confirmed that, with the current code, if you only select columnNameOfCorruptRecord, the results are all null. The result is inappropriate.
image

So I think we'd better restore the previous detection and throw relevant exceptions code. (I don’t know if I missed some questions.) WDYT? @HyukjinKwon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk do you remember why we removed this below?

- 
-     if (requiredSchema.length == 1 &&
-       requiredSchema.head.name == parsedOptions.columnNameOfCorruptRecord) {
-       throw QueryCompilationErrors.queryFromRawFilesIncludeCorruptRecordColumnError()
-     }
+     val columnPruning = sparkSession.sessionState.conf.csvColumnPruning
+     // Don't push any filter which refers to the "virtual" column which cannot present in the input.
+     // Such filters will be applied later on the upper layer.
+     val actualFilters =
+       filters.filterNot(_.references.contains(parsedOptions.columnNameOfCorruptRecord))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gentle ping @MaxGekk

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment I don't remember details. I think it makes sense to restore such behaviour with the error.

@github-actions github-actions bot added SQL and removed DOCS labels Aug 2, 2024
@wayneguow wayneguow changed the title [SPARK-49016][SQL][DOCS] Remove CSV about queries are disallowed when the referenced columns only include the internal corrupt record column in sql-migration-guide.md [SPARK-49016][SQL] Queries from raw CSV files are disallowed when the referenced columns only include the internal corrupt record column Aug 2, 2024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wayneguow Since you are here, could you assign proper name for the error condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk Of course. And I plan to use UNSUPPORTED_FEATURE.QUERY_ONLY_INCLUDE_CORRUPT_RECORD_COLUMN as the error class name, do you think it's suitable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can omit _INCLUDE_, just UNSUPPORTED_FEATURE.QUERY_ONLY_CORRUPT_RECORD_COLUMN

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I add a new commit.

@wayneguow wayneguow changed the title [SPARK-49016][SQL] Queries from raw CSV files are disallowed when the referenced columns only include the internal corrupt record column [SPARK-49016][SQL] Restore the behavior that queries from raw CSV files are disallowed when only include corrupt record column and assign name to _LEGACY_ERROR_TEMP_1285 Aug 14, 2024
Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except of a comment.

Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
@wayneguow wayneguow requested a review from MaxGekk August 19, 2024 01:26
@wayneguow
Copy link
Contributor Author

LGTM except of a comment.

Updated it.

@MaxGekk
Copy link
Member

MaxGekk commented Aug 19, 2024

+1, LGTM. Merging to master.
Thank you, @wayneguow and @HyukjinKwon for review.

@MaxGekk MaxGekk closed this in dd259b0 Aug 19, 2024
checkError(
exception = intercept[AnalysisException] {
spark.read.schema(schema).csv(testFile(valueMalformedFile))
.select("_corrupt_record").collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was the behavior before this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous behavior looked like this:
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wayneguow the query itself makes sense, but the results "NULL" are wrong. Blocking this looks incorrect to me.

Could you revert it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile Sorry, I may not understand what you mean, you said that the results "NULL" are wrong, but it is the previous behavior before this PR, so why we need to revert this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition of "corrupted record" has been unclear for a while, as it depends on the column being read. This change itself is a breaking change as it introduced a new error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, from this point of view, this PR can be reverted. I respect your advices and you have more experience about this. If it's convenient for you, you can help to revert it. Thank you.

@wayneguow wayneguow deleted the SPARK-49016 branch February 11, 2025 04:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants