Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-38651][SQL] Add spark.sql.legacy.allowEmptySchemaWrite #35969

Closed
wants to merge 3 commits into from

Conversation

thejdeep
Copy link
Contributor

@thejdeep thejdeep commented Mar 25, 2022

What changes were proposed in this pull request?

Add SQL configuration spark.sql.legacy.allowEmptySchemaWrite to allow support for writing out empty schemas to certain file based datasources that support it.

Why are the changes needed?

Without this change, there is backward in-compatibility introduced while applications are migrated past Spark 2.3 since Spark 2.4 introduced a breaking change that would disallow empty schemas. Since some file formats like ORC support empty schemas, we should honor this by not validating for empty schemas.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Added a unit test to test this behavior

@thejdeep
Copy link
Contributor Author

cc: @cloud-fan @mridulm @gatorsmile Thanks

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

s"using $format when ${emptySchemaValidationConf} is enabled") {
withSQLConf(emptySchemaValidationConf -> "true") {
withTempPath { outputPath =>
spark.emptyDataFrame.write.format(format).save(outputPath.toString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to validate the file content for this test case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading files that contain certain schemas like orc require schema to be specified when it cannot be inferred. Hence, I did not go down that route of validating contents by reading and loading the written path again.

Copy link
Contributor

@xkrogen xkrogen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An approach like this seems like it will work, but it would be nice to do validation within the subclasses of FileFormat, so that each one can declare whether or not it supports empty schemas. This would be similar to the supportDataType() method that already exists on FileFormat. As-is, it seems somewhat wrong that we allow empty schemas even on types that we know don't support them.

Alternatively, we could consider this behavior purely a mechanism to skip this validation for legacy purposes, rather than a new "feature" to maintain moving forward. I think this is a valid viewpoint and could make the current approach reasonable. In that case, we should put "legacy" in the config name to make it clear that this functionality exists purely for backwards-compatibility purposes.

@thejdeep
Copy link
Contributor Author

@dongjoon-hyun @cloud-fan Can you please take a look at this PR ?

@thejdeep
Copy link
Contributor Author

@xkrogen @robreeves Thanks for your feedback, addressed the comments.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Sep 25, 2022
@cloud-fan cloud-fan removed the Stale label Sep 26, 2022
@cloud-fan
Copy link
Contributor

The regression was caused by 5c9eaa6

I think the major question is: is it a valid use case to write data with empty schema to file sources? If it's valid, I think we should let each file source to do the schema checking. If not, we can add a legacy config to skip the schema check and restore to the old behavior.

@thejdeep
Copy link
Contributor Author

thejdeep commented Nov 1, 2022

@cloud-fan We have had users writing data with empty schemas in production and changing the schema of a non-trivial number of rows seems like a big change. Spark allows creating empty schemas ad supports reading them, it might be fitting to consider writing out empty schemas. This PR adds a legacy configuration so that users can choose to ignore the validation check to restore old behavior. What are your thoughts on this ? Thanks!

Looks like this was also brought up earlier as part of the change PR discussion : #20579 (comment)

@xkrogen
Copy link
Contributor

xkrogen commented Nov 18, 2022

@cloud-fan , any more concerns on this approach based on what @thejdeep shared?

@thejdeep
Copy link
Contributor Author

thejdeep commented Dec 2, 2022

@cloud-fan Any further changes that you suggest need to be done ? Thanks for taking a look

@mridulm
Copy link
Contributor

mridulm commented Jan 10, 2023

For our migration usecases, this is currently an issue - would be great to include it in 3.4.
Any thoughts on this PR @dongjoon-hyun, @cloud-fan ? Thanks.

@mridulm
Copy link
Contributor

mridulm commented Jan 14, 2023

Can you look at the test failures @thejdeep ? Can you try updating to latest ? That might be sufficient to fix it.

…hemas in supported filebased datasources

 ### What changes were proposed in this pull request?
 Add SQL configuration `spark.sql.sources.file.allowEmptySchemaWrite` to allow support for writing out empty schemas to certain file based datasources that support it.

 ### Why are the changes needed?
 Without this change, there is backward in-compatibility introduced while applications are migrated past Spark 2.3 since Spark 2.4 introduced a breaking change that would disallow empty schemas. Since some file formats like ORC support empty schemas, we should honor this by not validating for empty schemas.

 ### Does this PR introduce _any_ user-facing change?
 Yes

 ### How was this patch tested?
 Added a unit test to test this behavior
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

Please re-trigger the tests to make it sure. I believe we can have this patch in Apache Saprk 3.4.0.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-38651][SQL] Add configuration to support writing out empty schemas in supported filebased datasources [SPARK-38651][SQL] Add spark.sql.legacy.allowEmptySchemaWrite Jan 14, 2023
@thejdeep
Copy link
Contributor Author

Thanks for reviewing @dongjoon-hyun. I have updated the branch and addressed the comments. Will wait for the build to run.

@dongjoon-hyun dongjoon-hyun self-assigned this Jan 14, 2023
@mridulm
Copy link
Contributor

mridulm commented Jan 14, 2023

Thanks for the review @dongjoon-hyun !
Once the tests pass we can merge it.

@dongjoon-hyun
Copy link
Member

I converted the configuration from public to internal and adjust the indentation.

 SPARK-34454: configs from the legacy namespace should be internal *** FAILED *** (6 milliseconds)�

@mridulm
Copy link
Contributor

mridulm commented Jan 14, 2023

Ah ! I was looking at the latest PR and I was not sure why it was complaining - since I saw it as internal - did not realize you had updated it @dongjoon-hyun :-) Thanks !

@dongjoon-hyun
Copy link
Member

Merged to master for Apache Spark 3.4.0.
Thank you, @thejdeep , @mridulm , @cloud-fan , @xkrogen

@mridulm
Copy link
Contributor

mridulm commented Jan 14, 2023

Thanks @dongjoon-hyun !

@cloud-fan
Copy link
Contributor

sorry for the late review. If this is a valid use case, shall we just allow it in certain file formats like ORC? We can pass the entire query schema to FileFormat.supportsDataType and only fail empty StructType if we are sure the format doesn't support (parquet is one of them).

@xkrogen
Copy link
Contributor

xkrogen commented Apr 19, 2023

I am supportive of @cloud-fan 's proposal above and think this simplifies things from the user perspective as well (no need to worry about setting a config), WDYT @thejdeep @mridulm @dongjoon-hyun ?

@dongjoon-hyun dongjoon-hyun removed their assignment Apr 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants