[SPARK-38651][SQL] Add `spark.sql.legacy.allowEmptySchemaWrite` #35969

thejdeep · 2022-03-25T04:51:13Z

What changes were proposed in this pull request?

Add SQL configuration spark.sql.legacy.allowEmptySchemaWrite to allow support for writing out empty schemas to certain file based datasources that support it.

Why are the changes needed?

Without this change, there is backward in-compatibility introduced while applications are migrated past Spark 2.3 since Spark 2.4 introduced a breaking change that would disallow empty schemas. Since some file formats like ORC support empty schemas, we should honor this by not validating for empty schemas.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Added a unit test to test this behavior

thejdeep · 2022-03-25T04:52:00Z

cc: @cloud-fan @mridulm @gatorsmile Thanks

AmplabJenkins · 2022-03-26T20:53:38Z

Can one of the admins verify this patch?

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

robreeves · 2022-03-29T21:25:24Z

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

+      s"using $format when ${emptySchemaValidationConf} is enabled") {
+      withSQLConf(emptySchemaValidationConf -> "true") {
+        withTempPath { outputPath =>
+          spark.emptyDataFrame.write.format(format).save(outputPath.toString)


Is it possible to validate the file content for this test case?

Reading files that contain certain schemas like orc require schema to be specified when it cannot be inferred. Hence, I did not go down that route of validating contents by reading and loading the written path again.

xkrogen

An approach like this seems like it will work, but it would be nice to do validation within the subclasses of FileFormat, so that each one can declare whether or not it supports empty schemas. This would be similar to the supportDataType() method that already exists on FileFormat. As-is, it seems somewhat wrong that we allow empty schemas even on types that we know don't support them.

Alternatively, we could consider this behavior purely a mechanism to skip this validation for legacy purposes, rather than a new "feature" to maintain moving forward. I think this is a valid viewpoint and could make the current approach reasonable. In that case, we should put "legacy" in the config name to make it clear that this functionality exists purely for backwards-compatibility purposes.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

thejdeep · 2022-04-27T16:51:55Z

@dongjoon-hyun @cloud-fan Can you please take a look at this PR ?

thejdeep · 2022-04-27T16:52:38Z

@xkrogen @robreeves Thanks for your feedback, addressed the comments.

github-actions · 2022-09-25T00:23:56Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

cloud-fan · 2022-09-26T07:45:20Z

The regression was caused by 5c9eaa6

I think the major question is: is it a valid use case to write data with empty schema to file sources? If it's valid, I think we should let each file source to do the schema checking. If not, we can add a legacy config to skip the schema check and restore to the old behavior.

thejdeep · 2022-11-01T15:46:20Z

@cloud-fan We have had users writing data with empty schemas in production and changing the schema of a non-trivial number of rows seems like a big change. Spark allows creating empty schemas ad supports reading them, it might be fitting to consider writing out empty schemas. This PR adds a legacy configuration so that users can choose to ignore the validation check to restore old behavior. What are your thoughts on this ? Thanks!

Looks like this was also brought up earlier as part of the change PR discussion : #20579 (comment)

xkrogen · 2022-11-18T21:12:45Z

@cloud-fan , any more concerns on this approach based on what @thejdeep shared?

thejdeep · 2022-12-02T14:48:56Z

@cloud-fan Any further changes that you suggest need to be done ? Thanks for taking a look

mridulm · 2023-01-10T23:41:54Z

For our migration usecases, this is currently an issue - would be great to include it in 3.4.
Any thoughts on this PR @dongjoon-hyun, @cloud-fan ? Thanks.

mridulm · 2023-01-14T03:26:26Z

Can you look at the test failures @thejdeep ? Can you try updating to latest ? That might be sufficient to fix it.

…hemas in supported filebased datasources ### What changes were proposed in this pull request? Add SQL configuration `spark.sql.sources.file.allowEmptySchemaWrite` to allow support for writing out empty schemas to certain file based datasources that support it. ### Why are the changes needed? Without this change, there is backward in-compatibility introduced while applications are migrated past Spark 2.3 since Spark 2.4 introduced a breaking change that would disallow empty schemas. Since some file formats like ORC support empty schemas, we should honor this by not validating for empty schemas. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added a unit test to test this behavior

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2023-01-14T03:52:50Z

Please re-trigger the tests to make it sure. I believe we can have this patch in Apache Saprk 3.4.0.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

thejdeep · 2023-01-14T03:59:14Z

Thanks for reviewing @dongjoon-hyun. I have updated the branch and addressed the comments. Will wait for the build to run.

mridulm · 2023-01-14T04:09:41Z

Thanks for the review @dongjoon-hyun !
Once the tests pass we can merge it.

dongjoon-hyun · 2023-01-14T07:39:00Z

I converted the configuration from public to internal and adjust the indentation.

 SPARK-34454: configs from the legacy namespace should be internal *** FAILED *** (6 milliseconds)�

mridulm · 2023-01-14T08:14:42Z

Ah ! I was looking at the latest PR and I was not sure why it was complaining - since I saw it as internal - did not realize you had updated it @dongjoon-hyun :-) Thanks !

dongjoon-hyun · 2023-01-14T09:59:18Z

Merged to master for Apache Spark 3.4.0.
Thank you, @thejdeep , @mridulm , @cloud-fan , @xkrogen

mridulm · 2023-01-14T10:01:00Z

Thanks @dongjoon-hyun !

cloud-fan · 2023-01-16T03:29:48Z

sorry for the late review. If this is a valid use case, shall we just allow it in certain file formats like ORC? We can pass the entire query schema to FileFormat.supportsDataType and only fail empty StructType if we are sure the format doesn't support (parquet is one of them).

xkrogen · 2023-04-19T00:02:08Z

I am supportive of @cloud-fan 's proposal above and think this simplifies things from the user perspective as well (no need to worry about setting a config), WDYT @thejdeep @mridulm @dongjoon-hyun ?

robreeves reviewed Mar 29, 2022

View reviewed changes

xkrogen reviewed Mar 30, 2022

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala Outdated Show resolved Hide resolved

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala Outdated Show resolved Hide resolved

thejdeep force-pushed the SPARK-38651 branch from eadf6c9 to 99480c1 Compare April 18, 2022 21:02

github-actions bot added the SQL label Apr 18, 2022

thejdeep force-pushed the SPARK-38651 branch from 99480c1 to 9c500b1 Compare May 31, 2022 20:28

github-actions bot added the Stale label Sep 25, 2022

cloud-fan removed the Stale label Sep 26, 2022

thejdeep force-pushed the SPARK-38651 branch from 9c500b1 to 6b6b9e5 Compare January 10, 2023 23:18

dongjoon-hyun approved these changes Jan 14, 2023

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-38651][SQL] Add configuration to support writing out empty schemas in supported filebased datasources~~ [SPARK-38651][SQL] Add spark.sql.legacy.allowEmptySchemaWrite Jan 14, 2023

dongjoon-hyun reviewed Jan 14, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jan 14, 2023

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala Outdated Show resolved Hide resolved

thejdeep force-pushed the SPARK-38651 branch from 6b6b9e5 to e453f3e Compare January 14, 2023 03:56

Address comments to fix lint

125afa8

dongjoon-hyun approved these changes Jan 14, 2023

View reviewed changes

dongjoon-hyun self-assigned this Jan 14, 2023

Fix test failure by adding .internal() and indentation

97d4d81

dongjoon-hyun closed this in 5d67581 Jan 14, 2023

dongjoon-hyun removed their assignment Apr 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-38651][SQL] Add `spark.sql.legacy.allowEmptySchemaWrite` #35969

[SPARK-38651][SQL] Add `spark.sql.legacy.allowEmptySchemaWrite` #35969

thejdeep commented Mar 25, 2022 •

edited by dongjoon-hyun

thejdeep commented Mar 25, 2022

AmplabJenkins commented Mar 26, 2022

robreeves Mar 29, 2022

thejdeep Jan 14, 2023

xkrogen left a comment

thejdeep commented Apr 27, 2022

thejdeep commented Apr 27, 2022

github-actions bot commented Sep 25, 2022

cloud-fan commented Sep 26, 2022

thejdeep commented Nov 1, 2022 •

edited

xkrogen commented Nov 18, 2022

thejdeep commented Dec 2, 2022

mridulm commented Jan 10, 2023

mridulm commented Jan 14, 2023 •

edited

dongjoon-hyun left a comment

dongjoon-hyun commented Jan 14, 2023

thejdeep commented Jan 14, 2023

mridulm commented Jan 14, 2023

dongjoon-hyun commented Jan 14, 2023

mridulm commented Jan 14, 2023 •

edited

dongjoon-hyun commented Jan 14, 2023

mridulm commented Jan 14, 2023

cloud-fan commented Jan 16, 2023

xkrogen commented Apr 19, 2023

[SPARK-38651][SQL] Add spark.sql.legacy.allowEmptySchemaWrite #35969

[SPARK-38651][SQL] Add spark.sql.legacy.allowEmptySchemaWrite #35969

Conversation

thejdeep commented Mar 25, 2022 • edited by dongjoon-hyun

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

thejdeep commented Mar 25, 2022

AmplabJenkins commented Mar 26, 2022

robreeves Mar 29, 2022

Choose a reason for hiding this comment

thejdeep Jan 14, 2023

Choose a reason for hiding this comment

xkrogen left a comment

Choose a reason for hiding this comment

thejdeep commented Apr 27, 2022

thejdeep commented Apr 27, 2022

github-actions bot commented Sep 25, 2022

cloud-fan commented Sep 26, 2022

thejdeep commented Nov 1, 2022 • edited

xkrogen commented Nov 18, 2022

thejdeep commented Dec 2, 2022

mridulm commented Jan 10, 2023

mridulm commented Jan 14, 2023 • edited

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Jan 14, 2023

thejdeep commented Jan 14, 2023

mridulm commented Jan 14, 2023

dongjoon-hyun commented Jan 14, 2023

mridulm commented Jan 14, 2023 • edited

dongjoon-hyun commented Jan 14, 2023

mridulm commented Jan 14, 2023

cloud-fan commented Jan 16, 2023

xkrogen commented Apr 19, 2023

[SPARK-38651][SQL] Add `spark.sql.legacy.allowEmptySchemaWrite` #35969

[SPARK-38651][SQL] Add `spark.sql.legacy.allowEmptySchemaWrite` #35969

thejdeep commented Mar 25, 2022 •

edited by dongjoon-hyun

thejdeep commented Nov 1, 2022 •

edited

mridulm commented Jan 14, 2023 •

edited

mridulm commented Jan 14, 2023 •

edited