New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-38651][SQL] Add spark.sql.legacy.allowEmptySchemaWrite
#35969
Conversation
cc: @cloud-fan @mridulm @gatorsmile Thanks |
Can one of the admins verify this patch? |
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
Outdated
Show resolved
Hide resolved
s"using $format when ${emptySchemaValidationConf} is enabled") { | ||
withSQLConf(emptySchemaValidationConf -> "true") { | ||
withTempPath { outputPath => | ||
spark.emptyDataFrame.write.format(format).save(outputPath.toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to validate the file content for this test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading files that contain certain schemas like orc
require schema to be specified when it cannot be inferred. Hence, I did not go down that route of validating contents by reading and loading the written path again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An approach like this seems like it will work, but it would be nice to do validation within the subclasses of FileFormat
, so that each one can declare whether or not it supports empty schemas. This would be similar to the supportDataType()
method that already exists on FileFormat
. As-is, it seems somewhat wrong that we allow empty schemas even on types that we know don't support them.
Alternatively, we could consider this behavior purely a mechanism to skip this validation for legacy purposes, rather than a new "feature" to maintain moving forward. I think this is a valid viewpoint and could make the current approach reasonable. In that case, we should put "legacy" in the config name to make it clear that this functionality exists purely for backwards-compatibility purposes.
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
Outdated
Show resolved
Hide resolved
@dongjoon-hyun @cloud-fan Can you please take a look at this PR ? |
@xkrogen @robreeves Thanks for your feedback, addressed the comments. |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
The regression was caused by 5c9eaa6 I think the major question is: is it a valid use case to write data with empty schema to file sources? If it's valid, I think we should let each file source to do the schema checking. If not, we can add a legacy config to skip the schema check and restore to the old behavior. |
@cloud-fan We have had users writing data with empty schemas in production and changing the schema of a non-trivial number of rows seems like a big change. Spark allows creating empty schemas ad supports reading them, it might be fitting to consider writing out empty schemas. This PR adds a legacy configuration so that users can choose to ignore the validation check to restore old behavior. What are your thoughts on this ? Thanks! Looks like this was also brought up earlier as part of the change PR discussion : #20579 (comment) |
@cloud-fan , any more concerns on this approach based on what @thejdeep shared? |
@cloud-fan Any further changes that you suggest need to be done ? Thanks for taking a look |
For our migration usecases, this is currently an issue - would be great to include it in 3.4. |
Can you look at the test failures @thejdeep ? Can you try updating to latest ? That might be sufficient to fix it. |
…hemas in supported filebased datasources ### What changes were proposed in this pull request? Add SQL configuration `spark.sql.sources.file.allowEmptySchemaWrite` to allow support for writing out empty schemas to certain file based datasources that support it. ### Why are the changes needed? Without this change, there is backward in-compatibility introduced while applications are migrated past Spark 2.3 since Spark 2.4 introduced a breaking change that would disallow empty schemas. Since some file formats like ORC support empty schemas, we should honor this by not validating for empty schemas. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added a unit test to test this behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
Please re-trigger the tests to make it sure. I believe we can have this patch in Apache Saprk 3.4.0. |
spark.sql.legacy.allowEmptySchemaWrite
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
Outdated
Show resolved
Hide resolved
Thanks for reviewing @dongjoon-hyun. I have updated the branch and addressed the comments. Will wait for the build to run. |
Thanks for the review @dongjoon-hyun ! |
I converted the configuration from public to internal and adjust the indentation.
|
Ah ! I was looking at the latest PR and I was not sure why it was complaining - since I saw it as |
Merged to master for Apache Spark 3.4.0. |
Thanks @dongjoon-hyun ! |
sorry for the late review. If this is a valid use case, shall we just allow it in certain file formats like ORC? We can pass the entire query schema to |
I am supportive of @cloud-fan 's proposal above and think this simplifies things from the user perspective as well (no need to worry about setting a config), WDYT @thejdeep @mridulm @dongjoon-hyun ? |
What changes were proposed in this pull request?
Add SQL configuration
spark.sql.legacy.allowEmptySchemaWrite
to allow support for writing out empty schemas to certain file based datasources that support it.Why are the changes needed?
Without this change, there is backward in-compatibility introduced while applications are migrated past Spark 2.3 since Spark 2.4 introduced a breaking change that would disallow empty schemas. Since some file formats like ORC support empty schemas, we should honor this by not validating for empty schemas.
Does this PR introduce any user-facing change?
Yes
How was this patch tested?
Added a unit test to test this behavior