-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30098][SQL] Add a configuration to use default datasource as provider for CREATE TABLE command #30554
Conversation
cc @rdblue |
Do you think we need a vote for this, @cloud-fan and @gatorsmile ? |
Test build #132003 has finished for PR 30554 at commit
|
docs/sql-migration-guide.md
Outdated
@@ -54,6 +54,8 @@ license: | | |||
|
|||
- In Spark 3.1, creating or altering a view will capture runtime SQL configs and store them as view properties. These configs will be applied during the parsing and analysis phases of the view resolution. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.useCurrentConfigsForView` to `true`. | |||
|
|||
- In Spark 3.1, `CREATE TABLE` without a specific table provider uses the value of `spark.sql.sources.default` as its table provider. In Spark version 3.0 and below, it was Hive. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.createHiveTableByDefault.enabled` to `true`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that the default behavior of CREATE TABLE
should change in a point release. Why is this considered a "safe" change to make?
This could easily break existing workflows and should be done in a major release.
I think a vote should be required before committing this if the intent is to change the default behavior of CREATE TABLE. I'm not sure that a vote would even be appropriate given that breaking behavior changes are generally not included in point releases. |
+1 for @rdblue 's opinion. |
I don't think this is something banned as long as we have a way to restore the previous behaviour. More importantly, in Apache Spark, All items in the migration guide are behaviour changes since we don't usually list bug fixes in the migration guide. I can provide a bunch of similar examples if you guys doubt. So I believe the point we should discuss/focus here is that this is a significant behaviour change since |
I'd agree to change it even in minor release if we have discussed this for enough time in public, and put some efforts on figuring out impacts to the end users and guide in prior (like roadmap). I don't think we did anything I mentioned. The last discussion we did before Spark 3.0.0 was more likely concerning the change without proper discussion, and we reverted it. Unifying create table syntax fixes the long term issue along confused two create table syntaxes, but that's it and it's not a rationalization of changing the default provider. Changing the default provider for create table is totally different story. My experience of Spark community says that we're most likely reluctant to make a backward incompatible change (even we did for major release), and sometimes we set old behavior by default even the new functionality is available. I'm surprised PR description doesn't mention anything about impacts. I agree this requires enough discussion in public before going further. In discussion we should make clear the benefits of changing this, "AND" the all possible impacts of changing this. |
I agree with the point of describing the things and explaining rationalization. I would have imagined to have a detailed PR description as well. Enough discussion should better be made for a significant change of course. I don't believe this PR targeted to push without it. Also, I don't believe such discussions should happen in a specific place like mailing list. They can happen in JIRA, PR, etc. One alternative is just to turn this switch off by default, issue warnings and turn it on later in another release in the future. Just to make it extra clear, what this PR changes looks correct and straightforward to me. Spark should create a Spark table when users |
043a11b
to
66c6495
Compare
I've updated the PR description to include more details. I think this is not a big breaking change that needs a formal vote (we don't vote for every breaking change). It's mostly about hive-compatibility, not behavior changes. A normal PR review process should work here. But if someone has a different opinion, I'm OK to do a vote as well. |
66c6495
to
ddfa0e8
Compare
From reading this thread, I think it is safe to say that @HeartSaVioR, @dongjoon-hyun, and I have all expressed an opinion that this requries a wider discussion on the dev list, followed by a vote if there isn't already clear consensus. I think that discussion also needs to address how this will affect users with jobs running in Spark 3.0, and why you consider it safe enough for a point release. |
I'll turn off the config once all tests pass, so that it's optional. I believe this at least is a useful feature for Spark vendors. In the meanwhile, I'll send an email to discuss it. BTW just to make it clear: breaking changes are allowed in point releases. You can read the migration guide for each point release to see examples. |
sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala
Outdated
Show resolved
Hide resolved
Can you explain this reasoning more clearly? How is changing the file format by default a corner case?
I don't think this is accurate. What about the file format change above? Is that not a behavior difference? Hive tables support partitions that use different formats and serde, but Spark tables do not. That is a table behavior difference as well. |
I don't think there is any objection to having a config so you can change it in your environment. It is just that there should be more discussion at a minimum about breaking changes like this. I also don't think that Spark's versioning policy allows it.
Can you point to where this is allowed by the versioning policy? |
I was saying accessing data files of Spark tables directly are corner cases. Users create a table because they want to access the data by table name, not directory.
This is a good point. It's not possible to do it via Spark DDL commands, but people can create table with Spark, then add partitions with different serdes with Hive. I'll put it in the
Quoted below:
We try to avoid breaking changes, but we don't completely forbid it. Breaking changes are allowed when it's justified. We definitely don't allow arbitrary breaking changes. |
@cloud-fan . If this is a new configuration PR which is disabled by default, could you adjust PR title and description and remove
|
All tests pass, I'm changing it now. |
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala
Show resolved
Hide resolved
buildConf("spark.sql.legacy.createHiveTableByDefault") | ||
.internal() | ||
.doc("When set to true, CREATE TABLE syntax without a table provider will use hive " + | ||
s"instead of the value of ${DEFAULT_DATA_SOURCE_NAME.key} as the table provider.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a "legacy" key mention how long folks can depend on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are already a lot of legacy configurations, and there's no plan for how long we'll keep it. That's what I know as far as I have followed in the community. It would be a separate issue to decide lifetime of legacy configurations but ideally the legacy configurations will be removed in the major release bumpup I guess. I remember I discussed this with Sean as well somewhere. cc @srowen FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a general policy for legacy configs that means this remains the same until Spark 4 by default I guess there is no need to document it here (I don't remember that conversation but I was out for a few months last/this year).
Test build #132034 has finished for PR 30554 at commit
|
retest this please |
Test build #132043 has finished for PR 30554 at commit
|
retest this please |
I think that you're misinterpreting it. That means that Spark avoids breaking APIs and behavior even at major releases. It is more strict, not more lax. It does not mean that the rules are flexible and Spark will make breaking behavior changes. This expectation is set in the first paragraph, where it states that Spark will follow semver, except for a small set of multi-module issues. |
// 1. `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT` is false, or | ||
// 2. It's a CTAS and `conf.convertCTAS` is true. | ||
val createHiveTableByDefault = conf.getConf(SQLConf.LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT) | ||
if (!createHiveTableByDefault || (ctas && conf.convertCTAS)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this mark convertCTAS
as deprecated since it is superseded by the new config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I can do it in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that this doesn't change the behavior for external and does not enable this by default, I think this is ready when tests are passing.
Test build #132058 has finished for PR 30554 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @cloud-fan and all.
- The Jenkins 3 UT failures are flaky ones. I saw them recently on master branch.
- GitHub Action 1 UT failure looks weird.
CliSuite.SPARK-29022 Commands using SerDe provided in ADD JAR sql
So, I checked it manually.
$ build/sbt "hive-thriftserver/testOnly *.CliSuite -- -z SPARK-29022" -Phive-thriftserver
[info] CliSuite:
[info] - SPARK-29022: Commands using SerDe provided in --hive.aux.jars.path (16 seconds, 860 milliseconds)
[info] - SPARK-29022 Commands using SerDe provided in ADD JAR sql (14 seconds, 268 milliseconds)
...
[info] All tests passed.
[info] Passed: Total 2, Failed 0, Errors 0, Passed 2
retest this please |
Test build #132123 has finished for PR 30554 at commit
|
Kubernetes integration test starting |
retest this please |
Kubernetes integration test status success |
GA passed, merging to master, thanks for the review! |
Test build #132131 has finished for PR 30554 at commit
|
### What changes were proposed in this pull request? This is a followup of #30554 . Now we have a new config for converting CREATE TABLE, we don't need the old config that only works for CTAS. ### Why are the changes needed? It's confusing for having two config while one can cover another completely. ### Does this PR introduce _any_ user-facing change? no, it's deprecating not removing. ### How was this patch tested? N/A Closes #30651 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? This is a followup of #30554 . Now we have a new config for converting CREATE TABLE, we don't need the old config that only works for CTAS. ### Why are the changes needed? It's confusing for having two config while one can cover another completely. ### Does this PR introduce _any_ user-facing change? no, it's deprecating not removing. ### How was this patch tested? N/A Closes #30651 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6aff215) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… `false` by default ### What changes were proposed in this pull request? This PR aims to switch `spark.sql.legacy.createHiveTableByDefault` to `false` by default in order to move away from this legacy behavior from `Apache Spark 4.0.0` while the legacy functionality will be preserved during Apache Spark 4.x period by setting `spark.sql.legacy.createHiveTableByDefault=true`. ### Why are the changes needed? Historically, this behavior change was merged at `Apache Spark 3.0.0` activity in SPARK-30098 and reverted officially during the `3.0.0 RC` period. - 2019-12-06: #26736 (58be82a) - 2019-12-06: https://lists.apache.org/thread/g90dz1og1zt4rr5h091rn1zqo50y759j - 2020-05-16: #28517 At `Apache Spark 3.1.0`, we had another discussion and defined it as `Legacy` behavior via a new configuration by reusing the JIRA ID, SPARK-30098. - 2020-12-01: https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204 - 2020-12-03: #30554 Last year, this was proposed again twice and `Apache Spark 4.0.0` is a good time to make a decision for Apache Spark future direction. - SPARK-42603 on 2023-02-27 as an independent idea. - SPARK-46122 on 2023-11-27 as a part of Apache Spark 4.0.0 idea ### Does this PR introduce _any_ user-facing change? Yes, the migration document is updated. ### How was this patch tested? Pass the CIs with the adjusted test cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46207 from dongjoon-hyun/SPARK-46122. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
… `false` by default ### What changes were proposed in this pull request? This PR aims to switch `spark.sql.legacy.createHiveTableByDefault` to `false` by default in order to move away from this legacy behavior from `Apache Spark 4.0.0` while the legacy functionality will be preserved during Apache Spark 4.x period by setting `spark.sql.legacy.createHiveTableByDefault=true`. ### Why are the changes needed? Historically, this behavior change was merged at `Apache Spark 3.0.0` activity in SPARK-30098 and reverted officially during the `3.0.0 RC` period. - 2019-12-06: apache#26736 (58be82a) - 2019-12-06: https://lists.apache.org/thread/g90dz1og1zt4rr5h091rn1zqo50y759j - 2020-05-16: apache#28517 At `Apache Spark 3.1.0`, we had another discussion and defined it as `Legacy` behavior via a new configuration by reusing the JIRA ID, SPARK-30098. - 2020-12-01: https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204 - 2020-12-03: apache#30554 Last year, this was proposed again twice and `Apache Spark 4.0.0` is a good time to make a decision for Apache Spark future direction. - SPARK-42603 on 2023-02-27 as an independent idea. - SPARK-46122 on 2023-11-27 as a part of Apache Spark 4.0.0 idea ### Does this PR introduce _any_ user-facing change? Yes, the migration document is updated. ### How was this patch tested? Pass the CIs with the adjusted test cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46207 from dongjoon-hyun/SPARK-46122. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
For CRETE TABLE [AS SELECT] command, creates native Parquet table if neither USING nor STORE AS is specified and
spark.sql.legacy.createHiveTableByDefault
is false.This is a retry after we unify the CREATE TABLE syntax. It partially reverts d2bec5e
This PR allows
CREATE EXTERNAL TABLE
whenLOCATION
is present. This was not allowed for data source tables before, which is an unnecessary behavior different with hive tables.Why are the changes needed?
Changing from Hive text table to native Parquet table has many benefits:
DataFrameWriter.saveAsTable
.insert into t values struct(null)
actually inserts a null value notstruct(null)
ift
is a Hive text table, which leads to wrong result)Does this PR introduce any user-facing change?
No by default. If the config is set, the behavior change is described below:
Behavior-wise, the change is very small as the native Parquet table is also Hive-compatible. All the Spark DDL commands that works for hive tables also works for native Parquet tables, with two exceptions:
ALTER TABLE SET [SERDE | SERDEPROPERTIES]
andLOAD DATA
.char/varchar behavior has been taken care by #30412, and there is no behavior difference between data source and hive tables.
One potential issue is
CREATE TABLE ... LOCATION ...
while users want to directly access the files later. It's more like a corner case and the legacy config should be good enough.Another potential issue is users may use Spark to create the table and then use Hive to add partitions with different serde. This is not allowed for Spark native tables.
How was this patch tested?
Re-enable the tests