[SPARK-30098][SQL] Add a configuration to use default datasource as provider for CREATE TABLE command #30554

cloud-fan · 2020-11-30T17:11:40Z

What changes were proposed in this pull request?

For CRETE TABLE [AS SELECT] command, creates native Parquet table if neither USING nor STORE AS is specified and spark.sql.legacy.createHiveTableByDefault is false.

This is a retry after we unify the CREATE TABLE syntax. It partially reverts d2bec5e

This PR allows CREATE EXTERNAL TABLE when LOCATION is present. This was not allowed for data source tables before, which is an unnecessary behavior different with hive tables.

Why are the changes needed?

Changing from Hive text table to native Parquet table has many benefits:

be consistent with DataFrameWriter.saveAsTable.
better performance
better support for nested types (Hive text table doesn't work well with nested types, e.g. insert into t values struct(null) actually inserts a null value not struct(null) if t is a Hive text table, which leads to wrong result)
better interoperability as Parquet is a more popular open file format.

Does this PR introduce any user-facing change?

No by default. If the config is set, the behavior change is described below:

Behavior-wise, the change is very small as the native Parquet table is also Hive-compatible. All the Spark DDL commands that works for hive tables also works for native Parquet tables, with two exceptions: ALTER TABLE SET [SERDE | SERDEPROPERTIES] and LOAD DATA.

char/varchar behavior has been taken care by #30412, and there is no behavior difference between data source and hive tables.

One potential issue is CREATE TABLE ... LOCATION ... while users want to directly access the files later. It's more like a corner case and the legacy config should be good enough.

Another potential issue is users may use Spark to create the table and then use Hive to add partitions with different serde. This is not allowed for Spark native tables.

How was this patch tested?

Re-enable the tests

dongjoon-hyun · 2020-11-30T18:18:23Z

cc @rdblue

dongjoon-hyun · 2020-11-30T18:20:01Z

Do you think we need a vote for this, @cloud-fan and @gatorsmile ?

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

SparkQA · 2020-11-30T18:34:26Z

Test build #132003 has finished for PR 30554 at commit a2c4680.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2020-11-30T18:41:30Z

docs/sql-migration-guide.md

@@ -54,6 +54,8 @@ license: |

  - In Spark 3.1, creating or altering a view will capture runtime SQL configs and store them as view properties. These configs will be applied during the parsing and analysis phases of the view resolution. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.useCurrentConfigsForView` to `true`.

+  - In Spark 3.1, `CREATE TABLE` without a specific table provider uses the value of `spark.sql.sources.default` as its table provider. In Spark version 3.0 and below, it was Hive. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.createHiveTableByDefault.enabled` to `true`.


I don't think that the default behavior of CREATE TABLE should change in a point release. Why is this considered a "safe" change to make?

This could easily break existing workflows and should be done in a major release.

rdblue · 2020-11-30T18:43:51Z

Do you think we need a vote for this, @cloud-fan and @gatorsmile?

I think a vote should be required before committing this if the intent is to change the default behavior of CREATE TABLE. I'm not sure that a vote would even be appropriate given that breaking behavior changes are generally not included in point releases.

dongjoon-hyun · 2020-11-30T19:03:16Z

+1 for @rdblue 's opinion.

HyukjinKwon · 2020-12-01T01:17:50Z

I don't think this is something banned as long as we have a way to restore the previous behaviour. More importantly, in Apache Spark, CREATE TABLE should create a Spark table, not Hive table.

All items in the migration guide are behaviour changes since we don't usually list bug fixes in the migration guide. I can provide a bunch of similar examples if you guys doubt.

So I believe the point we should discuss/focus here is that this is a significant behaviour change since CREATE TABLE is the very entry point of users. Personally the vote sounds fine to me given that two committers think it might better need.

HeartSaVioR · 2020-12-01T02:34:08Z

I'd agree to change it even in minor release if we have discussed this for enough time in public, and put some efforts on figuring out impacts to the end users and guide in prior (like roadmap).

I don't think we did anything I mentioned. The last discussion we did before Spark 3.0.0 was more likely concerning the change without proper discussion, and we reverted it. Unifying create table syntax fixes the long term issue along confused two create table syntaxes, but that's it and it's not a rationalization of changing the default provider.

Changing the default provider for create table is totally different story. My experience of Spark community says that we're most likely reluctant to make a backward incompatible change (even we did for major release), and sometimes we set old behavior by default even the new functionality is available. I'm surprised PR description doesn't mention anything about impacts.

I agree this requires enough discussion in public before going further. In discussion we should make clear the benefits of changing this, "AND" the all possible impacts of changing this.

HyukjinKwon · 2020-12-01T04:23:53Z

I agree with the point of describing the things and explaining rationalization. I would have imagined to have a detailed PR description as well.

Enough discussion should better be made for a significant change of course. I don't believe this PR targeted to push without it. Also, I don't believe such discussions should happen in a specific place like mailing list. They can happen in JIRA, PR, etc.

One alternative is just to turn this switch off by default, issue warnings and turn it on later in another release in the future.

Just to make it extra clear, what this PR changes looks correct and straightforward to me. Spark should create a Spark table when users CREATE TABLE.

cloud-fan · 2020-12-01T10:31:07Z

I've updated the PR description to include more details. I think this is not a big breaking change that needs a formal vote (we don't vote for every breaking change). It's mostly about hive-compatibility, not behavior changes. A normal PR review process should work here. But if someone has a different opinion, I'm OK to do a vote as well.

rdblue · 2020-12-01T17:13:55Z

But if someone has a different opinion, I'm OK to do a vote as well.

From reading this thread, I think it is safe to say that @HeartSaVioR, @dongjoon-hyun, and I have all expressed an opinion that this requries a wider discussion on the dev list, followed by a vote if there isn't already clear consensus.

I think that discussion also needs to address how this will affect users with jobs running in Spark 3.0, and why you consider it safe enough for a point release.

cloud-fan · 2020-12-01T17:25:34Z

I'll turn off the config once all tests pass, so that it's optional. I believe this at least is a useful feature for Spark vendors. In the meanwhile, I'll send an email to discuss it.

BTW just to make it clear: breaking changes are allowed in point releases. You can read the migration guide for each point release to see examples.

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

rdblue · 2020-12-01T17:32:07Z

One potential issue is CREATE TABLE ... LOCATION ... while users want to directly access the files later. It's more like a corner case and the legacy config should be good enough.

Can you explain this reasoning more clearly? How is changing the file format by default a corner case?

there is no behavior difference between data source and hive tables.

I don't think this is accurate. What about the file format change above? Is that not a behavior difference?

Hive tables support partitions that use different formats and serde, but Spark tables do not. That is a table behavior difference as well.

rdblue · 2020-12-01T17:44:57Z

I'll turn off the config once all tests pass, so that it's optional. I believe this at least is a useful feature for Spark vendors.

I don't think there is any objection to having a config so you can change it in your environment. It is just that there should be more discussion at a minimum about breaking changes like this. I also don't think that Spark's versioning policy allows it.

BTW just to make it clear: breaking changes are allowed in point releases. You can read the migration guide for each point release to see examples.

Can you point to where this is allowed by the versioning policy?

cloud-fan · 2020-12-01T17:53:22Z

How is changing the file format by default a corner case?

I was saying accessing data files of Spark tables directly are corner cases. Users create a table because they want to access the data by table name, not directory.

Hive tables support partitions that use different formats and serde, but Spark tables do not.

This is a good point. It's not possible to do it via Spark DDL commands, but people can create table with Spark, then add partitions with different serdes with Hive. I'll put it in the Does this PR introduce any user-facing change? section.

Can you point to where this is allowed by the versioning policy?

Quoted below:

The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.

We try to avoid breaking changes, but we don't completely forbid it. Breaking changes are allowed when it's justified. We definitely don't allow arbitrary breaking changes.

dongjoon-hyun · 2020-12-01T18:49:07Z

@cloud-fan . If this is a new configuration PR which is disabled by default, could you adjust PR title and description and remove docs/sql-migration-guide.md from this PR?

I'll turn off the config once all tests pass, so that it's optional.

cloud-fan · 2020-12-01T18:58:09Z

All tests pass, I'm changing it now.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

holdenk · 2020-12-02T00:25:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    buildConf("spark.sql.legacy.createHiveTableByDefault")
+      .internal()
+      .doc("When set to true, CREATE TABLE syntax without a table provider will use hive " +
+        s"instead of the value of ${DEFAULT_DATA_SOURCE_NAME.key} as the table provider.")


If this is a "legacy" key mention how long folks can depend on it.

There are already a lot of legacy configurations, and there's no plan for how long we'll keep it. That's what I know as far as I have followed in the community. It would be a separate issue to decide lifetime of legacy configurations but ideally the legacy configurations will be removed in the major release bumpup I guess. I remember I discussed this with Sean as well somewhere. cc @srowen FYI

If we have a general policy for legacy configs that means this remains the same until Spark 4 by default I guess there is no need to document it here (I don't remember that conversation but I was out for a few months last/this year).

SparkQA · 2020-12-02T10:41:27Z

Test build #132034 has finished for PR 30554 at commit bff924d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-12-02T10:42:21Z

retest this please

SparkQA · 2020-12-02T15:37:27Z

Test build #132043 has finished for PR 30554 at commit bff924d.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-12-02T16:13:53Z

retest this please

rdblue · 2020-12-02T17:35:09Z

Can you point to where this is allowed by the versioning policy?

Quoted below:

The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.

We try to avoid breaking changes, but we don't completely forbid it. Breaking changes are allowed when it's justified. We definitely don't allow arbitrary breaking changes.

I think that you're misinterpreting it. That means that Spark avoids breaking APIs and behavior even at major releases. It is more strict, not more lax.

It does not mean that the rules are flexible and Spark will make breaking behavior changes. This expectation is set in the first paragraph, where it states that Spark will follow semver, except for a small set of multi-module issues.

rdblue · 2020-12-02T17:37:33Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

+      //   1. `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT` is false, or
+      //   2. It's a CTAS and `conf.convertCTAS` is true.
+      val createHiveTableByDefault = conf.getConf(SQLConf.LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT)
+      if (!createHiveTableByDefault || (ctas && conf.convertCTAS)) {


Should this mark convertCTAS as deprecated since it is superseded by the new config?

yea, I can do it in a follow-up.

rdblue

Now that this doesn't change the behavior for external and does not enable this by default, I think this is ready when tests are passing.

SparkQA · 2020-12-02T20:08:13Z

Test build #132058 has finished for PR 30554 at commit bff924d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @cloud-fan and all.

The Jenkins 3 UT failures are flaky ones. I saw them recently on master branch.
GitHub Action 1 UT failure looks weird.

CliSuite.SPARK-29022 Commands using SerDe provided in ADD JAR sql

So, I checked it manually.

$ build/sbt "hive-thriftserver/testOnly *.CliSuite -- -z SPARK-29022" -Phive-thriftserver
[info] CliSuite:
[info] - SPARK-29022: Commands using SerDe provided in --hive.aux.jars.path (16 seconds, 860 milliseconds)
[info] - SPARK-29022 Commands using SerDe provided in ADD JAR sql (14 seconds, 268 milliseconds)
...
[info] All tests passed.
[info] Passed: Total 2, Failed 0, Errors 0, Passed 2

cloud-fan · 2020-12-03T10:42:17Z

retest this please

SparkQA · 2020-12-03T11:53:11Z

Test build #132123 has finished for PR 30554 at commit bff924d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-03T12:07:19Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36724/

cloud-fan · 2020-12-03T12:25:04Z

retest this please

SparkQA · 2020-12-03T12:34:26Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36724/

cloud-fan · 2020-12-03T15:24:42Z

GA passed, merging to master, thanks for the review!

SparkQA · 2020-12-03T16:43:25Z

Test build #132131 has finished for PR 30554 at commit bff924d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

### What changes were proposed in this pull request? This is a followup of #30554 . Now we have a new config for converting CREATE TABLE, we don't need the old config that only works for CTAS. ### Why are the changes needed? It's confusing for having two config while one can cover another completely. ### Does this PR introduce _any_ user-facing change? no, it's deprecating not removing. ### How was this patch tested? N/A Closes #30651 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This is a followup of #30554 . Now we have a new config for converting CREATE TABLE, we don't need the old config that only works for CTAS. ### Why are the changes needed? It's confusing for having two config while one can cover another completely. ### Does this PR introduce _any_ user-facing change? no, it's deprecating not removing. ### How was this patch tested? N/A Closes #30651 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6aff215) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… `false` by default ### What changes were proposed in this pull request? This PR aims to switch `spark.sql.legacy.createHiveTableByDefault` to `false` by default in order to move away from this legacy behavior from `Apache Spark 4.0.0` while the legacy functionality will be preserved during Apache Spark 4.x period by setting `spark.sql.legacy.createHiveTableByDefault=true`. ### Why are the changes needed? Historically, this behavior change was merged at `Apache Spark 3.0.0` activity in SPARK-30098 and reverted officially during the `3.0.0 RC` period. - 2019-12-06: #26736 (58be82a) - 2019-12-06: https://lists.apache.org/thread/g90dz1og1zt4rr5h091rn1zqo50y759j - 2020-05-16: #28517 At `Apache Spark 3.1.0`, we had another discussion and defined it as `Legacy` behavior via a new configuration by reusing the JIRA ID, SPARK-30098. - 2020-12-01: https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204 - 2020-12-03: #30554 Last year, this was proposed again twice and `Apache Spark 4.0.0` is a good time to make a decision for Apache Spark future direction. - SPARK-42603 on 2023-02-27 as an independent idea. - SPARK-46122 on 2023-11-27 as a part of Apache Spark 4.0.0 idea ### Does this PR introduce _any_ user-facing change? Yes, the migration document is updated. ### How was this patch tested? Pass the CIs with the adjusted test cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46207 from dongjoon-hyun/SPARK-46122. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

… `false` by default ### What changes were proposed in this pull request? This PR aims to switch `spark.sql.legacy.createHiveTableByDefault` to `false` by default in order to move away from this legacy behavior from `Apache Spark 4.0.0` while the legacy functionality will be preserved during Apache Spark 4.x period by setting `spark.sql.legacy.createHiveTableByDefault=true`. ### Why are the changes needed? Historically, this behavior change was merged at `Apache Spark 3.0.0` activity in SPARK-30098 and reverted officially during the `3.0.0 RC` period. - 2019-12-06: apache#26736 (58be82a) - 2019-12-06: https://lists.apache.org/thread/g90dz1og1zt4rr5h091rn1zqo50y759j - 2020-05-16: apache#28517 At `Apache Spark 3.1.0`, we had another discussion and defined it as `Legacy` behavior via a new configuration by reusing the JIRA ID, SPARK-30098. - 2020-12-01: https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204 - 2020-12-03: apache#30554 Last year, this was proposed again twice and `Apache Spark 4.0.0` is a good time to make a decision for Apache Spark future direction. - SPARK-42603 on 2023-02-27 as an independent idea. - SPARK-46122 on 2023-11-27 as a part of Apache Spark 4.0.0 idea ### Does this PR introduce _any_ user-facing change? Yes, the migration document is updated. ### How was this patch tested? Pass the CIs with the adjusted test cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46207 from dongjoon-hyun/SPARK-46122. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

github-actions bot added DOCS SQL labels Nov 30, 2020

dongjoon-hyun requested a review from rdblue November 30, 2020 18:17

dongjoon-hyun reviewed Nov 30, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Show resolved Hide resolved

rdblue reviewed Nov 30, 2020

View reviewed changes

cloud-fan force-pushed the create-table branch 2 times, most recently from 043a11b to 66c6495 Compare December 1, 2020 09:10

Use default datasource as provider for CREATE TABLE command

ddfa0e8

cloud-fan force-pushed the create-table branch from 66c6495 to ddfa0e8 Compare December 1, 2020 10:39

remove invalid test

a508985

rdblue reviewed Dec 1, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala Outdated Show resolved Hide resolved

do not enable by default

ada2bd8

cloud-fan changed the title ~~[SPARK-30098][SQL] Use default datasource as provider for CREATE TABLE command~~ [SPARK-30098][SQL] Use default datasource as provider for CREATE TABLE command if the config is set Dec 1, 2020

HyukjinKwon changed the title ~~[SPARK-30098][SQL] Use default datasource as provider for CREATE TABLE command if the config is set~~ [SPARK-30098][SQL] Add a configuration to use default datasource as provider for CREATE TABLE command Dec 2, 2020

HyukjinKwon approved these changes Dec 2, 2020

View reviewed changes

HyukjinKwon reviewed Dec 2, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Dec 2, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala Show resolved Hide resolved

holdenk reviewed Dec 2, 2020

View reviewed changes

address comments

bff924d

rdblue reviewed Dec 2, 2020

View reviewed changes

rdblue approved these changes Dec 2, 2020

View reviewed changes

dongjoon-hyun approved these changes Dec 3, 2020

View reviewed changes

cloud-fan closed this in 0706e64 Dec 3, 2020

cloud-fan mentioned this pull request Dec 7, 2020

[SPARK-33693][SQL] deprecate spark.sql.hive.convertCTAS #30651

Closed

dongjoon-hyun mentioned this pull request Apr 24, 2024

[SPARK-46122][SQL] Set spark.sql.legacy.createHiveTableByDefault to false by default #46207

Closed

		@@ -54,6 +54,8 @@ license: \|

		- In Spark 3.1, creating or altering a view will capture runtime SQL configs and store them as view properties. These configs will be applied during the parsing and analysis phases of the view resolution. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.useCurrentConfigsForView` to `true`.

		- In Spark 3.1, `CREATE TABLE` without a specific table provider uses the value of `spark.sql.sources.default` as its table provider. In Spark version 3.0 and below, it was Hive. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.createHiveTableByDefault.enabled` to `true`.

[SPARK-30098][SQL] Add a configuration to use default datasource as provider for CREATE TABLE command #30554

[SPARK-30098][SQL] Add a configuration to use default datasource as provider for CREATE TABLE command #30554

Conversation

cloud-fan commented Nov 30, 2020 • edited by HyukjinKwon Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun commented Nov 30, 2020

dongjoon-hyun commented Nov 30, 2020

SparkQA commented Nov 30, 2020

rdblue Nov 30, 2020

Choose a reason for hiding this comment

rdblue commented Nov 30, 2020

dongjoon-hyun commented Nov 30, 2020

HyukjinKwon commented Dec 1, 2020

HeartSaVioR commented Dec 1, 2020 • edited Loading

HyukjinKwon commented Dec 1, 2020 • edited Loading

cloud-fan commented Dec 1, 2020 • edited Loading

rdblue commented Dec 1, 2020

cloud-fan commented Dec 1, 2020

rdblue commented Dec 1, 2020

rdblue commented Dec 1, 2020

cloud-fan commented Dec 1, 2020 • edited Loading

dongjoon-hyun commented Dec 1, 2020

cloud-fan commented Dec 1, 2020

holdenk Dec 2, 2020

Choose a reason for hiding this comment

HyukjinKwon Dec 2, 2020

Choose a reason for hiding this comment

holdenk Dec 2, 2020

Choose a reason for hiding this comment

SparkQA commented Dec 2, 2020

HyukjinKwon commented Dec 2, 2020

SparkQA commented Dec 2, 2020

cloud-fan commented Dec 2, 2020

rdblue commented Dec 2, 2020

rdblue Dec 2, 2020

Choose a reason for hiding this comment

cloud-fan Dec 2, 2020

Choose a reason for hiding this comment

rdblue left a comment

Choose a reason for hiding this comment

SparkQA commented Dec 2, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

cloud-fan commented Dec 3, 2020

SparkQA commented Dec 3, 2020

SparkQA commented Dec 3, 2020

cloud-fan commented Dec 3, 2020

SparkQA commented Dec 3, 2020

cloud-fan commented Dec 3, 2020

SparkQA commented Dec 3, 2020

cloud-fan commented Nov 30, 2020 •

edited by HyukjinKwon

Loading

HeartSaVioR commented Dec 1, 2020 •

edited

Loading

HyukjinKwon commented Dec 1, 2020 •

edited

Loading

cloud-fan commented Dec 1, 2020 •

edited

Loading

cloud-fan commented Dec 1, 2020 •

edited

Loading