Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30098][SQL] Add a configuration to use default datasource as provider for CREATE TABLE command #30554

Closed
wants to merge 4 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Nov 30, 2020

What changes were proposed in this pull request?

For CRETE TABLE [AS SELECT] command, creates native Parquet table if neither USING nor STORE AS is specified and spark.sql.legacy.createHiveTableByDefault is false.

This is a retry after we unify the CREATE TABLE syntax. It partially reverts d2bec5e

This PR allows CREATE EXTERNAL TABLE when LOCATION is present. This was not allowed for data source tables before, which is an unnecessary behavior different with hive tables.

Why are the changes needed?

Changing from Hive text table to native Parquet table has many benefits:

  1. be consistent with DataFrameWriter.saveAsTable.
  2. better performance
  3. better support for nested types (Hive text table doesn't work well with nested types, e.g. insert into t values struct(null) actually inserts a null value not struct(null) if t is a Hive text table, which leads to wrong result)
  4. better interoperability as Parquet is a more popular open file format.

Does this PR introduce any user-facing change?

No by default. If the config is set, the behavior change is described below:

Behavior-wise, the change is very small as the native Parquet table is also Hive-compatible. All the Spark DDL commands that works for hive tables also works for native Parquet tables, with two exceptions: ALTER TABLE SET [SERDE | SERDEPROPERTIES] and LOAD DATA.

char/varchar behavior has been taken care by #30412, and there is no behavior difference between data source and hive tables.

One potential issue is CREATE TABLE ... LOCATION ... while users want to directly access the files later. It's more like a corner case and the legacy config should be good enough.

Another potential issue is users may use Spark to create the table and then use Hive to add partitions with different serde. This is not allowed for Spark native tables.

How was this patch tested?

Re-enable the tests

@dongjoon-hyun
Copy link
Member

cc @rdblue

@dongjoon-hyun
Copy link
Member

Do you think we need a vote for this, @cloud-fan and @gatorsmile ?

@SparkQA
Copy link

SparkQA commented Nov 30, 2020

Test build #132003 has finished for PR 30554 at commit a2c4680.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -54,6 +54,8 @@ license: |

- In Spark 3.1, creating or altering a view will capture runtime SQL configs and store them as view properties. These configs will be applied during the parsing and analysis phases of the view resolution. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.useCurrentConfigsForView` to `true`.

- In Spark 3.1, `CREATE TABLE` without a specific table provider uses the value of `spark.sql.sources.default` as its table provider. In Spark version 3.0 and below, it was Hive. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.createHiveTableByDefault.enabled` to `true`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the default behavior of CREATE TABLE should change in a point release. Why is this considered a "safe" change to make?

This could easily break existing workflows and should be done in a major release.

@rdblue
Copy link
Contributor

rdblue commented Nov 30, 2020

Do you think we need a vote for this, @cloud-fan and @gatorsmile?

I think a vote should be required before committing this if the intent is to change the default behavior of CREATE TABLE. I'm not sure that a vote would even be appropriate given that breaking behavior changes are generally not included in point releases.

@dongjoon-hyun
Copy link
Member

+1 for @rdblue 's opinion.

@HyukjinKwon
Copy link
Member

I don't think this is something banned as long as we have a way to restore the previous behaviour. More importantly, in Apache Spark, CREATE TABLE should create a Spark table, not Hive table.

All items in the migration guide are behaviour changes since we don't usually list bug fixes in the migration guide. I can provide a bunch of similar examples if you guys doubt.

So I believe the point we should discuss/focus here is that this is a significant behaviour change since CREATE TABLE is the very entry point of users. Personally the vote sounds fine to me given that two committers think it might better need.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Dec 1, 2020

I'd agree to change it even in minor release if we have discussed this for enough time in public, and put some efforts on figuring out impacts to the end users and guide in prior (like roadmap).

I don't think we did anything I mentioned. The last discussion we did before Spark 3.0.0 was more likely concerning the change without proper discussion, and we reverted it. Unifying create table syntax fixes the long term issue along confused two create table syntaxes, but that's it and it's not a rationalization of changing the default provider.

Changing the default provider for create table is totally different story. My experience of Spark community says that we're most likely reluctant to make a backward incompatible change (even we did for major release), and sometimes we set old behavior by default even the new functionality is available. I'm surprised PR description doesn't mention anything about impacts.

I agree this requires enough discussion in public before going further. In discussion we should make clear the benefits of changing this, "AND" the all possible impacts of changing this.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Dec 1, 2020

I agree with the point of describing the things and explaining rationalization. I would have imagined to have a detailed PR description as well.

Enough discussion should better be made for a significant change of course. I don't believe this PR targeted to push without it. Also, I don't believe such discussions should happen in a specific place like mailing list. They can happen in JIRA, PR, etc.

One alternative is just to turn this switch off by default, issue warnings and turn it on later in another release in the future.

Just to make it extra clear, what this PR changes looks correct and straightforward to me. Spark should create a Spark table when users CREATE TABLE.

@cloud-fan cloud-fan force-pushed the create-table branch 2 times, most recently from 043a11b to 66c6495 Compare December 1, 2020 09:10
@cloud-fan
Copy link
Contributor Author

cloud-fan commented Dec 1, 2020

I've updated the PR description to include more details. I think this is not a big breaking change that needs a formal vote (we don't vote for every breaking change). It's mostly about hive-compatibility, not behavior changes. A normal PR review process should work here. But if someone has a different opinion, I'm OK to do a vote as well.

@rdblue
Copy link
Contributor

rdblue commented Dec 1, 2020

But if someone has a different opinion, I'm OK to do a vote as well.

From reading this thread, I think it is safe to say that @HeartSaVioR, @dongjoon-hyun, and I have all expressed an opinion that this requries a wider discussion on the dev list, followed by a vote if there isn't already clear consensus.

I think that discussion also needs to address how this will affect users with jobs running in Spark 3.0, and why you consider it safe enough for a point release.

@cloud-fan
Copy link
Contributor Author

I'll turn off the config once all tests pass, so that it's optional. I believe this at least is a useful feature for Spark vendors. In the meanwhile, I'll send an email to discuss it.

BTW just to make it clear: breaking changes are allowed in point releases. You can read the migration guide for each point release to see examples.

@rdblue
Copy link
Contributor

rdblue commented Dec 1, 2020

One potential issue is CREATE TABLE ... LOCATION ... while users want to directly access the files later. It's more like a corner case and the legacy config should be good enough.

Can you explain this reasoning more clearly? How is changing the file format by default a corner case?

there is no behavior difference between data source and hive tables.

I don't think this is accurate. What about the file format change above? Is that not a behavior difference?

Hive tables support partitions that use different formats and serde, but Spark tables do not. That is a table behavior difference as well.

@rdblue
Copy link
Contributor

rdblue commented Dec 1, 2020

I'll turn off the config once all tests pass, so that it's optional. I believe this at least is a useful feature for Spark vendors.

I don't think there is any objection to having a config so you can change it in your environment. It is just that there should be more discussion at a minimum about breaking changes like this. I also don't think that Spark's versioning policy allows it.

BTW just to make it clear: breaking changes are allowed in point releases. You can read the migration guide for each point release to see examples.

Can you point to where this is allowed by the versioning policy?

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Dec 1, 2020

How is changing the file format by default a corner case?

I was saying accessing data files of Spark tables directly are corner cases. Users create a table because they want to access the data by table name, not directory.

Hive tables support partitions that use different formats and serde, but Spark tables do not.

This is a good point. It's not possible to do it via Spark DDL commands, but people can create table with Spark, then add partitions with different serdes with Hive. I'll put it in the Does this PR introduce any user-facing change? section.

Can you point to where this is allowed by the versioning policy?

Quoted below:

The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.

We try to avoid breaking changes, but we don't completely forbid it. Breaking changes are allowed when it's justified. We definitely don't allow arbitrary breaking changes.

@dongjoon-hyun
Copy link
Member

@cloud-fan . If this is a new configuration PR which is disabled by default, could you adjust PR title and description and remove docs/sql-migration-guide.md from this PR?

I'll turn off the config once all tests pass, so that it's optional.

@cloud-fan
Copy link
Contributor Author

All tests pass, I'm changing it now.

@cloud-fan cloud-fan changed the title [SPARK-30098][SQL] Use default datasource as provider for CREATE TABLE command [SPARK-30098][SQL] Use default datasource as provider for CREATE TABLE command if the config is set Dec 1, 2020
@HyukjinKwon HyukjinKwon changed the title [SPARK-30098][SQL] Use default datasource as provider for CREATE TABLE command if the config is set [SPARK-30098][SQL] Add a configuration to use default datasource as provider for CREATE TABLE command Dec 2, 2020
buildConf("spark.sql.legacy.createHiveTableByDefault")
.internal()
.doc("When set to true, CREATE TABLE syntax without a table provider will use hive " +
s"instead of the value of ${DEFAULT_DATA_SOURCE_NAME.key} as the table provider.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a "legacy" key mention how long folks can depend on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are already a lot of legacy configurations, and there's no plan for how long we'll keep it. That's what I know as far as I have followed in the community. It would be a separate issue to decide lifetime of legacy configurations but ideally the legacy configurations will be removed in the major release bumpup I guess. I remember I discussed this with Sean as well somewhere. cc @srowen FYI

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a general policy for legacy configs that means this remains the same until Spark 4 by default I guess there is no need to document it here (I don't remember that conversation but I was out for a few months last/this year).

@SparkQA
Copy link

SparkQA commented Dec 2, 2020

Test build #132034 has finished for PR 30554 at commit bff924d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Dec 2, 2020

Test build #132043 has finished for PR 30554 at commit bff924d.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@rdblue
Copy link
Contributor

rdblue commented Dec 2, 2020

Can you point to where this is allowed by the versioning policy?

Quoted below:

The Spark project strives to avoid breaking APIs or silently changing behavior, even at major versions. While this is not always possible, the balance of the following factors should be considered before choosing to break an API.

We try to avoid breaking changes, but we don't completely forbid it. Breaking changes are allowed when it's justified. We definitely don't allow arbitrary breaking changes.

I think that you're misinterpreting it. That means that Spark avoids breaking APIs and behavior even at major releases. It is more strict, not more lax.

It does not mean that the rules are flexible and Spark will make breaking behavior changes. This expectation is set in the first paragraph, where it states that Spark will follow semver, except for a small set of multi-module issues.

// 1. `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT` is false, or
// 2. It's a CTAS and `conf.convertCTAS` is true.
val createHiveTableByDefault = conf.getConf(SQLConf.LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT)
if (!createHiveTableByDefault || (ctas && conf.convertCTAS)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this mark convertCTAS as deprecated since it is superseded by the new config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I can do it in a follow-up.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this doesn't change the behavior for external and does not enable this by default, I think this is ready when tests are passing.

@SparkQA
Copy link

SparkQA commented Dec 2, 2020

Test build #132058 has finished for PR 30554 at commit bff924d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @cloud-fan and all.

  • The Jenkins 3 UT failures are flaky ones. I saw them recently on master branch.
  • GitHub Action 1 UT failure looks weird.
CliSuite.SPARK-29022 Commands using SerDe provided in ADD JAR sql

So, I checked it manually.

$ build/sbt "hive-thriftserver/testOnly *.CliSuite -- -z SPARK-29022" -Phive-thriftserver
[info] CliSuite:
[info] - SPARK-29022: Commands using SerDe provided in --hive.aux.jars.path (16 seconds, 860 milliseconds)
[info] - SPARK-29022 Commands using SerDe provided in ADD JAR sql (14 seconds, 268 milliseconds)
...
[info] All tests passed.
[info] Passed: Total 2, Failed 0, Errors 0, Passed 2

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Test build #132123 has finished for PR 30554 at commit bff924d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36724/

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36724/

@cloud-fan
Copy link
Contributor Author

GA passed, merging to master, thanks for the review!

@cloud-fan cloud-fan closed this in 0706e64 Dec 3, 2020
@SparkQA
Copy link

SparkQA commented Dec 3, 2020

Test build #132131 has finished for PR 30554 at commit bff924d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

dongjoon-hyun pushed a commit that referenced this pull request Dec 7, 2020
### What changes were proposed in this pull request?

This is a followup of #30554 . Now we have a new config for converting CREATE TABLE, we don't need the old config that only works for CTAS.

### Why are the changes needed?

It's confusing for having two config while one can cover another completely.

### Does this PR introduce _any_ user-facing change?

no, it's deprecating not removing.

### How was this patch tested?

N/A

Closes #30651 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Dec 7, 2020
### What changes were proposed in this pull request?

This is a followup of #30554 . Now we have a new config for converting CREATE TABLE, we don't need the old config that only works for CTAS.

### Why are the changes needed?

It's confusing for having two config while one can cover another completely.

### Does this PR introduce _any_ user-facing change?

no, it's deprecating not removing.

### How was this patch tested?

N/A

Closes #30651 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 6aff215)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun added a commit that referenced this pull request Apr 30, 2024
… `false` by default

### What changes were proposed in this pull request?

This PR aims to switch `spark.sql.legacy.createHiveTableByDefault` to `false` by default in order to move away from this legacy behavior from `Apache Spark 4.0.0` while the legacy functionality will be preserved during Apache Spark 4.x period by setting `spark.sql.legacy.createHiveTableByDefault=true`.

### Why are the changes needed?

Historically, this behavior change was merged at `Apache Spark 3.0.0` activity in SPARK-30098 and reverted officially during the `3.0.0 RC` period.

- 2019-12-06: #26736 (58be82a)
- 2019-12-06: https://lists.apache.org/thread/g90dz1og1zt4rr5h091rn1zqo50y759j
- 2020-05-16: #28517

At `Apache Spark 3.1.0`, we had another discussion and defined it as `Legacy` behavior via a new configuration by reusing the JIRA ID, SPARK-30098.
- 2020-12-01: https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
- 2020-12-03: #30554

Last year, this was proposed again twice and `Apache Spark 4.0.0` is a good time to make a decision for Apache Spark future direction.
- SPARK-42603 on 2023-02-27 as an independent idea.
- SPARK-46122 on 2023-11-27 as a part of Apache Spark 4.0.0 idea

### Does this PR introduce _any_ user-facing change?

Yes, the migration document is updated.

### How was this patch tested?

Pass the CIs with the adjusted test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46207 from dongjoon-hyun/SPARK-46122.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
… `false` by default

### What changes were proposed in this pull request?

This PR aims to switch `spark.sql.legacy.createHiveTableByDefault` to `false` by default in order to move away from this legacy behavior from `Apache Spark 4.0.0` while the legacy functionality will be preserved during Apache Spark 4.x period by setting `spark.sql.legacy.createHiveTableByDefault=true`.

### Why are the changes needed?

Historically, this behavior change was merged at `Apache Spark 3.0.0` activity in SPARK-30098 and reverted officially during the `3.0.0 RC` period.

- 2019-12-06: apache#26736 (58be82a)
- 2019-12-06: https://lists.apache.org/thread/g90dz1og1zt4rr5h091rn1zqo50y759j
- 2020-05-16: apache#28517

At `Apache Spark 3.1.0`, we had another discussion and defined it as `Legacy` behavior via a new configuration by reusing the JIRA ID, SPARK-30098.
- 2020-12-01: https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
- 2020-12-03: apache#30554

Last year, this was proposed again twice and `Apache Spark 4.0.0` is a good time to make a decision for Apache Spark future direction.
- SPARK-42603 on 2023-02-27 as an independent idea.
- SPARK-46122 on 2023-11-27 as a part of Apache Spark 4.0.0 idea

### Does this PR introduce _any_ user-facing change?

Yes, the migration document is updated.

### How was this patch tested?

Pass the CIs with the adjusted test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46207 from dongjoon-hyun/SPARK-46122.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants