Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19305][SQL] partitioned table should always put partition columns at the end of table schema #16655

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

For data source tables, we will always reorder the specified table schema, or the query in CTAS, to put partition columns at the end. e.g. CREATE TABLE t(a int, b int, c int, d int) USING parquet PARTITIONED BY (d, b) will create a table with schema <a, c, d, b>

Hive serde tables don't have this problem before, because its CREATE TABLE syntax specifies data schema and partition schema individually.

However, after we unifed the CREATE TABLE syntax, Hive serde table also need to do the reorder. This PR puts the reorder logic in a analyzer rule, which works with both data source tables and Hive serde tables.

How was this patch tested?

new regression test

@cloud-fan
Copy link
Contributor Author

cc @yhuai @gatorsmile @windpiger

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71705 has started for PR 16655 at commit 9ec7d36.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71714 has finished for PR 16655 at commit 9ec7d36.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


val columnNames = if (sparkSession.sessionState.conf.caseSensitiveAnalysis) {
schema.map(_.name)
c.copy(tableDesc = normalizedTable, query = Some(reorderedQuery))
Copy link
Member

@gatorsmile gatorsmile Jan 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding one more check above this line here?

        assert(normalizedTable.schema.isEmpty,
          "Schema may not be specified in a Create Table As Select (CTAS) statement")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be guaranteed by the parser, but we can check it again here.

@gatorsmile
Copy link
Member

LGTM pending test

@windpiger
Copy link
Contributor

LGTM, after this merged, I will contiune the work #16593 thanks~

@SparkQA
Copy link

SparkQA commented Jan 21, 2017

Test build #71754 has finished for PR 16655 at commit 68f639e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@asfgit asfgit closed this in 3c2ba9f Jan 21, 2017
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…mns at the end of table schema

## What changes were proposed in this pull request?

For data source tables, we will always reorder the specified table schema, or the query in CTAS, to put partition columns at the end. e.g. `CREATE TABLE t(a int, b int, c int, d int) USING parquet PARTITIONED BY (d, b)` will create a table with schema `<a, c, d, b>`

Hive serde tables don't have this problem before, because its CREATE TABLE syntax specifies data schema and partition schema individually.

However, after we unifed the CREATE TABLE syntax, Hive serde table also need to do the reorder. This PR puts the reorder logic in a analyzer rule,  which works with both data source tables and Hive serde tables.

## How was this patch tested?

new regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#16655 from cloud-fan/schema.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…mns at the end of table schema

## What changes were proposed in this pull request?

For data source tables, we will always reorder the specified table schema, or the query in CTAS, to put partition columns at the end. e.g. `CREATE TABLE t(a int, b int, c int, d int) USING parquet PARTITIONED BY (d, b)` will create a table with schema `<a, c, d, b>`

Hive serde tables don't have this problem before, because its CREATE TABLE syntax specifies data schema and partition schema individually.

However, after we unifed the CREATE TABLE syntax, Hive serde table also need to do the reorder. This PR puts the reorder logic in a analyzer rule,  which works with both data source tables and Hive serde tables.

## How was this patch tested?

new regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#16655 from cloud-fan/schema.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants