[SPARK-19305][SQL] partitioned table should always put partition columns at the end of table schema #16655

cloud-fan · 2017-01-20T06:10:55Z

What changes were proposed in this pull request?

For data source tables, we will always reorder the specified table schema, or the query in CTAS, to put partition columns at the end. e.g. CREATE TABLE t(a int, b int, c int, d int) USING parquet PARTITIONED BY (d, b) will create a table with schema <a, c, d, b>

Hive serde tables don't have this problem before, because its CREATE TABLE syntax specifies data schema and partition schema individually.

However, after we unifed the CREATE TABLE syntax, Hive serde table also need to do the reorder. This PR puts the reorder logic in a analyzer rule, which works with both data source tables and Hive serde tables.

How was this patch tested?

new regression test

…able schema

cloud-fan · 2017-01-20T06:11:36Z

cc @yhuai @gatorsmile @windpiger

SparkQA · 2017-01-20T06:12:42Z

Test build #71705 has started for PR 16655 at commit 9ec7d36.

cloud-fan · 2017-01-20T10:13:11Z

retest this please

SparkQA · 2017-01-20T12:38:37Z

Test build #71714 has finished for PR 16655 at commit 9ec7d36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-21T03:12:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala


-      val columnNames = if (sparkSession.sessionState.conf.caseSensitiveAnalysis) {
-        schema.map(_.name)
+        c.copy(tableDesc = normalizedTable, query = Some(reorderedQuery))


How about adding one more check above this line here?

assert(normalizedTable.schema.isEmpty, "Schema may not be specified in a Create Table As Select (CTAS) statement")

this should be guaranteed by the parser, but we can check it again here.

gatorsmile · 2017-01-21T03:15:23Z

LGTM pending test

windpiger · 2017-01-21T03:17:12Z

LGTM, after this merged, I will contiune the work #16593 thanks~

SparkQA · 2017-01-21T05:53:06Z

Test build #71754 has finished for PR 16655 at commit 68f639e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-21T05:58:22Z

thanks for the review, merging to master!

…mns at the end of table schema ## What changes were proposed in this pull request? For data source tables, we will always reorder the specified table schema, or the query in CTAS, to put partition columns at the end. e.g. `CREATE TABLE t(a int, b int, c int, d int) USING parquet PARTITIONED BY (d, b)` will create a table with schema `<a, c, d, b>` Hive serde tables don't have this problem before, because its CREATE TABLE syntax specifies data schema and partition schema individually. However, after we unifed the CREATE TABLE syntax, Hive serde table also need to do the reorder. This PR puts the reorder logic in a analyzer rule, which works with both data source tables and Hive serde tables. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#16655 from cloud-fan/schema.

partitioned table should always put partition columns at the end of t…

9ec7d36

…able schema

cloud-fan mentioned this pull request Jan 20, 2017

[SPARK-19153][SQL]DataFrameWriter.saveAsTable work with create partitioned table #16593

Closed

gatorsmile reviewed Jan 21, 2017

View reviewed changes

address comments

68f639e

asfgit closed this in 3c2ba9f Jan 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19305][SQL] partitioned table should always put partition columns at the end of table schema #16655

[SPARK-19305][SQL] partitioned table should always put partition columns at the end of table schema #16655

cloud-fan commented Jan 20, 2017

cloud-fan commented Jan 20, 2017

SparkQA commented Jan 20, 2017

cloud-fan commented Jan 20, 2017

SparkQA commented Jan 20, 2017

gatorsmile Jan 21, 2017 •

edited

Loading

cloud-fan Jan 21, 2017

gatorsmile commented Jan 21, 2017

windpiger commented Jan 21, 2017

SparkQA commented Jan 21, 2017

cloud-fan commented Jan 21, 2017

[SPARK-19305][SQL] partitioned table should always put partition columns at the end of table schema #16655

[SPARK-19305][SQL] partitioned table should always put partition columns at the end of table schema #16655

Conversation

cloud-fan commented Jan 20, 2017

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Jan 20, 2017

SparkQA commented Jan 20, 2017

cloud-fan commented Jan 20, 2017

SparkQA commented Jan 20, 2017

gatorsmile Jan 21, 2017 • edited Loading

Choose a reason for hiding this comment

cloud-fan Jan 21, 2017

Choose a reason for hiding this comment

gatorsmile commented Jan 21, 2017

windpiger commented Jan 21, 2017

SparkQA commented Jan 21, 2017

cloud-fan commented Jan 21, 2017

gatorsmile Jan 21, 2017 •

edited

Loading