[SPARK-3343] [SQL] Add serde support for CTAS #2570

chenghao-intel · 2014-09-29T03:10:03Z

Currently, CTAS (Create Table As Select) doesn't support specifying the SerDe in HQL. This PR will pass down the ASTNode into the physical operator execution.CreateTableAsSelect, which will extract the CreateTableDesc object via Hive SemanticAnalyzer. In the meantime, I also update the HiveMetastoreCatalog.createTable to optionally support the CreateTableDesc for table creation.

SparkQA · 2014-09-29T03:14:45Z

QA tests have started for PR 2570 at commit 439ce77.

This patch merges cleanly.

SparkQA · 2014-09-29T04:05:04Z

QA tests have finished for PR 2570 at commit 439ce77.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- logDebug("Found class for $serdeName")

AmplabJenkins · 2014-09-29T04:05:07Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20956/

SparkQA · 2014-09-30T02:59:37Z

QA tests have started for PR 2570 at commit 4ea462c.

This patch merges cleanly.

SparkQA · 2014-09-30T03:49:19Z

QA tests have finished for PR 2570 at commit 4ea462c.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- logDebug("Found class for $serdeName")

AmplabJenkins · 2014-09-30T03:49:21Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21016/

yhuai · 2014-09-30T19:57:12Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

+        | STORED AS RCFile AS
+        |   SELECT key, value
+        |   FROM src
+        |   ORDER BY key, value""".stripMargin).collect


I am not sure we should just check the contents of created tables to test whether we can correctly set some table properties.

That's a good question, I will add the describe command to verify if the properties are correctly set.

AmplabJenkins · 2014-10-01T08:22:19Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21099/

chenghao-intel · 2014-10-01T13:19:57Z

retest this please.

SparkQA · 2014-10-01T13:24:33Z

QA tests have started for PR 2570 at commit fcbbc61.

This patch merges cleanly.

yhuai · 2014-10-01T14:01:44Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateTableAsSelect.scala

-    insertIntoRelation(metastoreRelation).execute
+    // TODO ideally, we should get the output data ready first and then
+    // update the relation, just in case of failure occurs in data
+    // processing. Otherwise we may not able to get a consistent.


If we populate metastore after evaluating the query, we also need to make sure information stored in CreateTableDesc will be correctly set to tableInfo in the FileSinkDesc. Also, if TableDesc org.apache.hadoop.hive.ql.plan.PlanUtils.getTableDesc(CreateTableDesc, String, String) will be used to create the tableInfo, the implementation of this method in hive 0.12 cannot be used because of the bug mentioned in https://issues.apache.org/jira/browse/HIVE-6083. Can you add a note at here?

Oh, thanks, I didn't know this, will do that.

SparkQA · 2014-10-01T14:15:12Z

QA tests have finished for PR 2570 at commit fcbbc61.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- logDebug("Found class for $serdeName")

AmplabJenkins · 2014-10-01T14:15:15Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21112/

yhuai · 2014-10-01T14:18:52Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

+      tblName: String,
+      crtTbl: CreateTableDesc,
+      schema: Seq[Attribute]) {
+    // Most of code are similar with the DDLTask.createTable(),


It will be good to mention DDLTask is in Hive's codebase.

Can we consolidate two createTable? With your change, seems the original createTable will be only used by createTable in HiveContext.

Also, can you add comments to HiveContext.createTable to explain what table properties will be used if users call this method?

chenghao-intel · 2014-10-08T02:16:20Z

Thank you @yhuai , I will update the code accordingly.

SparkQA · 2014-10-08T08:59:40Z

QA tests have started for PR 2570 at commit d49596b.

This patch merges cleanly.

SparkQA · 2014-10-08T09:49:01Z

QA tests have finished for PR 2570 at commit d49596b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- logDebug("Found class for $serdeName")

AmplabJenkins · 2014-10-08T09:49:04Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21458/Test PASSed.

chenghao-intel · 2014-10-09T01:10:24Z

I've updated the code, @yhuai @marmbrus , any more comments?

AmplabJenkins · 2014-10-09T01:17:24Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21502/Test FAILed.

SparkQA · 2014-10-09T01:59:58Z

QA tests have started for PR 2570 at commit ff2e140.

This patch merges cleanly.

SparkQA · 2014-10-09T02:43:41Z

QA tests have finished for PR 2570 at commit ff2e140.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-09T02:43:43Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21509/Test PASSed.

marmbrus · 2014-10-10T02:00:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala

-  override def output = child.output
+    child: LogicalPlan,
+    allowExisting: Boolean,
+    extra: AnyRef = null) extends UnaryNode {


Instead of making this an AnyRef maybe we can just move it into the hive package? Either that or create a specialized hive version of this operator if we use it elsewhere.

What about to make the extra as generic type? CTAS probably widely supported by different SQL dialects, creating specialized version maybe lead to duplicated code.

SparkQA · 2014-10-27T08:09:46Z

Test build #22283 has started for PR 2570 at commit 53d0c7a.

This patch merges cleanly.

chenghao-intel · 2014-10-27T08:10:38Z

@marmbrus, I've rebase dthe code with latest master (with Hive 0.13.1 supported, but not compatible with Hive 0.12). Please let me know if you have concerns on this.

SparkQA · 2014-10-27T08:12:53Z

Test build #22283 has finished for PR 2570 at commit 53d0c7a.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CreateTableAsSelect[T](
- logDebug("Found class for $serdeName")

AmplabJenkins · 2014-10-27T08:12:54Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22283/
Test FAILed.

chenghao-intel · 2014-10-27T08:34:41Z

Build failed because this PR is only compatible with Hive 0,13.1 (not 0.12 any more).

yhuai · 2014-10-27T13:20:46Z

Can you explain the reason that hive 0.12 is not supported?

SparkQA · 2014-10-28T01:19:56Z

Test build #22317 has started for PR 2570 at commit 2ab88c3.

This patch merges cleanly.

SparkQA · 2014-10-28T01:28:32Z

Test build #22317 has finished for PR 2570 at commit 2ab88c3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CreateTableAsSelect[T](
- logDebug("Found class for $serdeName")

AmplabJenkins · 2014-10-28T01:28:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22317/
Test FAILed.

chenghao-intel · 2014-10-28T01:45:36Z

@yhuai Some of the methods signature changed after upgrading to Hive 0.13, this actually is my concerns for how to write the shim code.
For this case:
https://github.com/apache/hive/blob/branch-0.13/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L4130
https://github.com/apache/hive/blob/branch-0.12/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3597

yhuai · 2014-10-28T02:06:42Z

Is it possible to add a method in shim like the following one?

// Hive 0.13
def setTableDataLocation(table: Table, createTableDesc: CreateTableDesc) {
  table.setDataLocation(new Path(createTableDesc.getLocation()));
}
// Hive 0.12
def setTableDataLocation(table: Table, createTableDesc: CreateTableDesc) {
  table.setDataLocation(new Path(createTableDesc.getLocation()).toUri());
}

chenghao-intel · 2014-10-28T02:45:38Z

Support conditional compilation probably need some workaround here, that's a general problem for the Hive upgrading I think. We need another PR to solve that before merging the PRs like this one.

marmbrus · 2014-10-28T03:37:59Z

@chenghao-intel there is already support for conditional compilation based on hive version. This code can go in sql/hive/v0.X.0/src/main/scala/org/apache/spark/sql/hive/ShimX.scala

chenghao-intel · 2014-10-28T05:09:35Z

Thanks @marmbrus I will update the code.

SparkQA · 2014-10-28T05:29:52Z

Test build #22337 has started for PR 2570 at commit e011ef5.

This patch merges cleanly.

SparkQA · 2014-10-28T06:26:04Z

Test build #22337 has finished for PR 2570 at commit e011ef5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CreateTableAsSelect[T](
- logDebug("Found class for $serdeName")

AmplabJenkins · 2014-10-28T06:26:06Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22337/
Test PASSed.

marmbrus · 2014-10-28T21:35:44Z

Thanks for working on this! I'm going to merge it in, but we should consider moving the semantic analyzer part to the analysis phase. These execution APIs are all developer / experimental so we can change them whenever .

This is the code refactor and follow ups for #2570 Author: Cheng Hao <hao.cheng@intel.com> Closes #3336 from chenghao-intel/createtbl and squashes the following commits: 3563142 [Cheng Hao] remove the unused variable e215187 [Cheng Hao] eliminate the compiling warning 4f97f14 [Cheng Hao] fix bug in unittest 5d58812 [Cheng Hao] revert the API changes b85b620 [Cheng Hao] fix the regression of temp tabl not found in CTAS (cherry picked from commit 51b1fe1) Signed-off-by: Michael Armbrust <michael@databricks.com>

This is the code refactor and follow ups for #2570 Author: Cheng Hao <hao.cheng@intel.com> Closes #3336 from chenghao-intel/createtbl and squashes the following commits: 3563142 [Cheng Hao] remove the unused variable e215187 [Cheng Hao] eliminate the compiling warning 4f97f14 [Cheng Hao] fix bug in unittest 5d58812 [Cheng Hao] revert the API changes b85b620 [Cheng Hao] fix the regression of temp tabl not found in CTAS

yhuai reviewed Sep 30, 2014
View reviewed changes

yhuai reviewed Oct 1, 2014
View reviewed changes

chenghao-intel force-pushed the ctas_serde branch from fcbbc61 to d49596b Compare October 8, 2014 08:54

chenghao-intel force-pushed the ctas_serde branch from da02ec3 to ff2e140 Compare October 9, 2014 01:56

chenghao-intel mentioned this pull request Oct 10, 2014

[SPARK-3816][SQL] Add table properties from storage handler to output jobConf #2677

Closed

marmbrus reviewed Oct 10, 2014
View reviewed changes

chenghao-intel added 2 commits October 28, 2014 12:54

Support SerDe properties within CTAS

c8a547d

revert to hive 0.12

cfb3662

shim for both 0.12 & 0.13.1

e011ef5

chenghao-intel force-pushed the ctas_serde branch from 2ab88c3 to e011ef5 Compare October 28, 2014 05:26

asfgit closed this in 4b55482 Oct 28, 2014

chenghao-intel deleted the ctas_serde branch November 17, 2014 15:20

chenghao-intel mentioned this pull request Nov 18, 2014

[SPARK-4769] [SQL] CTAS does not work when reading from temporary tables #3336

Closed

[SPARK-3343] [SQL] Add serde support for CTAS #2570

[SPARK-3343] [SQL] Add serde support for CTAS #2570

Conversation

chenghao-intel commented Sep 29, 2014

SparkQA commented Sep 29, 2014

SparkQA commented Sep 29, 2014

AmplabJenkins commented Sep 29, 2014

SparkQA commented Sep 30, 2014

SparkQA commented Sep 30, 2014

AmplabJenkins commented Sep 30, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Oct 1, 2014

chenghao-intel commented Oct 1, 2014

SparkQA commented Oct 1, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 1, 2014

AmplabJenkins commented Oct 1, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenghao-intel commented Oct 8, 2014

SparkQA commented Oct 8, 2014

SparkQA commented Oct 8, 2014

AmplabJenkins commented Oct 8, 2014

chenghao-intel commented Oct 9, 2014

AmplabJenkins commented Oct 9, 2014

SparkQA commented Oct 9, 2014

SparkQA commented Oct 9, 2014

AmplabJenkins commented Oct 9, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 27, 2014

chenghao-intel commented Oct 27, 2014

SparkQA commented Oct 27, 2014

AmplabJenkins commented Oct 27, 2014

chenghao-intel commented Oct 27, 2014

yhuai commented Oct 27, 2014

SparkQA commented Oct 28, 2014

SparkQA commented Oct 28, 2014

AmplabJenkins commented Oct 28, 2014

chenghao-intel commented Oct 28, 2014

yhuai commented Oct 28, 2014

chenghao-intel commented Oct 28, 2014

marmbrus commented Oct 28, 2014

chenghao-intel commented Oct 28, 2014

SparkQA commented Oct 28, 2014

SparkQA commented Oct 28, 2014

AmplabJenkins commented Oct 28, 2014

marmbrus commented Oct 28, 2014