Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3343] [SQL] Add serde support for CTAS #2570

Closed
wants to merge 3 commits into from

Conversation

chenghao-intel
Copy link
Contributor

Currently, CTAS (Create Table As Select) doesn't support specifying the SerDe in HQL. This PR will pass down the ASTNode into the physical operator execution.CreateTableAsSelect, which will extract the CreateTableDesc object via Hive SemanticAnalyzer. In the meantime, I also update the HiveMetastoreCatalog.createTable to optionally support the CreateTableDesc for table creation.

@SparkQA
Copy link

SparkQA commented Sep 29, 2014

QA tests have started for PR 2570 at commit 439ce77.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 29, 2014

QA tests have finished for PR 2570 at commit 439ce77.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • logDebug("Found class for $serdeName")

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20956/

@SparkQA
Copy link

SparkQA commented Sep 30, 2014

QA tests have started for PR 2570 at commit 4ea462c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 30, 2014

QA tests have finished for PR 2570 at commit 4ea462c.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • logDebug("Found class for $serdeName")

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21016/

| STORED AS RCFile AS
| SELECT key, value
| FROM src
| ORDER BY key, value""".stripMargin).collect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we should just check the contents of created tables to test whether we can correctly set some table properties.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question, I will add the describe command to verify if the properties are correctly set.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21099/

@chenghao-intel
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Oct 1, 2014

QA tests have started for PR 2570 at commit fcbbc61.

  • This patch merges cleanly.

insertIntoRelation(metastoreRelation).execute
// TODO ideally, we should get the output data ready first and then
// update the relation, just in case of failure occurs in data
// processing. Otherwise we may not able to get a consistent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we populate metastore after evaluating the query, we also need to make sure information stored in CreateTableDesc will be correctly set to tableInfo in the FileSinkDesc. Also, if TableDesc org.apache.hadoop.hive.ql.plan.PlanUtils.getTableDesc(CreateTableDesc, String, String) will be used to create the tableInfo, the implementation of this method in hive 0.12 cannot be used because of the bug mentioned in https://issues.apache.org/jira/browse/HIVE-6083. Can you add a note at here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thanks, I didn't know this, will do that.

@SparkQA
Copy link

SparkQA commented Oct 1, 2014

QA tests have finished for PR 2570 at commit fcbbc61.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • logDebug("Found class for $serdeName")

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21112/

tblName: String,
crtTbl: CreateTableDesc,
schema: Seq[Attribute]) {
// Most of code are similar with the DDLTask.createTable(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be good to mention DDLTask is in Hive's codebase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we consolidate two createTable? With your change, seems the original createTable will be only used by createTable in HiveContext.

Also, can you add comments to HiveContext.createTable to explain what table properties will be used if users call this method?

@chenghao-intel
Copy link
Contributor Author

Thank you @yhuai , I will update the code accordingly.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have started for PR 2570 at commit d49596b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have finished for PR 2570 at commit d49596b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • logDebug("Found class for $serdeName")

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21458/Test PASSed.

@chenghao-intel
Copy link
Contributor Author

I've updated the code, @yhuai @marmbrus , any more comments?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21502/Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have started for PR 2570 at commit ff2e140.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have finished for PR 2570 at commit ff2e140.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21509/Test PASSed.

override def output = child.output
child: LogicalPlan,
allowExisting: Boolean,
extra: AnyRef = null) extends UnaryNode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making this an AnyRef maybe we can just move it into the hive package? Either that or create a specialized hive version of this operator if we use it elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about to make the extra as generic type? CTAS probably widely supported by different SQL dialects, creating specialized version maybe lead to duplicated code.

@SparkQA
Copy link

SparkQA commented Oct 27, 2014

Test build #22283 has started for PR 2570 at commit 53d0c7a.

  • This patch merges cleanly.

@chenghao-intel
Copy link
Contributor Author

@marmbrus, I've rebase dthe code with latest master (with Hive 0.13.1 supported, but not compatible with Hive 0.12). Please let me know if you have concerns on this.

@SparkQA
Copy link

SparkQA commented Oct 27, 2014

Test build #22283 has finished for PR 2570 at commit 53d0c7a.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CreateTableAsSelect[T](
    • logDebug("Found class for $serdeName")

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22283/
Test FAILed.

@chenghao-intel
Copy link
Contributor Author

Build failed because this PR is only compatible with Hive 0,13.1 (not 0.12 any more).

@yhuai
Copy link
Contributor

yhuai commented Oct 27, 2014

Can you explain the reason that hive 0.12 is not supported?

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22317 has started for PR 2570 at commit 2ab88c3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22317 has finished for PR 2570 at commit 2ab88c3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CreateTableAsSelect[T](
    • logDebug("Found class for $serdeName")

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22317/
Test FAILed.

@chenghao-intel
Copy link
Contributor Author

@yhuai Some of the methods signature changed after upgrading to Hive 0.13, this actually is my concerns for how to write the shim code.
For this case:
https://github.com/apache/hive/blob/branch-0.13/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L4130
https://github.com/apache/hive/blob/branch-0.12/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3597

@yhuai
Copy link
Contributor

yhuai commented Oct 28, 2014

Is it possible to add a method in shim like the following one?

// Hive 0.13
def setTableDataLocation(table: Table, createTableDesc: CreateTableDesc) {
  table.setDataLocation(new Path(createTableDesc.getLocation()));
}
// Hive 0.12
def setTableDataLocation(table: Table, createTableDesc: CreateTableDesc) {
  table.setDataLocation(new Path(createTableDesc.getLocation()).toUri());
}

@chenghao-intel
Copy link
Contributor Author

Support conditional compilation probably need some workaround here, that's a general problem for the Hive upgrading I think. We need another PR to solve that before merging the PRs like this one.

@marmbrus
Copy link
Contributor

@chenghao-intel there is already support for conditional compilation based on hive version. This code can go in sql/hive/v0.X.0/src/main/scala/org/apache/spark/sql/hive/ShimX.scala

@chenghao-intel
Copy link
Contributor Author

Thanks @marmbrus I will update the code.

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22337 has started for PR 2570 at commit e011ef5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22337 has finished for PR 2570 at commit e011ef5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CreateTableAsSelect[T](
    • logDebug("Found class for $serdeName")

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22337/
Test PASSed.

@marmbrus
Copy link
Contributor

Thanks for working on this! I'm going to merge it in, but we should consider moving the semantic analyzer part to the analysis phase. These execution APIs are all developer / experimental so we can change them whenever .

@asfgit asfgit closed this in 4b55482 Oct 28, 2014
@chenghao-intel chenghao-intel deleted the ctas_serde branch November 17, 2014 15:20
asfgit pushed a commit that referenced this pull request Dec 9, 2014
This is the code refactor and follow ups for #2570

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3336 from chenghao-intel/createtbl and squashes the following commits:

3563142 [Cheng Hao] remove the unused variable
e215187 [Cheng Hao] eliminate the compiling warning
4f97f14 [Cheng Hao] fix bug in unittest
5d58812 [Cheng Hao] revert the API changes
b85b620 [Cheng Hao] fix the regression of temp tabl not found in CTAS

(cherry picked from commit 51b1fe1)
Signed-off-by: Michael Armbrust <michael@databricks.com>
asfgit pushed a commit that referenced this pull request Dec 9, 2014
This is the code refactor and follow ups for #2570

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3336 from chenghao-intel/createtbl and squashes the following commits:

3563142 [Cheng Hao] remove the unused variable
e215187 [Cheng Hao] eliminate the compiling warning
4f97f14 [Cheng Hao] fix bug in unittest
5d58812 [Cheng Hao] revert the API changes
b85b620 [Cheng Hao] fix the regression of temp tabl not found in CTAS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants