Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19115] [SQL] Supporting Create External Table Like Location #16638

Closed
wants to merge 7 commits into from
Closed

[SPARK-19115] [SQL] Supporting Create External Table Like Location #16638

wants to merge 7 commits into from

Conversation

ouyangxiaochen
Copy link

@ouyangxiaochen ouyangxiaochen commented Jan 19, 2017

What changes were proposed in this pull request?

Support CREATE [EXTERNAL] TABLE LIKE LOCATION... syntax for Hive tables.
In this PR,we follow SparkSQL design rules :

  1. supporting create external table like view or physical table or temporary view with location.
  2. creating an external table without location,we will throw an OpreationNotAllowed exception.
  3. creating a managed table with location,this table will be an external table other than managed table.

How was this patch tested?

Add new test cases and update existing test cases

@gatorsmile
Copy link
Member

Could you follow the title requirement in http://spark.apache.org/contributing.html?

@@ -58,6 +58,7 @@ import org.apache.spark.util.Utils
case class CreateTableLikeCommand(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the comment of this class.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok,i will update it later,Thanks!

@gatorsmile
Copy link
Member

gatorsmile commented Jan 19, 2017

We have a few test cases you can follow. Please create test cases. Thanks!

@gatorsmile
Copy link
Member

In the PR, you might need to consider more scenarios. For example, let me ask a question. How does Hive behave when the specified location is not empty?

@ouyangxiaochen
Copy link
Author

Here is the differences between Hive and Spark2.x as follow:
1.Hive
create table test(id int); --> MANAGED_TABLE
create table test(id int) location '/warehouse/test'; --> MANAGED_TABLE
create external table test(id int) location '/warehouse/test'; --> EXTERNAL_TABLE
create external table test(id int); --> EXTERNAL_TABLE

2.Spark2.x:
create table test(id int); --> MANAGED_TABLE
create table test(id int) location '/warehouse/test'; --> EXTERNAL_TABLE
create external table test(id int) location '/warehouse/test'; --> EXTERNAL_TABLE
create external table test(id int); --> operationNotAllowed("CREATE EXTERNAL TABLE must be accompanied by LOCATION", ctx)

So,this PR follows the spark2.x‘s design rules,thanks!

@gatorsmile
Copy link
Member

gatorsmile commented Jan 23, 2017

First, please change the PR title to [SPARK-19115][SQL]Supporting Create External Table Like Location

@gatorsmile
Copy link
Member

Let me rephrase it. If the directory specified in the LOCATION spec contains the other files, what does Hive behave?

@gatorsmile
Copy link
Member

Please keep updating your PR description. For example, this PR is not relying on manual tests. In addition, you also need to summarize what this PR did. List more details to help reviewers understand your changes and impacts. Thanks!

@ouyangxiaochen ouyangxiaochen changed the title spark-19115 [SPARK-19115] [SQL] Supporting Create External Table Like Location Jan 23, 2017
@ouyangxiaochen
Copy link
Author

I am sorry that I did't grasp the key points of your question. In Hive, if there are data files under the specified path while creating an external table, then Hive will identify the files as table data files.
In many spark applications, external table data is generated by other applications under the external table path. So, Hive did nothing with the directory specified in the LOCATION.
Thank you for your patience and guidance. @gatorsmile

* }}}
*/
override def visitCreateTableLike(ctx: CreateTableLikeContext): LogicalPlan = withOrigin(ctx) {
val targetTable = visitTableIdentifier(ctx.target)
val sourceTable = visitTableIdentifier(ctx.source)
CreateTableLikeCommand(targetTable, sourceTable, ctx.EXISTS != null)
val location = Option(ctx.locationSpec).map(visitLocationSpec)
if (ctx.EXTERNAL != null && location.isEmpty) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment above this line:

    // If we are creating an EXTERNAL table, then the LOCATION field is required

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll do it later, Thanks!

val location = Option(ctx.locationSpec).map(visitLocationSpec)
if (ctx.EXTERNAL != null && location.isEmpty) {
operationNotAllowed("CREATE EXTERNAL TABLE LIKE must be accompanied by LOCATION", ctx)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the other reviewers, we are following what we did in visitCreateHiveTable

@gatorsmile
Copy link
Member

@ouyangxiaochen Please do not duplicate the test cases. Try to combine them.

@cloud-fan @yhuai Could you please check whether such a DDL support is desirable?

@gatorsmile
Copy link
Member

ok to test.

@SparkQA
Copy link

SparkQA commented Jan 24, 2017

Test build #71887 has finished for PR 16638 at commit 713ca97.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

…ala file.

2. repair the error for test cases in HiveDDLSuite.scala file, sql statements
   lost a pair of single quotes.
@ouyangxiaochen
Copy link
Author

I have fixed the error of test cases and they run successfully. So,please run the test cases again.Thanks a lot! @SparkQA

@@ -81,8 +81,8 @@ statement
rowFormat? createFileFormat? locationSpec?
(TBLPROPERTIES tablePropertyList)?
(AS? query)? #createHiveTable
| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier
LIKE source=tableIdentifier #createTableLike
| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since Spark 2.2, we wanna hide the manage/external concept from users. It looks reasonable to add a LOCATION statement in CREATE TABLE LIKE, but do we really need the EXTERNAL keyword? We don't need to be exactly same with hive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok then let's simplify the logic: if location is specified, we create an external table internally. Else, create managed table.

@SparkQA
Copy link

SparkQA commented Jan 24, 2017

Test build #71904 has finished for PR 16638 at commit b80f8e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

ping @ouyangxiaochen : )

@ouyangxiaochen
Copy link
Author

Happy Chinese New Year ! @gatorsmile
The Spring Festival holiday just ended, and I return to work today, what work do I need to do?

@ouyangxiaochen
Copy link
Author

ping @gatorsmile

@cloud-fan
Copy link
Contributor

please address #16638 (comment)

2.simplify the logic: if location is specified, we create an external table internally. Else, create managed table
3.update test cases
@@ -51,13 +51,14 @@ import org.apache.spark.util.Utils
*
* The syntax of using this command in SQL is:
* {{{
* CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
* LIKE [other_db_name.]existing_table_name
* CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no EXTERNAL

@@ -518,8 +518,8 @@ class HiveDDLCommandSuite extends PlanTest with SQLTestUtils with TestHiveSingle

test("create table like") {
val v1 = "CREATE TABLE table1 LIKE table2"
val (target, source, exists) = parser.parsePlan(v1).collect {
case CreateTableLikeCommand(t, s, allowExisting) => (t, s, allowExisting)
val (target, source, location, exists) = parser.parsePlan(v1).collect {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an assert to check location is empty

@@ -528,8 +528,8 @@ class HiveDDLCommandSuite extends PlanTest with SQLTestUtils with TestHiveSingle
assert(source.table == "table2")

val v2 = "CREATE TABLE IF NOT EXISTS table1 LIKE table2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add one more test case to check CREATE TABLE LIKE with location

TableIdentifier(targetTabName, Some("default")))

checkCreateTableLike(sourceTable, targetTable)
test("CREATE TABLE LIKE a temporary view [LOCATION]...") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we don't need to change the test name

checkCreateTableLike(sourceTable, targetTable)
test("CREATE TABLE LIKE a temporary view [LOCATION]...") {
var createdTableType = "MANAGED"
for ( i <- 0 to 1 ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can create a method with parameter location: Option[String], instead of writing a for loop with 2 iterations...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I write this for the purpose of reusing this piece of public code, because the basic logic of these two scenarios are almost the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creating a method and wrap this piece of code can also reuse the code.

private def checkCreateTableLike(
sourceTable: CatalogTable,
targetTable: CatalogTable,
tableType: String): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not pass in a CataogTableType instead of a string?

@cloud-fan
Copy link
Contributor

please resolve the conflict too, thanks!

@SparkQA
Copy link

SparkQA commented Feb 8, 2017

Test build #72550 has finished for PR 16638 at commit 71f1d12.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@ouyangxiaochen
Copy link
Author

I met some troubles when I resolving the conflict, So can u give me some guidances? Thanks a lot! @cloud-fan

@cloud-fan
Copy link
Contributor

you can start with a new branch and apply the changes manually, e.g. copy code from this PR to the new branch.

@SparkQA
Copy link

SparkQA commented Feb 8, 2017

Test build #72576 has finished for PR 16638 at commit 5dd21b2.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@ouyangxiaochen
Copy link
Author

ouyangxiaochen commented Feb 9, 2017

Should I delete my remote master repository firstly,and fork a new one again? @cloud-fan

@gatorsmile
Copy link
Member

Your master is clean (i.e., exactly identical to the upstream/master), right?

@ouyangxiaochen
Copy link
Author

My master branch with the master of Apache is not synchronized, and then I did the pull operation, my master branch still not synchronized, and finally I removed my remote repository.
But I do not know how to associate a new branch with this PR? I Think I made a misopreation. @gatorsmile

@gatorsmile
Copy link
Member

You might not be familiar with the Github/Git. How about submitting a new PR? : )

@ouyangxiaochen
Copy link
Author

Here's how I create a PR:
1.fork the master of Apache;
2.create a new branch in my master branch
3.select my new branch menu and create a new PR.
4.edit my new branch code.
5.commit and push.
Can u point lost or mistake steps for me, Thank u for your guidances! @gatorsmile

@gatorsmile
Copy link
Member

You do not need to do the step 1 every time. You might miss the following two steps when you want to resolve your conflicts.

git fetch upstream
git merge upstream/master

@ouyangxiaochen
Copy link
Author

Oh, I See, I miss a step ’git remote add upstream ...‘.
But now, I have delete my repository in my profile. So this PR can‘t know which repository should be associated. So, do u have a method to help me cover this problem?

@gatorsmile
Copy link
Member

No worry, open/submit a new PR. : )

@gatorsmile
Copy link
Member

gatorsmile commented Feb 9, 2017

You might be able to make it by forcefully pushing the new changes by git push -f origin NEW_BRANCH:REMOTE_BRANCH

@ouyangxiaochen
Copy link
Author

OK. I'll try it immediately. Thank U very much!

@ouyangxiaochen
Copy link
Author

I have created a PR at https://github.com/apache/spark/pull/16868, please review it, Thanks! @gatorsmile @cloud-fan

@maropu maropu mentioned this pull request Apr 23, 2017
maropu added a commit to maropu/spark that referenced this pull request Apr 23, 2017
@asfgit asfgit closed this in e9f9715 Apr 24, 2017
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
This pr proposed to close stale PRs. Currently, we have 400+ open PRs and there are some stale PRs whose JIRA tickets have been already closed and whose JIRA tickets does not exist (also, they seem not to be minor issues).

// Open PRs whose JIRA tickets have been already closed
Closes apache#11785
Closes apache#13027
Closes apache#13614
Closes apache#13761
Closes apache#15197
Closes apache#14006
Closes apache#12576
Closes apache#15447
Closes apache#13259
Closes apache#15616
Closes apache#14473
Closes apache#16638
Closes apache#16146
Closes apache#17269
Closes apache#17313
Closes apache#17418
Closes apache#17485
Closes apache#17551
Closes apache#17463
Closes apache#17625

// Open PRs whose JIRA tickets does not exist and they are not minor issues
Closes apache#10739
Closes apache#15193
Closes apache#15344
Closes apache#14804
Closes apache#16993
Closes apache#17040
Closes apache#15180
Closes apache#17238

N/A

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes apache#17734 from maropu/resolved_pr.

Change-Id: Id2e590aa7283fe5ac01424d30a40df06da6098b5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants