[SPARK-27732][SQL] Add v2 CreateTable implementation. #24617

rdblue · 2019-05-15T21:14:58Z

What changes were proposed in this pull request?

This adds a v2 implementation of create table:

CreateV2Table is the logical plan, named using v2 to avoid conflicting with the existing plan
CreateTableExec is the physical plan

How was this patch tested?

Added resolution and v2 SQL tests.

rdblue · 2019-05-15T21:16:31Z

@cloud-fan, @mccheah, here's the next v2 operation, create table.

SparkQA · 2019-05-16T00:16:24Z

Test build #105430 has finished for PR 24617 at commit a5b9b10.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CreateV2Table(
case class CreateTableExec(

cloud-fan · 2019-05-16T01:24:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateTableExec.scala

between catalog.tableExists and catalog.createTable, another user may create the table and then catalog.createTable fails.

To make it atomic, shall we just pass the ignoreIfExists parameter to catalog and ask the catalog to implement it? I checked hive catalog, it does have a ignoreIfExists parameter in its createTable method.

No, I don't think that adding more parameters to the API is the right answer. If the table already exists because of a race condition, then createTable throws an exception.

The purpose of this check is not for strict correctness with race conditions, it is to enforce consistency. If the catalog returns that the table exists, then Spark must not attempt to create it.

I pushed a fix for the case where the table is created after the exists check and ignoreIfExists is true. If ignoreIfExists is true, then TableAlreadyExistsException should be caught and ignored.

SparkQA · 2019-05-16T22:43:44Z

Test build #105470 has finished for PR 24617 at commit cf12faa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-05-17T12:56:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateTableExec.scala

This trick seems to work, but I'm not a database expert so I don't know the common pitfalls to implement a CREATE TABLE. cc @gatorsmile @dilipbiswal

rdblue · 2019-05-22T22:05:10Z

@cloud-fan, any more comments?

@mccheah and @dongjoon-hyun, do you have any comments?

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateTableExec.scala

mccheah · 2019-05-23T00:59:18Z

Looks good, about what I would expect apart from some small changes.

rdblue · 2019-05-23T01:53:29Z

@mccheah, I made the changes you requested. Should be good to go when tests pass.

SparkQA · 2019-05-23T04:56:11Z

Test build #105708 has finished for PR 24617 at commit 47d89d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala

cloud-fan · 2019-05-23T13:03:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateTableExec.scala

I checked presto's SPI, it asks the connector to implement a createTable method with the ignoreExisting parameter.

When implementing Hive/JDBC with data source v2, I think it's better to directly pass the ignoreIfExists flag, as these data sources support this flag natively.

I don't know the exact reason why presto designed its SPI in this way, maybe it's because the data source can have a chance to optimize for the ignoreIfExists flag. I think it's better to follow the design of presto here.

BTW I think it's a separated issue from adding CREATE TABLE support. I'm fine as long as we add a TODO here.

@cloud-fan, adding an additional argument to the createTable method is a poor choice because it forces Spark to depend on sources to implement consistent behavior. Consistency and reliability is a problem in Spark that we are trying to address by making Spark handle these cases.

That's why not adding a flag to createTable is the right choice. It keeps the API simpler for implementers and guarantees consistent behavior.

The question is if the downstream source has different behavior from what Spark wants to enforce if the ignoreIfExists flag is passed to the source vs. Spark deciding how to handle it. So I can imagine there being a discrepancy if the user gets different behavior from running the IF NOT EXISTS query directly on the Hive / SQL DB vs. running it through Spark.

I think it's better to keep Spark consistent across sources, which does leave a concession for us being inconsistent in the above way. We should document the behavior of the SQL queries where they may deviate from the behavior of the underlying source where appropriate.

Thanks, @mccheah! I talked with Wenchen this morning and I think we are all in agreement now that we should guarantee consistency.

cloud-fan · 2019-05-23T16:03:08Z

LGTM. I'm fine with not adding ignoreIfExists flag in the createTable method. If others have different opinions, please leave comments and we can discuss further.

SparkQA · 2019-05-23T18:38:28Z

Test build #105730 has finished for PR 24617 at commit 9664638.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-05-24T03:13:39Z

thanks, merging to master!

rdblue · 2019-05-24T15:45:03Z

Thanks for merging and reviewing, @cloud-fan!

## What changes were proposed in this pull request? This adds a v2 implementation of create table: * `CreateV2Table` is the logical plan, named using v2 to avoid conflicting with the existing plan * `CreateTableExec` is the physical plan ## How was this patch tested? Added resolution and v2 SQL tests. Closes apache#24617 from rdblue/SPARK-27732-add-v2-create-table. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan reviewed May 16, 2019

View reviewed changes

cloud-fan reviewed May 17, 2019

View reviewed changes

mccheah reviewed May 23, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateTableExec.scala Outdated Show resolved Hide resolved

mccheah reviewed May 23, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateTableExec.scala Outdated Show resolved Hide resolved

cloud-fan reviewed May 23, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala Outdated Show resolved Hide resolved

cloud-fan reviewed May 23, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed May 23, 2019

View reviewed changes

rdblue added 4 commits May 23, 2019 08:26

Add v2 CreateTable implementation.

40d8aab

Fix race condition with IF NOT EXISTS.

c585da6

Update to fix review comments.

36682ea

Update to use the default v2 catalog.

9664638

rdblue force-pushed the SPARK-27732-add-v2-create-table branch from 47d89d3 to 9664638 Compare May 23, 2019 15:32

HyukjinKwon mentioned this pull request May 24, 2019

[SPARK-27350][SQL] Support create table on data source V2 #24278

Closed

cloud-fan closed this in 6b28497 May 24, 2019

[SPARK-27732][SQL] Add v2 CreateTable implementation. #24617

[SPARK-27732][SQL] Add v2 CreateTable implementation. #24617

Uh oh!

Conversation

rdblue commented May 15, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

rdblue commented May 15, 2019

Uh oh!

SparkQA commented May 16, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 16, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented May 22, 2019

Uh oh!

Uh oh!

Uh oh!

mccheah commented May 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented May 23, 2019

Uh oh!

SparkQA commented May 23, 2019

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented May 23, 2019

Uh oh!

SparkQA commented May 23, 2019

Uh oh!

cloud-fan commented May 24, 2019

Uh oh!

rdblue commented May 24, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mccheah commented May 23, 2019 •

edited

Loading