Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented May 15, 2019

What changes were proposed in this pull request?

This adds a v2 implementation of create table:

  • CreateV2Table is the logical plan, named using v2 to avoid conflicting with the existing plan
  • CreateTableExec is the physical plan

How was this patch tested?

Added resolution and v2 SQL tests.

@rdblue
Copy link
Contributor Author

rdblue commented May 15, 2019

@cloud-fan, @mccheah, here's the next v2 operation, create table.

@SparkQA
Copy link

SparkQA commented May 16, 2019

Test build #105430 has finished for PR 24617 at commit a5b9b10.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CreateV2Table(
  • case class CreateTableExec(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

between catalog.tableExists and catalog.createTable, another user may create the table and then catalog.createTable fails.

To make it atomic, shall we just pass the ignoreIfExists parameter to catalog and ask the catalog to implement it? I checked hive catalog, it does have a ignoreIfExists parameter in its createTable method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think that adding more parameters to the API is the right answer. If the table already exists because of a race condition, then createTable throws an exception.

The purpose of this check is not for strict correctness with race conditions, it is to enforce consistency. If the catalog returns that the table exists, then Spark must not attempt to create it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a fix for the case where the table is created after the exists check and ignoreIfExists is true. If ignoreIfExists is true, then TableAlreadyExistsException should be caught and ignored.

@SparkQA
Copy link

SparkQA commented May 16, 2019

Test build #105470 has finished for PR 24617 at commit cf12faa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This trick seems to work, but I'm not a database expert so I don't know the common pitfalls to implement a CREATE TABLE. cc @gatorsmile @dilipbiswal

@rdblue
Copy link
Contributor Author

rdblue commented May 22, 2019

@cloud-fan, any more comments?

@mccheah and @dongjoon-hyun, do you have any comments?

@mccheah
Copy link
Contributor

mccheah commented May 23, 2019

Looks good, about what I would expect apart from some small changes.

@rdblue
Copy link
Contributor Author

rdblue commented May 23, 2019

@mccheah, I made the changes you requested. Should be good to go when tests pass.

@SparkQA
Copy link

SparkQA commented May 23, 2019

Test build #105708 has finished for PR 24617 at commit 47d89d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked presto's SPI, it asks the connector to implement a createTable method with the ignoreExisting parameter.

When implementing Hive/JDBC with data source v2, I think it's better to directly pass the ignoreIfExists flag, as these data sources support this flag natively.

I don't know the exact reason why presto designed its SPI in this way, maybe it's because the data source can have a chance to optimize for the ignoreIfExists flag. I think it's better to follow the design of presto here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I think it's a separated issue from adding CREATE TABLE support. I'm fine as long as we add a TODO here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, adding an additional argument to the createTable method is a poor choice because it forces Spark to depend on sources to implement consistent behavior. Consistency and reliability is a problem in Spark that we are trying to address by making Spark handle these cases.

That's why not adding a flag to createTable is the right choice. It keeps the API simpler for implementers and guarantees consistent behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is if the downstream source has different behavior from what Spark wants to enforce if the ignoreIfExists flag is passed to the source vs. Spark deciding how to handle it. So I can imagine there being a discrepancy if the user gets different behavior from running the IF NOT EXISTS query directly on the Hive / SQL DB vs. running it through Spark.

I think it's better to keep Spark consistent across sources, which does leave a concession for us being inconsistent in the above way. We should document the behavior of the SQL queries where they may deviate from the behavior of the underlying source where appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @mccheah! I talked with Wenchen this morning and I think we are all in agreement now that we should guarantee consistency.

@rdblue rdblue force-pushed the SPARK-27732-add-v2-create-table branch from 47d89d3 to 9664638 Compare May 23, 2019 15:32
@cloud-fan
Copy link
Contributor

LGTM. I'm fine with not adding ignoreIfExists flag in the createTable method. If others have different opinions, please leave comments and we can discuss further.

@SparkQA
Copy link

SparkQA commented May 23, 2019

Test build #105730 has finished for PR 24617 at commit 9664638.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 6b28497 May 24, 2019
@rdblue
Copy link
Contributor Author

rdblue commented May 24, 2019

Thanks for merging and reviewing, @cloud-fan!

mccheah pushed a commit to palantir/spark that referenced this pull request Jun 6, 2019
## What changes were proposed in this pull request?

This adds a v2 implementation of create table:
* `CreateV2Table` is the logical plan, named using v2 to avoid conflicting with the existing plan
* `CreateTableExec` is the physical plan

## How was this patch tested?

Added resolution and v2 SQL tests.

Closes apache#24617 from rdblue/SPARK-27732-add-v2-create-table.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants