- 
                Notifications
    
You must be signed in to change notification settings  - Fork 28.9k
 
[SPARK-27732][SQL] Add v2 CreateTable implementation. #24617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 
           @cloud-fan, @mccheah, here's the next v2 operation, create table.  | 
    
| 
           Test build #105430 has finished for PR 24617 at commit  
  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
between catalog.tableExists and catalog.createTable, another user may create the table and then catalog.createTable fails.
To make it atomic, shall we just pass the ignoreIfExists parameter to catalog and ask the catalog to implement it? I checked hive catalog, it does have a ignoreIfExists parameter in its createTable method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think that adding more parameters to the API is the right answer. If the table already exists because of a race condition, then createTable throws an exception.
The purpose of this check is not for strict correctness with race conditions, it is to enforce consistency. If the catalog returns that the table exists, then Spark must not attempt to create it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a fix for the case where the table is created after the exists check and ignoreIfExists is true. If ignoreIfExists is true, then TableAlreadyExistsException should be caught and ignored.
| 
           Test build #105470 has finished for PR 24617 at commit  
  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This trick seems to work, but I'm not a database expert so I don't know the common pitfalls to implement a CREATE TABLE. cc @gatorsmile @dilipbiswal
| 
           @cloud-fan, any more comments? @mccheah and @dongjoon-hyun, do you have any comments?  | 
    
        
          
                sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateTableExec.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateTableExec.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | 
           Looks good, about what I would expect apart from some small changes.  | 
    
| 
           @mccheah, I made the changes you requested. Should be good to go when tests pass.  | 
    
| 
           Test build #105708 has finished for PR 24617 at commit  
  | 
    
        
          
                sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                sql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked presto's SPI, it asks the connector to implement a createTable method with the ignoreExisting parameter.
When implementing Hive/JDBC with data source v2, I think it's better to directly pass the ignoreIfExists flag, as these data sources support this flag natively.
I don't know the exact reason why presto designed its SPI in this way, maybe it's because the data source can have a chance to optimize for the ignoreIfExists flag. I think it's better to follow the design of presto here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I think it's a separated issue from adding CREATE TABLE support. I'm fine as long as we add a TODO here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan, adding an additional argument to the createTable method is a poor choice because it forces Spark to depend on sources to implement consistent behavior. Consistency and reliability is a problem in Spark that we are trying to address by making Spark handle these cases.
That's why not adding a flag to createTable is the right choice. It keeps the API simpler for implementers and guarantees consistent behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is if the downstream source has different behavior from what Spark wants to enforce if the ignoreIfExists flag is passed to the source vs. Spark deciding how to handle it. So I can imagine there being a discrepancy if the user gets different behavior from running the IF NOT EXISTS query directly on the Hive / SQL DB vs. running it through Spark.
I think it's better to keep Spark consistent across sources, which does leave a concession for us being inconsistent in the above way. We should document the behavior of the SQL queries where they may deviate from the behavior of the underlying source where appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @mccheah! I talked with Wenchen this morning and I think we are all in agreement now that we should guarantee consistency.
47d89d3    to
    9664638      
    Compare
  
    | 
           LGTM. I'm fine with not adding   | 
    
| 
           Test build #105730 has finished for PR 24617 at commit  
  | 
    
| 
           thanks, merging to master!  | 
    
| 
           Thanks for merging and reviewing, @cloud-fan!  | 
    
## What changes were proposed in this pull request? This adds a v2 implementation of create table: * `CreateV2Table` is the logical plan, named using v2 to avoid conflicting with the existing plan * `CreateTableExec` is the physical plan ## How was this patch tested? Added resolution and v2 SQL tests. Closes apache#24617 from rdblue/SPARK-27732-add-v2-create-table. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This adds a v2 implementation of create table:
CreateV2Tableis the logical plan, named using v2 to avoid conflicting with the existing planCreateTableExecis the physical planHow was this patch tested?
Added resolution and v2 SQL tests.