New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27953][SQL] Save default constraint with Column into table properties when create Hive table #24792
[SPARK-27953][SQL] Save default constraint with Column into table properties when create Hive table #24792
Conversation
Test build #106146 has finished for PR 24792 at commit
|
Test build #106147 has finished for PR 24792 at commit
|
Test build #106148 has finished for PR 24792 at commit
|
Test build #106151 has finished for PR 24792 at commit
|
Test build #106153 has finished for PR 24792 at commit
|
remove link
Test build #106156 has finished for PR 24792 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@beliefer Thank you for initiating this. This is not a small work. Could you have a design doc? We need to investigate the impact of DEFAULT on all the other DDL/DML commands and the impact on the data source APIs.
Personally, I think we might need to create an umbrella JIRA and estimate the sizing.
Test build #106180 has finished for PR 24792 at commit
|
Test build #106185 has finished for PR 24792 at commit
|
Test build #106195 has finished for PR 24792 at commit
|
@gatorsmile Thanks for your review. As you said, this is not a small work. I refined the description of PR and created a parent jira |
@srowen Maybe you can help me review this PR, thanks! If not , thanks too. |
I don't feel confident enough to review changes to the SQL language support here |
It doesn't matter, thanks. |
@beliefer Before submitting PRs, could we first start it with a design doc? Ping me if the design doc is ready to review. Thanks! |
@gatorsmile Thanks for your reply. The design doc is ready, how I pass it to you? What format of design doc recommended? |
@@ -735,7 +735,7 @@ colTypeList | |||
; | |||
|
|||
colType | |||
: identifier dataType (COMMENT STRING)? | |||
: identifier dataType (COMMENT STRING)? (DEFAULT defaultExpression=expression)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that defaultExpression=expression
scope too big for DDL default constraint? In my memory, the common default constraint are NULL
, NUMBER
, STRING
, CURRENT_DATE
, CURRENT_TIMESTAMP
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that
defaultExpression=expression
scope too big for DDL default constraint? In my memory, the common default constraint areNULL
,NUMBER
,STRING
,CURRENT_DATE
,CURRENT_TIMESTAMP
.
Thanks for your review. As your said, the description of this PR contains a discussion about the scope of default constraint. Do we need to implement other expressions, like Cast(1 as float)
, 1 + 2
and so on ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is Oracle's default constraint. https://docs.oracle.com/javadb/10.8.3.0/ref/rrefsqlj30540.html#rrefsqlj30540__sqlj64478
You can take a look at other DB engines' default constraint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lipzhu It's worth to reference, but we need to look at the actual situation on Spark SQL. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lipzhu I reduced the scope of default constraint. Thanks.
Test build #106401 has finished for PR 24792 at commit
|
Test build #106415 has finished for PR 24792 at commit
|
Test build #106416 has finished for PR 24792 at commit
|
Test build #106418 has finished for PR 24792 at commit
|
Hi, @beliefer . For the umbrella issue, the subtask JIRA ID is enough for the title. |
OK. Thanks for your reminder. |
Test build #106451 has finished for PR 24792 at commit
|
Test build #106452 has finished for PR 24792 at commit
|
@gatorsmile The design doc of default constraint is ready. |
Test build #113627 has finished for PR 24792 at commit
|
what's the progress of this pr? As https://issues.apache.org/jira/browse/SPARK-29119 also associate with this pr. I think this will be a useful function for users to handle default value or computed columns; @beliefer You can put your design doc on the Google Docs for more details and add comparisons with other engines, eg: |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Background
Default constraint with column is ANSI standard.
Hive 3.0+ has supported default constraint ref:https://issues.apache.org/jira/browse/HIVE-18726
But Spark SQL implement this feature not yet.
Design
Hive is widely used in production environments and is the standard in the field of big data in fact.
But Hive exists many version used in production and the feature between each version are different.
Spark SQL need to implement default constraint, but there are three points to pay attention to in design:
First, Spark SQL should reduce coupling with Hive.
Second, default constraint could compatible with different versions of Hive.
Thrid, Which expression of default constraint should Spark SQL support? I think should support
literal
,current_date()
,current_timestamp()
. Maybe other expression should also supported, likeCast(1 as float)
,1 + 2
and so on.We want to save the metadata of default constraint into properties of Hive table, and then we restore metadata from the properties after client gets newest metadata.The implement is the same as other metadata (e.g. partition,bucket,statistics).
Because default constraint is part of column, so I think could reuse the metadata of StructField. The default constraint will cached by metadata of StructField.
Detail of this PR
This is a sub task to implement default constraint.
This PR will solve the issue that save default constraint into properties of Hive table or data source table.
There exists some issue in this PR:
First, how to check a number specified by somebody compliance with the accuracy and scope of the data type, like float, double.
Second, some code looks not very elegant, I hope to improve it with your suggestions.
Brother PR
This PR is related to https://github.com/apache/spark/pull/24372. If this PR finish, unselected target column can be inserted into the default value, while running
insert into
.After this PR, I will continue open other PR about default constraint, like alter table, desc table.
How was this patch tested?
UT