-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21784][SQL] Adds support for defining informational primary key and foreign key constraints using ALTER TABLE DDL. #18994
Conversation
Test build #80851 has finished for PR 18994 at commit
|
retest this please |
Test build #80855 has finished for PR 18994 at commit
|
Could you evaluate the impact of the other DDL on the constraints? For example, rename. |
sure. DDL that changes table name , column name and data type of the referenced primary key will affect foreign key definitions. I will check the spark DDL that does schema changes and get back to you. |
.map(findColumnByName(table.dataSchema, _, resolver)) | ||
// Constraints are only supported for basic sql types, throw error for any other data types. | ||
keyColFields.map(_.dataType).foreach { | ||
case ByteType | ShortType | IntegerType | LongType | FloatType | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BinaryType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @viirya . Overlooked the binay type , I will add it.
val TABLE_CONSTRAINT_PREFIX = SPARK_SQL_PREFIX + "constraint." | ||
val TABLE_CONSTRAINT_PRIMARY_KEY = SPARK_SQL_PREFIX + TABLE_CONSTRAINT_PREFIX + "pk" | ||
val TABLE_NUM_FK_CONSTRAINTS = SPARK_SQL_PREFIX + "numFkConstraints" | ||
val TABLE_CONSTRAINT_FOREIGNKEY_PREFIX = SPARK_SQL_PREFIX + TABLE_CONSTRAINT_PREFIX + "fk." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SPARK_SQL_PREFIX
is duplicated in TABLE_CONSTRAINT_PRIMARY_KEY
and TABLE_CONSTRAINT_FOREIGNKEY_PREFIX
.
E.g., TABLE_CONSTRAINT_PRIMARY_KEY
is SPARK_SQL_PREFIX
+ SPARK_SQL_PREFIX
+ "constraint.".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I will fix it.
Thank you very much for reviewing @gatorsmile ALTER STATEMENTS ALTER TABLE name RENAME TO new_name
ALTER TABLE name CHANGE column_name new_name new_type Spark SQL can raise errors if the When spark adds support for DROP/REPLACE of columns they will impact informational constraints. ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...]) DROP TABLE DROP TABLE name Hive drops the referential constraints automatically. Oracle requires user specify [CASCADE CONSTRAINTS] clause to automatically drop the referential constraints, otherwise raises the error. Should we stick to the Hive behavior ? Fixing the affected DDL’s requires carrying additional dependency information as part of storing primary key definition, Is it ok if I fix the affected DDLS in a separate PR ? |
@sureshthalamati Sure. Please create sub-JIRAs for them |
Created SPARK-21823 and SPARK-21824 for fixing the DDL's that impact the informational constraints. |
4839e84
to
f1f6d35
Compare
Test build #81123 has finished for PR 18994 at commit
|
f1f6d35
to
ea39601
Compare
Test build #82019 has finished for PR 18994 at commit
|
retest this please |
Test build #82039 has finished for PR 18994 at commit
|
ping @gatorsmile @cloud-fan @rxin |
Before we review the DDL changes, we need to see the PRs that can get benefits from this. |
Thank you for the input @gatorsmile |
@sureshthalamati Hi Suresh, We are planning to proceed with the performance improvements. Will you be able to continue working on this PR? Thanks. |
…ow users define informational primary key and foreign key constraints on a table.
… and fixed constraint property names
ea39601
to
c126122
Compare
@ioana-delaney Thank you for pinging me. I would like to complete this PR. My responses might be slow due other commitments at my workplace , if I am blocking you please feel free to take over the PR. |
Test build #88456 has finished for PR 18994 at commit
|
I think Constraint should be designed with DataSource v2 and can do more than SPARK-19842. Constraint can be used to:
For data integrity, we have two scenarios: For Optimizer rewrite query: Above all, we can bring Constraint feature to DataSource v2 design: Hive catalog support constraint, we can implement this logic in createTable/alterTable API . Then we can use SparkSQL DDL to create Table with constraint which stored to HiveMetaStore by Hive catalog API. As for how to store constraint, because Hive 2.1 has provide constraint API in Hive.java, we can call it directly in createTable/alterTable API of Hive catalog. There is no need to use table properties to store these |
Any update on this? Would be awesome to have in spark 3.0! |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
This PR implements ALTER TABLE DDL ADD CONSTRAINT to add informational primary key and foreign key (referential integrity) constraints in Spark. These constraints will be used in query optimization and you can find more details about this in the spec in SPARK-19842
The proposed syntax of the constraints DDL is similar to the Hive 2.1 referential integrity constraints support (https://issues.apache.org/jira/browse/HIVE-13076) which is aligned to Oracle's semantics.
Syntax:
Examples :
The constraint information is stored in the table properties as JSON string for each constraint.
One of the advantages of storing constraints in the table properties is that this functionality will work in all the supported Hive metastore versions.
An alternative approach that we considered was to store the constraints information using the hive metastore API that stores the constraints in a separate table. The problem with this approach is this feature will only work in Spark installations that use Hive 2.1 metastore, and also this version is NOT the current spark default. More details are in the spec document.
This PR implements the ALTER TABLE constraint DDL using table properties because it is important to work with default hive metastore version of the spark.
The syntax to define the constraints as part of create table definition will be implemented in a follow-up Jira.
How was this patch tested?
Added new unit test cases to HiveDDLSuite, and SparkSqlParserSuite