[SPARK-21784][SQL] Adds support for defining informational primary key and foreign key constraints using ALTER TABLE DDL. #18994

sureshthalamati · 2017-08-18T16:48:55Z

What changes were proposed in this pull request?

This PR implements ALTER TABLE DDL ADD CONSTRAINT to add informational primary key and foreign key (referential integrity) constraints in Spark. These constraints will be used in query optimization and you can find more details about this in the spec in SPARK-19842

The proposed syntax of the constraints DDL is similar to the Hive 2.1 referential integrity constraints support (https://issues.apache.org/jira/browse/HIVE-13076) which is aligned to Oracle's semantics.

Syntax:

ALTER TABLE [db_name.]table_name ADD [CONSTRAINT constraintName]
  (PRIMARY KEY (col_names) |
  FOREIGN KEY (col_names) REFERENCES [db_name.]table_name [(col_names)])
  [VALIDATE | NOVALIDATE] [RELY | NORELY]

Examples :

ALTER TABLE employee ADD CONSTRANT pk  PRIMARY KEY(empno) VALIDATE RELY
ALTER TABLE department ADD CONSTRAINT emp_fk FOREIGN KEY (mgrno) REFERENCES employee(empno) NOVALIDATE NORELY
ALTER TABLE department ADD PRIMARY KEY(deptno) VALIDATE RELY
ALTER TABLE employee ADD FOREIGN KEY (workdept) REFERENCES department(deptno) VALIDATE RELY;

The constraint information is stored in the table properties as JSON string for each constraint.
One of the advantages of storing constraints in the table properties is that this functionality will work in all the supported Hive metastore versions.

An alternative approach that we considered was to store the constraints information using the hive metastore API that stores the constraints in a separate table. The problem with this approach is this feature will only work in Spark installations that use Hive 2.1 metastore, and also this version is NOT the current spark default. More details are in the spec document.

This PR implements the ALTER TABLE constraint DDL using table properties because it is important to work with default hive metastore version of the spark.

The syntax to define the constraints as part of create table definition will be implemented in a follow-up Jira.

How was this patch tested?

Added new unit test cases to HiveDDLSuite, and SparkSqlParserSuite

SparkQA · 2017-08-18T19:03:19Z

Test build #80851 has finished for PR 18994 at commit 4839e84.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class TableConstraints(
sealed trait TableConstraint
case class PrimaryKey(
case class ForeignKey(
case class AlterTableAddConstraintCommand(

sureshthalamati · 2017-08-18T20:07:45Z

retest this please

SparkQA · 2017-08-18T22:56:53Z

Test build #80855 has finished for PR 18994 at commit 4839e84.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class TableConstraints(
sealed trait TableConstraint
case class PrimaryKey(
case class ForeignKey(
case class AlterTableAddConstraintCommand(

gatorsmile · 2017-08-19T18:54:37Z

Could you evaluate the impact of the other DDL on the constraints? For example, rename.

sureshthalamati · 2017-08-20T01:03:45Z

sure. DDL that changes table name , column name and data type of the referenced primary key will affect foreign key definitions. I will check the spark DDL that does schema changes and get back to you.

viirya · 2017-08-21T04:36:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/TableConstraints.scala

+      .map(findColumnByName(table.dataSchema, _, resolver))
+    // Constraints are only supported for basic sql types, throw error for any other data types.
+    keyColFields.map(_.dataType).foreach {
+      case ByteType | ShortType | IntegerType | LongType | FloatType |


BinaryType?

Thanks for the review @viirya . Overlooked the binay type , I will add it.

viirya · 2017-08-21T04:51:55Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+  val TABLE_CONSTRAINT_PREFIX = SPARK_SQL_PREFIX + "constraint."
+  val TABLE_CONSTRAINT_PRIMARY_KEY = SPARK_SQL_PREFIX + TABLE_CONSTRAINT_PREFIX + "pk"
+  val TABLE_NUM_FK_CONSTRAINTS = SPARK_SQL_PREFIX + "numFkConstraints"
+  val TABLE_CONSTRAINT_FOREIGNKEY_PREFIX = SPARK_SQL_PREFIX + TABLE_CONSTRAINT_PREFIX + "fk."


SPARK_SQL_PREFIX is duplicated in TABLE_CONSTRAINT_PRIMARY_KEY and TABLE_CONSTRAINT_FOREIGNKEY_PREFIX.

E.g., TABLE_CONSTRAINT_PRIMARY_KEY is SPARK_SQL_PREFIX + SPARK_SQL_PREFIX + "constraint.".

Good catch. I will fix it.

sureshthalamati · 2017-08-23T21:53:50Z

Thank you very much for reviewing @gatorsmile
By scanning through the current supported DDL syntax for non-partition columns, I think following DDL statements will impact informational constraints:

ALTER STATEMENTS

ALTER TABLE name RENAME TO new_name
ALTER TABLE name CHANGE column_name new_name new_type

Spark SQL can raise errors if the
informational constraints are defined on the affected columns and let the user drop constraints before proceeding with the DDL. In the future we can enhance the affected DDL's to automatically fix up the constraint definition when possible, and not raise error

When spark adds support for DROP/REPLACE of columns they will impact informational constraints.

ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])

DROP TABLE

DROP TABLE name

Hive drops the referential constraints automatically. Oracle requires user specify [CASCADE CONSTRAINTS] clause to automatically drop the referential constraints, otherwise raises the error. Should we stick to the Hive behavior ?

Fixing the affected DDL’s requires carrying additional dependency information as part of storing primary key definition, Is it ok if I fix the affected DDLS in a separate PR ?

gatorsmile · 2017-08-23T22:33:06Z

@sureshthalamati Sure. Please create sub-JIRAs for them

sureshthalamati · 2017-08-24T07:27:42Z

Created SPARK-21823 and SPARK-21824 for fixing the DDL's that impact the informational constraints.

SparkQA · 2017-08-25T10:24:55Z

Test build #81123 has finished for PR 18994 at commit f1f6d35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-21T06:48:29Z

Test build #82019 has finished for PR 18994 at commit ea39601.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sureshthalamati · 2017-09-21T17:05:45Z

retest this please

SparkQA · 2017-09-21T19:59:06Z

Test build #82039 has finished for PR 18994 at commit ea39601.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sureshthalamati · 2017-09-21T21:30:46Z

ping @gatorsmile @cloud-fan @rxin

gatorsmile · 2017-09-22T07:05:05Z

Before we review the DDL changes, we need to see the PRs that can get benefits from this.

sureshthalamati · 2017-09-22T17:51:31Z

Thank you for the input @gatorsmile

ioana-delaney · 2018-03-10T19:20:03Z

@sureshthalamati Hi Suresh, We are planning to proceed with the performance improvements. Will you be able to continue working on this PR? Thanks.

…ow users define informational primary key and foreign key constraints on a table.

… and fixed constraint property names

sureshthalamati · 2018-03-21T07:04:36Z

@ioana-delaney Thank you for pinging me. I would like to complete this PR. My responses might be slow due other commitments at my workplace , if I am blocking you please feel free to take over the PR.

SparkQA · 2018-03-21T10:32:34Z

Test build #88456 has finished for PR 18994 at commit c126122.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

windpiger · 2019-04-17T08:28:55Z

I think Constraint should be designed with DataSource v2 and can do more than SPARK-19842.

Constraint can be used to:

data integrity(not include in SPARK-19842)
optimizer can use it to rewrite query to gain perfermance(not just PK/FK, unique/not null is also useful)

For data integrity, we have two scenarios:
1.1 DataSource native support data integrity, such as mysql/oracle and so on
Spark should only call read/write API of this DataSource, and do nothing about data integrity.
1.2 DataSource do not support data integrity, such as csv/json/parquet and so on
Spark can provide data integrity for this DataSource like Hive does(maybe a switch can be used to turn it off), and we can discuss to support which kind of Constraint.
For example, Hive support PK/FK/UNIQUE(DISABLE RELY)/NOT NUL/DEFAULT, NOT NULL ENFORCE check is implement by add an extra UDF GenericUDFEnforceNotNullConstraint to the Plan(HIVE-16605).

For Optimizer rewrite query:
2.1 We can add Constraint Information into CatalogTable which is returned by catalog.getTable API. Then Optimizer can use it to do query rewrite.
2.2 if we can not get Constraint information, we can use hint to the SQL

Above all, we can bring Constraint feature to DataSource v2 design:
a) to support 2.1 feature, we can add constraint information to createTable/alterTable/getTable API in this SPIP(https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#)
b) to support data integrity, we can add ConstaintSupport mix-in for DataSource v2:
if one DataSource support Constraint, then Spark do nothing when insert data;
if one DataSource do not support Constraint but still want to do constraint check, then Spark should do the constraint check like Hive(such as not null in Hive add a extra udf GenericUDFEnforceNotNullConstraint to the Plan).
if one DataSource do not support Constraint and do not want to do constraint check, then Spark do nothing.

Hive catalog support constraint, we can implement this logic in createTable/alterTable API . Then we can use SparkSQL DDL to create Table with constraint which stored to HiveMetaStore by Hive catalog API.
for example:CREATE TABLE t(a STRING, b STRING NOT NULL DISABLE, CONSTRAINT pk1 PRIMARY KEY (a) DISABLE) USING parquet;

As for how to store constraint, because Hive 2.1 has provide constraint API in Hive.java, we can call it directly in createTable/alterTable API of Hive catalog. There is no need to use table properties to store these
constraint information by Spark. There are some concern for using Hive 2.1 catalog API directly in the docs(https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit#heading=h.lnxbz9), such as Spark built-in Hive is 1.2.1, but upgrade Hive to 2.3.4 is inprogress(SPARK-23710).

@cloud-fan @gatorsmile @sureshthalamati @ioana-delaney

h-vetinari · 2019-09-23T14:35:29Z

Any update on this? Would be awesome to have in spark 3.0!

github-actions · 2020-02-25T00:12:37Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

sureshthalamati changed the title ~~[SPARK-21784][SQL] Adds support for defining information primary key and foreign key constraints using ALTER TABLE DDL.~~ [SPARK-21784][SQL] Adds support for defining informational primary key and foreign key constraints using ALTER TABLE DDL. Aug 18, 2017

viirya reviewed Aug 21, 2017

View reviewed changes

sureshthalamati force-pushed the alter_add_pk_fk_SPARK-21784 branch from 4839e84 to f1f6d35 Compare August 25, 2017 07:36

sureshthalamati force-pushed the alter_add_pk_fk_SPARK-21784 branch from f1f6d35 to ea39601 Compare September 21, 2017 04:29

ioana-delaney mentioned this pull request Mar 21, 2018

[SPARK-23750][SQL] Inner Join Elimination based on Informational RI constraints #20868

Closed

sureshthalamati added 2 commits March 20, 2018 23:57

[SPARK-21784][SQL] Adds alter table add constraint DDL support to all…

a2d09c9

…ow users define informational primary key and foreign key constraints on a table.

Added BinaryType to the supported list of data types for contsraints,…

c126122

… and fixed constraint property names

sureshthalamati force-pushed the alter_add_pk_fk_SPARK-21784 branch from ea39601 to c126122 Compare March 21, 2018 06:58

dongjoon-hyun added the SQL label Jun 14, 2019

github-actions bot added the Stale label Feb 25, 2020

github-actions bot closed this Feb 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21784][SQL] Adds support for defining informational primary key and foreign key constraints using ALTER TABLE DDL. #18994

[SPARK-21784][SQL] Adds support for defining informational primary key and foreign key constraints using ALTER TABLE DDL. #18994

sureshthalamati commented Aug 18, 2017

SparkQA commented Aug 18, 2017

sureshthalamati commented Aug 18, 2017

SparkQA commented Aug 18, 2017

gatorsmile commented Aug 19, 2017 •

edited

sureshthalamati commented Aug 20, 2017

viirya Aug 21, 2017

sureshthalamati Aug 24, 2017

viirya Aug 21, 2017

sureshthalamati Aug 24, 2017

sureshthalamati commented Aug 23, 2017

gatorsmile commented Aug 23, 2017

sureshthalamati commented Aug 24, 2017

SparkQA commented Aug 25, 2017

SparkQA commented Sep 21, 2017

sureshthalamati commented Sep 21, 2017

SparkQA commented Sep 21, 2017

sureshthalamati commented Sep 21, 2017

gatorsmile commented Sep 22, 2017

sureshthalamati commented Sep 22, 2017

ioana-delaney commented Mar 10, 2018

sureshthalamati commented Mar 21, 2018

SparkQA commented Mar 21, 2018

windpiger commented Apr 17, 2019

h-vetinari commented Sep 23, 2019

github-actions bot commented Feb 25, 2020

[SPARK-21784][SQL] Adds support for defining informational primary key and foreign key constraints using ALTER TABLE DDL. #18994

[SPARK-21784][SQL] Adds support for defining informational primary key and foreign key constraints using ALTER TABLE DDL. #18994

Conversation

sureshthalamati commented Aug 18, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Aug 18, 2017

sureshthalamati commented Aug 18, 2017

SparkQA commented Aug 18, 2017

gatorsmile commented Aug 19, 2017 • edited

sureshthalamati commented Aug 20, 2017

viirya Aug 21, 2017

Choose a reason for hiding this comment

sureshthalamati Aug 24, 2017

Choose a reason for hiding this comment

viirya Aug 21, 2017

Choose a reason for hiding this comment

sureshthalamati Aug 24, 2017

Choose a reason for hiding this comment

sureshthalamati commented Aug 23, 2017

gatorsmile commented Aug 23, 2017

sureshthalamati commented Aug 24, 2017

SparkQA commented Aug 25, 2017

SparkQA commented Sep 21, 2017

sureshthalamati commented Sep 21, 2017

SparkQA commented Sep 21, 2017

sureshthalamati commented Sep 21, 2017

gatorsmile commented Sep 22, 2017

sureshthalamati commented Sep 22, 2017

ioana-delaney commented Mar 10, 2018

sureshthalamati commented Mar 21, 2018

SparkQA commented Mar 21, 2018

windpiger commented Apr 17, 2019

h-vetinari commented Sep 23, 2019

github-actions bot commented Feb 25, 2020

gatorsmile commented Aug 19, 2017 •

edited