Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21784][SQL] Adds support for defining informational primary key and foreign key constraints using ALTER TABLE DDL. #18994

Conversation

sureshthalamati
Copy link
Contributor

What changes were proposed in this pull request?

This PR implements ALTER TABLE DDL ADD CONSTRAINT to add informational primary key and foreign key (referential integrity) constraints in Spark. These constraints will be used in query optimization and you can find more details about this in the spec in SPARK-19842

The proposed syntax of the constraints DDL is similar to the Hive 2.1 referential integrity constraints support (https://issues.apache.org/jira/browse/HIVE-13076) which is aligned to Oracle's semantics.

Syntax:

ALTER TABLE [db_name.]table_name ADD [CONSTRAINT constraintName]
  (PRIMARY KEY (col_names) |
  FOREIGN KEY (col_names) REFERENCES [db_name.]table_name [(col_names)])
  [VALIDATE | NOVALIDATE] [RELY | NORELY]

Examples :

ALTER TABLE employee ADD CONSTRANT pk  PRIMARY KEY(empno) VALIDATE RELY
ALTER TABLE department ADD CONSTRAINT emp_fk FOREIGN KEY (mgrno) REFERENCES employee(empno) NOVALIDATE NORELY
ALTER TABLE department ADD PRIMARY KEY(deptno) VALIDATE RELY
ALTER TABLE employee ADD FOREIGN KEY (workdept) REFERENCES department(deptno) VALIDATE RELY;

The constraint information is stored in the table properties as JSON string for each constraint.
One of the advantages of storing constraints in the table properties is that this functionality will work in all the supported Hive metastore versions.

An alternative approach that we considered was to store the constraints information using the hive metastore API that stores the constraints in a separate table. The problem with this approach is this feature will only work in Spark installations that use Hive 2.1 metastore, and also this version is NOT the current spark default. More details are in the spec document.

This PR implements the ALTER TABLE constraint DDL using table properties because it is important to work with default hive metastore version of the spark.

The syntax to define the constraints as part of create table definition will be implemented in a follow-up Jira.

How was this patch tested?

Added new unit test cases to HiveDDLSuite, and SparkSqlParserSuite

@sureshthalamati sureshthalamati changed the title [SPARK-21784][SQL] Adds support for defining information primary key and foreign key constraints using ALTER TABLE DDL. [SPARK-21784][SQL] Adds support for defining informational primary key and foreign key constraints using ALTER TABLE DDL. Aug 18, 2017
@SparkQA
Copy link

SparkQA commented Aug 18, 2017

Test build #80851 has finished for PR 18994 at commit 4839e84.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class TableConstraints(
  • sealed trait TableConstraint
  • case class PrimaryKey(
  • case class ForeignKey(
  • case class AlterTableAddConstraintCommand(

@sureshthalamati
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 18, 2017

Test build #80855 has finished for PR 18994 at commit 4839e84.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class TableConstraints(
  • sealed trait TableConstraint
  • case class PrimaryKey(
  • case class ForeignKey(
  • case class AlterTableAddConstraintCommand(

@gatorsmile
Copy link
Member

gatorsmile commented Aug 19, 2017

Could you evaluate the impact of the other DDL on the constraints? For example, rename.

@sureshthalamati
Copy link
Contributor Author

sure. DDL that changes table name , column name and data type of the referenced primary key will affect foreign key definitions. I will check the spark DDL that does schema changes and get back to you.

.map(findColumnByName(table.dataSchema, _, resolver))
// Constraints are only supported for basic sql types, throw error for any other data types.
keyColFields.map(_.dataType).foreach {
case ByteType | ShortType | IntegerType | LongType | FloatType |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BinaryType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @viirya . Overlooked the binay type , I will add it.

val TABLE_CONSTRAINT_PREFIX = SPARK_SQL_PREFIX + "constraint."
val TABLE_CONSTRAINT_PRIMARY_KEY = SPARK_SQL_PREFIX + TABLE_CONSTRAINT_PREFIX + "pk"
val TABLE_NUM_FK_CONSTRAINTS = SPARK_SQL_PREFIX + "numFkConstraints"
val TABLE_CONSTRAINT_FOREIGNKEY_PREFIX = SPARK_SQL_PREFIX + TABLE_CONSTRAINT_PREFIX + "fk."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPARK_SQL_PREFIX is duplicated in TABLE_CONSTRAINT_PRIMARY_KEY and TABLE_CONSTRAINT_FOREIGNKEY_PREFIX.

E.g., TABLE_CONSTRAINT_PRIMARY_KEY is SPARK_SQL_PREFIX + SPARK_SQL_PREFIX + "constraint.".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I will fix it.

@sureshthalamati
Copy link
Contributor Author

Thank you very much for reviewing @gatorsmile
By scanning through the current supported DDL syntax for non-partition columns, I think following DDL statements will impact informational constraints:

ALTER STATEMENTS

ALTER TABLE name RENAME TO new_name
ALTER TABLE name CHANGE column_name new_name new_type

Spark SQL can raise errors if the
informational constraints are defined on the affected columns and let the user drop constraints before proceeding with the DDL. In the future we can enhance the affected DDL's to automatically fix up the constraint definition when possible, and not raise error

When spark adds support for DROP/REPLACE of columns they will impact informational constraints.

ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])

DROP TABLE

DROP TABLE name

Hive drops the referential constraints automatically. Oracle requires user specify [CASCADE CONSTRAINTS] clause to automatically drop the referential constraints, otherwise raises the error. Should we stick to the Hive behavior ?

Fixing the affected DDL’s requires carrying additional dependency information as part of storing primary key definition, Is it ok if I fix the affected DDLS in a separate PR ?

@gatorsmile
Copy link
Member

@sureshthalamati Sure. Please create sub-JIRAs for them

@sureshthalamati
Copy link
Contributor Author

Created SPARK-21823 and SPARK-21824 for fixing the DDL's that impact the informational constraints.

@SparkQA
Copy link

SparkQA commented Aug 25, 2017

Test build #81123 has finished for PR 18994 at commit f1f6d35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 21, 2017

Test build #82019 has finished for PR 18994 at commit ea39601.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sureshthalamati
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Sep 21, 2017

Test build #82039 has finished for PR 18994 at commit ea39601.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sureshthalamati
Copy link
Contributor Author

ping @gatorsmile @cloud-fan @rxin

@gatorsmile
Copy link
Member

Before we review the DDL changes, we need to see the PRs that can get benefits from this.

@sureshthalamati
Copy link
Contributor Author

Thank you for the input @gatorsmile

@ioana-delaney
Copy link
Contributor

@sureshthalamati Hi Suresh, We are planning to proceed with the performance improvements. Will you be able to continue working on this PR? Thanks.

…ow users define informational primary key and foreign key constraints on a table.
@sureshthalamati
Copy link
Contributor Author

@ioana-delaney Thank you for pinging me. I would like to complete this PR. My responses might be slow due other commitments at my workplace , if I am blocking you please feel free to take over the PR.

@SparkQA
Copy link

SparkQA commented Mar 21, 2018

Test build #88456 has finished for PR 18994 at commit c126122.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger
Copy link
Contributor

I think Constraint should be designed with DataSource v2 and can do more than SPARK-19842.

Constraint can be used to:

  1. data integrity(not include in SPARK-19842)
  2. optimizer can use it to rewrite query to gain perfermance(not just PK/FK, unique/not null is also useful)

For data integrity, we have two scenarios:
1.1 DataSource native support data integrity, such as mysql/oracle and so on
Spark should only call read/write API of this DataSource, and do nothing about data integrity.
1.2 DataSource do not support data integrity, such as csv/json/parquet and so on
Spark can provide data integrity for this DataSource like Hive does(maybe a switch can be used to turn it off), and we can discuss to support which kind of Constraint.
For example, Hive support PK/FK/UNIQUE(DISABLE RELY)/NOT NUL/DEFAULT, NOT NULL ENFORCE check is implement by add an extra UDF GenericUDFEnforceNotNullConstraint to the Plan(HIVE-16605).

For Optimizer rewrite query:
2.1 We can add Constraint Information into CatalogTable which is returned by catalog.getTable API. Then Optimizer can use it to do query rewrite.
2.2 if we can not get Constraint information, we can use hint to the SQL

Above all, we can bring Constraint feature to DataSource v2 design:
a) to support 2.1 feature, we can add constraint information to createTable/alterTable/getTable API in this SPIP(https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#)
b) to support data integrity, we can add ConstaintSupport mix-in for DataSource v2:
if one DataSource support Constraint, then Spark do nothing when insert data;
if one DataSource do not support Constraint but still want to do constraint check, then Spark should do the constraint check like Hive(such as not null in Hive add a extra udf GenericUDFEnforceNotNullConstraint to the Plan).
if one DataSource do not support Constraint and do not want to do constraint check, then Spark do nothing.

Hive catalog support constraint, we can implement this logic in createTable/alterTable API . Then we can use SparkSQL DDL to create Table with constraint which stored to HiveMetaStore by Hive catalog API.
for example:CREATE TABLE t(a STRING, b STRING NOT NULL DISABLE, CONSTRAINT pk1 PRIMARY KEY (a) DISABLE) USING parquet;

As for how to store constraint, because Hive 2.1 has provide constraint API in Hive.java, we can call it directly in createTable/alterTable API of Hive catalog. There is no need to use table properties to store these
constraint information by Spark. There are some concern for using Hive 2.1 catalog API directly in the docs(https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit#heading=h.lnxbz9), such as Spark built-in Hive is 1.2.1, but upgrade Hive to 2.3.4 is inprogress(SPARK-23710).

@cloud-fan @gatorsmile @sureshthalamati @ioana-delaney

@h-vetinari
Copy link
Contributor

Any update on this? Would be awesome to have in spark 3.0!

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Feb 25, 2020
@github-actions github-actions bot closed this Feb 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants