Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values #37430

Closed
wants to merge 26 commits into from

Conversation

dtenedor
Copy link
Contributor

@dtenedor dtenedor commented Aug 7, 2022

What changes were proposed in this pull request?

Update INSERTs without user-specified fields to not automatically add default values.

For example, with the new behavior, this INSERT INTO command will fail with an error message reporting that the table has two columns but the command only inserted one:

CREATE TABLE t (a INT DEFAULT 1, b INT DEFAULT 2) USING PARQUET;
INSERT INTO t VALUES (42);

For INSERTs with user-specified fields, these commands may now specify fewer field/value pairs than the number of columns in the target table. The analyzer will assign the default value for each remaining column (either NULL, or else the explicit DEFAULT value assigned to the column from a previous command).

For example, with the new behavior, this INSERT INTO command will succeed, assigning the new row (42, 2) to the target table:

CREATE TABLE t (a INT DEFAULT 1, b INT DEFAULT 2) USING PARQUET;
INSERT INTO t (a) VALUES (42);

To implement this behavior, this PR creates the following config which is true by default:

spark.sql.defaultColumn.addMissingValuesForInsertsWithExplicitColumns

To switch back to the previous behavior of returning errors for INSERT INTO commands with fewer user-specified fields than the number of columns in the target table, switch this new config to false.

Why are the changes needed?

After looking at desired SQL semantics, it is preferred to be strict and require that the number of inserted columns exactly matches the target table to prevent against accidental mistakes.

Does this PR introduce any user-facing change?

Yes, see above.

How was this patch tested?

Updated unit test coverage.

@github-actions github-actions bot added the SQL label Aug 7, 2022
@dtenedor
Copy link
Contributor Author

dtenedor commented Aug 7, 2022

Hi @gengliangwang this change will be helpful to let SQL engines toggle whether to return an error if INSERT commands without user-specified fields include fewer columns than present in the target table.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Contributor Author

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review!

@dtenedor
Copy link
Contributor Author

Hi @gengliangwang, thanks for your review, this is ready again :)

@gengliangwang
Copy link
Member

Had an offline discussion with @dtenedor . Let's make it simple and strict for now: throw an exception if a table insert contains less columns than the table and there is not a insert column list.

@dtenedor dtenedor changed the title [SPARK-40000][SQL] Add config to toggle whether to automatically add default values for INSERTs without user-specified fields [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values Aug 12, 2022
Copy link
Contributor Author

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review! The unit tests had to change a bit, but the behavior is simpler and more consistent now.

@dtenedor
Copy link
Contributor Author

Hi @gengliangwang responded to comments, this is ready for another round when ready :)

@dtenedor
Copy link
Contributor Author

Hi @gengliangwang per offline discussion, I have added the new flag spark.sql.defaultColumn.addMissingValuesForInsertsWithExplicitColumns to toggle whether commands like INSERT INTO t (a, b) VALUES ... will succeed or fail when the number of inserted columns is less than the number of columns in the target table.

@gengliangwang
Copy link
Member

gengliangwang commented Aug 13, 2022

@dtenedor let's update the migration guide https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md since this changes the default behavior.

Also, please update the PR description as per the latest code changes.

@gengliangwang
Copy link
Member

@dtenedor FYI there is a conflict with the master branch now.

@dtenedor
Copy link
Contributor Author

OK @gengliangwang all the conflicts should be resolved now. 👍

@github-actions github-actions bot added the DOCS label Aug 15, 2022
dtenedor and others added 5 commits August 15, 2022 12:08
…ysis/ResolveDefaultColumns.scala

Co-authored-by: Gengliang Wang <ltnwgl@gmail.com>
…ysis/ResolveDefaultColumns.scala

Co-authored-by: Gengliang Wang <ltnwgl@gmail.com>
…ysis/ResolveDefaultColumns.scala

Co-authored-by: Gengliang Wang <ltnwgl@gmail.com>
…ysis/ResolveDefaultColumns.scala

Co-authored-by: Gengliang Wang <ltnwgl@gmail.com>
@dtenedor
Copy link
Contributor Author

Thanks @gengliangwang for the reviews, applied suggestions and updated the PR description 👍

@gengliangwang
Copy link
Member

Thanks, merging to master

@dongjoon-hyun
Copy link
Member

This is reverted via 50c1635

@gengliangwang
Copy link
Member

@dongjoon-hyun Thanks for the note!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants