[SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values #37430

dtenedor · 2022-08-07T21:28:56Z

What changes were proposed in this pull request?

Update INSERTs without user-specified fields to not automatically add default values.

For example, with the new behavior, this INSERT INTO command will fail with an error message reporting that the table has two columns but the command only inserted one:

CREATE TABLE t (a INT DEFAULT 1, b INT DEFAULT 2) USING PARQUET;
INSERT INTO t VALUES (42);

For INSERTs with user-specified fields, these commands may now specify fewer field/value pairs than the number of columns in the target table. The analyzer will assign the default value for each remaining column (either NULL, or else the explicit DEFAULT value assigned to the column from a previous command).

For example, with the new behavior, this INSERT INTO command will succeed, assigning the new row (42, 2) to the target table:

CREATE TABLE t (a INT DEFAULT 1, b INT DEFAULT 2) USING PARQUET;
INSERT INTO t (a) VALUES (42);

To implement this behavior, this PR creates the following config which is true by default:

spark.sql.defaultColumn.addMissingValuesForInsertsWithExplicitColumns

To switch back to the previous behavior of returning errors for INSERT INTO commands with fewer user-specified fields than the number of columns in the target table, switch this new config to false.

Why are the changes needed?

After looking at desired SQL semantics, it is preferred to be strict and require that the number of inserted columns exactly matches the target table to prevent against accidental mistakes.

Does this PR introduce any user-facing change?

Yes, see above.

How was this patch tested?

Updated unit test coverage.

dtenedor · 2022-08-07T21:30:06Z

Hi @gengliangwang this change will be helpful to let SQL engines toggle whether to return an error if INSERT commands without user-specified fields include fewer columns than present in the target table.

AmplabJenkins · 2022-08-07T21:41:11Z

Can one of the admins verify this patch?

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

dtenedor

Thanks for your review!

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

dtenedor · 2022-08-11T19:02:30Z

Hi @gengliangwang, thanks for your review, this is ready again :)

gengliangwang · 2022-08-12T17:16:42Z

Had an offline discussion with @dtenedor . Let's make it simple and strict for now: throw an exception if a table insert contains less columns than the table and there is not a insert column list.

fetch latest from master

sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala

dtenedor

Thanks for your review! The unit tests had to change a bit, but the behavior is simpler and more consistent now.

sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala

dtenedor · 2022-08-12T19:43:38Z

Hi @gengliangwang responded to comments, this is ready for another round when ready :)

dtenedor · 2022-08-12T23:06:54Z

Hi @gengliangwang per offline discussion, I have added the new flag spark.sql.defaultColumn.addMissingValuesForInsertsWithExplicitColumns to toggle whether commands like INSERT INTO t (a, b) VALUES ... will succeed or fail when the number of inserted columns is less than the number of columns in the target table.

gengliangwang · 2022-08-13T00:45:30Z

@dtenedor let's update the migration guide https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md since this changes the default behavior.

Also, please update the PR description as per the latest code changes.

gengliangwang · 2022-08-13T17:54:27Z

@dtenedor FYI there is a conflict with the master branch now.

fetch latest from master

dtenedor · 2022-08-15T05:16:58Z

OK @gengliangwang all the conflicts should be resolved now. 👍

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

…ysis/ResolveDefaultColumns.scala Co-authored-by: Gengliang Wang <ltnwgl@gmail.com>

dtenedor · 2022-08-15T19:16:45Z

Thanks @gengliangwang for the reviews, applied suggestions and updated the PR description 👍

gengliangwang · 2022-08-16T03:01:12Z

Thanks, merging to master

dongjoon-hyun · 2022-08-18T21:44:25Z

This is reverted via 50c1635

gengliangwang · 2022-08-18T21:45:43Z

@dongjoon-hyun Thanks for the note!

implement

c378c17

github-actions bot added the SQL label Aug 7, 2022

gengliangwang reviewed Aug 11, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

dtenedor commented Aug 11, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

respond to code review comments

f5ec51e

dtenedor requested a review from gengliangwang August 11, 2022 19:02

dtenedor added 2 commits August 12, 2022 10:19

Merge branch 'master' into insert-fewer-columns

4060c0d

fetch latest from master

simplify configs

c65cf96

dtenedor changed the title ~~[SPARK-40000][SQL] Add config to toggle whether to automatically add default values for INSERTs without user-specified fields~~ [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values Aug 12, 2022

dtenedor added 2 commits August 12, 2022 10:38

simplify configs

8aa2425

simplify configs

8345abc

gengliangwang reviewed Aug 12, 2022

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala Outdated Show resolved Hide resolved

fix tests

01340af

dtenedor commented Aug 12, 2022

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala Outdated Show resolved Hide resolved

dtenedor requested a review from gengliangwang August 12, 2022 19:43

dtenedor added 2 commits August 12, 2022 16:00

respond to code review comments

5814d55

respond to code review comments

0be7d73

gengliangwang approved these changes Aug 13, 2022

View reviewed changes

dtenedor added 4 commits August 14, 2022 13:09

Merge branch 'master' into insert-fewer-columns

152d32c

fetch latest from master

merge

4d81653

merge

664a906

merge

f75af9b

dtenedor added 6 commits August 14, 2022 21:33

merge

580041f

merge

ae96143

merge

55dfee5

merge

d90ee2e

merge

6967338

merge

e3f667a

merge

79d9a0c

github-actions bot added the DOCS label Aug 15, 2022