[SPARK-34720][SQL] MERGE ... UPDATE/INSERT * should do by-name resolution #32192

cloud-fan · 2021-04-15T15:44:37Z

What changes were proposed in this pull request?

In Spark, we have an extension in the MERGE syntax: INSERT/UPDATE *. This is not from ANSI standard or any other mainstream databases, so we need to define the behaviors by our own.

The behavior today is very weird: assume the source table has n1 columns, target table has n2 columns. We generate the assignments by taking the first min(n1, n2) columns from source & target tables and pairing them by ordinal.

This PR proposes a more reasonable behavior: take all the columns from target table as keys, and find the corresponding columns from source table by name as values.

Why are the changes needed?

Fix the MEREG INSERT/UPDATE * to be more user-friendly and easy to do schema evolution.

Does this PR introduce any user-facing change?

Yes, but MERGE is only supported by very few data sources.

How was this patch tested?

new tests

github-actions · 2021-04-15T15:45:10Z

Test build #752552040 for PR 32192 at commit ed828c3.

cloud-fan · 2021-04-15T15:46:59Z

cc @tdas @dongjoon-hyun

cloud-fan · 2021-04-15T15:50:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

                UpdateAction(
                  updateCondition.map(resolveExpressionByPlanChildren(_, m)),
-                  resolveAssignments(assignments = None, m, resolveValuesWithSourceOnly = false))
+                  // For UPDATE *, the value must from source table.
+                  resolveAssignments(assignments, m, resolveValuesWithSourceOnly = true))


To avoid resolving UpdateAction again with resolveValuesWithSourceOnly = false in the next analysis round, I added resolveMergeExprOrFail to fail earlier if an attribute can't be resolved.

dongjoon-hyun · 2021-04-15T16:29:24Z

cc @aokolnychyi and @RussellSpitzer

RussellSpitzer · 2021-04-15T16:35:36Z

I think this is a great idea, I see folks hitting this extremely frequently. It does feel like it would be a rather large change to the current behavior though.

SparkQA · 2021-04-15T16:40:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42002/

SparkQA · 2021-04-15T16:40:51Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42002/

SparkQA · 2021-04-15T20:17:13Z

Test build #137424 has finished for PR 32192 at commit ed828c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-07T14:51:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42778/

SparkQA · 2021-05-07T14:56:00Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42778/

SparkQA · 2021-05-07T19:03:16Z

Test build #138256 has finished for PR 32192 at commit 19d77ea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Although this is not ANSI, there exists some references which the users might expect. It would be great if you add a comparison section to compare your proposal with them in your PR description officially for a record. At least, I guess we can find the followings.

dongjoon-hyun · 2021-05-08T22:04:07Z

Also, cc @sunchao and @viirya

cloud-fan · 2021-05-10T07:29:06Z

I've made it clear at the beginning of the PR description: INSERT/UPDATE * is an extension and I can't find it in other mainstream databases. If you open the docs you posted, none of them (except for Delta lake) supports INSERT/UPDATE *.

Actually the original Spark PR to add MERGE SQL syntax was following https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-merge-into.html . It's a mistake that the INSERT/UPDATE * behavior is different and the current behavior is super confusing that we'd better fix it.

dongjoon-hyun · 2021-05-10T07:35:19Z

Got it for the further explanation.

cloud-fan · 2021-05-13T12:58:23Z

thanks for the review, merging to master!

…tion In Spark, we have an extension in the MERGE syntax: INSERT/UPDATE *. This is not from ANSI standard or any other mainstream databases, so we need to define the behaviors by our own. The behavior today is very weird: assume the source table has `n1` columns, target table has `n2` columns. We generate the assignments by taking the first `min(n1, n2)` columns from source & target tables and pairing them by ordinal. This PR proposes a more reasonable behavior: take all the columns from target table as keys, and find the corresponding columns from source table by name as values. Fix the MEREG INSERT/UPDATE * to be more user-friendly and easy to do schema evolution. Yes, but MERGE is only supported by very few data sources. new tests Closes apache#32192 from cloud-fan/merge. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Apr 15, 2021

cloud-fan commented Apr 15, 2021

View reviewed changes

MERGE ... UPDATE/INSERT * should do by-name resolution

19d77ea

cloud-fan force-pushed the merge branch from ed828c3 to 19d77ea Compare May 7, 2021 14:04

dongjoon-hyun reviewed May 8, 2021

View reviewed changes

dongjoon-hyun approved these changes May 11, 2021

View reviewed changes

cloud-fan closed this in d1b8bd7 May 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34720][SQL] MERGE ... UPDATE/INSERT * should do by-name resolution #32192

[SPARK-34720][SQL] MERGE ... UPDATE/INSERT * should do by-name resolution #32192

cloud-fan commented Apr 15, 2021

github-actions bot commented Apr 15, 2021

cloud-fan commented Apr 15, 2021

cloud-fan Apr 15, 2021

dongjoon-hyun commented Apr 15, 2021

RussellSpitzer commented Apr 15, 2021

SparkQA commented Apr 15, 2021

SparkQA commented Apr 15, 2021

SparkQA commented Apr 15, 2021

SparkQA commented May 7, 2021

SparkQA commented May 7, 2021

SparkQA commented May 7, 2021

dongjoon-hyun left a comment

dongjoon-hyun commented May 8, 2021

cloud-fan commented May 10, 2021

dongjoon-hyun commented May 10, 2021

cloud-fan commented May 13, 2021

[SPARK-34720][SQL] MERGE ... UPDATE/INSERT * should do by-name resolution #32192

[SPARK-34720][SQL] MERGE ... UPDATE/INSERT * should do by-name resolution #32192

Conversation

cloud-fan commented Apr 15, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

github-actions bot commented Apr 15, 2021

cloud-fan commented Apr 15, 2021

cloud-fan Apr 15, 2021

Choose a reason for hiding this comment

dongjoon-hyun commented Apr 15, 2021

RussellSpitzer commented Apr 15, 2021

SparkQA commented Apr 15, 2021

SparkQA commented Apr 15, 2021

SparkQA commented Apr 15, 2021

SparkQA commented May 7, 2021

SparkQA commented May 7, 2021

SparkQA commented May 7, 2021

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented May 8, 2021

cloud-fan commented May 10, 2021

dongjoon-hyun commented May 10, 2021

cloud-fan commented May 13, 2021