Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34720][SQL] MERGE ... UPDATE/INSERT * should do by-name resolution #32192

Closed
wants to merge 1 commit into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

In Spark, we have an extension in the MERGE syntax: INSERT/UPDATE *. This is not from ANSI standard or any other mainstream databases, so we need to define the behaviors by our own.

The behavior today is very weird: assume the source table has n1 columns, target table has n2 columns. We generate the assignments by taking the first min(n1, n2) columns from source & target tables and pairing them by ordinal.

This PR proposes a more reasonable behavior: take all the columns from target table as keys, and find the corresponding columns from source table by name as values.

Why are the changes needed?

Fix the MEREG INSERT/UPDATE * to be more user-friendly and easy to do schema evolution.

Does this PR introduce any user-facing change?

Yes, but MERGE is only supported by very few data sources.

How was this patch tested?

new tests

@github-actions
Copy link

Test build #752552040 for PR 32192 at commit ed828c3.

@github-actions github-actions bot added the SQL label Apr 15, 2021
@cloud-fan
Copy link
Contributor Author

cc @tdas @dongjoon-hyun

UpdateAction(
updateCondition.map(resolveExpressionByPlanChildren(_, m)),
resolveAssignments(assignments = None, m, resolveValuesWithSourceOnly = false))
// For UPDATE *, the value must from source table.
resolveAssignments(assignments, m, resolveValuesWithSourceOnly = true))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid resolving UpdateAction again with resolveValuesWithSourceOnly = false in the next analysis round, I added resolveMergeExprOrFail to fail earlier if an attribute can't be resolved.

@dongjoon-hyun
Copy link
Member

cc @aokolnychyi and @RussellSpitzer

@RussellSpitzer
Copy link
Member

I think this is a great idea, I see folks hitting this extremely frequently. It does feel like it would be a rather large change to the current behavior though.

@SparkQA
Copy link

SparkQA commented Apr 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42002/

@SparkQA
Copy link

SparkQA commented Apr 15, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42002/

@SparkQA
Copy link

SparkQA commented Apr 15, 2021

Test build #137424 has finished for PR 32192 at commit ed828c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42778/

@SparkQA
Copy link

SparkQA commented May 7, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42778/

@SparkQA
Copy link

SparkQA commented May 7, 2021

Test build #138256 has finished for PR 32192 at commit 19d77ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this is not ANSI, there exists some references which the users might expect. It would be great if you add a comparison section to compare your proposal with them in your PR description officially for a record. At least, I guess we can find the followings.

@dongjoon-hyun
Copy link
Member

Also, cc @sunchao and @viirya

@cloud-fan
Copy link
Contributor Author

I've made it clear at the beginning of the PR description: INSERT/UPDATE * is an extension and I can't find it in other mainstream databases. If you open the docs you posted, none of them (except for Delta lake) supports INSERT/UPDATE *.

Actually the original Spark PR to add MERGE SQL syntax was following https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-merge-into.html . It's a mistake that the INSERT/UPDATE * behavior is different and the current behavior is super confusing that we'd better fix it.

@dongjoon-hyun
Copy link
Member

Got it for the further explanation.

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@cloud-fan cloud-fan closed this in d1b8bd7 May 13, 2021
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
…tion

In Spark, we have an extension in the MERGE syntax: INSERT/UPDATE *. This is not from ANSI standard or any other mainstream databases, so we need to define the behaviors by our own.

The behavior today is very weird: assume the source table has `n1` columns, target table has `n2` columns. We generate the assignments by taking the first `min(n1, n2)` columns from source & target tables and pairing them by ordinal.

This PR proposes a more reasonable behavior: take all the columns from target table as keys, and find the corresponding columns from source table by name as values.

Fix the MEREG INSERT/UPDATE * to be more user-friendly and easy to do schema evolution.

Yes, but MERGE is only supported by very few data sources.

new tests

Closes apache#32192 from cloud-fan/merge.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants