[SPARK-54496][SQL] Fix Merge Into Schema Evolution for Dataframe API #53207

szehon-ho · 2025-11-25T05:17:47Z

What changes were proposed in this pull request?

Some fixes to allow the Dataframe Merge API to support schema evolution. The DataFrame API is here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/classic/MergeIntoWriter.scala#L7

The fixes are described inline.

Why are the changes needed?

The Dataframe Merge API is broken for schema evolution mode without these fixes.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add unit tests. Will try to refactor later to combine test re-use.

Was this patch authored or co-authored using generative AI tooling?

No

szehon-ho · 2025-11-25T05:20:33Z

.../src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveMergeIntoSchemaEvolution.scala

      } else {
-        m transformUpWithNewOutput {
-          case r @ DataSourceV2Relation(_: SupportsRowLevelOperations, _, _, _, _, _) =>
+        val finalAttrMapping = ArrayBuffer.empty[(Attribute, Attribute)]


There is a bug here, as it actually hits both the sourceTable and targetTable and tries schema evolution on both, when actually schema evolution should always be performed only for target table.

I had done it this way because of the limitation of transformUpWithNewOutput that it doesn't re-map the attribues of the top level object (MergeIntoTable). See #52866 (comment) for my finding. So I transformed all children of MergeIntoTable and assumed that the match with SupportsRowLevelOperation Table would be enough to only do schema evolution on the targetTable, but I was wrong.

So the fix is to explicitly transform on the MergeIntoTable's targetTable, and I add an extra rewriteAttrs to rewrite the top level MergeIntoTable object.

szehon-ho · 2025-11-25T05:24:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

-          (assignment.value.resolved && sourcePaths.exists {
-            path => MergeIntoTable.isEqual(assignment, path)
-          })
+      val hasStarActions = actions.exists {


The call to canEvaluateSchemaEvolution (that guards whether the schema evolution check gets evaluated) gets called with updateStar and insertStar. This is triggered in the Dataframe API, and revealed I had missed this case. So here I return false to explicitly skip until they are resolved. Else this hits an analysis error later. This was not triggered in SQL case where the stars got resolved earlier.

dongjoon-hyun

Thank you, @szehon-ho . I have a few questions in this domain.

May I ask how you proceed the testing this MERGE INTO features?
Do you think you have more area which needs to be investigated from now?

dongjoon-hyun · 2025-11-25T05:26:40Z

sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala

    sql(s"DROP TABLE IF EXISTS $tableNameAsString")
  }

+  test("merge with schema evolution using dataframe API: add new column and set all") {


Thank you for adding 1050 lines additionally.

dongjoon-hyun · 2025-11-25T07:14:56Z

Could you fix the compilation failure, @szehon-ho ?

[error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala:5686:31: value MERGE_INTO_NESTED_TYPE_UPDATE_BY_FIELD is not a member of object org.apache.spark.sql.internal.SQLConf
[error]           withSQLConf(SQLConf.MERGE_INTO_NESTED_TYPE_UPDATE_BY_FIELD.key ->
[error]                               ^
[warn] 22 warnings found
[error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala:5806:31: value MERGE_INTO_NESTED_TYPE_UPDATE_BY_FIELD is not a member of object org.apache.spark.sql.internal.SQLConf
[error]           withSQLConf(SQLConf.MERGE_INTO_NESTED_TYPE_UPDATE_BY_FIELD.key ->
[error]                               ^
[error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala:6049:19: value MERGE_INTO_NESTED_TYPE_UPDATE_BY_FIELD is not a member of object org.apache.spark.sql.internal.SQLConf
[error]           SQLConf.MERGE_INTO_NESTED_TYPE_UPDATE_BY_FIELD.key ->
[error]                   ^
[error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala:6165:21: value MERGE_INTO_NESTED_TYPE_UPDATE_BY_FIELD is not a member of object org.apache.spark.sql.internal.SQLConf
[error]             SQLConf.MERGE_INTO_NESTED_TYPE_UPDATE_BY_FIELD.key ->
[error]                     ^

szehon-ho · 2025-11-25T07:21:38Z

ah, i didnt rebase on my other fix, let me do it tomorrow

dongjoon-hyun · 2025-11-25T07:23:42Z

Got it. Thank you, @szehon-ho .

szehon-ho · 2025-11-25T20:04:34Z

Thanks I fixed it @dongjoon-hyun , sorry for delay.

Re: testing the overall feature for MERGE INTO of schema evolution, @aokolnychyi and @cloud-fan are testing it. In fact @aokolnychyi found this issue as per the JIRA description. Actually unrelated to this pr, @cloud-fan had a bit of concern of #53149 , as there is some ambiguity in what the user wants by UPDATE SET * if there is struct mismatch (by column or by field). So we are discussing it and thinking potentially revert it or disable this part of it for the release

dongjoon-hyun · 2025-11-25T21:07:22Z

Ack. Thank you for sharing the updated status, @szehon-ho .

szehon-ho · 2025-11-26T02:29:23Z

build failures look infra related (free up disk space..)

dongjoon-hyun

+1, LGTM (Pending CIs).

### What changes were proposed in this pull request? Some fixes to allow the Dataframe Merge API to support schema evolution. The DataFrame API is here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/classic/MergeIntoWriter.scala#L7 The fixes are described inline. ### Why are the changes needed? The Dataframe Merge API is broken for schema evolution mode without these fixes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add unit tests. Will try to refactor later to combine test re-use. ### Was this patch authored or co-authored using generative AI tooling? No Closes #53207 from szehon-ho/merge_schema_evolution_bug. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 9feb1b2) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun · 2025-11-26T06:07:37Z

Merged to master/4.1 for Apache Spark 4.1.0.

github-actions bot added the SQL label Nov 25, 2025

szehon-ho commented Nov 25, 2025

View reviewed changes

dongjoon-hyun reviewed Nov 25, 2025

View reviewed changes

[SPARK-54496][SQL] Fix Merge Into Schema Evolution for Dataframe API

57ddd40

Rebase

b240faf

szehon-ho force-pushed the merge_schema_evolution_bug branch from 8e2996d to b240faf Compare November 25, 2025 20:01

dongjoon-hyun approved these changes Nov 26, 2025

View reviewed changes

dongjoon-hyun closed this in 9feb1b2 Nov 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54496][SQL] Fix Merge Into Schema Evolution for Dataframe API #53207

[SPARK-54496][SQL] Fix Merge Into Schema Evolution for Dataframe API #53207

szehon-ho commented Nov 25, 2025 •

edited

Loading

Uh oh!

szehon-ho Nov 25, 2025 •

edited

Loading

Uh oh!

szehon-ho Nov 25, 2025 •

edited

Loading

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun Nov 25, 2025

Uh oh!

dongjoon-hyun commented Nov 25, 2025

Uh oh!

szehon-ho commented Nov 25, 2025 •

edited

Loading

Uh oh!

dongjoon-hyun commented Nov 25, 2025

Uh oh!

szehon-ho commented Nov 25, 2025 •

edited

Loading

Uh oh!

dongjoon-hyun commented Nov 25, 2025

Uh oh!

szehon-ho commented Nov 26, 2025

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-54496][SQL] Fix Merge Into Schema Evolution for Dataframe API #53207

[SPARK-54496][SQL] Fix Merge Into Schema Evolution for Dataframe API #53207

Conversation

szehon-ho commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

szehon-ho Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Nov 25, 2025

Uh oh!

szehon-ho commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Nov 25, 2025

Uh oh!

szehon-ho commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Nov 25, 2025

Uh oh!

szehon-ho commented Nov 26, 2025

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

szehon-ho commented Nov 25, 2025 •

edited

Loading

szehon-ho Nov 25, 2025 •

edited

Loading

szehon-ho Nov 25, 2025 •

edited

Loading

szehon-ho commented Nov 25, 2025 •

edited

Loading

szehon-ho commented Nov 25, 2025 •

edited

Loading