Skip to content

Conversation

@szehon-ho
Copy link
Member

@szehon-ho szehon-ho commented Nov 25, 2025

What changes were proposed in this pull request?

Some fixes to allow the Dataframe Merge API to support schema evolution. The DataFrame API is here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/classic/MergeIntoWriter.scala#L7

The fixes are described inline.

Why are the changes needed?

The Dataframe Merge API is broken for schema evolution mode without these fixes.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add unit tests. Will try to refactor later to combine test re-use.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Nov 25, 2025
} else {
m transformUpWithNewOutput {
case r @ DataSourceV2Relation(_: SupportsRowLevelOperations, _, _, _, _, _) =>
val finalAttrMapping = ArrayBuffer.empty[(Attribute, Attribute)]
Copy link
Member Author

@szehon-ho szehon-ho Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bug here, as it actually hits both the sourceTable and targetTable and tries schema evolution on both, when actually schema evolution should always be performed only for target table.

I had done it this way because of the limitation of transformUpWithNewOutput that it doesn't re-map the attribues of the top level object (MergeIntoTable). See #52866 (comment) for my finding. So I transformed all children of MergeIntoTable and assumed that the match with SupportsRowLevelOperation Table would be enough to only do schema evolution on the targetTable, but I was wrong.

So the fix is to explicitly transform on the MergeIntoTable's targetTable, and I add an extra rewriteAttrs to rewrite the top level MergeIntoTable object.

(assignment.value.resolved && sourcePaths.exists {
path => MergeIntoTable.isEqual(assignment, path)
})
val hasStarActions = actions.exists {
Copy link
Member Author

@szehon-ho szehon-ho Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to canEvaluateSchemaEvolution (that guards whether the schema evolution check gets evaluated) gets called with updateStar and insertStar. This is triggered in the Dataframe API, and revealed I had missed this case. So here I return false to explicitly skip until they are resolved. Else this hits an analysis error later. This was not triggered in SQL case where the stars got resolved earlier.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @szehon-ho . I have a few questions in this domain.

  1. May I ask how you proceed the testing this MERGE INTO features?
  2. Do you think you have more area which needs to be investigated from now?

sql(s"DROP TABLE IF EXISTS $tableNameAsString")
}

test("merge with schema evolution using dataframe API: add new column and set all") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding 1050 lines additionally.

@dongjoon-hyun
Copy link
Member

Could you fix the compilation failure, @szehon-ho ?

[error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala:5686:31: value MERGE_INTO_NESTED_TYPE_UPDATE_BY_FIELD is not a member of object org.apache.spark.sql.internal.SQLConf
[error]           withSQLConf(SQLConf.MERGE_INTO_NESTED_TYPE_UPDATE_BY_FIELD.key ->
[error]                               ^
[warn] 22 warnings found
[error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala:5806:31: value MERGE_INTO_NESTED_TYPE_UPDATE_BY_FIELD is not a member of object org.apache.spark.sql.internal.SQLConf
[error]           withSQLConf(SQLConf.MERGE_INTO_NESTED_TYPE_UPDATE_BY_FIELD.key ->
[error]                               ^
[error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala:6049:19: value MERGE_INTO_NESTED_TYPE_UPDATE_BY_FIELD is not a member of object org.apache.spark.sql.internal.SQLConf
[error]           SQLConf.MERGE_INTO_NESTED_TYPE_UPDATE_BY_FIELD.key ->
[error]                   ^
[error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala:6165:21: value MERGE_INTO_NESTED_TYPE_UPDATE_BY_FIELD is not a member of object org.apache.spark.sql.internal.SQLConf
[error]             SQLConf.MERGE_INTO_NESTED_TYPE_UPDATE_BY_FIELD.key ->
[error]                     ^

@szehon-ho
Copy link
Member Author

szehon-ho commented Nov 25, 2025

ah, i didnt rebase on my other fix, let me do it tomorrow

@dongjoon-hyun
Copy link
Member

Got it. Thank you, @szehon-ho .

@szehon-ho szehon-ho force-pushed the merge_schema_evolution_bug branch from 8e2996d to b240faf Compare November 25, 2025 20:01
@szehon-ho
Copy link
Member Author

szehon-ho commented Nov 25, 2025

Thanks I fixed it @dongjoon-hyun , sorry for delay.

Re: testing the overall feature for MERGE INTO of schema evolution, @aokolnychyi and @cloud-fan are testing it. In fact @aokolnychyi found this issue as per the JIRA description. Actually unrelated to this pr, @cloud-fan had a bit of concern of #53149 , as there is some ambiguity in what the user wants by UPDATE SET * if there is struct mismatch (by column or by field). So we are discussing it and thinking potentially revert it or disable this part of it for the release

@dongjoon-hyun
Copy link
Member

Ack. Thank you for sharing the updated status, @szehon-ho .

@szehon-ho
Copy link
Member Author

build failures look infra related (free up disk space..)

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs).

dongjoon-hyun pushed a commit that referenced this pull request Nov 26, 2025
### What changes were proposed in this pull request?
Some fixes to allow the Dataframe Merge API to support schema evolution.  The DataFrame API is here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/classic/MergeIntoWriter.scala#L7

The fixes are described inline.

### Why are the changes needed?
The Dataframe Merge API is broken for schema evolution mode without these fixes.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add unit tests.  Will try to refactor later to combine test re-use.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #53207 from szehon-ho/merge_schema_evolution_bug.

Authored-by: Szehon Ho <szehon.apache@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 9feb1b2)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member

Merged to master/4.1 for Apache Spark 4.1.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants