Skip to content

Conversation

@szehon-ho
Copy link
Member

@szehon-ho szehon-ho commented Nov 21, 2025

What changes were proposed in this pull request?

Introduce a new flag spark.sql.merge.nested.type.assign.by.field that allows UPDATE SET * action in MERGE INTO to be shorthand to assign every nested struct to its existing source counterpart (ie, UPDATE SET a.b.c = source.a.b.c). This will have the implication that existing struct field in the target table that has no source equivalent are preserved, when the corresponding source struct has less fields than target.

Additional code is added to prevent null expansion in this case (ie, a null source struct expanding to a struct of nulls).

Why are the changes needed?

Following #52347, we now allow MERGE INTO to have a source table struct with less nested fields than target table struct. In this scenario, a user making a UPDATE SET * may have two interpretations.

The use may interpret UPDATE SET * as shorthand to assign every top-column level field, ie UPDATE SET struct=source.struct, then the target struct is set to source struct object as is, with missing fields as NULL. This is the current behavior.

The user may also mean that UPDATE SET * is short-hand to assign every nested struct field (ie, UPDATE SET struct.a.b = source.struct.a.b), in which case the target struct fields missing in source are retained. This is similar to UPDATE SET * not overriding existing target columns missing in the source, for example. For this case, this flag is added.

Does this PR introduce any user-facing change?

No, the support to allow source structs to have less fields than target structs in MERGE INTO is unreleased yet (#52347), and in any case there is a flag to toggle this functionality.

How was this patch tested?

Unit tests, especially around cases where the source struct is null.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Nov 21, 2025
@szehon-ho szehon-ho changed the title [SPARK-54289][SQL] Allow MERGE INTO to preserve existing struct fields for UPDATE SET * when source has nested fields [SPARK-54289][SQL] Allow MERGE INTO to preserve existing struct fields for UPDATE SET * when source struct has less nested fields than target struct Nov 21, 2025
@szehon-ho szehon-ho force-pushed the merge_schema_evolution_update_nested branch 3 times, most recently from db416a9 to 142d795 Compare November 21, 2025 03:08
…s for UPDATE SET * when source has less fields
@szehon-ho szehon-ho force-pushed the merge_schema_evolution_update_nested branch from 142d795 to fdddef1 Compare November 21, 2025 03:13
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an improvement, @szehon-ho ?
It looks like a massive change if it's a bug fix.

Never mind. I checked the JIRA discussion we had before.

@szehon-ho
Copy link
Member Author

@cloud-fan can you help review? Thanks

}
}

private def applyNestedFieldAssignments(
Copy link
Member Author

@szehon-ho szehon-ho Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this is like applyFieldAssignments above, but recurses into nested structs

.createWithDefault(true)

val MERGE_INTO_SOURCE_NESTED_TYPE_UPDATE_BY_FIELD =
buildConf("spark.sql.merge.nested.type.assign.by.fieldv2")
Copy link
Member

@dongjoon-hyun dongjoon-hyun Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @szehon-ho . The naming space design looks weird. Why nested is at the different level like the following?

spark.sql.merge.nested.type.assign.by.fieldv2
spark.sql.merge.source.nested.type.coercion.enabled

Copy link
Member Author

@szehon-ho szehon-ho Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes i fixed the config string in latest one, thanks!

getConf(SQLConf.LEGACY_XML_PARSER_ENABLED)

def coerceMergeNestedTypes: Boolean =
def mergeCoerceNestedTypes: Boolean =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you also knew that the naming is weird, don't you, @szehon-ho ? That's the reason you renaming this in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, its not strictly related, but realized its better they align.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

dongjoon-hyun pushed a commit that referenced this pull request Nov 22, 2025
…s for UPDATE SET * when source struct has less nested fields than target struct

### What changes were proposed in this pull request?
Introduce a new flag spark.sql.merge.nested.type.assign.by.field that allows UPDATE SET * action in MERGE INTO to be shorthand to assign every nested struct to its existing source counterpart (ie, UPDATE SET a.b.c = source.a.b.c).  This will have the implication that existing struct field in the target table that has no source equivalent are preserved, when the corresponding source struct has less fields than target.

Additional code is added to prevent null expansion in this case (ie, a null source struct expanding to a struct of nulls).

### Why are the changes needed?
Following #52347, we now allow MERGE INTO to have a source table struct with less nested fields than target table struct.  In this scenario, a user making a UPDATE SET * may have two interpretations.

The use may interpret UPDATE SET * as shorthand to assign every top-column level field, ie UPDATE SET struct=source.struct, then the target struct is set to source struct object as is, with missing fields as NULL.  This is the current behavior.

The user may also mean that UPDATE SET * is short-hand to assign every nested struct field (ie, UPDATE SET struct.a.b = source.struct.a.b), in which case the target struct fields missing in source are retained.  This is similar to UPDATE SET * not overriding existing target columns missing in the source, for example.  For this case, this flag is added.

### Does this PR introduce _any_ user-facing change?
No, the support to allow source structs to have less fields than target structs in MERGE INTO is unreleased yet (#52347), and in any case there is a flag to toggle this functionality.

### How was this patch tested?
Unit tests, especially around cases where the source struct is null.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #53149 from szehon-ho/merge_schema_evolution_update_nested.

Authored-by: Szehon Ho <szehon.apache@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 966e053)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member

Merged to master/4.1 for Apache Spark 4.1.0.

Happy Thanksgiving, @szehon-ho .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants