-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-54289][SQL] Allow MERGE INTO to preserve existing struct fields for UPDATE SET * when source struct has less nested fields than target struct #53149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
db416a9 to
142d795
Compare
…s for UPDATE SET * when source has less fields
142d795 to
fdddef1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an improvement, @szehon-ho ?
It looks like a massive change if it's a bug fix.
Never mind. I checked the JIRA discussion we had before.
|
@cloud-fan can you help review? Thanks |
| } | ||
| } | ||
|
|
||
| private def applyNestedFieldAssignments( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this is like applyFieldAssignments above, but recurses into nested structs
| .createWithDefault(true) | ||
|
|
||
| val MERGE_INTO_SOURCE_NESTED_TYPE_UPDATE_BY_FIELD = | ||
| buildConf("spark.sql.merge.nested.type.assign.by.fieldv2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @szehon-ho . The naming space design looks weird. Why nested is at the different level like the following?
spark.sql.merge.nested.type.assign.by.fieldv2
spark.sql.merge.source.nested.type.coercion.enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes i fixed the config string in latest one, thanks!
| getConf(SQLConf.LEGACY_XML_PARSER_ENABLED) | ||
|
|
||
| def coerceMergeNestedTypes: Boolean = | ||
| def mergeCoerceNestedTypes: Boolean = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you also knew that the naming is weird, don't you, @szehon-ho ? That's the reason you renaming this in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, its not strictly related, but realized its better they align.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
…s for UPDATE SET * when source struct has less nested fields than target struct ### What changes were proposed in this pull request? Introduce a new flag spark.sql.merge.nested.type.assign.by.field that allows UPDATE SET * action in MERGE INTO to be shorthand to assign every nested struct to its existing source counterpart (ie, UPDATE SET a.b.c = source.a.b.c). This will have the implication that existing struct field in the target table that has no source equivalent are preserved, when the corresponding source struct has less fields than target. Additional code is added to prevent null expansion in this case (ie, a null source struct expanding to a struct of nulls). ### Why are the changes needed? Following #52347, we now allow MERGE INTO to have a source table struct with less nested fields than target table struct. In this scenario, a user making a UPDATE SET * may have two interpretations. The use may interpret UPDATE SET * as shorthand to assign every top-column level field, ie UPDATE SET struct=source.struct, then the target struct is set to source struct object as is, with missing fields as NULL. This is the current behavior. The user may also mean that UPDATE SET * is short-hand to assign every nested struct field (ie, UPDATE SET struct.a.b = source.struct.a.b), in which case the target struct fields missing in source are retained. This is similar to UPDATE SET * not overriding existing target columns missing in the source, for example. For this case, this flag is added. ### Does this PR introduce _any_ user-facing change? No, the support to allow source structs to have less fields than target structs in MERGE INTO is unreleased yet (#52347), and in any case there is a flag to toggle this functionality. ### How was this patch tested? Unit tests, especially around cases where the source struct is null. ### Was this patch authored or co-authored using generative AI tooling? No Closes #53149 from szehon-ho/merge_schema_evolution_update_nested. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 966e053) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
|
Merged to master/4.1 for Apache Spark 4.1.0. Happy Thanksgiving, @szehon-ho . |
What changes were proposed in this pull request?
Introduce a new flag spark.sql.merge.nested.type.assign.by.field that allows UPDATE SET * action in MERGE INTO to be shorthand to assign every nested struct to its existing source counterpart (ie, UPDATE SET a.b.c = source.a.b.c). This will have the implication that existing struct field in the target table that has no source equivalent are preserved, when the corresponding source struct has less fields than target.
Additional code is added to prevent null expansion in this case (ie, a null source struct expanding to a struct of nulls).
Why are the changes needed?
Following #52347, we now allow MERGE INTO to have a source table struct with less nested fields than target table struct. In this scenario, a user making a UPDATE SET * may have two interpretations.
The use may interpret UPDATE SET * as shorthand to assign every top-column level field, ie UPDATE SET struct=source.struct, then the target struct is set to source struct object as is, with missing fields as NULL. This is the current behavior.
The user may also mean that UPDATE SET * is short-hand to assign every nested struct field (ie, UPDATE SET struct.a.b = source.struct.a.b), in which case the target struct fields missing in source are retained. This is similar to UPDATE SET * not overriding existing target columns missing in the source, for example. For this case, this flag is added.
Does this PR introduce any user-facing change?
No, the support to allow source structs to have less fields than target structs in MERGE INTO is unreleased yet (#52347), and in any case there is a flag to toggle this functionality.
How was this patch tested?
Unit tests, especially around cases where the source struct is null.
Was this patch authored or co-authored using generative AI tooling?
No