[spark] Unify MERGE INTO assignment alignment#7976
Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three focused tests under default spark.sql.caseSensitive=false: - top-level column UPDATE * / INSERT * via expandStarAssignments - explicit SET LHS resolution via resolveAssignments - nested struct field matching via resolveStructType recursion Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
leaves12138
left a comment
There was a problem hiding this comment.
I reviewed the MERGE INTO alignment changes. The common alignment/evolution path and the Spark 3/4 shims look reasonable to me, and CI is green. I found one documentation mismatch that should be fixed before merge.
|
|
||
| `INSERT *` / `UPDATE SET *` expand against the target columns: | ||
|
|
||
| - Source columns missing from the target are rejected by default. Enable `spark.paimon.write.merge-schema` to keep them and evolve the table schema at write time (see [Write Merge Schema](#write-merge-schema)). |
There was a problem hiding this comment.
This section does not seem to match the implementation and the new tests. For top-level source-only columns under UPDATE SET * / INSERT *, strict mode currently drops them (for example source extra columns silently dropped under star expansion (mergeSchema=false)), while merge-schema=true evolves the target schema. Conversely, for target-only columns missing from the source, strict UPDATE SET * / INSERT * throws, and only merge-schema mode preserves the target value for UPDATE SET * or fills default/NULL for INSERT *. Could you adjust these bullets to describe those four cases explicitly?
JingsongLi
left a comment
There was a problem hiding this comment.
This is a significant refactoring with good motivations — unifying the assignment alignment logic removes a lot of duplicated version-specific code. The +1777/-1481 diff shows meaningful consolidation.
A few observations:
-
MissingFieldBehaviorsemantics are well-thought-out: The three modes (FailMissing,NullForMissing,PreserveTarget) cover the right matrix of explicit vs star clauses under strict/merge-schema modes. ThePreserveTargetmode forUPDATE *on nested structs (substitutingGetStructField(target, ordinal)for unmentioned subfields) is the correct behavior. -
Test coverage looks thorough: 24 cases in
MergeIntoAlignmentTestcovering the key scenarios. This gives me confidence in the refactoring. -
Question about the Spark 4.0 path: The diff removes
PaimonMergeIntoResolverBase.scalafrompaimon-spark-4.0and significantly simplifiesSpark41MergeIntoRewrite.scala. Can you confirm the Spark 4.0 tests pass? The Spark 4 MERGE INTO semantics differ from 3.x in how they handle star clause expansion. -
Documentation update: Good that the docs update mentions the
spark.paimon.write.merge-schemacontrol. This is user-facing behavior change worth highlighting in release notes.
Overall this looks like a good consolidation. @Zouxxyy please confirm CI passes for all Spark versions (3.2, 3.3, 3.4, 3.5, 4.0).
Purpose
Route MERGE INTO assignment alignment through
PaimonOutputResolverso MERGE / UPDATE / INSERT share one name-based, depth-aware alignment path with consistent merge-schema semantics.Behavior is controlled by
spark.paimon.write.merge-schema:INSERT */UPDATE *with a source column missing from the target throws whenmerge-schema=false; whentrue, source-extras are kept soSchemaHelperevolves the table at write time.MissingFieldBehavior:FailMissing(merge-schema=false): nested missing target / source-extra throws.NullForMissing(merge-schema=true, INSERT and explicit UPDATE): missing nested fields NULL-fill, source-extras kept at any depth.PreserveTarget(merge-schema=true,UPDATE *on a struct target): missing source subfields are substituted withGetStructField(target, ordinal)so unmentioned subfields keep their current value instead of being nulled.Tests
New
MergeIntoAlignmentTest(24 cases) covers basicUPDATE */INSERT *, source-extra drop under strict star, top-level and nested merge-schema evolution,PreserveTargetsemantics on nested structUPDATE *, explicit assignments to nested fields, and null-fill for omitted columns.