Skip to content

[spark] Unify MERGE INTO assignment alignment#7976

Open
Zouxxyy wants to merge 5 commits into
apache:masterfrom
Zouxxyy:dev/merge-into-update
Open

[spark] Unify MERGE INTO assignment alignment#7976
Zouxxyy wants to merge 5 commits into
apache:masterfrom
Zouxxyy:dev/merge-into-update

Conversation

@Zouxxyy
Copy link
Copy Markdown
Contributor

@Zouxxyy Zouxxyy commented May 26, 2026

Purpose

Route MERGE INTO assignment alignment through PaimonOutputResolver so MERGE / UPDATE / INSERT share one name-based, depth-aware alignment path with consistent merge-schema semantics.

Behavior is controlled by spark.paimon.write.merge-schema:

  • Top-level alignment is by name. Unmentioned target columns are NULL-filled under explicit clauses (matches Spark INSERT FILL).
  • INSERT * / UPDATE * with a source column missing from the target throws when merge-schema=false; when true, source-extras are kept so SchemaHelper evolves the table at write time.
  • Nested struct alignment follows MissingFieldBehavior:
    • FailMissing (merge-schema=false): nested missing target / source-extra throws.
    • NullForMissing (merge-schema=true, INSERT and explicit UPDATE): missing nested fields NULL-fill, source-extras kept at any depth.
    • PreserveTarget (merge-schema=true, UPDATE * on a struct target): missing source subfields are substituted with GetStructField(target, ordinal) so unmentioned subfields keep their current value instead of being nulled.

Tests

New MergeIntoAlignmentTest (24 cases) covers basic UPDATE * / INSERT *, source-extra drop under strict star, top-level and nested merge-schema evolution, PreserveTarget semantics on nested struct UPDATE *, explicit assignments to nested fields, and null-fill for omitted columns.

Zouxxyy and others added 2 commits May 26, 2026 16:05
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Zouxxyy Zouxxyy closed this May 26, 2026
@Zouxxyy Zouxxyy reopened this May 26, 2026
Zouxxyy and others added 2 commits May 26, 2026 17:20
Three focused tests under default spark.sql.caseSensitive=false:
- top-level column UPDATE * / INSERT * via expandStarAssignments
- explicit SET LHS resolution via resolveAssignments
- nested struct field matching via resolveStructType recursion

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the MERGE INTO alignment changes. The common alignment/evolution path and the Spark 3/4 shims look reasonable to me, and CI is green. I found one documentation mismatch that should be fixed before merge.

Comment thread docs/docs/spark/sql-write.md Outdated

`INSERT *` / `UPDATE SET *` expand against the target columns:

- Source columns missing from the target are rejected by default. Enable `spark.paimon.write.merge-schema` to keep them and evolve the table schema at write time (see [Write Merge Schema](#write-merge-schema)).
Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section does not seem to match the implementation and the new tests. For top-level source-only columns under UPDATE SET * / INSERT *, strict mode currently drops them (for example source extra columns silently dropped under star expansion (mergeSchema=false)), while merge-schema=true evolves the target schema. Conversely, for target-only columns missing from the source, strict UPDATE SET * / INSERT * throws, and only merge-schema mode preserves the target value for UPDATE SET * or fills default/NULL for INSERT *. Could you adjust these bullets to describe those four cases explicitly?

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a significant refactoring with good motivations — unifying the assignment alignment logic removes a lot of duplicated version-specific code. The +1777/-1481 diff shows meaningful consolidation.

A few observations:

  1. MissingFieldBehavior semantics are well-thought-out: The three modes (FailMissing, NullForMissing, PreserveTarget) cover the right matrix of explicit vs star clauses under strict/merge-schema modes. The PreserveTarget mode for UPDATE * on nested structs (substituting GetStructField(target, ordinal) for unmentioned subfields) is the correct behavior.

  2. Test coverage looks thorough: 24 cases in MergeIntoAlignmentTest covering the key scenarios. This gives me confidence in the refactoring.

  3. Question about the Spark 4.0 path: The diff removes PaimonMergeIntoResolverBase.scala from paimon-spark-4.0 and significantly simplifies Spark41MergeIntoRewrite.scala. Can you confirm the Spark 4.0 tests pass? The Spark 4 MERGE INTO semantics differ from 3.x in how they handle star clause expansion.

  4. Documentation update: Good that the docs update mentions the spark.paimon.write.merge-schema control. This is user-facing behavior change worth highlighting in release notes.

Overall this looks like a good consolidation. @Zouxxyy please confirm CI passes for all Spark versions (3.2, 3.3, 3.4, 3.5, 4.0).

@Zouxxyy Zouxxyy marked this pull request as draft May 27, 2026 02:41
@Zouxxyy Zouxxyy marked this pull request as ready for review May 27, 2026 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants