Skip to content

[SPARK-55953][SQL] Compute net changes in ResolveChangelogTable for batch CDC reads#55583

Closed
SanJSp wants to merge 4 commits intoapache:masterfrom
SanJSp:SPARK-55953-net-changes
Closed

[SPARK-55953][SQL] Compute net changes in ResolveChangelogTable for batch CDC reads#55583
SanJSp wants to merge 4 commits intoapache:masterfrom
SanJSp:SPARK-55953-net-changes

Conversation

@SanJSp
Copy link
Copy Markdown
Contributor

@SanJSp SanJSp commented Apr 28, 2026

What changes were proposed in this pull request?

This PR adds the netChanges deduplication mode to the ResolveChangelogTable analyzer rule (SPARK-55668 / #55508 ). When a CDC read sets deduplicationMode = 'netChanges', intermediate changes per row identity are collapsed into a single net effect, per the SPIP Deduplication Semantics in B.8.

Net change collapse rules (per SPIP)

Change Sequence Net Result
INSERT → DELETE No row (cancels out)
INSERT → UPDATE(s) Final insert only
UPDATE(s) First pre-image + final post-image
UPDATE(s) → DELETE delete with original values

Implementation: 2x2 matrix on (existedBefore, existsAfter)

The four SPIP rules map onto a 2x2 matrix that the implementation evaluates per rowId partition:

  • existedBefore is true iff the partition's first event is delete or update_preimage.
  • existsAfter is true iff the partition's last event is insert or update_postimage.

These two booleans are sufficient to reproduce the SPIP rules above, because the SPIP only cares about whether the row existed at the boundaries of the version range — never about the intermediate events.

existedBefore existsAfter output
false false (cancel)
false true insert
true false delete
true true update_preimage + update_postimage

If computeUpdates = false, the update_preimage + update_postimage pair is emitted as delete + insert instead.

Pipeline: Window (per-rowId aggregates: row number, row count, first/last _change_type) → Filter (keep first and/or last row per partition) → Project (relabel _change_type, drop helper columns).

Why are the changes needed?

This completes the net-change post-processing capability of the DSv2 CDC API per the SPIP. Without it, connectors that surface intermediate changes cannot expose a deduplicated change feed to users via the standard CDC API.

Does this PR introduce any user-facing change?

Yes. Requesting deduplicationMode = 'netChanges' on a CDC read now produces a deduplicated change stream. Previously the same request was rejected up-front.

How was this patch tested?

Added ResolveChangelogTableNetChangesSuite — a trait + 2 concrete suite classes (...WithComputeUpdatesSuite, ...WithoutComputeUpdatesSuite) running the same 16-test body under both modes (32 invocations total). Coverage:

  • 4 single-event tests (lone insert/delete across various range shapes).
  • 9 matrix tests covering all (first_change_type, last_change_type) cells.
  • 1 range-narrowing test (events outside the requested version range are not seen).
  • 2 multi-rowId tests (independent partitions, mixed mode-dependent cells).

Removed 2 obsolete tests in ResolveChangelogTablePostProcessingSuite that asserted the previous "not supported" rejection.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-7)

Copy link
Copy Markdown
Contributor

@johanl-db johanl-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor comments

computeUpdates: Boolean): LogicalPlan = {
val windowedPlan = addNetChangesWindow(plan, cl)
val filteredAndRelabeledPlan =
removeIntermediateChangelogEntriesAndRelabelChangeTypes(windowedPlan, computeUpdates)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
removeIntermediateChangelogEntriesAndRelabelChangeTypes(windowedPlan, computeUpdates)
removeIntermediateChanges(windowedPlan, computeUpdates)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep the longer name. It makes both responsibilities of the function explicit (filter + relabel)

Copy link
Copy Markdown
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Net-change implementation looks correct and the per-cell tests are thorough. Two notes:

  • Dead code after the rejection path was removed: cdcNetChangesNotYetSupported at sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala:3872-3876 and the corresponding INVALID_CDC_OPTION.NET_CHANGES_NOT_YET_SUPPORTED entry at common/utils/src/main/resources/error/error-conditions.json:3295-3299 are now unreferenced — the only call site in evaluateRequirements is gone and the two ..._is_rejected tests in ResolveChangelogTablePostProcessingSuite were deleted in this PR. Consider removing both in this PR (or as a quick follow-up) so the error catalog stays accurate.

  • One inline comment below on test coverage of the combined post-processing pipeline.

cat.setChangelogProperties(ident, ChangelogProperties(
containsIntermediateChanges = true,
containsCarryoverRows = false,
representsUpdateAsDeleteAndInsert = false,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trait pins representsUpdateAsDeleteAndInsert = false, which keeps addRowLevelPostProcessing (update detection) out of the pipeline. As a result, the chained path where update detection's relabel produces update_preimage/update_postimage rows that then feed injectNetChangeComputation is not exercised end-to-end. Consider at least one variant with representsUpdateAsDeleteAndInsert = true so the integration of the two passes is covered.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — see the new WithUpdateDetectionSuite (16 tests × representsUpdateAsDeleteAndInsert = true, computeUpdates = true).

Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please hold on this CDC PR because the master branch was broken at CDC commit and we are recovering it.

@SanJSp
Copy link
Copy Markdown
Contributor Author

SanJSp commented Apr 29, 2026

@gengliangwang regarding your top level comment:
I've removed the dead code. The new WithUpdateDetectionSuite actually caught a bug, so well spotted 👍
Claude explanation of bug:

injectNetChangeComputation was calling V2ExpressionUtils.resolveRefs(cl.rowId(), plan) on the Project(Window(...)) produced by addRowLevelPostProcessing, and V2 resolution only works on a bare DataSourceV2Relation. Fixed by hoisting rowId resolution to the top of apply against the bare relation; the resolved Catalyst attributes flow through any wrapping operators by ExprId, so addNetChangesWindow doesn't re-resolve.

Feel free to give it another check, thanks in advance 🙏

@SanJSp
Copy link
Copy Markdown
Contributor Author

SanJSp commented Apr 29, 2026

As Gengliang has resolved the issues, this PR should no longer be blocked @dongjoon-hyun ?
See commit 83af083

SanJSp and others added 4 commits April 29, 2026 21:30
- ResolveChangelogTable: move the hoisted V2 rowId resolution inside the
  `if (req.requiresNetChanges)` branch. The hoist was needed to resolve
  against the bare DataSourceV2Relation (V2ExpressionUtils.resolveRefs
  doesn't work on the wrapped Project/Window plan), but it should not run
  unconditionally: connectors that report all of containsCarryoverRows /
  containsIntermediateChanges / representsUpdateAsDeleteAndInsert as
  false are allowed by the Changelog contract to inherit the default
  rowId() impl, which throws. Gating preserves the previous "only call
  when needed" behavior while keeping the V2-resolution fix.

- pom.xml: revert to upstream/master. The diff was stale-rebase noise
  (maven 3.9.15->3.9.14, scala-maven-plugin 4.9.10->4.9.9, commons-codec
  1.22.0->1.21.0, guava 33.6.0->33.5.0, etc.), unrelated to this PR.
@gengliangwang gengliangwang force-pushed the SPARK-55953-net-changes branch from 5d71af6 to 312495b Compare April 29, 2026 21:31
@gengliangwang
Copy link
Copy Markdown
Member

Tests passed in https://github.com/gengliangwang/spark/actions/runs/25135903704.
Merging to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants