[SPARK-55953][SQL] Compute net changes in ResolveChangelogTable for batch CDC reads#55583
[SPARK-55953][SQL] Compute net changes in ResolveChangelogTable for batch CDC reads#55583SanJSp wants to merge 4 commits intoapache:masterfrom
Conversation
| computeUpdates: Boolean): LogicalPlan = { | ||
| val windowedPlan = addNetChangesWindow(plan, cl) | ||
| val filteredAndRelabeledPlan = | ||
| removeIntermediateChangelogEntriesAndRelabelChangeTypes(windowedPlan, computeUpdates) |
There was a problem hiding this comment.
nit:
| removeIntermediateChangelogEntriesAndRelabelChangeTypes(windowedPlan, computeUpdates) | |
| removeIntermediateChanges(windowedPlan, computeUpdates) |
There was a problem hiding this comment.
I'd prefer to keep the longer name. It makes both responsibilities of the function explicit (filter + relabel)
gengliangwang
left a comment
There was a problem hiding this comment.
Net-change implementation looks correct and the per-cell tests are thorough. Two notes:
-
Dead code after the rejection path was removed:
cdcNetChangesNotYetSupportedatsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala:3872-3876and the correspondingINVALID_CDC_OPTION.NET_CHANGES_NOT_YET_SUPPORTEDentry atcommon/utils/src/main/resources/error/error-conditions.json:3295-3299are now unreferenced — the only call site inevaluateRequirementsis gone and the two..._is_rejectedtests inResolveChangelogTablePostProcessingSuitewere deleted in this PR. Consider removing both in this PR (or as a quick follow-up) so the error catalog stays accurate. -
One inline comment below on test coverage of the combined post-processing pipeline.
| cat.setChangelogProperties(ident, ChangelogProperties( | ||
| containsIntermediateChanges = true, | ||
| containsCarryoverRows = false, | ||
| representsUpdateAsDeleteAndInsert = false, |
There was a problem hiding this comment.
The trait pins representsUpdateAsDeleteAndInsert = false, which keeps addRowLevelPostProcessing (update detection) out of the pipeline. As a result, the chained path where update detection's relabel produces update_preimage/update_postimage rows that then feed injectNetChangeComputation is not exercised end-to-end. Consider at least one variant with representsUpdateAsDeleteAndInsert = true so the integration of the two passes is covered.
There was a problem hiding this comment.
Done — see the new WithUpdateDetectionSuite (16 tests × representsUpdateAsDeleteAndInsert = true, computeUpdates = true).
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Please hold on this CDC PR because the master branch was broken at CDC commit and we are recovering it.
|
@gengliangwang regarding your top level comment:
Feel free to give it another check, thanks in advance 🙏 |
|
As Gengliang has resolved the issues, this PR should no longer be blocked @dongjoon-hyun ? |
- ResolveChangelogTable: move the hoisted V2 rowId resolution inside the `if (req.requiresNetChanges)` branch. The hoist was needed to resolve against the bare DataSourceV2Relation (V2ExpressionUtils.resolveRefs doesn't work on the wrapped Project/Window plan), but it should not run unconditionally: connectors that report all of containsCarryoverRows / containsIntermediateChanges / representsUpdateAsDeleteAndInsert as false are allowed by the Changelog contract to inherit the default rowId() impl, which throws. Gating preserves the previous "only call when needed" behavior while keeping the V2-resolution fix. - pom.xml: revert to upstream/master. The diff was stale-rebase noise (maven 3.9.15->3.9.14, scala-maven-plugin 4.9.10->4.9.9, commons-codec 1.22.0->1.21.0, guava 33.6.0->33.5.0, etc.), unrelated to this PR.
5d71af6 to
312495b
Compare
|
Tests passed in https://github.com/gengliangwang/spark/actions/runs/25135903704. |
What changes were proposed in this pull request?
This PR adds the
netChangesdeduplication mode to theResolveChangelogTableanalyzer rule (SPARK-55668 / #55508 ). When a CDC read setsdeduplicationMode = 'netChanges', intermediate changes per row identity are collapsed into a single net effect, per the SPIPDeduplication Semanticsin B.8.Net change collapse rules (per SPIP)
Implementation: 2x2 matrix on
(existedBefore, existsAfter)The four SPIP rules map onto a 2x2 matrix that the implementation evaluates per
rowIdpartition:existedBeforeistrueiff the partition's first event isdeleteorupdate_preimage.existsAfteristrueiff the partition's last event isinsertorupdate_postimage.These two booleans are sufficient to reproduce the SPIP rules above, because the SPIP only cares about whether the row existed at the boundaries of the version range — never about the intermediate events.
If
computeUpdates = false, theupdate_preimage+update_postimagepair is emitted asdelete+insertinstead.Pipeline:
Window(per-rowIdaggregates: row number, row count, first/last_change_type) →Filter(keep first and/or last row per partition) →Project(relabel_change_type, drop helper columns).Why are the changes needed?
This completes the net-change post-processing capability of the DSv2 CDC API per the SPIP. Without it, connectors that surface intermediate changes cannot expose a deduplicated change feed to users via the standard CDC API.
Does this PR introduce any user-facing change?
Yes. Requesting
deduplicationMode = 'netChanges'on a CDC read now produces a deduplicated change stream. Previously the same request was rejected up-front.How was this patch tested?
Added
ResolveChangelogTableNetChangesSuite— a trait + 2 concrete suite classes (...WithComputeUpdatesSuite,...WithoutComputeUpdatesSuite) running the same 16-test body under both modes (32 invocations total). Coverage:Removed 2 obsolete tests in
ResolveChangelogTablePostProcessingSuitethat asserted the previous "not supported" rejection.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (claude-opus-4-7)