[SPARK-56686][SQL] Support streaming row-level CDC post-processing by gengliangwang · Pull Request #55636 · apache/spark

gengliangwang · 2026-04-30T20:16:29Z

What changes were proposed in this pull request?

This PR implements row-level CDC post-processing (carry-over removal and update detection) for DSv2 streaming reads. Previously, streaming changes() rejected any post-processing with a blanket INVALID_CDC_OPTION.STREAMING_POST_PROCESSING_NOT_SUPPORTED error.

The batch path (added in #55508 and #55583) uses a Catalyst Window keyed by (rowId, _commit_version), which UnsupportedOperationChecker rejects on streaming queries (NON_TIME_WINDOW_NOT_SUPPORTED_IN_STREAMING). The streaming rewrite in ResolveChangelogTable now expresses the same logic with streaming-allowed primitives:

EventTimeWatermark(_commit_timestamp, 0s)
  -> Aggregate keyed by (rowId..., _commit_version, _commit_timestamp)
       (count_if delete/insert, [min/max/count rowVersion,] collect_list(struct(*)))
  -> [Filter on the carry-over predicate]
  -> Generate(Inline(events))
  -> [Project relabeling _change_type for delete+insert pairs]
  -> Project dropping __spark_cdc_* helpers

Including _commit_timestamp in the grouping keys is required to satisfy the Append-mode streaming aggregation contract (the watermark attribute must appear among the grouping expressions). By CDC convention all rows in a single commit share _commit_timestamp, so this is a no-op semantically relative to the batch (rowId, _commit_version) grouping.

deduplicationMode = netChanges is still rejected -- net change computation partitions by rowId alone and reasons over the entire requested range, which is fundamentally cross-batch. The existing error class INVALID_CDC_OPTION.STREAMING_POST_PROCESSING_NOT_SUPPORTED is replaced with the more specific INVALID_CDC_OPTION.STREAMING_NET_CHANGES_NOT_SUPPORTED, which now names the offending option and points users at the supported streaming alternatives.

Doc updates:

Changelog.java clarifies that all rows of a single _commit_version must share _commit_timestamp, and that streaming reads expect non-decreasing _commit_timestamp across micro-batches.
Changelog.java notes that containsIntermediateChanges() is range-scoped, hence the streaming limitation for netChanges.
DataStreamReader.changes() Scaladoc lists the netChanges streaming limitation.

Why are the changes needed?

Without this PR, any streaming CDC read against a connector that emits CoW carry-over pairs (containsCarryoverRows = true) or represents updates as raw delete+insert (representsUpdateAsDeleteAndInsert = true) raises an analysis error, forcing users to fall back to batch reads. The batch-only restriction is unnecessary for these passes -- they don't need cross-version state -- and it surprises users since the same options work on batch reads.

Does this PR introduce any user-facing change?

Yes.

Streaming spark.readStream.changes(...) now supports computeUpdates = true and deduplicationMode = dropCarryovers. Previously these threw INVALID_CDC_OPTION.STREAMING_POST_PROCESSING_NOT_SUPPORTED.
The error class INVALID_CDC_OPTION.STREAMING_POST_PROCESSING_NOT_SUPPORTED is renamed to INVALID_CDC_OPTION.STREAMING_NET_CHANGES_NOT_SUPPORTED with a more specific message. The new error fires only for deduplicationMode = netChanges on streaming reads.
DataStreamReader.changes() Scaladoc is updated accordingly.
Changelog.java Scaladoc clarifies the _commit_timestamp contract for streaming.

How was this patch tested?

86 tests across 4 CDC suites (all passing):

ResolveChangelogTableStreamingPostProcessingSuite (new, 5 tests) -- plan-shape assertions covering carry-over only, update detection only, both fused, and the no-rewrite pass-through cases. Verifies the EventTimeWatermark + Aggregate + Generate(Inline) rewrite shape.
ChangelogResolutionSuite -- the two existing streaming throw-tests are flipped to plan-shape assertions; a new test covers the netChanges streaming throw.
ResolveChangelogTablePostProcessingSuite -- the existing streaming throw test is updated to cover the netChanges-only case.
ChangelogEndToEndSuite -- three new streaming end-to-end tests using InMemoryChangelogCatalog: carry-over removal drops CoW pairs, update detection relabels delete+insert as update, and netChanges throws.

Also confirmed UnsupportedOperationsSuite (216 tests) still passes -- the rewritten plan does not contain Window or any other streaming-rejected operator.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-7)

Implements carry-over removal and update detection for DSv2 CDC streaming reads, which previously rejected any post-processing with a blanket error. The batch path uses a Catalyst Window keyed by (rowId, _commit_version), which UnsupportedOperationChecker rejects on streaming queries (NON_TIME_WINDOW_NOT_SUPPORTED_IN_STREAMING). The streaming rewrite expresses the same logic with streaming-allowed primitives: EventTimeWatermark on _commit_timestamp -> Aggregate keyed by (rowId..., _commit_version, _commit_timestamp) buffering events into a collect_list of structs -> [Filter on the carry-over predicate] -> Generate(Inline(events)) to re-emit rows -> [Project relabeling _change_type for delete+insert pairs] -> drop helper columns. deduplicationMode=netChanges remains batch-only; it requires partitioning by rowId across the entire requested range and is fundamentally cross-batch. The existing INVALID_CDC_OPTION.STREAMING_POST_PROCESSING_NOT_SUPPORTED error is replaced with the more specific INVALID_CDC_OPTION.STREAMING_NET_CHANGES_NOT_SUPPORTED, which now also points users at the supported streaming alternatives. Also clarifies the Changelog.java contract that all rows of a single _commit_version must share _commit_timestamp and that streaming reads expect non-decreasing _commit_timestamp across micro-batches, plus a note in DataStreamReader.changes() Scaladoc about the netChanges streaming limitation.

gengliangwang · 2026-04-30T21:01:28Z

@huaxingao FYI this is part of the SPIP for CDC support (SPARK-55668), targeting the Spark 4.2 release. We're aiming to get it ready and merged ASAP.

…date netChanges test

gengliangwang · 2026-04-30T23:07:43Z

cc @johanl-db @SanJSp

…ams and runtime walkthrough

viirya · 2026-05-01T00:01:28Z

+   *  6. Final [[Project]] (via [[removeHelperColumns]]) drops `__spark_cdc_*` helpers so
+   *     the output schema matches the connector's declared schema.
+   */
+  private def addStreamingRowLevelPostProcessing(


Does this only make sense for append mode? If so, we may need to enforce it with a check to prevent update/complete mode?

Good catch. Implemented in dee5e84 — added a CDC-specific case in UnsupportedOperationChecker.checkForStreaming that detects the rewrite by the __spark_cdc_events helper aggregate expression and throws STREAMING_OUTPUT_MODE.UNSUPPORTED_OPERATION for non-Append modes. Negative end-to-end tests added in ChangelogEndToEndSuite for both Update and Complete output modes.

viirya · 2026-05-01T00:02:04Z

+      .writeStream
+      .format("memory")
+      .queryName("cdc_stream_carryover")
+      .outputMode("append")


Can we also add tests for update/complete mode?

Done in dee5e84 — instead of behavioral tests, the rewrite now explicitly rejects Update / Complete output modes (see the reply on addStreamingRowLevelPostProcessing above), so the new tests are negative ones asserting STREAMING_OUTPUT_MODE.UNSUPPORTED_OPERATION is raised at writer-start time. The error message names "Change Data Capture (CDC) streaming reads with post-processing" so the failure mode is discoverable.

viirya · 2026-05-01T00:20:14Z

+   *     contract every row in a single commit shares `_commit_timestamp`, so taking it
+   *     as event time is safe. Note: this is currently the only analyzer rule that
+   *     auto-injects an [[EventTimeWatermark]] (others resolve user-supplied watermarks).
+   *     The watermark metadata is preserved on the user-visible `_commit_timestamp`


This will expose internal watermark metadata, right? This watermark might involve in downstream multi-watermark policy and change the state eviction of later join/aggregation. Can we remove this _commit_timestamp watermark metadata once we don't need it?

Right, fixed in dee5e84. Added a final Project at the boundary of the streaming rewrite (stripCommitTimestampWatermarkMetadata) that recreates _commit_timestamp with EventTimeWatermark.delayKey removed from its metadata. The watermark is preserved internally on the Aggregate's grouping attribute (so the rewrite still works), but no longer leaks to the user-visible output, so a downstream withWatermark on a different column won't interact with our auto-injected watermark via the global multi-watermark policy. New plan-shape test: "watermark metadata is stripped from user-visible _commit_timestamp".

viirya

I’m concerned that the proposed streaming contract for _commit_timestamp may be too weak for a zero-delay watermark. The docs say connectors should emit _commit_timestamp in non-decreasing order across micro-batches, but non-decreasing still allows a later micro-batch to contain rows with the same timestamp as the previous batch’s max event time.

With EventTimeWatermark(_commit_timestamp, 0s), once a batch observes max event time T, the next batch can advance the watermark to T and the aggregate may emit/evict groups with eventTime <= T. If another row for the same commit, or another commit with the same _commit_timestamp = T, arrives in that next batch, it may be treated as late or arrive after the group has already been finalized. That could make carry-over removal or update detection operate on an incomplete delete/insert pair.

Should the contract be stronger here? For example, require that all rows for a commit are emitted within the same micro-batch, and that commit timestamps strictly increase across micro-batches when streaming post-processing is enabled. Alternatively, the implementation may need a non-zero delay or another way to avoid finalizing groups while equal-timestamp rows can still arrive.

…l rewrite Three fixes from viirya's review on apache#55636: 1. Strip the auto-injected EventTimeWatermark metadata from the user-visible `_commit_timestamp` output. The metadata flowed through `Generate(Inline)` onto the public output, where it would have interacted with downstream user-supplied watermarks via the global multi-watermark policy. A final Project at the boundary of the rewrite now removes `EventTimeWatermark.delayKey` so the watermark stays internal-only. 2. Reject non-Append output modes for streaming CDC reads with post-processing. The injected streaming Aggregate's append-mode emission (one group per `_commit_timestamp` once the watermark advances past it) is the only semantically valid mode -- Update would re-emit per-batch state changes, Complete would re-emit the full result table per batch, neither matching batch CDC semantics. UnsupportedOperationChecker now detects the rewrite by the `__spark_cdc_events` helper aggregate expression and throws STREAMING_OUTPUT_MODE.UNSUPPORTED_OPERATION for non-Append modes. 3. Tighten the `_commit_timestamp` streaming contract in `Changelog.java`. The previous "non-decreasing across micro-batches" wording was too weak: Spark's stateful aggregate evicts groups with `eventTime <= watermark` (statefulOperators.scala:643-650), so equal-timestamp rows in a later micro-batch would be dropped as late. The contract now requires that all rows of a single commit appear in the same micro-batch -- the natural atomic-commit emission pattern of real CDC connectors (Delta versions, Iceberg snapshots) -- which makes the zero-delay watermark sound. Adds a plan-shape test asserting no watermark metadata leaks to user-visible output, and two end-to-end negative tests covering Update / Complete output mode rejection.

gengliangwang · 2026-05-01T02:13:20Z

@viirya re: the streaming _commit_timestamp contract concern — you're right. Verified against statefulOperators.scala:643-650 that the eviction predicate is LessThanOrEqual (i.e., a group at event time T evicts when watermark >= T), so equal-timestamp rows arriving in a later micro-batch would be dropped as late under the previous "non-decreasing" wording.

Tightened the contract in dee5e84: Changelog.java now requires that all rows of a single commit appear in the same micro-batch (i.e. micro-batch boundaries align with commit boundaries) for streaming reads with post-processing. This is the natural atomic-commit emission pattern of real CDC connectors (Delta versions, Iceberg snapshots), and it makes the zero-delay watermark sound: within any micro-batch, all rows of a commit are observed before the watermark advances past that commit's timestamp.

I considered a non-zero delay as a belt-and-suspenders guard, but it would only paper over connectors that violate the atomic-microbatch rule (a contract violation anyway), and would delay every user's output by the chosen interval. Happy to revisit if the explicit contract turns out to be too restrictive in practice.

…treaming CDC Address the second sub-case from viirya's review on apache#55636. The previous contract change covered the same-commit-split-across-micro-batches case via "all rows of a single commit must appear in the same micro-batch", but missed the case where two DIFFERENT commits with the same `_commit_timestamp` arrive in different micro-batches. Spark's late-event filter and state-eviction predicate both use `LessThanOrEqual` (`statefulOperators.scala:633-651`), so once a micro-batch observes max event time T and advances the watermark to T, any later row at exactly `_commit_timestamp = T` is silently dropped as late. The atomic-microbatch contract alone doesn't rule this out for distinct commits. Adds a second contract requirement: distinct `_commit_version` values must have distinct `_commit_timestamp` values when streaming post-processing is enabled. Atomic-commit CDC connectors that derive `_commit_timestamp` from wall-clock time at commit time (Delta, Iceberg) naturally satisfy this. Doc-only change; no code modifications. The existing tests already exercise the supported cases; the unsupported case 2 is by definition a connector contract violation, so we don't add a test for it.

gengliangwang · 2026-05-01T02:19:31Z

@viirya re-reading your concern more carefully: the atomic-microbatch contract from dee5e84 only covers the same-commit-split case. The case you also called out — different commits with the same _commit_timestamp arriving in different micro-batches — is still broken under that contract alone, since Spark's late-event filter also uses LessThanOrEqual (the same WatermarkSupport.watermarkExpression at statefulOperators.scala:633-651 is shared between eviction and late-event filtering). So a v2-row at ts=T arriving in batch 2 after batch 1 advanced the watermark to T would be silently dropped as late.

Tightened the contract again in ffa0646 to add a second requirement: distinct _commit_version values must have distinct _commit_timestamp values when streaming post-processing is enabled. That rules out the different-commit collision. Atomic-commit CDC connectors that derive _commit_timestamp from wall-clock time at commit time (Delta, Iceberg) naturally satisfy this — and the contract fails fast at the connector boundary if a connector violates it (rather than silently producing wrong results).

I still didn't go with a non-zero watermark delay because it would impose latency on every user even when the connector respects the contract; the explicit two-requirement contract makes the failure mode discoverable instead. Happy to revisit if real connectors turn out to need the slack.

…ermark-strip Two follow-ups on the streaming CDC row-level rewrite: 1. `dev/lint-scala` runs scalafmt on `sql/api`; my prior edit to `DataStreamReader.changes()` left the Scaladoc lines wrapped at the wrong column. Re-flowed via `./build/mvn scalafmt:format -pl sql/api`. 2. Updated the user-visible Scaladoc on `DataStreamReader.changes()` to reflect the watermark-metadata strip from dee5e84. The previous wording said "the watermark metadata is preserved on the user-visible `_commit_timestamp` output ... global watermark becomes the min of the two" -- that was accurate before the strip, but is now stale. The new wording says the metadata is stripped (so downstream user-supplied watermarks do not interact with it via the global multi-watermark policy) and explicitly notes that streaming row-level post-processing constrains the query to Append output mode. Note: the Java unidoc CI step is failing on an unrelated pre-existing name-clash error in `core/target/java/.../JavaSparkContext.java:415` (`<K,V>union(Seq<JavaPairRDD<K,V>>)` vs `<T>union(Seq<JavaRDD<T>>)` -- same erasure). Verified identical to upstream master, so it's not from this PR.

@zikangh

…treaming row-level rewrite Address @zikangh's review on apache#55637 -- the streaming row-level rewrite should enforce non-NULL _commit_timestamp, mirroring the runtime guard in CdcNetChangesStatefulProcessor. A NULL _commit_timestamp on a streaming read is a connector contract violation that would silently stall the row's group: the downstream streaming Aggregate uses _commit_timestamp as an event-time watermark column AND a grouping key, and Spark's eviction predicate is LessThanOrEqual(eventTime, watermark) -- a NULL group key never satisfies that, so the group sits in state until end of stream producing no output and no error. Add a Filter at the top of the streaming row-level rewrite that raises CHANGELOG_CONTRACT_VIOLATION.NULL_COMMIT_TIMESTAMP via the same RaiseError pattern used for the multiple-changes-per-row-version guard in the batch path. Also adds the new error class to error-conditions.json. Tests: - Plan-shape tests: assert the guard Filter is present and sits directly above the streaming relation (so it runs before any downstream operator sees the NULL). - End-to-end test: feeding a row with a NULL _commit_timestamp surfaces CHANGELOG_CONTRACT_VIOLATION.NULL_COMMIT_TIMESTAMP at the streaming query level rather than producing no output. - Existing carry-over / update-detection plan-shape tests updated for the extra guard Filter (was 1 -> now 2 Filters in carry-over and combined paths; was 0 -> now 1 in update-detection-only). Also refreshed the addStreamingRowLevelPostProcessing Scaladoc to add a step 0 (the guard) and step 7 (the watermark-metadata strip), keeping the per-operator detail aligned with the rewrite's actual shape. Doc-only side effect: scalafmt reflowed the watermark-metadata bullet in DataStreamReader.changes() Scaladoc (no semantic change).

viirya · 2026-05-01T06:12:50Z

+    // CaseWhen returns the default branch (true) for non-null timestamps and
+    // evaluates the side-effecting RaiseError for nulls; the row never passes the
+    // filter on a contract violation.
+    val checkExpr = CaseWhen(Seq(IsNull(commitTsAttr) -> raise), Literal(true))


Spark’s NullPropagation can rewrite IsNull(c) to false when c.nullable == false. Since _commit_timestamp is now a required non-NULL contract field, a connector may reasonably declare it non-nullable. In that case the fail-fast guard could disappear, and a malformed runtime NULL would not raise NULL_COMMIT_TIMESTAMP as intended. The current test catalog appears to declare CDC metadata nullable, so the new test does not cover this case.

viirya · 2026-05-01T06:13:31Z

+ *       requirement 2 rules out the different-commit case. Atomic-commit CDC connectors
+ *       (e.g. Delta versions, Iceberg snapshots) that derive {@code _commit_timestamp}
+ *       from wall-clock time at commit time naturally satisfy both requirements.
+ *       Behavior is undefined if {@code _commit_timestamp} is {@code NULL} on any row


It says NULL behavior is “undefined,” but the latest code now intentionally raises CHANGELOG_CONTRACT_VIOLATION.NULL_COMMIT_TIMESTAMP. That doc should be updated to match the new fail-fast behavior.

viirya · 2026-05-01T06:16:41Z

+ *       {@code _commit_timestamp}. For streaming reads with post-processing enabled,
+ *       two additional requirements apply:
+ *       <ol>
+ *         <li>All rows of a single commit must appear in the same micro-batch (i.e.


The new requirements fix the “same commit split across batches” case and the “same timestamp in later batch” case only if timestamps also arrive in increasing event-time order. But the doc no longer explicitly requires that every later micro-batch has _commit_timestamp greater than the previous watermark/max.

Example:

batch 1: commit v2, ts = 20
batch 2: commit v3, ts = 10

Timestamps are distinct, and each commit is atomic, but batch 2 is late after watermark 20. So the real required invariant is closer to: no later micro-batch may contain rows with _commit_timestamp <= previous max event time. Also, “distinct commit versions must have distinct timestamps” is stronger than necessary and may be unrealistic for ms-resolution commit timestamps; equal timestamps are safe if all such commits are emitted before the watermark advances.

viirya

Looks good generally. I have other 3 comments where most are doc-only issues.

gengliangwang mentioned this pull request Apr 30, 2026

[SPARK-56687][SQL] Support netChanges for DSv2 CDC streaming reads #55637

Open

gengliangwang requested a review from aokolnychyi April 30, 2026 20:59

Address review: clarify streaming CDC watermark semantics and consoli…

c1c94d1

…date netChanges test

Expand Scaladoc on addStreamingRowLevelPostProcessing with plan diagr…

cb1e7ac

…ams and runtime walkthrough

viirya reviewed May 1, 2026

View reviewed changes

gengliangwang added 2 commits April 30, 2026 21:30

viirya reviewed May 1, 2026

View reviewed changes

viirya approved these changes May 1, 2026

View reviewed changes

Conversation

gengliangwang commented Apr 30, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

gengliangwang commented Apr 30, 2026

Uh oh!

gengliangwang commented Apr 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented May 1, 2026

Uh oh!

gengliangwang commented May 1, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants