Skip to content

[SPARK-56686][SQL] Support streaming row-level CDC post-processing#55636

Open
gengliangwang wants to merge 7 commits intoapache:masterfrom
gengliangwang:streamingCDC
Open

[SPARK-56686][SQL] Support streaming row-level CDC post-processing#55636
gengliangwang wants to merge 7 commits intoapache:masterfrom
gengliangwang:streamingCDC

Conversation

@gengliangwang
Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

This PR implements row-level CDC post-processing (carry-over removal and update detection) for DSv2 streaming reads. Previously, streaming changes() rejected any post-processing with a blanket INVALID_CDC_OPTION.STREAMING_POST_PROCESSING_NOT_SUPPORTED error.

The batch path (added in #55508 and #55583) uses a Catalyst Window keyed by (rowId, _commit_version), which UnsupportedOperationChecker rejects on streaming queries (NON_TIME_WINDOW_NOT_SUPPORTED_IN_STREAMING). The streaming rewrite in ResolveChangelogTable now expresses the same logic with streaming-allowed primitives:

EventTimeWatermark(_commit_timestamp, 0s)
  -> Aggregate keyed by (rowId..., _commit_version, _commit_timestamp)
       (count_if delete/insert, [min/max/count rowVersion,] collect_list(struct(*)))
  -> [Filter on the carry-over predicate]
  -> Generate(Inline(events))
  -> [Project relabeling _change_type for delete+insert pairs]
  -> Project dropping __spark_cdc_* helpers

Including _commit_timestamp in the grouping keys is required to satisfy the Append-mode streaming aggregation contract (the watermark attribute must appear among the grouping expressions). By CDC convention all rows in a single commit share _commit_timestamp, so this is a no-op semantically relative to the batch (rowId, _commit_version) grouping.

deduplicationMode = netChanges is still rejected -- net change computation partitions by rowId alone and reasons over the entire requested range, which is fundamentally cross-batch. The existing error class INVALID_CDC_OPTION.STREAMING_POST_PROCESSING_NOT_SUPPORTED is replaced with the more specific INVALID_CDC_OPTION.STREAMING_NET_CHANGES_NOT_SUPPORTED, which now names the offending option and points users at the supported streaming alternatives.

Doc updates:

  • Changelog.java clarifies that all rows of a single _commit_version must share _commit_timestamp, and that streaming reads expect non-decreasing _commit_timestamp across micro-batches.
  • Changelog.java notes that containsIntermediateChanges() is range-scoped, hence the streaming limitation for netChanges.
  • DataStreamReader.changes() Scaladoc lists the netChanges streaming limitation.

Why are the changes needed?

Without this PR, any streaming CDC read against a connector that emits CoW carry-over pairs (containsCarryoverRows = true) or represents updates as raw delete+insert (representsUpdateAsDeleteAndInsert = true) raises an analysis error, forcing users to fall back to batch reads. The batch-only restriction is unnecessary for these passes -- they don't need cross-version state -- and it surprises users since the same options work on batch reads.

Does this PR introduce any user-facing change?

Yes.

  • Streaming spark.readStream.changes(...) now supports computeUpdates = true and deduplicationMode = dropCarryovers. Previously these threw INVALID_CDC_OPTION.STREAMING_POST_PROCESSING_NOT_SUPPORTED.
  • The error class INVALID_CDC_OPTION.STREAMING_POST_PROCESSING_NOT_SUPPORTED is renamed to INVALID_CDC_OPTION.STREAMING_NET_CHANGES_NOT_SUPPORTED with a more specific message. The new error fires only for deduplicationMode = netChanges on streaming reads.
  • DataStreamReader.changes() Scaladoc is updated accordingly.
  • Changelog.java Scaladoc clarifies the _commit_timestamp contract for streaming.

How was this patch tested?

86 tests across 4 CDC suites (all passing):

  • ResolveChangelogTableStreamingPostProcessingSuite (new, 5 tests) -- plan-shape assertions covering carry-over only, update detection only, both fused, and the no-rewrite pass-through cases. Verifies the EventTimeWatermark + Aggregate + Generate(Inline) rewrite shape.
  • ChangelogResolutionSuite -- the two existing streaming throw-tests are flipped to plan-shape assertions; a new test covers the netChanges streaming throw.
  • ResolveChangelogTablePostProcessingSuite -- the existing streaming throw test is updated to cover the netChanges-only case.
  • ChangelogEndToEndSuite -- three new streaming end-to-end tests using InMemoryChangelogCatalog: carry-over removal drops CoW pairs, update detection relabels delete+insert as update, and netChanges throws.

Also confirmed UnsupportedOperationsSuite (216 tests) still passes -- the rewritten plan does not contain Window or any other streaming-rejected operator.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-7)

Implements carry-over removal and update detection for DSv2 CDC streaming
reads, which previously rejected any post-processing with a blanket error.

The batch path uses a Catalyst Window keyed by (rowId, _commit_version),
which UnsupportedOperationChecker rejects on streaming queries
(NON_TIME_WINDOW_NOT_SUPPORTED_IN_STREAMING). The streaming rewrite
expresses the same logic with streaming-allowed primitives:
EventTimeWatermark on _commit_timestamp -> Aggregate keyed by
(rowId..., _commit_version, _commit_timestamp) buffering events into a
collect_list of structs -> [Filter on the carry-over predicate] ->
Generate(Inline(events)) to re-emit rows -> [Project relabeling
_change_type for delete+insert pairs] -> drop helper columns.

deduplicationMode=netChanges remains batch-only; it requires partitioning
by rowId across the entire requested range and is fundamentally
cross-batch. The existing
INVALID_CDC_OPTION.STREAMING_POST_PROCESSING_NOT_SUPPORTED error is
replaced with the more specific
INVALID_CDC_OPTION.STREAMING_NET_CHANGES_NOT_SUPPORTED, which now also
points users at the supported streaming alternatives.

Also clarifies the Changelog.java contract that all rows of a single
_commit_version must share _commit_timestamp and that streaming reads
expect non-decreasing _commit_timestamp across micro-batches, plus a
note in DataStreamReader.changes() Scaladoc about the netChanges
streaming limitation.
@gengliangwang
Copy link
Copy Markdown
Member Author

@huaxingao FYI this is part of the SPIP for CDC support (SPARK-55668), targeting the Spark 4.2 release. We're aiming to get it ready and merged ASAP.

@gengliangwang
Copy link
Copy Markdown
Member Author

cc @johanl-db @SanJSp

* 6. Final [[Project]] (via [[removeHelperColumns]]) drops `__spark_cdc_*` helpers so
* the output schema matches the connector's declared schema.
*/
private def addStreamingRowLevelPostProcessing(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this only make sense for append mode? If so, we may need to enforce it with a check to prevent update/complete mode?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Implemented in dee5e84 — added a CDC-specific case in UnsupportedOperationChecker.checkForStreaming that detects the rewrite by the __spark_cdc_events helper aggregate expression and throws STREAMING_OUTPUT_MODE.UNSUPPORTED_OPERATION for non-Append modes. Negative end-to-end tests added in ChangelogEndToEndSuite for both Update and Complete output modes.

.writeStream
.format("memory")
.queryName("cdc_stream_carryover")
.outputMode("append")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add tests for update/complete mode?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in dee5e84 — instead of behavioral tests, the rewrite now explicitly rejects Update / Complete output modes (see the reply on addStreamingRowLevelPostProcessing above), so the new tests are negative ones asserting STREAMING_OUTPUT_MODE.UNSUPPORTED_OPERATION is raised at writer-start time. The error message names "Change Data Capture (CDC) streaming reads with post-processing" so the failure mode is discoverable.

* contract every row in a single commit shares `_commit_timestamp`, so taking it
* as event time is safe. Note: this is currently the only analyzer rule that
* auto-injects an [[EventTimeWatermark]] (others resolve user-supplied watermarks).
* The watermark metadata is preserved on the user-visible `_commit_timestamp`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will expose internal watermark metadata, right? This watermark might involve in downstream multi-watermark policy and change the state eviction of later join/aggregation. Can we remove this _commit_timestamp watermark metadata once we don't need it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, fixed in dee5e84. Added a final Project at the boundary of the streaming rewrite (stripCommitTimestampWatermarkMetadata) that recreates _commit_timestamp with EventTimeWatermark.delayKey removed from its metadata. The watermark is preserved internally on the Aggregate's grouping attribute (so the rewrite still works), but no longer leaks to the user-visible output, so a downstream withWatermark on a different column won't interact with our auto-injected watermark via the global multi-watermark policy. New plan-shape test: "watermark metadata is stripped from user-visible _commit_timestamp".

Copy link
Copy Markdown
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m concerned that the proposed streaming contract for _commit_timestamp may be too weak for a zero-delay watermark. The docs say connectors should emit _commit_timestamp in non-decreasing order across micro-batches, but non-decreasing still allows a later micro-batch to contain rows with the same timestamp as the previous batch’s max event time.

With EventTimeWatermark(_commit_timestamp, 0s), once a batch observes max event time T, the next batch can advance the watermark to T and the aggregate may emit/evict groups with eventTime <= T. If another row for the same commit, or another commit with the same _commit_timestamp = T, arrives in that next batch, it may be treated as late or arrive after the group has already been finalized. That could make carry-over removal or update detection operate on an incomplete delete/insert pair.

Should the contract be stronger here? For example, require that all rows for a commit are emitted within the same micro-batch, and that commit timestamps strictly increase across micro-batches when streaming post-processing is enabled. Alternatively, the implementation may need a non-zero delay or another way to avoid finalizing groups while equal-timestamp rows can still arrive.

…l rewrite

Three fixes from viirya's review on apache#55636:

1. Strip the auto-injected EventTimeWatermark metadata from the user-visible
   `_commit_timestamp` output. The metadata flowed through `Generate(Inline)`
   onto the public output, where it would have interacted with downstream
   user-supplied watermarks via the global multi-watermark policy. A final
   Project at the boundary of the rewrite now removes
   `EventTimeWatermark.delayKey` so the watermark stays internal-only.

2. Reject non-Append output modes for streaming CDC reads with post-processing.
   The injected streaming Aggregate's append-mode emission (one group per
   `_commit_timestamp` once the watermark advances past it) is the only
   semantically valid mode -- Update would re-emit per-batch state changes,
   Complete would re-emit the full result table per batch, neither matching
   batch CDC semantics. UnsupportedOperationChecker now detects the rewrite
   by the `__spark_cdc_events` helper aggregate expression and throws
   STREAMING_OUTPUT_MODE.UNSUPPORTED_OPERATION for non-Append modes.

3. Tighten the `_commit_timestamp` streaming contract in `Changelog.java`.
   The previous "non-decreasing across micro-batches" wording was too weak:
   Spark's stateful aggregate evicts groups with `eventTime <= watermark`
   (statefulOperators.scala:643-650), so equal-timestamp rows in a later
   micro-batch would be dropped as late. The contract now requires that all
   rows of a single commit appear in the same micro-batch -- the natural
   atomic-commit emission pattern of real CDC connectors (Delta versions,
   Iceberg snapshots) -- which makes the zero-delay watermark sound.

Adds a plan-shape test asserting no watermark metadata leaks to user-visible
output, and two end-to-end negative tests covering Update / Complete output
mode rejection.
@gengliangwang
Copy link
Copy Markdown
Member Author

@viirya re: the streaming _commit_timestamp contract concern — you're right. Verified against statefulOperators.scala:643-650 that the eviction predicate is LessThanOrEqual (i.e., a group at event time T evicts when watermark >= T), so equal-timestamp rows arriving in a later micro-batch would be dropped as late under the previous "non-decreasing" wording.

Tightened the contract in dee5e84: Changelog.java now requires that all rows of a single commit appear in the same micro-batch (i.e. micro-batch boundaries align with commit boundaries) for streaming reads with post-processing. This is the natural atomic-commit emission pattern of real CDC connectors (Delta versions, Iceberg snapshots), and it makes the zero-delay watermark sound: within any micro-batch, all rows of a commit are observed before the watermark advances past that commit's timestamp.

I considered a non-zero delay as a belt-and-suspenders guard, but it would only paper over connectors that violate the atomic-microbatch rule (a contract violation anyway), and would delay every user's output by the chosen interval. Happy to revisit if the explicit contract turns out to be too restrictive in practice.

…treaming CDC

Address the second sub-case from viirya's review on apache#55636. The previous
contract change covered the same-commit-split-across-micro-batches case via
"all rows of a single commit must appear in the same micro-batch", but missed
the case where two DIFFERENT commits with the same `_commit_timestamp` arrive
in different micro-batches.

Spark's late-event filter and state-eviction predicate both use
`LessThanOrEqual` (`statefulOperators.scala:633-651`), so once a micro-batch
observes max event time T and advances the watermark to T, any later row at
exactly `_commit_timestamp = T` is silently dropped as late. The
atomic-microbatch contract alone doesn't rule this out for distinct commits.

Adds a second contract requirement: distinct `_commit_version` values must
have distinct `_commit_timestamp` values when streaming post-processing is
enabled. Atomic-commit CDC connectors that derive `_commit_timestamp` from
wall-clock time at commit time (Delta, Iceberg) naturally satisfy this.

Doc-only change; no code modifications. The existing tests already exercise
the supported cases; the unsupported case 2 is by definition a connector
contract violation, so we don't add a test for it.
@gengliangwang
Copy link
Copy Markdown
Member Author

@viirya re-reading your concern more carefully: the atomic-microbatch contract from dee5e84 only covers the same-commit-split case. The case you also called out — different commits with the same _commit_timestamp arriving in different micro-batches — is still broken under that contract alone, since Spark's late-event filter also uses LessThanOrEqual (the same WatermarkSupport.watermarkExpression at statefulOperators.scala:633-651 is shared between eviction and late-event filtering). So a v2-row at ts=T arriving in batch 2 after batch 1 advanced the watermark to T would be silently dropped as late.

Tightened the contract again in ffa0646 to add a second requirement: distinct _commit_version values must have distinct _commit_timestamp values when streaming post-processing is enabled. That rules out the different-commit collision. Atomic-commit CDC connectors that derive _commit_timestamp from wall-clock time at commit time (Delta, Iceberg) naturally satisfy this — and the contract fails fast at the connector boundary if a connector violates it (rather than silently producing wrong results).

I still didn't go with a non-zero watermark delay because it would impose latency on every user even when the connector respects the contract; the explicit two-requirement contract makes the failure mode discoverable instead. Happy to revisit if real connectors turn out to need the slack.

…ermark-strip

Two follow-ups on the streaming CDC row-level rewrite:

1. `dev/lint-scala` runs scalafmt on `sql/api`; my prior edit to
   `DataStreamReader.changes()` left the Scaladoc lines wrapped at the
   wrong column. Re-flowed via
   `./build/mvn scalafmt:format -pl sql/api`.

2. Updated the user-visible Scaladoc on `DataStreamReader.changes()` to
   reflect the watermark-metadata strip from dee5e84. The previous wording
   said "the watermark metadata is preserved on the user-visible
   `_commit_timestamp` output ... global watermark becomes the min of the
   two" -- that was accurate before the strip, but is now stale. The new
   wording says the metadata is stripped (so downstream user-supplied
   watermarks do not interact with it via the global multi-watermark
   policy) and explicitly notes that streaming row-level post-processing
   constrains the query to Append output mode.

Note: the Java unidoc CI step is failing on an unrelated pre-existing
name-clash error in `core/target/java/.../JavaSparkContext.java:415`
(`<K,V>union(Seq<JavaPairRDD<K,V>>)` vs `<T>union(Seq<JavaRDD<T>>)` --
same erasure). Verified identical to upstream master, so it's not from
this PR.
…treaming row-level rewrite

Address @zikangh's review on apache#55637 -- the streaming row-level rewrite should
enforce non-NULL _commit_timestamp, mirroring the runtime guard in
CdcNetChangesStatefulProcessor.

A NULL _commit_timestamp on a streaming read is a connector contract
violation that would silently stall the row's group: the downstream
streaming Aggregate uses _commit_timestamp as an event-time watermark
column AND a grouping key, and Spark's eviction predicate is
LessThanOrEqual(eventTime, watermark) -- a NULL group key never
satisfies that, so the group sits in state until end of stream
producing no output and no error.

Add a Filter at the top of the streaming row-level rewrite that raises
CHANGELOG_CONTRACT_VIOLATION.NULL_COMMIT_TIMESTAMP via the same
RaiseError pattern used for the multiple-changes-per-row-version guard
in the batch path. Also adds the new error class to
error-conditions.json.

Tests:
 - Plan-shape tests: assert the guard Filter is present and sits
   directly above the streaming relation (so it runs before any
   downstream operator sees the NULL).
 - End-to-end test: feeding a row with a NULL _commit_timestamp
   surfaces CHANGELOG_CONTRACT_VIOLATION.NULL_COMMIT_TIMESTAMP at the
   streaming query level rather than producing no output.
 - Existing carry-over / update-detection plan-shape tests updated for
   the extra guard Filter (was 1 -> now 2 Filters in carry-over and
   combined paths; was 0 -> now 1 in update-detection-only).

Also refreshed the addStreamingRowLevelPostProcessing Scaladoc to add a
step 0 (the guard) and step 7 (the watermark-metadata strip), keeping
the per-operator detail aligned with the rewrite's actual shape.

Doc-only side effect: scalafmt reflowed the watermark-metadata bullet
in DataStreamReader.changes() Scaladoc (no semantic change).
// CaseWhen returns the default branch (true) for non-null timestamps and
// evaluates the side-effecting RaiseError for nulls; the row never passes the
// filter on a contract violation.
val checkExpr = CaseWhen(Seq(IsNull(commitTsAttr) -> raise), Literal(true))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark’s NullPropagation can rewrite IsNull(c) to false when c.nullable == false. Since _commit_timestamp is now a required non-NULL contract field, a connector may reasonably declare it non-nullable. In that case the fail-fast guard could disappear, and a malformed runtime NULL would not raise NULL_COMMIT_TIMESTAMP as intended. The current test catalog appears to declare CDC metadata nullable, so the new test does not cover this case.

* requirement 2 rules out the different-commit case. Atomic-commit CDC connectors
* (e.g. Delta versions, Iceberg snapshots) that derive {@code _commit_timestamp}
* from wall-clock time at commit time naturally satisfy both requirements.
* Behavior is undefined if {@code _commit_timestamp} is {@code NULL} on any row
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says NULL behavior is “undefined,” but the latest code now intentionally raises CHANGELOG_CONTRACT_VIOLATION.NULL_COMMIT_TIMESTAMP. That doc should be updated to match the new fail-fast behavior.

* {@code _commit_timestamp}. For streaming reads with post-processing enabled,
* two additional requirements apply:
* <ol>
* <li>All rows of a single commit must appear in the same micro-batch (i.e.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new requirements fix the “same commit split across batches” case and the “same timestamp in later batch” case only if timestamps also arrive in increasing event-time order. But the doc no longer explicitly requires that every later micro-batch has _commit_timestamp greater than the previous watermark/max.

Example:

batch 1: commit v2, ts = 20
batch 2: commit v3, ts = 10

Timestamps are distinct, and each commit is atomic, but batch 2 is late after watermark 20. So the real required invariant is closer to: no later micro-batch may contain rows with _commit_timestamp <= previous max event time. Also, “distinct commit versions must have distinct timestamps” is stronger than necessary and may be unrealistic for ms-resolution commit timestamps; equal timestamps are safe if all such commits are emitted before the watermark advances.

Copy link
Copy Markdown
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good generally. I have other 3 comments where most are doc-only issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants