[SPARK-56686][SQL] Support streaming row-level CDC post-processing#55636
[SPARK-56686][SQL] Support streaming row-level CDC post-processing#55636gengliangwang wants to merge 7 commits intoapache:masterfrom
Conversation
Implements carry-over removal and update detection for DSv2 CDC streaming reads, which previously rejected any post-processing with a blanket error. The batch path uses a Catalyst Window keyed by (rowId, _commit_version), which UnsupportedOperationChecker rejects on streaming queries (NON_TIME_WINDOW_NOT_SUPPORTED_IN_STREAMING). The streaming rewrite expresses the same logic with streaming-allowed primitives: EventTimeWatermark on _commit_timestamp -> Aggregate keyed by (rowId..., _commit_version, _commit_timestamp) buffering events into a collect_list of structs -> [Filter on the carry-over predicate] -> Generate(Inline(events)) to re-emit rows -> [Project relabeling _change_type for delete+insert pairs] -> drop helper columns. deduplicationMode=netChanges remains batch-only; it requires partitioning by rowId across the entire requested range and is fundamentally cross-batch. The existing INVALID_CDC_OPTION.STREAMING_POST_PROCESSING_NOT_SUPPORTED error is replaced with the more specific INVALID_CDC_OPTION.STREAMING_NET_CHANGES_NOT_SUPPORTED, which now also points users at the supported streaming alternatives. Also clarifies the Changelog.java contract that all rows of a single _commit_version must share _commit_timestamp and that streaming reads expect non-decreasing _commit_timestamp across micro-batches, plus a note in DataStreamReader.changes() Scaladoc about the netChanges streaming limitation.
|
@huaxingao FYI this is part of the SPIP for CDC support (SPARK-55668), targeting the Spark 4.2 release. We're aiming to get it ready and merged ASAP. |
…date netChanges test
…ams and runtime walkthrough
| * 6. Final [[Project]] (via [[removeHelperColumns]]) drops `__spark_cdc_*` helpers so | ||
| * the output schema matches the connector's declared schema. | ||
| */ | ||
| private def addStreamingRowLevelPostProcessing( |
There was a problem hiding this comment.
Does this only make sense for append mode? If so, we may need to enforce it with a check to prevent update/complete mode?
There was a problem hiding this comment.
Good catch. Implemented in dee5e84 — added a CDC-specific case in UnsupportedOperationChecker.checkForStreaming that detects the rewrite by the __spark_cdc_events helper aggregate expression and throws STREAMING_OUTPUT_MODE.UNSUPPORTED_OPERATION for non-Append modes. Negative end-to-end tests added in ChangelogEndToEndSuite for both Update and Complete output modes.
| .writeStream | ||
| .format("memory") | ||
| .queryName("cdc_stream_carryover") | ||
| .outputMode("append") |
There was a problem hiding this comment.
Can we also add tests for update/complete mode?
There was a problem hiding this comment.
Done in dee5e84 — instead of behavioral tests, the rewrite now explicitly rejects Update / Complete output modes (see the reply on addStreamingRowLevelPostProcessing above), so the new tests are negative ones asserting STREAMING_OUTPUT_MODE.UNSUPPORTED_OPERATION is raised at writer-start time. The error message names "Change Data Capture (CDC) streaming reads with post-processing" so the failure mode is discoverable.
| * contract every row in a single commit shares `_commit_timestamp`, so taking it | ||
| * as event time is safe. Note: this is currently the only analyzer rule that | ||
| * auto-injects an [[EventTimeWatermark]] (others resolve user-supplied watermarks). | ||
| * The watermark metadata is preserved on the user-visible `_commit_timestamp` |
There was a problem hiding this comment.
This will expose internal watermark metadata, right? This watermark might involve in downstream multi-watermark policy and change the state eviction of later join/aggregation. Can we remove this _commit_timestamp watermark metadata once we don't need it?
There was a problem hiding this comment.
Right, fixed in dee5e84. Added a final Project at the boundary of the streaming rewrite (stripCommitTimestampWatermarkMetadata) that recreates _commit_timestamp with EventTimeWatermark.delayKey removed from its metadata. The watermark is preserved internally on the Aggregate's grouping attribute (so the rewrite still works), but no longer leaks to the user-visible output, so a downstream withWatermark on a different column won't interact with our auto-injected watermark via the global multi-watermark policy. New plan-shape test: "watermark metadata is stripped from user-visible _commit_timestamp".
viirya
left a comment
There was a problem hiding this comment.
I’m concerned that the proposed streaming contract for _commit_timestamp may be too weak for a zero-delay watermark. The docs say connectors should emit _commit_timestamp in non-decreasing order across micro-batches, but non-decreasing still allows a later micro-batch to contain rows with the same timestamp as the previous batch’s max event time.
With EventTimeWatermark(_commit_timestamp, 0s), once a batch observes max event time T, the next batch can advance the watermark to T and the aggregate may emit/evict groups with eventTime <= T. If another row for the same commit, or another commit with the same _commit_timestamp = T, arrives in that next batch, it may be treated as late or arrive after the group has already been finalized. That could make carry-over removal or update detection operate on an incomplete delete/insert pair.
Should the contract be stronger here? For example, require that all rows for a commit are emitted within the same micro-batch, and that commit timestamps strictly increase across micro-batches when streaming post-processing is enabled. Alternatively, the implementation may need a non-zero delay or another way to avoid finalizing groups while equal-timestamp rows can still arrive.
…l rewrite Three fixes from viirya's review on apache#55636: 1. Strip the auto-injected EventTimeWatermark metadata from the user-visible `_commit_timestamp` output. The metadata flowed through `Generate(Inline)` onto the public output, where it would have interacted with downstream user-supplied watermarks via the global multi-watermark policy. A final Project at the boundary of the rewrite now removes `EventTimeWatermark.delayKey` so the watermark stays internal-only. 2. Reject non-Append output modes for streaming CDC reads with post-processing. The injected streaming Aggregate's append-mode emission (one group per `_commit_timestamp` once the watermark advances past it) is the only semantically valid mode -- Update would re-emit per-batch state changes, Complete would re-emit the full result table per batch, neither matching batch CDC semantics. UnsupportedOperationChecker now detects the rewrite by the `__spark_cdc_events` helper aggregate expression and throws STREAMING_OUTPUT_MODE.UNSUPPORTED_OPERATION for non-Append modes. 3. Tighten the `_commit_timestamp` streaming contract in `Changelog.java`. The previous "non-decreasing across micro-batches" wording was too weak: Spark's stateful aggregate evicts groups with `eventTime <= watermark` (statefulOperators.scala:643-650), so equal-timestamp rows in a later micro-batch would be dropped as late. The contract now requires that all rows of a single commit appear in the same micro-batch -- the natural atomic-commit emission pattern of real CDC connectors (Delta versions, Iceberg snapshots) -- which makes the zero-delay watermark sound. Adds a plan-shape test asserting no watermark metadata leaks to user-visible output, and two end-to-end negative tests covering Update / Complete output mode rejection.
|
@viirya re: the streaming Tightened the contract in dee5e84: I considered a non-zero delay as a belt-and-suspenders guard, but it would only paper over connectors that violate the atomic-microbatch rule (a contract violation anyway), and would delay every user's output by the chosen interval. Happy to revisit if the explicit contract turns out to be too restrictive in practice. |
…treaming CDC Address the second sub-case from viirya's review on apache#55636. The previous contract change covered the same-commit-split-across-micro-batches case via "all rows of a single commit must appear in the same micro-batch", but missed the case where two DIFFERENT commits with the same `_commit_timestamp` arrive in different micro-batches. Spark's late-event filter and state-eviction predicate both use `LessThanOrEqual` (`statefulOperators.scala:633-651`), so once a micro-batch observes max event time T and advances the watermark to T, any later row at exactly `_commit_timestamp = T` is silently dropped as late. The atomic-microbatch contract alone doesn't rule this out for distinct commits. Adds a second contract requirement: distinct `_commit_version` values must have distinct `_commit_timestamp` values when streaming post-processing is enabled. Atomic-commit CDC connectors that derive `_commit_timestamp` from wall-clock time at commit time (Delta, Iceberg) naturally satisfy this. Doc-only change; no code modifications. The existing tests already exercise the supported cases; the unsupported case 2 is by definition a connector contract violation, so we don't add a test for it.
|
@viirya re-reading your concern more carefully: the atomic-microbatch contract from dee5e84 only covers the same-commit-split case. The case you also called out — different commits with the same Tightened the contract again in ffa0646 to add a second requirement: distinct I still didn't go with a non-zero watermark delay because it would impose latency on every user even when the connector respects the contract; the explicit two-requirement contract makes the failure mode discoverable instead. Happy to revisit if real connectors turn out to need the slack. |
…ermark-strip Two follow-ups on the streaming CDC row-level rewrite: 1. `dev/lint-scala` runs scalafmt on `sql/api`; my prior edit to `DataStreamReader.changes()` left the Scaladoc lines wrapped at the wrong column. Re-flowed via `./build/mvn scalafmt:format -pl sql/api`. 2. Updated the user-visible Scaladoc on `DataStreamReader.changes()` to reflect the watermark-metadata strip from dee5e84. The previous wording said "the watermark metadata is preserved on the user-visible `_commit_timestamp` output ... global watermark becomes the min of the two" -- that was accurate before the strip, but is now stale. The new wording says the metadata is stripped (so downstream user-supplied watermarks do not interact with it via the global multi-watermark policy) and explicitly notes that streaming row-level post-processing constrains the query to Append output mode. Note: the Java unidoc CI step is failing on an unrelated pre-existing name-clash error in `core/target/java/.../JavaSparkContext.java:415` (`<K,V>union(Seq<JavaPairRDD<K,V>>)` vs `<T>union(Seq<JavaRDD<T>>)` -- same erasure). Verified identical to upstream master, so it's not from this PR.
…treaming row-level rewrite Address @zikangh's review on apache#55637 -- the streaming row-level rewrite should enforce non-NULL _commit_timestamp, mirroring the runtime guard in CdcNetChangesStatefulProcessor. A NULL _commit_timestamp on a streaming read is a connector contract violation that would silently stall the row's group: the downstream streaming Aggregate uses _commit_timestamp as an event-time watermark column AND a grouping key, and Spark's eviction predicate is LessThanOrEqual(eventTime, watermark) -- a NULL group key never satisfies that, so the group sits in state until end of stream producing no output and no error. Add a Filter at the top of the streaming row-level rewrite that raises CHANGELOG_CONTRACT_VIOLATION.NULL_COMMIT_TIMESTAMP via the same RaiseError pattern used for the multiple-changes-per-row-version guard in the batch path. Also adds the new error class to error-conditions.json. Tests: - Plan-shape tests: assert the guard Filter is present and sits directly above the streaming relation (so it runs before any downstream operator sees the NULL). - End-to-end test: feeding a row with a NULL _commit_timestamp surfaces CHANGELOG_CONTRACT_VIOLATION.NULL_COMMIT_TIMESTAMP at the streaming query level rather than producing no output. - Existing carry-over / update-detection plan-shape tests updated for the extra guard Filter (was 1 -> now 2 Filters in carry-over and combined paths; was 0 -> now 1 in update-detection-only). Also refreshed the addStreamingRowLevelPostProcessing Scaladoc to add a step 0 (the guard) and step 7 (the watermark-metadata strip), keeping the per-operator detail aligned with the rewrite's actual shape. Doc-only side effect: scalafmt reflowed the watermark-metadata bullet in DataStreamReader.changes() Scaladoc (no semantic change).
| // CaseWhen returns the default branch (true) for non-null timestamps and | ||
| // evaluates the side-effecting RaiseError for nulls; the row never passes the | ||
| // filter on a contract violation. | ||
| val checkExpr = CaseWhen(Seq(IsNull(commitTsAttr) -> raise), Literal(true)) |
There was a problem hiding this comment.
Spark’s NullPropagation can rewrite IsNull(c) to false when c.nullable == false. Since _commit_timestamp is now a required non-NULL contract field, a connector may reasonably declare it non-nullable. In that case the fail-fast guard could disappear, and a malformed runtime NULL would not raise NULL_COMMIT_TIMESTAMP as intended. The current test catalog appears to declare CDC metadata nullable, so the new test does not cover this case.
| * requirement 2 rules out the different-commit case. Atomic-commit CDC connectors | ||
| * (e.g. Delta versions, Iceberg snapshots) that derive {@code _commit_timestamp} | ||
| * from wall-clock time at commit time naturally satisfy both requirements. | ||
| * Behavior is undefined if {@code _commit_timestamp} is {@code NULL} on any row |
There was a problem hiding this comment.
It says NULL behavior is “undefined,” but the latest code now intentionally raises CHANGELOG_CONTRACT_VIOLATION.NULL_COMMIT_TIMESTAMP. That doc should be updated to match the new fail-fast behavior.
| * {@code _commit_timestamp}. For streaming reads with post-processing enabled, | ||
| * two additional requirements apply: | ||
| * <ol> | ||
| * <li>All rows of a single commit must appear in the same micro-batch (i.e. |
There was a problem hiding this comment.
The new requirements fix the “same commit split across batches” case and the “same timestamp in later batch” case only if timestamps also arrive in increasing event-time order. But the doc no longer explicitly requires that every later micro-batch has _commit_timestamp greater than the previous watermark/max.
Example:
batch 1: commit v2, ts = 20
batch 2: commit v3, ts = 10
Timestamps are distinct, and each commit is atomic, but batch 2 is late after watermark 20. So the real required invariant is closer to: no later micro-batch may contain rows with _commit_timestamp <= previous max event time. Also, “distinct commit versions must have distinct timestamps” is stronger than necessary and may be unrealistic for ms-resolution commit timestamps; equal timestamps are safe if all such commits are emitted before the watermark advances.
viirya
left a comment
There was a problem hiding this comment.
Looks good generally. I have other 3 comments where most are doc-only issues.
What changes were proposed in this pull request?
This PR implements row-level CDC post-processing (carry-over removal and update detection) for DSv2 streaming reads. Previously, streaming
changes()rejected any post-processing with a blanketINVALID_CDC_OPTION.STREAMING_POST_PROCESSING_NOT_SUPPORTEDerror.The batch path (added in #55508 and #55583) uses a Catalyst
Windowkeyed by(rowId, _commit_version), whichUnsupportedOperationCheckerrejects on streaming queries (NON_TIME_WINDOW_NOT_SUPPORTED_IN_STREAMING). The streaming rewrite inResolveChangelogTablenow expresses the same logic with streaming-allowed primitives:Including
_commit_timestampin the grouping keys is required to satisfy the Append-mode streaming aggregation contract (the watermark attribute must appear among the grouping expressions). By CDC convention all rows in a single commit share_commit_timestamp, so this is a no-op semantically relative to the batch(rowId, _commit_version)grouping.deduplicationMode = netChangesis still rejected -- net change computation partitions byrowIdalone and reasons over the entire requested range, which is fundamentally cross-batch. The existing error classINVALID_CDC_OPTION.STREAMING_POST_PROCESSING_NOT_SUPPORTEDis replaced with the more specificINVALID_CDC_OPTION.STREAMING_NET_CHANGES_NOT_SUPPORTED, which now names the offending option and points users at the supported streaming alternatives.Doc updates:
Changelog.javaclarifies that all rows of a single_commit_versionmust share_commit_timestamp, and that streaming reads expect non-decreasing_commit_timestampacross micro-batches.Changelog.javanotes thatcontainsIntermediateChanges()is range-scoped, hence the streaming limitation fornetChanges.DataStreamReader.changes()Scaladoc lists thenetChangesstreaming limitation.Why are the changes needed?
Without this PR, any streaming CDC read against a connector that emits CoW carry-over pairs (
containsCarryoverRows = true) or represents updates as raw delete+insert (representsUpdateAsDeleteAndInsert = true) raises an analysis error, forcing users to fall back to batch reads. The batch-only restriction is unnecessary for these passes -- they don't need cross-version state -- and it surprises users since the same options work on batch reads.Does this PR introduce any user-facing change?
Yes.
spark.readStream.changes(...)now supportscomputeUpdates = trueanddeduplicationMode = dropCarryovers. Previously these threwINVALID_CDC_OPTION.STREAMING_POST_PROCESSING_NOT_SUPPORTED.INVALID_CDC_OPTION.STREAMING_POST_PROCESSING_NOT_SUPPORTEDis renamed toINVALID_CDC_OPTION.STREAMING_NET_CHANGES_NOT_SUPPORTEDwith a more specific message. The new error fires only fordeduplicationMode = netChangeson streaming reads.DataStreamReader.changes()Scaladoc is updated accordingly.Changelog.javaScaladoc clarifies the_commit_timestampcontract for streaming.How was this patch tested?
86 tests across 4 CDC suites (all passing):
ResolveChangelogTableStreamingPostProcessingSuite(new, 5 tests) -- plan-shape assertions covering carry-over only, update detection only, both fused, and the no-rewrite pass-through cases. Verifies theEventTimeWatermark+Aggregate+Generate(Inline)rewrite shape.ChangelogResolutionSuite-- the two existing streaming throw-tests are flipped to plan-shape assertions; a new test covers thenetChangesstreaming throw.ResolveChangelogTablePostProcessingSuite-- the existing streaming throw test is updated to cover thenetChanges-only case.ChangelogEndToEndSuite-- three new streaming end-to-end tests usingInMemoryChangelogCatalog: carry-over removal drops CoW pairs, update detection relabels delete+insert as update, andnetChangesthrows.Also confirmed
UnsupportedOperationsSuite(216 tests) still passes -- the rewritten plan does not containWindowor any other streaming-rejected operator.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (claude-opus-4-7)