From 0122bbcf722e66d62dfc1613c02bf8a919e123a7 Mon Sep 17 00:00:00 2001 From: Gengliang Wang Date: Fri, 8 May 2026 17:13:11 -0700 Subject: [PATCH] [SPARK-56798][SQL][DOCS] Clarify streaming CDC emission timing and netChanges scope Address two follow-up review threads on PR #55637 (streaming CDC netChanges): - The "held back" paragraph was worded as if the one-commit emission lag were a netChanges-specific property. It is not -- carry-over removal and update detection use append-mode `Aggregate` keyed on `_commit_timestamp` and have the same lag as the netChanges `transformWithState` timer. - Set realistic expectations for streaming netChanges: for typical CDC sources that produce at most one change per row per commit, the streaming output equals what `computeUpdates` would produce, because only one commit's changes are buffered at a time. Cross-commit merging only kicks in when several commits touch the same row before the older one's output is emitted. Direct users to a batch read for full-range collapse. Both points are now stated up-front in plain language, with a bulleted list and short bold labels for scannability. --- .../sql/connector/catalog/Changelog.java | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Changelog.java b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Changelog.java index 2ef0846ab800b..918422899c311 100644 --- a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Changelog.java +++ b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Changelog.java @@ -71,10 +71,21 @@ * *

* Streaming reads support carry-over removal, update detection, and net change - * computation. Net change collapses are kept in the state store keyed by row identity; - * row identities only touched in the latest observed commit are held back until either a - * later commit (with strictly greater `_commit_timestamp`) advances the global watermark - * past them, or the source terminates. + * computation. Two streaming-specific behaviors to be aware of: + *

*

* Pushdown contract. When any post-processing pass applies (carry-over * removal, update detection, or netChanges), Spark only pushes predicates