Skip to content

[SPARK-56798][SQL][DOCS] Clarify streaming CDC emission timing and netChanges scope#55776

Open
gengliangwang wants to merge 1 commit intoapache:masterfrom
gengliangwang:SPARK-cdc-streaming-doc-clarify
Open

[SPARK-56798][SQL][DOCS] Clarify streaming CDC emission timing and netChanges scope#55776
gengliangwang wants to merge 1 commit intoapache:masterfrom
gengliangwang:SPARK-cdc-streaming-doc-clarify

Conversation

@gengliangwang
Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Address two follow-up review threads on PR #55637 (streaming CDC netChanges) by clarifying the streaming behavior in the Changelog Javadoc.

The previous paragraph read as if the one-commit emission lag were a netChanges-specific property; in fact carry-over removal and update detection use append-mode Aggregate keyed on _commit_timestamp and have the same lag as the netChanges transformWithState timer. The paragraph also did not set expectations for what streaming netChanges actually collapses in practice.

Replaced the existing single paragraph with a bulleted list:

  • Output is delayed by one commit. When a micro-batch ingests a commit, that commit's output rows are buffered and not emitted in the same batch. They are emitted by the next micro-batch -- the one that ingests the following commit. The last commit's output is emitted when the source terminates.
  • netChanges only merges changes that are buffered together. For a typical CDC source that produces at most one change per row per commit, only one commit's changes are buffered at a time per row, so the streaming output is the same as computeUpdates. Multiple commits' changes are merged only when those commits touch the same row before the older one's output has been emitted. For full-range collapse, use a batch read.

This is a sub-task of SPARK-55668.

Why are the changes needed?

Spelling out the emission timing and the practical netChanges scope prevents adopters from forming wrong expectations about what streaming netChanges does for typical (atomic-commit) CDC workloads. Naming the lag and the buffer-window scope explicitly also makes the doc consistent with the implementation, where both facts are properties of all three streaming post-processing paths.

Does this PR introduce any user-facing change?

Documentation only. No behavior change.

How was this patch tested?

Doc-only change. Xdoclint:html,syntax,accessibility is clean on Changelog.java (errors limited to expected "cannot find symbol" without classpath). No code changed; existing CDC test suites unaffected.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude opus-4-7

…tChanges scope

Address two follow-up review threads on PR apache#55637 (streaming CDC netChanges):

- The "held back" paragraph was worded as if the one-commit emission lag
  were a netChanges-specific property. It is not -- carry-over removal and
  update detection use append-mode `Aggregate` keyed on `_commit_timestamp`
  and have the same lag as the netChanges `transformWithState` timer.
- Set realistic expectations for streaming netChanges: for typical CDC
  sources that produce at most one change per row per commit, the
  streaming output equals what `computeUpdates` would produce, because
  only one commit's changes are buffered at a time. Cross-commit merging
  only kicks in when several commits touch the same row before the older
  one's output is emitted. Direct users to a batch read for full-range
  collapse.

Both points are now stated up-front in plain language, with a bulleted
list and short bold labels for scannability.
@gengliangwang
Copy link
Copy Markdown
Member Author

cc @johanl-db

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant