Skip to content

[SPARK-55948][SQL][FOLLOWUP] Clarify rowVersion contract in DSv2 Changelog interface#55460

Closed
gengliangwang wants to merge 3 commits intoapache:masterfrom
gengliangwang:cdc_java_doc
Closed

[SPARK-55948][SQL][FOLLOWUP] Clarify rowVersion contract in DSv2 Changelog interface#55460
gengliangwang wants to merge 3 commits intoapache:masterfrom
gengliangwang:cdc_java_doc

Conversation

@gengliangwang
Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Refine the Javadoc for two methods on the DSv2 Changelog interface introduced in SPARK-55948:

  • rowVersion() — redefined as the commit at which the row's content was last modified: assigned on initial insert, bumped on update, and preserved when the row is rewritten by a copy-on-write operation without a content change. The previous Javadoc described it as an "ordering column", which invited implementers to set it equal to _commit_version.
  • rowId() — extended the "must override" condition to include containsCarryoverRows(), matching the fact that the column is used for carry-over removal in addition to update detection and net change computation.

Documentation only; no code behavior changes.

Why are the changes needed?

When rowVersion is conflated with _commit_version, both halves of a delete+insert pair always carry the same value, so row-version equality cannot discriminate a copy-on-write carry-over from a real update. Post-processing is then forced into full-row equality checks — slower, and incorrect in the edge case where a real update happens to rewrite a row to identical values (it would be silently dropped as a carry-over).

With the clarified contract, carry-over detection becomes a scalar comparison: within a (rowId, _commit_version) group, equal rowVersion ⇒ copy, different rowVersion ⇒ real update.

Does this PR introduce any user-facing change?

No. Javadoc only.

How was this patch tested?

N/A — documentation only.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7

Redefine rowVersion as the commit at which the row's content was last
modified: assigned on insert, bumped on update, and preserved when the
row is rewritten by a copy-on-write operation without a content change.
This makes rowVersion distinct from _commit_version and lets Spark
distinguish copy from update by a scalar comparison instead of full-row
equality.

Also broaden rowId's Javadoc to cover carry-over removal and extend the
override requirement to include containsCarryoverRows().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gengliangwang
Copy link
Copy Markdown
Member Author

cc @johanl-db @SanJSp as well

Comment thread sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Changelog.java Outdated
Comment thread sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Changelog.java Outdated
Copy link
Copy Markdown

@SanJSp SanJSp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the initiative!

Copy link
Copy Markdown

@SanJSp SanJSp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to keep Javadoc changes in a PR, this could be a nice addition for this comment of yours #55426 (comment)

Comment thread sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Changelog.java Outdated
Co-authored-by: Sandro Sp <38314662+SanJSp@users.noreply.github.com>
@gengliangwang
Copy link
Copy Markdown
Member Author

@SanJSp Thanks, I have applied all your suggestions.

@gengliangwang
Copy link
Copy Markdown
Member Author

@SanJSp @viirya thanks for the review. Merging to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants