[SPARK-56711][SQL] Restrict CDC _commit_version column to LongType or StringType#55663
Closed
gengliangwang wants to merge 1 commit into
Closed
[SPARK-56711][SQL] Restrict CDC _commit_version column to LongType or StringType#55663gengliangwang wants to merge 1 commit into
_commit_version column to LongType or StringType#55663gengliangwang wants to merge 1 commit into
Conversation
…or StringType ### What changes were proposed in this pull request? Tighten the CDC `Changelog` connector contract so that `_commit_version` must be either `LongType` or `StringType`. Previously any `AtomicType` was accepted, which left several edge-case types (`IntegerType`, `TimestampType`, `BinaryType`, `Decimal`, `Float`, `Double`, `Boolean`, ...) silently allowed. - `ChangelogTable.validateSchema` now rejects everything outside `LongType` / `StringType` with a `BIGINT or STRING` expected-type message. - `Changelog` Javadoc updated to state the narrower contract. - `CdcNetChangesStatefulProcessor` ordering comment updated; the existing Catalyst-routed comparator is left in place for symmetry with the batch `SortOrder`. - `ChangelogResolutionSuite` updates: accept-list narrowed to `Long` / `String`; reject-list expanded to cover the previously-allowed atomic types (`Integer`, `Timestamp`) plus the existing complex-type cases. ### Why are the changes needed? `Long` (numeric monotonic version) and `String` (opaque commit identifier) cover every realistic CDC source. The other atomic types are either strict subsets (`IntegerType` -> `LongType`) or duplicate the role of `_commit_timestamp` (`TimestampType`); types like `BinaryType` / `Float` / `Double` add NaN / boxing / ordering foot-guns with no expressive power gained. Locking down now is non-breaking (no external connectors yet) and keeps the documented surface area small. Relaxing later is non-breaking; restricting later is not. ### Does this PR introduce _any_ user-facing change? The `Changelog` connector API is `@Evolving` and has no external implementations yet; the restriction only narrows what implementers may return. No user-facing behavior change. ### How was this patch tested? - `ChangelogResolutionSuite` (27 tests) -- covers the new accept / reject matrix. - `ResolveChangelogTablePostProcessingSuite`, `ResolveChangelogTableStreamingPostProcessingSuite`, `ResolveChangelogTableNetChangesSuite`, `ChangelogEndToEndSuite` -- 130 existing tests still pass on the new contract. - `UnsupportedOperationsSuite` (216 tests) still passes. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude opus-4-7
Member
Author
|
Locking down the data type before 4.2 release is non-breaking (no external connectors yet) and keeps the documented surface area small. |
viirya
approved these changes
May 4, 2026
Member
|
Merged to master, branch-4.x and branch-4.2. |
HyukjinKwon
pushed a commit
that referenced
this pull request
May 4, 2026
…or StringType ### What changes were proposed in this pull request? Tighten the CDC `Changelog` connector contract so that `_commit_version` must be either `LongType` or `StringType`. Previously any `AtomicType` was accepted, which left several edge-case types (`IntegerType`, `TimestampType`, `BinaryType`, `Decimal`, `Float`, `Double`, `Boolean`, ...) silently allowed. - `ChangelogTable.validateSchema` now rejects everything outside `LongType` / `StringType` with a `BIGINT or STRING` expected-type message. - `Changelog` Javadoc updated to state the narrower contract and explain the ordering requirement (the netChanges post-processing path sorts rows by this column, so the column's natural ordering must match commit order). - `CdcNetChangesStatefulProcessor` ordering comment updated; the existing Catalyst-routed comparator is left in place for symmetry with the batch `SortOrder`. - `ChangelogResolutionSuite` updates: accept-list narrowed to `Long` / `String`; reject-list expanded to cover the previously-allowed atomic types (`Integer`, `Timestamp`) plus the existing complex-type cases. ### Why are the changes needed? `Long` (numeric monotonic version) and `String` (lexicographically ordered commit identifier) cover every realistic CDC source. The other atomic types are either strict subsets (`IntegerType` -> `LongType`) or duplicate the role of `_commit_timestamp` (`TimestampType`); types like `BinaryType` / `Float` / `Double` add NaN / boxing / ordering foot-guns with no expressive power gained. The narrower contract also lets the Javadoc state the ordering requirement precisely (matching what the netChanges code actually relies on). Locking down now is non-breaking (no external connectors yet) and keeps the documented surface area small. Relaxing later is non-breaking; restricting later is not. ### Does this PR introduce _any_ user-facing change? The `Changelog` connector API is `Evolving` and has no external implementations yet; the restriction only narrows what implementers may return. No user-facing behavior change. ### How was this patch tested? - `ChangelogResolutionSuite` (27 tests) covers the new accept / reject matrix. - `ResolveChangelogTablePostProcessingSuite`, `ResolveChangelogTableStreamingPostProcessingSuite`, `ResolveChangelogTableNetChangesSuite`, `ChangelogEndToEndSuite` -- 98 existing tests still pass on the new contract. - `UnsupportedOperationsSuite` (216 tests) still passes. - `Xdoclint:html,syntax,accessibility` is clean on `Changelog.java`; no new warnings under `Xdoclint:all`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude opus-4-7 Closes #55663 from gengliangwang/SPARK-56711-restrict-commit-version-type. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ae5c075) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon
pushed a commit
that referenced
this pull request
May 4, 2026
…or StringType ### What changes were proposed in this pull request? Tighten the CDC `Changelog` connector contract so that `_commit_version` must be either `LongType` or `StringType`. Previously any `AtomicType` was accepted, which left several edge-case types (`IntegerType`, `TimestampType`, `BinaryType`, `Decimal`, `Float`, `Double`, `Boolean`, ...) silently allowed. - `ChangelogTable.validateSchema` now rejects everything outside `LongType` / `StringType` with a `BIGINT or STRING` expected-type message. - `Changelog` Javadoc updated to state the narrower contract and explain the ordering requirement (the netChanges post-processing path sorts rows by this column, so the column's natural ordering must match commit order). - `CdcNetChangesStatefulProcessor` ordering comment updated; the existing Catalyst-routed comparator is left in place for symmetry with the batch `SortOrder`. - `ChangelogResolutionSuite` updates: accept-list narrowed to `Long` / `String`; reject-list expanded to cover the previously-allowed atomic types (`Integer`, `Timestamp`) plus the existing complex-type cases. ### Why are the changes needed? `Long` (numeric monotonic version) and `String` (lexicographically ordered commit identifier) cover every realistic CDC source. The other atomic types are either strict subsets (`IntegerType` -> `LongType`) or duplicate the role of `_commit_timestamp` (`TimestampType`); types like `BinaryType` / `Float` / `Double` add NaN / boxing / ordering foot-guns with no expressive power gained. The narrower contract also lets the Javadoc state the ordering requirement precisely (matching what the netChanges code actually relies on). Locking down now is non-breaking (no external connectors yet) and keeps the documented surface area small. Relaxing later is non-breaking; restricting later is not. ### Does this PR introduce _any_ user-facing change? The `Changelog` connector API is `Evolving` and has no external implementations yet; the restriction only narrows what implementers may return. No user-facing behavior change. ### How was this patch tested? - `ChangelogResolutionSuite` (27 tests) covers the new accept / reject matrix. - `ResolveChangelogTablePostProcessingSuite`, `ResolveChangelogTableStreamingPostProcessingSuite`, `ResolveChangelogTableNetChangesSuite`, `ChangelogEndToEndSuite` -- 98 existing tests still pass on the new contract. - `UnsupportedOperationsSuite` (216 tests) still passes. - `Xdoclint:html,syntax,accessibility` is clean on `Changelog.java`; no new warnings under `Xdoclint:all`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude opus-4-7 Closes #55663 from gengliangwang/SPARK-56711-restrict-commit-version-type. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ae5c075) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Tighten the CDC
Changelogconnector contract so that_commit_versionmust be eitherLongTypeorStringType. Previously anyAtomicTypewas accepted, which left several edge-case types (IntegerType,TimestampType,BinaryType,Decimal,Float,Double,Boolean, ...) silently allowed.ChangelogTable.validateSchemanow rejects everything outsideLongType/StringTypewith aBIGINT or STRINGexpected-type message.ChangelogJavadoc updated to state the narrower contract and explain the ordering requirement (the netChanges post-processing path sorts rows by this column, so the column's natural ordering must match commit order).CdcNetChangesStatefulProcessorordering comment updated; the existing Catalyst-routed comparator is left in place for symmetry with the batchSortOrder.ChangelogResolutionSuiteupdates: accept-list narrowed toLong/String; reject-list expanded to cover the previously-allowed atomic types (Integer,Timestamp) plus the existing complex-type cases.Why are the changes needed?
Long(numeric monotonic version) andString(lexicographically ordered commit identifier) cover every realistic CDC source. The other atomic types are either strict subsets (IntegerType->LongType) or duplicate the role of_commit_timestamp(TimestampType); types likeBinaryType/Float/Doubleadd NaN / boxing / ordering foot-guns with no expressive power gained. The narrower contract also lets the Javadoc state the ordering requirement precisely (matching what the netChanges code actually relies on).Locking down now is non-breaking (no external connectors yet) and keeps the documented surface area small. Relaxing later is non-breaking; restricting later is not.
Does this PR introduce any user-facing change?
The
Changelogconnector API is@Evolvingand has no external implementations yet; the restriction only narrows what implementers may return. No user-facing behavior change.How was this patch tested?
ChangelogResolutionSuite(27 tests) covers the new accept / reject matrix.ResolveChangelogTablePostProcessingSuite,ResolveChangelogTableStreamingPostProcessingSuite,ResolveChangelogTableNetChangesSuite,ChangelogEndToEndSuite-- 98 existing tests still pass on the new contract.UnsupportedOperationsSuite(216 tests) still passes.Xdoclint:html,syntax,accessibilityis clean onChangelog.java; no new warnings underXdoclint:all.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude opus-4-7