Skip to content

[SPARK-56711][SQL] Restrict CDC _commit_version column to LongType or StringType#55663

Closed
gengliangwang wants to merge 1 commit into
apache:masterfrom
gengliangwang:SPARK-56711-restrict-commit-version-type
Closed

[SPARK-56711][SQL] Restrict CDC _commit_version column to LongType or StringType#55663
gengliangwang wants to merge 1 commit into
apache:masterfrom
gengliangwang:SPARK-56711-restrict-commit-version-type

Conversation

@gengliangwang
Copy link
Copy Markdown
Member

@gengliangwang gengliangwang commented May 3, 2026

What changes were proposed in this pull request?

Tighten the CDC Changelog connector contract so that _commit_version must be either LongType or StringType. Previously any AtomicType was accepted, which left several edge-case types (IntegerType, TimestampType, BinaryType, Decimal, Float, Double, Boolean, ...) silently allowed.

  • ChangelogTable.validateSchema now rejects everything outside LongType / StringType with a BIGINT or STRING expected-type message.
  • Changelog Javadoc updated to state the narrower contract and explain the ordering requirement (the netChanges post-processing path sorts rows by this column, so the column's natural ordering must match commit order).
  • CdcNetChangesStatefulProcessor ordering comment updated; the existing Catalyst-routed comparator is left in place for symmetry with the batch SortOrder.
  • ChangelogResolutionSuite updates: accept-list narrowed to Long / String; reject-list expanded to cover the previously-allowed atomic types (Integer, Timestamp) plus the existing complex-type cases.

Why are the changes needed?

Long (numeric monotonic version) and String (lexicographically ordered commit identifier) cover every realistic CDC source. The other atomic types are either strict subsets (IntegerType -> LongType) or duplicate the role of _commit_timestamp (TimestampType); types like BinaryType / Float / Double add NaN / boxing / ordering foot-guns with no expressive power gained. The narrower contract also lets the Javadoc state the ordering requirement precisely (matching what the netChanges code actually relies on).

Locking down now is non-breaking (no external connectors yet) and keeps the documented surface area small. Relaxing later is non-breaking; restricting later is not.

Does this PR introduce any user-facing change?

The Changelog connector API is @Evolving and has no external implementations yet; the restriction only narrows what implementers may return. No user-facing behavior change.

How was this patch tested?

  • ChangelogResolutionSuite (27 tests) covers the new accept / reject matrix.
  • ResolveChangelogTablePostProcessingSuite, ResolveChangelogTableStreamingPostProcessingSuite, ResolveChangelogTableNetChangesSuite, ChangelogEndToEndSuite -- 98 existing tests still pass on the new contract.
  • UnsupportedOperationsSuite (216 tests) still passes.
  • Xdoclint:html,syntax,accessibility is clean on Changelog.java; no new warnings under Xdoclint:all.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude opus-4-7

…or StringType

### What changes were proposed in this pull request?

Tighten the CDC `Changelog` connector contract so that `_commit_version` must
be either `LongType` or `StringType`. Previously any `AtomicType` was
accepted, which left several edge-case types (`IntegerType`, `TimestampType`,
`BinaryType`, `Decimal`, `Float`, `Double`, `Boolean`, ...) silently allowed.

- `ChangelogTable.validateSchema` now rejects everything outside `LongType` /
  `StringType` with a `BIGINT or STRING` expected-type message.
- `Changelog` Javadoc updated to state the narrower contract.
- `CdcNetChangesStatefulProcessor` ordering comment updated; the existing
  Catalyst-routed comparator is left in place for symmetry with the batch
  `SortOrder`.
- `ChangelogResolutionSuite` updates: accept-list narrowed to `Long` /
  `String`; reject-list expanded to cover the previously-allowed atomic
  types (`Integer`, `Timestamp`) plus the existing complex-type cases.

### Why are the changes needed?

`Long` (numeric monotonic version) and `String` (opaque commit identifier)
cover every realistic CDC source. The other atomic types are either strict
subsets (`IntegerType` -> `LongType`) or duplicate the role of
`_commit_timestamp` (`TimestampType`); types like `BinaryType` / `Float` /
`Double` add NaN / boxing / ordering foot-guns with no expressive power
gained. Locking down now is non-breaking (no external connectors yet) and
keeps the documented surface area small. Relaxing later is non-breaking;
restricting later is not.

### Does this PR introduce _any_ user-facing change?

The `Changelog` connector API is `@Evolving` and has no external
implementations yet; the restriction only narrows what implementers may
return. No user-facing behavior change.

### How was this patch tested?

- `ChangelogResolutionSuite` (27 tests) -- covers the new accept / reject
  matrix.
- `ResolveChangelogTablePostProcessingSuite`,
  `ResolveChangelogTableStreamingPostProcessingSuite`,
  `ResolveChangelogTableNetChangesSuite`, `ChangelogEndToEndSuite` -- 130
  existing tests still pass on the new contract.
- `UnsupportedOperationsSuite` (216 tests) still passes.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude opus-4-7
@gengliangwang
Copy link
Copy Markdown
Member Author

Locking down the data type before 4.2 release is non-breaking (no external connectors yet) and keeps the documented surface area small.
cc @johanl-db @SanJSp as well.

@HyukjinKwon
Copy link
Copy Markdown
Member

Merged to master, branch-4.x and branch-4.2.

HyukjinKwon pushed a commit that referenced this pull request May 4, 2026
…or StringType

### What changes were proposed in this pull request?

Tighten the CDC `Changelog` connector contract so that `_commit_version` must be either `LongType` or `StringType`. Previously any `AtomicType` was accepted, which left several edge-case types (`IntegerType`, `TimestampType`, `BinaryType`, `Decimal`, `Float`, `Double`, `Boolean`, ...) silently allowed.

- `ChangelogTable.validateSchema` now rejects everything outside `LongType` / `StringType` with a `BIGINT or STRING` expected-type message.
- `Changelog` Javadoc updated to state the narrower contract and explain the ordering requirement (the netChanges post-processing path sorts rows by this column, so the column's natural ordering must match commit order).
- `CdcNetChangesStatefulProcessor` ordering comment updated; the existing Catalyst-routed comparator is left in place for symmetry with the batch `SortOrder`.
- `ChangelogResolutionSuite` updates: accept-list narrowed to `Long` / `String`; reject-list expanded to cover the previously-allowed atomic types (`Integer`, `Timestamp`) plus the existing complex-type cases.

### Why are the changes needed?

`Long` (numeric monotonic version) and `String` (lexicographically ordered commit identifier) cover every realistic CDC source. The other atomic types are either strict subsets (`IntegerType` -> `LongType`) or duplicate the role of `_commit_timestamp` (`TimestampType`); types like `BinaryType` / `Float` / `Double` add NaN / boxing / ordering foot-guns with no expressive power gained. The narrower contract also lets the Javadoc state the ordering requirement precisely (matching what the netChanges code actually relies on).

Locking down now is non-breaking (no external connectors yet) and keeps the documented surface area small. Relaxing later is non-breaking; restricting later is not.

### Does this PR introduce _any_ user-facing change?

The `Changelog` connector API is `Evolving` and has no external implementations yet; the restriction only narrows what implementers may return. No user-facing behavior change.

### How was this patch tested?

- `ChangelogResolutionSuite` (27 tests) covers the new accept / reject matrix.
- `ResolveChangelogTablePostProcessingSuite`, `ResolveChangelogTableStreamingPostProcessingSuite`, `ResolveChangelogTableNetChangesSuite`, `ChangelogEndToEndSuite` -- 98 existing tests still pass on the new contract.
- `UnsupportedOperationsSuite` (216 tests) still passes.
- `Xdoclint:html,syntax,accessibility` is clean on `Changelog.java`; no new warnings under `Xdoclint:all`.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude opus-4-7

Closes #55663 from gengliangwang/SPARK-56711-restrict-commit-version-type.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit ae5c075)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request May 4, 2026
…or StringType

### What changes were proposed in this pull request?

Tighten the CDC `Changelog` connector contract so that `_commit_version` must be either `LongType` or `StringType`. Previously any `AtomicType` was accepted, which left several edge-case types (`IntegerType`, `TimestampType`, `BinaryType`, `Decimal`, `Float`, `Double`, `Boolean`, ...) silently allowed.

- `ChangelogTable.validateSchema` now rejects everything outside `LongType` / `StringType` with a `BIGINT or STRING` expected-type message.
- `Changelog` Javadoc updated to state the narrower contract and explain the ordering requirement (the netChanges post-processing path sorts rows by this column, so the column's natural ordering must match commit order).
- `CdcNetChangesStatefulProcessor` ordering comment updated; the existing Catalyst-routed comparator is left in place for symmetry with the batch `SortOrder`.
- `ChangelogResolutionSuite` updates: accept-list narrowed to `Long` / `String`; reject-list expanded to cover the previously-allowed atomic types (`Integer`, `Timestamp`) plus the existing complex-type cases.

### Why are the changes needed?

`Long` (numeric monotonic version) and `String` (lexicographically ordered commit identifier) cover every realistic CDC source. The other atomic types are either strict subsets (`IntegerType` -> `LongType`) or duplicate the role of `_commit_timestamp` (`TimestampType`); types like `BinaryType` / `Float` / `Double` add NaN / boxing / ordering foot-guns with no expressive power gained. The narrower contract also lets the Javadoc state the ordering requirement precisely (matching what the netChanges code actually relies on).

Locking down now is non-breaking (no external connectors yet) and keeps the documented surface area small. Relaxing later is non-breaking; restricting later is not.

### Does this PR introduce _any_ user-facing change?

The `Changelog` connector API is `Evolving` and has no external implementations yet; the restriction only narrows what implementers may return. No user-facing behavior change.

### How was this patch tested?

- `ChangelogResolutionSuite` (27 tests) covers the new accept / reject matrix.
- `ResolveChangelogTablePostProcessingSuite`, `ResolveChangelogTableStreamingPostProcessingSuite`, `ResolveChangelogTableNetChangesSuite`, `ChangelogEndToEndSuite` -- 98 existing tests still pass on the new contract.
- `UnsupportedOperationsSuite` (216 tests) still passes.
- `Xdoclint:html,syntax,accessibility` is clean on `Changelog.java`; no new warnings under `Xdoclint:all`.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude opus-4-7

Closes #55663 from gengliangwang/SPARK-56711-restrict-commit-version-type.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit ae5c075)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants