[fix](streamingjob) Persist cdc_stream TVF offset across FE checkpoint#62902
Merged
Conversation
Member
Author
|
/review |
Member
Author
|
run buildall |
Member
Author
|
run p0 |
Member
Author
|
run nonConcurrent |
Member
Author
|
/review |
Contributor
There was a problem hiding this comment.
Summary:
I found one blocking test-coverage issue. The production change is focused on persisting and restoring the TVF JDBC offset provider state across checkpoint-image recovery, and the implementation generally follows the existing provider replay model, but the new regression test does not reliably exercise the mid-snapshot checkpoint recovery path that is the risky part of the change.
Critical checkpoint conclusions:
- Goal/test proof: The goal is to recover cdc_stream TVF progress after checkpoint GC removes pre-checkpoint journals. The binlog-image path is covered, but the snapshot chw-image path is not reliably proven because the test waits while the job continues to run.
- Scope: The code change is small and focused.
- Concurrency/lifecycle: No new locks or threads are introduced. The relevant lifecycle is FE image load plus later scheduler replay; no direct lock-order issue found.
- Persistence/transactions: The change relies on offsetProviderPersist in the image and txn commit attachments. The main uncovered risk is whether a partially completed snapshot is restored from image without txn replay.
- Parallel paths: Cloud replay still uses the existing latest attachment path; no additional issue found in this PR scope.
- Test results: Added regression output is ordered, but the new test can pass without exercising the intended mid-snapshot checkpoint scenario.
- Observability/performance: Added logs are lightweight; no new hot-path performance issue found.
User focus: No additional user-provided review focus was specified.
Contributor
|
PR approved by at least one committer and no changes requested. |
github-actions Bot
pushed a commit
that referenced
this pull request
May 15, 2026
#62902) ### What problem does this PR solve? Related PR: #62449 Problem Summary: PR #62449 fixed streaming job offset state-loss after FE checkpoint restart for the S3 path, but the cdc_stream TVF path has the same root cause and worse impact: after a checkpoint restart in the binlog phase, the job replays from the very beginning of the binlog (because `currentOffset == null` falls through to a fresh `BinlogSplit` with no `startingOffset`). Root cause: `JdbcTvfSourceOffsetProvider.getPersistInfo()` returns `null`, so `offsetProviderPersist` is never written into the FE image. After checkpoint, the pre-checkpoint journal is GC'd, neither journal-replayed `currentOffset` nor image-persisted state survives, and recovery falls back to a fresh provider with empty `chw`/`bop`. Only the non-cloud mode is affected. Cloud mode is fine because `replayOnCloudMode` pulls a cumulative attachment from MS. Fix — reuse the parent's existing `chw`/`bop`/`ts` `@SerializedName` persistence: - Drop the `getPersistInfo()` override so the parent's `GsonUtils.GSON.toJson(this)` writes `chw/bop/ts` into the image. - Add a `restoreFromPersistInfo()` override to read them back on FE startup (called from `gsonPostProcess`). - In `updateOffset` binlog branch, mirror `startingOffset` into `binlogOffsetPersist` so it survives the image (`currentOffset` has no `@SerializedName`). - In `replayIfNeed` `currentOffset == null` branch, rebuild `BinlogSplit` from `bop`, or apply `chw` (using the existing `null.null` remap) when restoring snapshot phase.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Related PR: #62449
Problem Summary:
PR #62449 fixed streaming job offset state-loss after FE checkpoint restart for the S3 path, but the cdc_stream TVF path has the same root cause and worse impact: after a checkpoint restart in the binlog phase, the job replays from the very beginning of the binlog (because
currentOffset == nullfalls through to a freshBinlogSplitwith nostartingOffset).Root cause:
JdbcTvfSourceOffsetProvider.getPersistInfo()returnsnull, sooffsetProviderPersistis never written into the FE image. After checkpoint, the pre-checkpoint journal is GC'd, neither journal-replayedcurrentOffsetnor image-persisted state survives, and recovery falls back to a fresh provider with emptychw/bop.Only the non-cloud mode is affected. Cloud mode is fine because
replayOnCloudModepulls a cumulative attachment from MS.Fix — reuse the parent's existing
chw/bop/ts@SerializedNamepersistence:getPersistInfo()override so the parent'sGsonUtils.GSON.toJson(this)writeschw/bop/tsinto the image.restoreFromPersistInfo()override to read them back on FE startup (called fromgsonPostProcess).updateOffsetbinlog branch, mirrorstartingOffsetintobinlogOffsetPersistso it survives the image (currentOffsethas no@SerializedName).replayIfNeedcurrentOffset == nullbranch, rebuildBinlogSplitfrombop, or applychw(using the existingnull.nullremap) when restoring snapshot phase.Release note
Fix cdc_stream TVF offset state loss after FE checkpoint restart (non-cloud mode).
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)