[fix](streamingjob) Persist cdc_stream TVF offset across FE checkpoint by JNSimba · Pull Request #62902 · apache/doris

JNSimba · 2026-04-28T07:45:18Z

What problem does this PR solve?

Related PR: #62449

Problem Summary:

PR #62449 fixed streaming job offset state-loss after FE checkpoint restart for the S3 path, but the cdc_stream TVF path has the same root cause and worse impact: after a checkpoint restart in the binlog phase, the job replays from the very beginning of the binlog (because currentOffset == null falls through to a fresh BinlogSplit with no startingOffset).

Root cause: JdbcTvfSourceOffsetProvider.getPersistInfo() returns null, so offsetProviderPersist is never written into the FE image. After checkpoint, the pre-checkpoint journal is GC'd, neither journal-replayed currentOffset nor image-persisted state survives, and recovery falls back to a fresh provider with empty chw/bop.

Only the non-cloud mode is affected. Cloud mode is fine because replayOnCloudMode pulls a cumulative attachment from MS.

Fix — reuse the parent's existing chw/bop/ts @SerializedName persistence:

Drop the getPersistInfo() override so the parent's GsonUtils.GSON.toJson(this) writes chw/bop/ts into the image.
Add a restoreFromPersistInfo() override to read them back on FE startup (called from gsonPostProcess).
In updateOffset binlog branch, mirror startingOffset into binlogOffsetPersist so it survives the image (currentOffset has no @SerializedName).
In replayIfNeed currentOffset == null branch, rebuild BinlogSplit from bop, or apply chw (using the existing null.null remap) when restoring snapshot phase.

Release note

Fix cdc_stream TVF offset state loss after FE checkpoint restart (non-cloud mode).

Check List (For Author)

Test
- Regression test (`test_streaming_job_cdc_stream_postgres_checkpoint_restart_fe.groovy` covers both mid-snapshot and binlog-phase checkpoint restart)
- Unit Test
- Manual test
- No need to test
Behavior changed:
- No.
Does this need documentation?
- No.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

…lized

JNSimba · 2026-04-29T02:18:48Z

/review

JNSimba · 2026-04-29T02:39:03Z

run buildall

JNSimba · 2026-05-06T09:24:09Z

run p0

JNSimba · 2026-05-06T09:24:21Z

run nonConcurrent

JNSimba · 2026-05-11T06:05:43Z

/review

github-actions

Summary:

I found one blocking test-coverage issue. The production change is focused on persisting and restoring the TVF JDBC offset provider state across checkpoint-image recovery, and the implementation generally follows the existing provider replay model, but the new regression test does not reliably exercise the mid-snapshot checkpoint recovery path that is the risky part of the change.

Critical checkpoint conclusions:

Goal/test proof: The goal is to recover cdc_stream TVF progress after checkpoint GC removes pre-checkpoint journals. The binlog-image path is covered, but the snapshot chw-image path is not reliably proven because the test waits while the job continues to run.
Scope: The code change is small and focused.
Concurrency/lifecycle: No new locks or threads are introduced. The relevant lifecycle is FE image load plus later scheduler replay; no direct lock-order issue found.
Persistence/transactions: The change relies on offsetProviderPersist in the image and txn commit attachments. The main uncovered risk is whether a partially completed snapshot is restored from image without txn replay.
Parallel paths: Cloud replay still uses the existing latest attachment path; no additional issue found in this PR scope.
Test results: Added regression output is ordered, but the new test can pass without exercising the intended mid-snapshot checkpoint scenario.
Observability/performance: Added logs are lightweight; no new hot-path performance issue found.

User focus: No additional user-provided review focus was specified.

liaoxin01

LGTM

github-actions · 2026-05-14T15:14:26Z

PR approved by at least one committer and no changes requested.

already fixed

#62902) ### What problem does this PR solve? Related PR: #62449 Problem Summary: PR #62449 fixed streaming job offset state-loss after FE checkpoint restart for the S3 path, but the cdc_stream TVF path has the same root cause and worse impact: after a checkpoint restart in the binlog phase, the job replays from the very beginning of the binlog (because `currentOffset == null` falls through to a fresh `BinlogSplit` with no `startingOffset`). Root cause: `JdbcTvfSourceOffsetProvider.getPersistInfo()` returns `null`, so `offsetProviderPersist` is never written into the FE image. After checkpoint, the pre-checkpoint journal is GC'd, neither journal-replayed `currentOffset` nor image-persisted state survives, and recovery falls back to a fresh provider with empty `chw`/`bop`. Only the non-cloud mode is affected. Cloud mode is fine because `replayOnCloudMode` pulls a cumulative attachment from MS. Fix — reuse the parent's existing `chw`/`bop`/`ts` `@SerializedName` persistence: - Drop the `getPersistInfo()` override so the parent's `GsonUtils.GSON.toJson(this)` writes `chw/bop/ts` into the image. - Add a `restoreFromPersistInfo()` override to read them back on FE startup (called from `gsonPostProcess`). - In `updateOffset` binlog branch, mirror `startingOffset` into `binlogOffsetPersist` so it survives the image (`currentOffset` has no `@SerializedName`). - In `replayIfNeed` `currentOffset == null` branch, rebuild `BinlogSplit` from `bop`, or apply `chw` (using the existing `null.null` remap) when restoring snapshot phase.

[fix](fe) Persist cdc_stream TVF offset across FE checkpoint

f60d3bc

JNSimba changed the title ~~[fix](fe) Persist cdc_stream TVF offset across FE checkpoint~~ [fix](streamingjob) Persist cdc_stream TVF offset across FE checkpoint Apr 28, 2026

[fix](fe) Avoid wiping restored chunkHighWatermarkMap in ensureInitia…

dcc0b38

…lized

JNSimba requested a review from Copilot April 28, 2026 08:58

JNSimba added the dev/4.1.x label Apr 29, 2026

github-actions Bot previously requested changes May 11, 2026

View reviewed changes

Comment thread ...p0/streaming_job/cdc/tvf/test_streaming_job_cdc_stream_postgres_checkpoint_restart_fe.groovy

liaoxin01 approved these changes May 14, 2026

View reviewed changes

github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 14, 2026

JNSimba merged commit 73b3f5e into apache:master May 15, 2026
33 checks passed

github-actions Bot mentioned this pull request May 15, 2026

branch-4.1: [fix](streamingjob) Persist cdc_stream TVF offset across FE checkpoint #62902 #63270

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix](streamingjob) Persist cdc_stream TVF offset across FE checkpoint#62902

[fix](streamingjob) Persist cdc_stream TVF offset across FE checkpoint#62902
JNSimba merged 2 commits into
apache:masterfrom
JNSimba:fix_cdc_tvf_checkpoint

JNSimba commented Apr 28, 2026

Uh oh!

JNSimba commented Apr 29, 2026

Uh oh!

JNSimba commented Apr 29, 2026

Uh oh!

JNSimba commented May 6, 2026

Uh oh!

JNSimba commented May 6, 2026

Uh oh!

JNSimba commented May 11, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

liaoxin01 left a comment

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JNSimba commented Apr 28, 2026

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

JNSimba commented Apr 29, 2026

Uh oh!

JNSimba commented Apr 29, 2026

Uh oh!

JNSimba commented May 6, 2026

Uh oh!

JNSimba commented May 6, 2026

Uh oh!

JNSimba commented May 11, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

liaoxin01 left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants