Skip to content

branch-4.1: [fix](streaming-job) start counting task max interval after the first record is received #63141#63163

Merged
yiguolei merged 1 commit into
branch-4.1from
auto-pick-63141-branch-4.1
May 13, 2026
Merged

branch-4.1: [fix](streaming-job) start counting task max interval after the first record is received #63141#63163
yiguolei merged 1 commit into
branch-4.1from
auto-pick-63141-branch-4.1

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

Cherry-picked from #63141

… record is received (#63141)

### What problem does this PR solve?

Problem Summary:

For PG CDC streaming jobs, when a task creates a fresh logical
replication
connection, the walsender must re-decode the WAL region from
`slot.restart_lsn`
up to the client-supplied `startLsn` before any event can be emitted. On
high
RTT networks (e.g. cross-region Aurora) this catch-up alone can take
several
seconds.

Under the previous behavior the task's `maxInterval` window started the
moment
the task entered `writeRecords`, so the time spent on WAL position
lookup +
walsender catch-up was charged against the interval. With
`max_interval=10s`
this consistently caused tasks to finish with `0 records, 0 heartbeats`,
the
slot's `confirmed_flush_lsn` never advanced, and every following task
repeated
the same catch-up — the slot was effectively stuck and replication lag
grew
indefinitely.

This PR delays the start of the `maxInterval` countdown until the first
record
(or heartbeat) is actually received from the source reader, so the
per-task
interval governs the real streaming window rather than being consumed by
setup.
The FE-side `streaming_task_timeout_multiplier * maxIntervalSec` still
acts as
the hard ceiling.
@github-actions github-actions Bot requested a review from yiguolei as a code owner May 12, 2026 06:31
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@hello-stephen
Copy link
Copy Markdown
Contributor

run buildall

@yiguolei yiguolei merged commit 97d0400 into branch-4.1 May 13, 2026
29 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants