[fix](streaming-job) bound cdc_client RPCs with per-category timeouts by JNSimba · Pull Request #62870 · apache/doris

JNSimba · 2026-04-27T10:09:52Z

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

JdbcSourceOffsetProvider.cleanMeta() and several other cdc_client RPCs called future.get() with no timeout. When the cdc_client (or PG/MySQL behind it) hangs, the call blocks forever. For cleanMeta() this is fatal because it runs inside JobManager.dropJobInternal() while holding JobManager.writeLock() — any subsequent CREATE / DROP / SHOW JOB on streaming jobs is then serialized behind the dead lock, effectively freezing the streaming-job control plane.

Fix

Introduce two configurable timeouts (mirroring the BE brpc_light/heavy_work_pool naming) and apply them to all 8 cdc_client RPC call sites:

streaming_cdc_light_rpc_timeout_sec = 90 for /api/close, /api/compareOffset, /api/fetchEndOffset, /api/getTaskOffset, /api/getFailReason (server-side single-statement queries / cache lookups, expected sub-second). Default is 90s rather than 30s to absorb cdc_client cold-start: when the BE-spawned cdc_client process is not yet running, start_cdc_client performs a health-check loop (worst case ~45s) before serving the request — 90s gives enough headroom to avoid spurious timeouts during this window while still bounding JobManager.writeLock hold time.
streaming_cdc_heavy_rpc_timeout_sec = 600 for /api/initReader, /api/fetchSplits, /api/writeRecords (may legitimately take minutes for replication slot creation, large snapshot split computation, or batch writes).

Both configs are mutable = true so they can be tuned via ADMIN SET FRONTEND CONFIG without restarting FE.

BackendServiceClient.requestCdcClient gains a timeout overload that applies a gRPC withDeadlineAfter; the per-call-site future.get(...) also passes the same timeout so the deadline is enforced on both sides.

On timeout we WARN with a uniform line carrying api / jobId / backend / timeout_sec for easy log aggregation. cleanMeta keeps its existing swallow-on-failure semantics (a cleanup hiccup must not fail DROP JOB); the other seven sites throw JobException consistent with their existing ExecutionException handling.

Release note

Add streaming-job FE configs streaming_cdc_light_rpc_timeout_sec (default 90s) and streaming_cdc_heavy_rpc_timeout_sec (default 600s) to bound cdc_client RPCs.

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason: defensive timeout — main path behavior is unchanged, the new branch only triggers when cdc_client itself is misbehaving (which is hard to reproduce in CI).
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

Thearas · 2026-04-27T10:09:58Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

JNSimba · 2026-04-28T01:47:50Z

/review

JNSimba · 2026-04-28T06:22:02Z

run buildall

Copilot

Pull request overview

This PR prevents the streaming-job FE control plane from hanging indefinitely by bounding cdc_client RPC waits with configurable per-category timeouts, enforced both via gRPC deadlines and Future.get(...) timeouts.

Changes:

Add timeout-aware requestCdcClient overloads in BackendServiceClient/BackendServiceProxy using gRPC withDeadlineAfter.
Apply per-RPC timeouts (light vs heavy) and explicit future.get(timeout, ...) across streaming-job cdc_client call sites, with consistent timeout WARN logging.
Introduce new mutable FE configs for cdc_client RPC timeouts.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
fe/fe-core/src/main/java/org/apache/doris/rpc/BackendServiceProxy.java	Adds `requestCdcClient(..., timeoutSec)` overload to pass timeout to client.
fe/fe-core/src/main/java/org/apache/doris/rpc/BackendServiceClient.java	Adds timeout overload applying gRPC deadline for cdc_client RPCs.
fe/fe-core/src/main/java/org/apache/doris/job/offset/jdbc/JdbcTvfSourceOffsetProvider.java	Bounds `/api/getTaskOffset` wait with configurable timeout and adds timeout logging.
fe/fe-core/src/main/java/org/apache/doris/job/offset/jdbc/JdbcSourceOffsetProvider.java	Bounds multiple cdc_client RPCs with configurable timeouts and adds timeout handling/logging.
fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingMultiTblTask.java	Bounds `/api/writeRecords` and `/api/getFailReason` calls and adds timeout logging/handling.
fe/fe-common/src/main/java/org/apache/doris/common/Config.java	Introduces new mutable FE configs for light/heavy cdc_client RPC timeouts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

JNSimba · 2026-04-29T02:19:22Z

/review

JNSimba · 2026-04-29T02:41:48Z

run buildall

### What problem does this PR solve? Issue Number: close #xxx Problem Summary: `JdbcSourceOffsetProvider.cleanMeta()` and several other cdc_client RPCs called `future.get()` with no timeout. When the cdc_client (or PG/MySQL behind it) hangs, the call blocks forever. For `cleanMeta()` this is fatal because it runs inside `JobManager.dropJobInternal()` while holding `JobManager.writeLock()` — any subsequent CREATE / DROP / SHOW JOB on streaming jobs is then serialized behind the dead lock, effectively freezing the streaming-job control plane. ### Fix Introduce two configurable timeouts (mirroring the BE `brpc_light/heavy_work_pool` naming) and apply them to all 8 cdc_client RPC call sites: - `streaming_cdc_light_rpc_timeout_sec = 30` for `/api/close`, `/api/compareOffset`, `/api/getTaskOffset`, `/api/getFailReason` (expected sub-second). - `streaming_cdc_heavy_rpc_timeout_sec = 600` for `/api/initReader`, `/api/fetchSplits`, `/api/fetchEndOffset`, `/api/writeRecords` (may take minutes for schema discovery / large snapshot splits). Both configs are `mutable = true` so they can be tuned via `ADMIN SET FRONTEND CONFIG` without restarting FE. `BackendServiceClient.requestCdcClient` gains a timeout overload that applies a gRPC `withDeadlineAfter`; the per-call site `future.get(...)` also passes the same timeout so the deadline is enforced on both sides. On timeout we WARN with a uniform line carrying api / jobId / backend / timeout_sec for easy log aggregation. `cleanMeta` keeps its existing swallow-on-failure semantics (a cleanup hiccup must not fail DROP JOB); the other seven sites throw `JobException` consistent with their existing `ExecutionException` handling. ### Release note Add streaming-job FE configs `streaming_cdc_light_rpc_timeout_sec` and `streaming_cdc_heavy_rpc_timeout_sec` to bound cdc_client RPCs.

liaoxin01

LGTM

github-actions · 2026-04-29T13:51:28Z

PR approved by at least one committer and no changes requested.

github-actions · 2026-04-29T13:51:31Z

PR approved by anyone and no changes requested.

JNSimba · 2026-04-30T00:59:54Z

run buildall

JNSimba · 2026-04-30T07:42:28Z

run p0

JNSimba · 2026-04-30T07:42:35Z

run cloud_p0

JNSimba · 2026-04-30T09:47:25Z

run p0

…#62870) ### What problem does this PR solve? Introduce two configurable timeouts (mirroring the BE `brpc_light/heavy_work_pool` naming) and apply them to all 8 cdc_client RPC call sites: - `streaming_cdc_light_rpc_timeout_sec = 90` for `/api/close`, `/api/compareOffset`, `/api/fetchEndOffset`, `/api/getTaskOffset`, `/api/getFailReason` (server-side single-statement queries / cache lookups, expected sub-second). Default is 90s rather than 30s to absorb cdc_client cold-start: when the BE-spawned cdc_client process is not yet running, `start_cdc_client` performs a health-check loop (worst case ~45s) before serving the request — 90s gives enough headroom to avoid spurious timeouts during this window while still bounding `JobManager.writeLock` hold time. - `streaming_cdc_heavy_rpc_timeout_sec = 600` for `/api/initReader`, `/api/fetchSplits`, `/api/writeRecords` (may legitimately take minutes for replication slot creation, large snapshot split computation, or batch writes).

JNSimba force-pushed the fix-doris-25420-cdc-rpc-timeout branch 3 times, most recently from d462026 to 403fefc Compare April 28, 2026 06:21

JNSimba added the dev/4.1.x label Apr 28, 2026

JNSimba requested a review from Copilot April 28, 2026 08:45

Copilot started reviewing on behalf of JNSimba April 28, 2026 08:46 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

JNSimba force-pushed the fix-doris-25420-cdc-rpc-timeout branch from 403fefc to ac29a53 Compare April 29, 2026 02:54

liaoxin01 approved these changes Apr 29, 2026

View reviewed changes

github-actions Bot added the approved Indicates a PR has been approved by one committer. label Apr 29, 2026

github-actions Bot added the reviewed label Apr 29, 2026

JNSimba closed this Apr 29, 2026

JNSimba reopened this Apr 29, 2026

JNSimba merged commit 235822d into apache:master Apr 30, 2026
49 of 50 checks passed

github-actions Bot mentioned this pull request Apr 30, 2026

branch-4.1: [fix](streaming-job) bound cdc_client RPCs with per-category timeouts #62870 #62983

Open

Conversation

JNSimba commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Fix

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

Thearas commented Apr 27, 2026

Uh oh!

JNSimba commented Apr 28, 2026

Uh oh!

JNSimba commented Apr 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JNSimba commented Apr 29, 2026

Uh oh!

JNSimba commented Apr 29, 2026

Uh oh!

liaoxin01 left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

JNSimba commented Apr 30, 2026

Uh oh!

JNSimba commented Apr 30, 2026

Uh oh!

JNSimba commented Apr 30, 2026

Uh oh!

JNSimba commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JNSimba commented Apr 27, 2026 •

edited

Loading