Skip to content

[fix](streaming-job) bound cdc_client RPCs with per-category timeouts#62870

Merged
JNSimba merged 1 commit intoapache:masterfrom
JNSimba:fix-doris-25420-cdc-rpc-timeout
Apr 30, 2026
Merged

[fix](streaming-job) bound cdc_client RPCs with per-category timeouts#62870
JNSimba merged 1 commit intoapache:masterfrom
JNSimba:fix-doris-25420-cdc-rpc-timeout

Conversation

@JNSimba
Copy link
Copy Markdown
Member

@JNSimba JNSimba commented Apr 27, 2026

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

JdbcSourceOffsetProvider.cleanMeta() and several other cdc_client RPCs called future.get() with no timeout. When the cdc_client (or PG/MySQL behind it) hangs, the call blocks forever. For cleanMeta() this is fatal because it runs inside JobManager.dropJobInternal() while holding JobManager.writeLock() — any subsequent CREATE / DROP / SHOW JOB on streaming jobs is then serialized behind the dead lock, effectively freezing the streaming-job control plane.

Fix

Introduce two configurable timeouts (mirroring the BE brpc_light/heavy_work_pool naming) and apply them to all 8 cdc_client RPC call sites:

  • streaming_cdc_light_rpc_timeout_sec = 90 for /api/close, /api/compareOffset, /api/fetchEndOffset, /api/getTaskOffset, /api/getFailReason (server-side single-statement queries / cache lookups, expected sub-second). Default is 90s rather than 30s to absorb cdc_client cold-start: when the BE-spawned cdc_client process is not yet running, start_cdc_client performs a health-check loop (worst case ~45s) before serving the request — 90s gives enough headroom to avoid spurious timeouts during this window while still bounding JobManager.writeLock hold time.
  • streaming_cdc_heavy_rpc_timeout_sec = 600 for /api/initReader, /api/fetchSplits, /api/writeRecords (may legitimately take minutes for replication slot creation, large snapshot split computation, or batch writes).

Both configs are mutable = true so they can be tuned via ADMIN SET FRONTEND CONFIG without restarting FE.

BackendServiceClient.requestCdcClient gains a timeout overload that applies a gRPC withDeadlineAfter; the per-call-site future.get(...) also passes the same timeout so the deadline is enforced on both sides.

On timeout we WARN with a uniform line carrying api / jobId / backend / timeout_sec for easy log aggregation. cleanMeta keeps its existing swallow-on-failure semantics (a cleanup hiccup must not fail DROP JOB); the other seven sites throw JobException consistent with their existing ExecutionException handling.

Release note

Add streaming-job FE configs streaming_cdc_light_rpc_timeout_sec (default 90s) and streaming_cdc_heavy_rpc_timeout_sec (default 600s) to bound cdc_client RPCs.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason: defensive timeout — main path behavior is unchanged, the new branch only triggers when cdc_client itself is misbehaving (which is hard to reproduce in CI).
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 27, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 28, 2026

/review

@JNSimba JNSimba force-pushed the fix-doris-25420-cdc-rpc-timeout branch 3 times, most recently from d462026 to 403fefc Compare April 28, 2026 06:21
@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 28, 2026

run buildall

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents the streaming-job FE control plane from hanging indefinitely by bounding cdc_client RPC waits with configurable per-category timeouts, enforced both via gRPC deadlines and Future.get(...) timeouts.

Changes:

  • Add timeout-aware requestCdcClient overloads in BackendServiceClient/BackendServiceProxy using gRPC withDeadlineAfter.
  • Apply per-RPC timeouts (light vs heavy) and explicit future.get(timeout, ...) across streaming-job cdc_client call sites, with consistent timeout WARN logging.
  • Introduce new mutable FE configs for cdc_client RPC timeouts.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
fe/fe-core/src/main/java/org/apache/doris/rpc/BackendServiceProxy.java Adds requestCdcClient(..., timeoutSec) overload to pass timeout to client.
fe/fe-core/src/main/java/org/apache/doris/rpc/BackendServiceClient.java Adds timeout overload applying gRPC deadline for cdc_client RPCs.
fe/fe-core/src/main/java/org/apache/doris/job/offset/jdbc/JdbcTvfSourceOffsetProvider.java Bounds /api/getTaskOffset wait with configurable timeout and adds timeout logging.
fe/fe-core/src/main/java/org/apache/doris/job/offset/jdbc/JdbcSourceOffsetProvider.java Bounds multiple cdc_client RPCs with configurable timeouts and adds timeout handling/logging.
fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingMultiTblTask.java Bounds /api/writeRecords and /api/getFailReason calls and adds timeout logging/handling.
fe/fe-common/src/main/java/org/apache/doris/common/Config.java Introduces new mutable FE configs for light/heavy cdc_client RPC timeouts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread fe/fe-common/src/main/java/org/apache/doris/common/Config.java
@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 29, 2026

/review

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 29, 2026

run buildall

### What problem does this PR solve?

Issue Number: close #xxx

Problem Summary:

`JdbcSourceOffsetProvider.cleanMeta()` and several other cdc_client RPCs
called `future.get()` with no timeout. When the cdc_client (or PG/MySQL
behind it) hangs, the call blocks forever. For `cleanMeta()` this is
fatal because it runs inside `JobManager.dropJobInternal()` while
holding `JobManager.writeLock()` — any subsequent CREATE / DROP / SHOW
JOB on streaming jobs is then serialized behind the dead lock,
effectively freezing the streaming-job control plane.

### Fix

Introduce two configurable timeouts (mirroring the BE
`brpc_light/heavy_work_pool` naming) and apply them to all 8 cdc_client
RPC call sites:

- `streaming_cdc_light_rpc_timeout_sec = 30` for
  `/api/close`, `/api/compareOffset`, `/api/getTaskOffset`,
  `/api/getFailReason` (expected sub-second).
- `streaming_cdc_heavy_rpc_timeout_sec = 600` for
  `/api/initReader`, `/api/fetchSplits`, `/api/fetchEndOffset`,
  `/api/writeRecords` (may take minutes for schema discovery / large
  snapshot splits).

Both configs are `mutable = true` so they can be tuned via
`ADMIN SET FRONTEND CONFIG` without restarting FE.

`BackendServiceClient.requestCdcClient` gains a timeout overload that
applies a gRPC `withDeadlineAfter`; the per-call site `future.get(...)`
also passes the same timeout so the deadline is enforced on both sides.

On timeout we WARN with a uniform line carrying api / jobId / backend /
timeout_sec for easy log aggregation. `cleanMeta` keeps its existing
swallow-on-failure semantics (a cleanup hiccup must not fail DROP JOB);
the other seven sites throw `JobException` consistent with their
existing `ExecutionException` handling.

### Release note

Add streaming-job FE configs `streaming_cdc_light_rpc_timeout_sec` and
`streaming_cdc_heavy_rpc_timeout_sec` to bound cdc_client RPCs.
@JNSimba JNSimba force-pushed the fix-doris-25420-cdc-rpc-timeout branch from 403fefc to ac29a53 Compare April 29, 2026 02:54
Copy link
Copy Markdown
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Apr 29, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@JNSimba JNSimba closed this Apr 29, 2026
@JNSimba JNSimba reopened this Apr 29, 2026
@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 30, 2026

run buildall

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 30, 2026

run p0

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 30, 2026

run cloud_p0

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 30, 2026

run p0

@JNSimba JNSimba merged commit 235822d into apache:master Apr 30, 2026
49 of 50 checks passed
github-actions Bot pushed a commit that referenced this pull request Apr 30, 2026
…#62870)

### What problem does this PR solve?

Introduce two configurable timeouts (mirroring the BE
`brpc_light/heavy_work_pool` naming) and apply them to all 8 cdc_client
RPC call sites:

- `streaming_cdc_light_rpc_timeout_sec = 90` for `/api/close`,
`/api/compareOffset`, `/api/fetchEndOffset`, `/api/getTaskOffset`,
`/api/getFailReason` (server-side single-statement queries / cache
lookups, expected sub-second). Default is 90s rather than 30s to absorb
cdc_client cold-start: when the BE-spawned cdc_client process is not yet
running, `start_cdc_client` performs a health-check loop (worst case
~45s) before serving the request — 90s gives enough headroom to avoid
spurious timeouts during this window while still bounding
`JobManager.writeLock` hold time.
- `streaming_cdc_heavy_rpc_timeout_sec = 600` for `/api/initReader`,
`/api/fetchSplits`, `/api/writeRecords` (may legitimately take minutes
for replication slot creation, large snapshot split computation, or
batch writes).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.1.x reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants