Fix peer disconnect detection in waitValueOrSignal (#12935)#13017
Fix peer disconnect detection in waitValueOrSignal (#12935)#13017saintstack merged 2 commits intoapple:release-7.4from
Conversation
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
There was a problem hiding this comment.
Pull request overview
This PR forward-ports a fix to prevent waitValueOrSignal() from hanging indefinitely on dead/replaced connections by also waiting on the associated Peer’s disconnect signal, allowing callers (e.g., load-balance retry loops) to react immediately.
Changes:
- Add a
when()clause inwaitValueOrSignal()to watchpeer->disconnectand returnrequest_maybe_delivered()on disconnect. - Add new unit tests covering peer disconnect detection, no-peer behavior, and a simple retry pattern.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
fdbrpc/include/fdbrpc/genericactors.actor.h |
Detect peer disconnects inside waitValueOrSignal() and fail fast with request_maybe_delivered(). |
fdbrpc/FlowTests.actor.cpp |
Add unit tests validating the new peer-disconnect behavior and preserving prior semantics without a peer. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| when(wait(peer.isValid() ? peer->disconnect.getFuture() : Never())) { | ||
| CODE_PROBE(true, "waitValueOrSignal detected peer disconnect"); | ||
| TraceEvent("WaitValueOrSignalPeerDisconnect") |
There was a problem hiding this comment.
WaitValueOrSignalPeerDisconnect TraceEvent is emitted on every peer disconnect without any rate limiting. In unstable networks this could produce high-volume logs across all clients/servers waiting on replies. Consider adding .suppressFor(...) (and/or lowering severity / sampling) to avoid log spam while keeping diagnostic value.
| TraceEvent("WaitValueOrSignalPeerDisconnect") | |
| TraceEvent("WaitValueOrSignalPeerDisconnect") | |
| .suppressFor(1.0) |
| #include "flow/IThreadPool.h" | ||
| #include "flow/WriteOnlySet.h" | ||
| #include "fdbrpc/fdbrpc.h" | ||
| #include "fdbrpc/FlowTransport.h" |
There was a problem hiding this comment.
fdbrpc/fdbrpc.h already includes fdbrpc/FlowTransport.h, so this additional include is redundant here. Removing it would keep includes minimal and reduce unnecessary compile dependencies.
| #include "fdbrpc/FlowTransport.h" |
|
@saintstack wdyt about #13017 (comment)? |
Forward-port from 7.3 to 7.4. Add a when() clause watching peer->disconnect in waitValueOrSignal (genericactors.actor.h) so dead connections (e.g., from NAT timeouts) are detected immediately instead of hanging indefinitely waiting on a connection the lower layer has already replaced. We saw this in an incident where waiting on a long reply on a network with frequent disconnects; low level fdb would make a new connection but high-level would wait until we timed out on the original. Includes unit tests for the peer disconnect detection.
I like it. Addressed in this next push. Thanks. |
0b7d4a2 to
da80499
Compare
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
|
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Forward-port from 7.3 to 7.4.
Add a when() clause watching peer->disconnect in waitValueOrSignal (genericactors.actor.h) so dead connections (e.g., from NAT timeouts) are detected immediately instead of hanging indefinitely waiting on a connection the lower layer has already replaced.
We saw this in an incident where waiting on a long reply on a network with frequent disconnects; low level fdb would make a new connection but high-level would wait until we timed out on the original.
Includes unit tests for the peer disconnect detection.
20260418-153449-stack_forward_port_waitValu-96d073ac1c50ff92 compressed=True data_size=41481497 duration=4484727 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:51:47 sanity=False started=100000 stopped=20260418-162636 submitted=20260418-153449 timeout=5400 username=stack_forward_port_waitValueOrSignal