Skip to content

Fix peer disconnect detection in waitValueOrSignal (#12935)#13017

Merged
saintstack merged 2 commits intoapple:release-7.4from
saintstack:cherry-pick-waitValueOrSignal-7.4
Apr 27, 2026
Merged

Fix peer disconnect detection in waitValueOrSignal (#12935)#13017
saintstack merged 2 commits intoapple:release-7.4from
saintstack:cherry-pick-waitValueOrSignal-7.4

Conversation

@saintstack
Copy link
Copy Markdown
Contributor

@saintstack saintstack commented Apr 18, 2026

Forward-port from 7.3 to 7.4.

Add a when() clause watching peer->disconnect in waitValueOrSignal (genericactors.actor.h) so dead connections (e.g., from NAT timeouts) are detected immediately instead of hanging indefinitely waiting on a connection the lower layer has already replaced.

We saw this in an incident where waiting on a long reply on a network with frequent disconnects; low level fdb would make a new connection but high-level would wait until we timed out on the original.

Includes unit tests for the peer disconnect detection.

20260418-153449-stack_forward_port_waitValu-96d073ac1c50ff92 compressed=True data_size=41481497 duration=4484727 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:51:47 sanity=False started=100000 stopped=20260418-162636 submitted=20260418-153449 timeout=5400 username=stack_forward_port_waitValueOrSignal

@saintstack saintstack requested a review from spraza as a code owner April 18, 2026 00:06
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 0b7d4a2
  • Duration 0:49:01
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 0b7d4a2
  • Duration 0:55:07
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 0b7d4a2
  • Duration 1:07:05
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 0b7d4a2
  • Duration 2:26:01
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@saintstack saintstack requested a review from gxglass April 18, 2026 16:31
gxglass
gxglass previously approved these changes Apr 18, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR forward-ports a fix to prevent waitValueOrSignal() from hanging indefinitely on dead/replaced connections by also waiting on the associated Peer’s disconnect signal, allowing callers (e.g., load-balance retry loops) to react immediately.

Changes:

  • Add a when() clause in waitValueOrSignal() to watch peer->disconnect and return request_maybe_delivered() on disconnect.
  • Add new unit tests covering peer disconnect detection, no-peer behavior, and a simple retry pattern.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
fdbrpc/include/fdbrpc/genericactors.actor.h Detect peer disconnects inside waitValueOrSignal() and fail fast with request_maybe_delivered().
fdbrpc/FlowTests.actor.cpp Add unit tests validating the new peer-disconnect behavior and preserving prior semantics without a peer.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}
when(wait(peer.isValid() ? peer->disconnect.getFuture() : Never())) {
CODE_PROBE(true, "waitValueOrSignal detected peer disconnect");
TraceEvent("WaitValueOrSignalPeerDisconnect")
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WaitValueOrSignalPeerDisconnect TraceEvent is emitted on every peer disconnect without any rate limiting. In unstable networks this could produce high-volume logs across all clients/servers waiting on replies. Consider adding .suppressFor(...) (and/or lowering severity / sampling) to avoid log spam while keeping diagnostic value.

Suggested change
TraceEvent("WaitValueOrSignalPeerDisconnect")
TraceEvent("WaitValueOrSignalPeerDisconnect")
.suppressFor(1.0)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saintstack like this suggestion?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Addressed.

Comment thread fdbrpc/FlowTests.actor.cpp Outdated
#include "flow/IThreadPool.h"
#include "flow/WriteOnlySet.h"
#include "fdbrpc/fdbrpc.h"
#include "fdbrpc/FlowTransport.h"
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fdbrpc/fdbrpc.h already includes fdbrpc/FlowTransport.h, so this additional include is redundant here. Removing it would keep includes minimal and reduce unnecessary compile dependencies.

Suggested change
#include "fdbrpc/FlowTransport.h"

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@spraza
Copy link
Copy Markdown
Collaborator

spraza commented Apr 23, 2026

@saintstack wdyt about #13017 (comment)?

Forward-port from 7.3 to 7.4.

Add a when() clause watching peer->disconnect in waitValueOrSignal
(genericactors.actor.h) so dead connections (e.g., from NAT timeouts)
are detected immediately instead of hanging indefinitely waiting on a
connection the lower layer has already replaced.

We saw this in an incident where waiting on a long reply on a network
with frequent disconnects; low level fdb would make a new connection
but high-level would wait until we timed out on the original.

Includes unit tests for the peer disconnect detection.
@saintstack
Copy link
Copy Markdown
Contributor Author

@spraza

@saintstack wdyt about #13017 (comment)?

I like it. Addressed in this next push. Thanks.

@saintstack saintstack force-pushed the cherry-pick-waitValueOrSignal-7.4 branch from 0b7d4a2 to da80499 Compare April 24, 2026 17:43
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: da80499
  • Duration 0:46:50
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: da80499
  • Duration 1:00:50
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: da80499
  • Duration 1:04:29
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@saintstack
Copy link
Copy Markdown
Contributor Author

20260424-181329-stack_all_waitforsignal_74-c0bae45e23d10c40 compressed=True data_size=41486571 duration=5177307 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:56:27 sanity=False started=100000 stopped=20260424-190956 submitted=20260424-181329 timeout=5400 username=stack_all_waitforsignal_74

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: da80499
  • Duration 2:32:54
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@saintstack saintstack requested a review from gxglass April 24, 2026 20:46
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: da80499
  • Duration 0:40:38
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: da80499
  • Duration 0:44:15
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 254
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: da80499
  • Duration 0:54:33
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@saintstack saintstack merged commit cec7693 into apple:release-7.4 Apr 27, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants