Skip to content

Fix ASAN shutdown leaks and gRPC use-after-return crash#13188

Merged
saintstack merged 1 commit into
apple:mainfrom
saintstack:asan
May 11, 2026
Merged

Fix ASAN shutdown leaks and gRPC use-after-return crash#13188
saintstack merged 1 commit into
apple:mainfrom
saintstack:asan

Conversation

@saintstack
Copy link
Copy Markdown
Contributor

@saintstack saintstack commented May 10, 2026

Two issues addressed:

  1. LSan reports 4648+ bytes leaked per Peer object at shutdown in fdb_c_api_tester and unit_tests. FlowTransport is allocated into a global slot and never deleted; deleting it triggers actor cancellation cascades that access freed state. Add suppressions for the known shutdown-time Peer/DDSketch/connectionKeeper leaks.

  2. GrpcServer::deregisterRoleServices() takes a const UID& then co_awaits stopServer(). The caller passes interf.id() which becomes a dangling reference after the coroutine suspends (the caller actor frame can be reclaimed). This causes a stack-use-after-return crash detected by ASAN, manifesting as a segfault in fdbcli during configure new. Fix by passing UID by value at both call sites.

20260511-165553-stack-034f0b9d67223a7a compressed=True data_size=36949840 duration=5027086 ended=100000 fail=1 fail_fast=10 max_runs=100000 pass=99999 priority=100 remaining=0:00:00 runtime=0:59:35 sanity=False started=100000 submitted=20260511-165553 timeout=5400 username=stack

Each nightly ASAN run has about 11 of these in ctest runs....


=================================================================
==20156==ERROR: LeakSanitizer: detected memory leaks

Indirect leak of 1672 byte(s) in 1 object(s) allocated from:
    #0 0x000000533a2c in malloc /tmp/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:68:3
    #1 0x7f0d4b5984a3 in operator new(unsigned long) stdlib_new_delete.cpp
    #2 0x7f0d49e5061c in resize /usr/local/bin/../include/c++/v1/vector:1805:11
    #3 0x7f0d49e5061c in setBucketSize /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbrpc/include/fdbrpc/DDSketch.h:213:48
    #4 0x7f0d49e5061c in DDSketch<double>::DDSketch(double) /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbrpc/include/fdbrpc/DDSketch.h:226:9
    #5 0x7f0d4aae6ab3 in Peer::Peer(TransportData*, NetworkAddress const&) /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbrpc/FlowTransport.cpp:1057:5
    #6 0x7f0d4aae9758 in makeReference<Peer, TransportData *, const NetworkAddress &> /codebuild/output/src41032683/src/github.com/apple/foundationdb/flow/include/flow/FastRef.h:195:26
    #7 0x7f0d4aae9758 in TransportData::getOrOpenPeer(NetworkAddress const&, bool) /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbrpc/FlowTransport.cpp:1750:10
    #8 0x7f0d4aaee1c9 in FlowTransport::addPeerReference(Endpoint const&, bool) /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbrpc/FlowTransport.cpp:1897:31
    #9 0x7f0d49fcd79c in FlowReceiver /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbrpc/include/fdbrpc/fdbrpc.h:48:30
    #10 0x7f0d49fcd79c in NetNotifiedQueue<GetLeaderRequest, true>::NetNotifiedQueue(int, int, Endpoint const&) /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbrpc/include/fdbrpc/fdbrpc.h:701:43
    #11 0x7f0d49ec43cb in RequestStream /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbrpc/include/fdbrpc/fdbrpc.h:900:63
    #12 0x7f0d49ec43cb in ClientLeaderRegInterface::ClientLeaderRegInterface(NetworkAddress) /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbclient/MonitorLeader.cpp:474:5
    #13 0x7f0d49ee079a in monitorProxiesOneGeneration(Reference<IClusterConnectionRecord>, Reference<AsyncVar<ClientDBInfo>>, Reference<AsyncVar<Optional<ClientLeaderRegInterface>>>, MonitorLeaderInfo, Reference<ReferencedObject<Standalone<VectorRef<ClientVersionRef, (VecSerStrategy)0>>>>, Standalone<StringRef>, IsInternal) /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbclient/MonitorLeader.cpp:862:33
    #14 0x7f0d49eea1c6 in monitorProxies(Reference<AsyncVar<Reference<IClusterConnectionRecord>>>, Reference<AsyncVar<ClientDBInfo>>, Reference<AsyncVar<Optional<ClientLeaderRegInterface>>>, Reference<ReferencedObject<Standalone<VectorRef<ClientVersionRef, (VecSerStrategy)0>>>>, Standalone<StringRef>, IsInternal) /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbclient/MonitorLeader.cpp:991:7
    #15 0x7f0d491d78bf in Database::createDatabase(Reference<IClusterConnectionRecord>, int, IsInternal, LocalityData const&, DatabaseContext*) /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbclient/NativeAPI.actor.cpp:577:35
    #16 0x7f0d4a9d6ca0 in operator() /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbclient/ThreadSafeTransaction.cpp:148:4
    #17 0x7f0d4a9d6ca0 in a_body1cont1 /codebuild/output/src41032683/src/github.com/apple/foundationdb/flow/include/flow/ThreadHelper.actor.h:45:2
    #18 0x7f0d4a9d6ca0 in internal_thread_helper::DoOnMainThreadVoidActorState<ThreadSafeDatabase::ThreadSafeDatabase(ThreadSafeDatabase::ConnectionRecordType, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, int)::$_0, internal_thread_helper::DoOnMainThreadVoidActor<ThreadSafeDatabase::ThreadSafeDatabase(ThreadSafeDatabase::ConnectionRecordType, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, int)::$_0>>::a_body1when1(Void const&, int) /codebuild/output/src41032683/src/github.com/apple/foundationdb/build_output/flow/include/flow/ThreadHelper.actor.g.h.py_gen:130:15
    #19 0x7f0d4a9d647d in a_callback_fire /codebuild/output/src41032683/src/github.com/apple/foundationdb/build_output/flow/include/flow/ThreadHelper.actor.g.h.py_gen:155:4
    #20 0x7f0d4a9d647d in ActorCallback<internal_thread_helper::DoOnMainThreadVoidActor<ThreadSafeDatabase::ThreadSafeDatabase(ThreadSafeDatabase::ConnectionRecordType, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, int)::$_0>, 0, Void>::fire(Void const&) /codebuild/output/src41032683/src/github.com/apple/foundationdb/flow/include/flow/flow.h:1538:34
    #21 0x7f0d4b021cef in send<Void> /codebuild/output/src41032683/src/github.com/apple/foundationdb/flow/include/flow/flow.h:771:23
    #22 0x7f0d4b021cef in send<Void> /codebuild/output/src41032683/src/github.com/apple/foundationdb/flow/include/flow/flow.h:1094:8
    #23 0x7f0d4b021cef in operator() /codebuild/output/src41032683/src/github.com/apple/foundationdb/flow/Net2.cpp:292:12
    #24 0x7f0d4b021cef in N2::Net2::run() /codebuild/output/src41032683/src/github.com/apple/foundationdb/flow/Net2.cpp:1705:5
    #25 0x7f0d491e117a in runNetwork() /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbclient/NativeAPI.actor.cpp:942:13
    #26 0x7f0d4a9c3148 in ThreadSafeApi::runNetwork() /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbclient/ThreadSafeTransaction.cpp:559:3
    #27 0x7f0d4907fb86 in MultiVersionApi::runNetwork() /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbclient/MultiVersionTransaction.cpp:2373:21
    #28 0x7f0d48f53a1b in fdb_run_network /codebuild/output/src41032683/src/github.com/apple/foundationdb/bindings/c/fdb_c.cpp:166:45
    #29 0x00000062fab3 in operator() /codebuild/output/src41032683/src/github.com/apple/foundationdb/bindings/c/test/unit/unit_tests.cpp:2629:45
    #30 0x00000062fab3 in __invoke<(lambda at /codebuild/output/src41032683/src/github.com/apple/foundationdb/bindings/c/test/unit/unit_tests.cpp:2629:30)> /usr/local/bin/../include/c++/v1/__type_traits/invoke.h:149:25
    #31 0x00000062fab3 in __thread_execute<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, (lambda at /codebuild/output/src41032683/src/github.com/apple/foundationdb/bindings/c/test/unit/unit_tests.cpp:2629:30)> /usr/local/bin/../include/c++/v1/__thread/thread.h:192:3
    #32 0x00000062fab3 in void* std::__1::__thread_proxy[abi:ne190105]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, main::$_0>>(void*) /usr/local/bin/../include/c++/v1/__thread/thread.h:201:3
    #33 0x000000465a03 in asan_thread_start(void*) /tmp/llvm-project/compiler-rt/lib/asan/asan_interceptors.cpp:239:43

Indirect leak of 1672 byte(s) in 1 object(s) allocated from:
    #0 0x000000533a2c in malloc /tmp/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:68:3
    #1 0x7f0d4b5984a3 in operator new(unsigned long) stdlib_new_delete.cpp
    #2 0x7f0d49e5061c in resize /usr/local/bin/../include/c++/v1/vector:1805:11
    #3 0x7f0d49e5061c in setBucketSize /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbrpc/include/fdbrpc/DDSketch.h:213:48
    #4 0x7f0d49e5061c in DDSketch<double>::DDSketch(double) /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbrpc/include/fdbrpc/DDSketch.h:226:9
    #5 0x7f0d4aae6828 in Peer::Peer(TransportData*, NetworkAddress const&) /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbrpc/FlowTransport.cpp:1053:5
    #6 0x7f0d4aae9758 in makeReference<Peer, TransportData *, const NetworkAddress &> /codebuild/output/src41032683/src/github.com/apple/foundationdb/flow/include/flow/FastRef.h:195:26
    #7 0x7f0d4aae9758 in TransportData::getOrOpenPeer(NetworkAddress const&, bool) /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbrpc/FlowTransport.cpp:1750:10
    #8 0x7f0d4aaee1c9 in FlowTransport::addPeerReference(Endpoint const&, bool) /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbrpc/FlowTransport.cpp:1897:31
    #9 0x7f0d49fcd79c in FlowReceiver /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbrpc/include/fdbrpc/fdbrpc.h:48:30
    #10 0x7f0d49fcd79c in NetNotifiedQueue<GetLeaderRequest, true>::NetNotifiedQueue(int, int, Endpoint const&) /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbrpc/include/fdbrpc/fdbrpc.h:701:43
    #11 0x7f0d49ec43cb in RequestStream /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbrpc/include/fdbrpc/fdbrpc.h:900:63
    #12 0x7f0d49ec43cb in ClientLeaderRegInterface::ClientLeaderRegInterface(NetworkAddress) /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbclient/MonitorLeader.cpp:474:5
    #13 0x7f0d49ee079a in monitorProxiesOneGeneration(Reference<IClusterConnectionRecord>, Reference<AsyncVar<ClientDBInfo>>, Reference<AsyncVar<Optional<ClientLeaderRegInterface>>>, MonitorLeaderInfo, Reference<ReferencedObject<Standalone<VectorRef<ClientVersionRef, (VecSerStrategy)0>>>>, Standalone<StringRef>, IsInternal) /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbclient/MonitorLeader.cpp:862:33
    #14 0x7f0d49eea1c6 in monitorProxies(Reference<AsyncVar<Reference<IClusterConnectionRecord>>>, Reference<AsyncVar<ClientDBInfo>>, Reference<AsyncVar<Optional<ClientLeaderRegInterface>>>, Reference<ReferencedObject<Standalone<VectorRef<ClientVersionRef, (VecSerStrategy)0>>>>, Standalone<StringRef>, IsInternal) /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbclient/MonitorLeader.cpp:991:7
    #15 0x7f0d491d78bf in Database::createDatabase(Reference<IClusterConnectionRecord>, int, IsInternal, LocalityData const&, DatabaseContext*) /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbclient/NativeAPI.actor.cpp:577:35
    #16 0x7f0d4a9d6ca0 in operator() /codebuild/output/src41032683/src/github.com/apple/foundationdb/fdbclient/ThreadSafeTransaction.cpp:148:4
    #17 0x7f0d4a9d6ca0 in a_body1cont1 /codebuild/output/src41032683/src/github.com/apple/foundationdb/flow/include/flow/ThreadHelper.actor.h:45:2
    #18 0x7f0d4a9d6ca0 in internal_thread_helper::DoOnMainThreadVoidActorState<ThreadSafeDatabase::ThreadSafeDatabase(ThreadSafeDatabase::ConnectionRecordType, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, int)::$_0, 


... etc.

@saintstack saintstack added the nightlies Issues to address failures in the nighty runs. label May 10, 2026
@saintstack saintstack requested a review from Copilot May 10, 2026 04:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses shutdown-time ASAN/LSan issues and a gRPC coroutine lifetime bug by eliminating a dangling-reference hazard when deregistering gRPC role services, and by adding LeakSanitizer suppressions for known shutdown leaks.

Changes:

  • Change GrpcServer::deregisterRoleServices() and worker-side deregisterGrpcService() to take UID by value to avoid referencing a temporary across co_await.
  • Update the gRPC server declaration/definition to match the new by-value signature.
  • Add LSan suppressions for shutdown-time leaks related to transport peers / connection keeper/monitor and DDSketch allocations.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
fdbserver/worker/worker.actor.cpp Passes gRPC service owner IDs by value to avoid coroutine lifetime issues.
fdbrpc/include/fdbrpc/FlowGrpc.h Updates gRPC deregistration API signature (and retains sync variant by ref).
fdbrpc/FlowGrpc.cpp Updates implementation of deregistration to match new by-value signature.
contrib/lsan.suppressions Adds suppressions for known shutdown-time leaks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread fdbrpc/include/fdbrpc/FlowGrpc.h Outdated
Comment thread contrib/lsan.suppressions Outdated
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: c1df042
  • Duration 0:29:08
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: c1df042
  • Duration 0:44:39
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: c1df042
  • Duration 0:54:24
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: c1df042
  • Duration 1:04:06
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: c1df042
  • Duration 1:06:17
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: c1df042
  • Duration 1:12:46
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: c1df042
  • Duration 1:15:41
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Two issues addressed:

1. LSan reports 4648+ bytes leaked per Peer object at shutdown in
   fdb_c_api_tester and unit_tests. FlowTransport is allocated into a
   global slot and never deleted; deleting it triggers actor cancellation
   cascades that access freed state. Add suppressions for the known
   shutdown-time Peer/DDSketch/connectionKeeper leaks.

2. GrpcServer::deregisterRoleServices() takes a const UID& then
   co_awaits stopServer(). The caller passes interf.id() which becomes
   a dangling reference after the coroutine suspends (the caller actor
   frame can be reclaimed). This causes a stack-use-after-return crash
   detected by ASAN, manifesting as a segfault in fdbcli during
   configure new. Fix by passing UID by value at both call sites.
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 6258ce0
  • Duration 0:25:36
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 6258ce0
  • Duration 0:47:37
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 6258ce0
  • Duration 0:55:30
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 6258ce0
  • Duration 1:01:49
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 6258ce0
  • Duration 1:03:37
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 6258ce0
  • Duration 1:11:45
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@saintstack
Copy link
Copy Markdown
Contributor Author

I ran 100k again after feedback changes... 20260511-181411-stack-2938cba1fd0a9b3d compressed=True data_size=36950053 duration=5675640 ended=100000 fail=2 fail_fast=10 max_runs=100000 pass=99998 priority=100 remaining=0 runtime=1:07:38 sanity=False started=100000 stopped=20260511-192149 submitted=20260511-181411 timeout=5400 username=stack. The two failures are...RandomSeed="1470358130" SourceVersion="6258ce078de67405475fc0b 15fd299ac221596cf" Time="1778524017" BuggifyEnabled="1" DeterminismCheck="0" FaultInjectionEnabled="1" TestFile="tests/slow/DifferentClustersSameRV.toml" and
RandomSeed="3100002028" SourceVersion="6258ce078de67405475fc0b
15fd299ac221596cf" Time="1778525283" BuggifyEnabled="1" DeterminismCheck="0" FaultInjectionEnabled="1" TestFile="tests/fast/MoveKeysCycle.toml" which is #13176 I believe.

I compiled with ASAN and see ASAN makecontext/swapcontext warnings which show as Severity=40 so some of the ctests fail. No stack-use-after-return anymore though. Lets try this PR.

@saintstack saintstack merged commit 73442bb into apple:main May 11, 2026
6 of 7 checks passed
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 6258ce0
  • Duration 2:32:18
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

saintstack pushed a commit to saintstack/foundationdb that referenced this pull request May 26, 2026
…elling actors

Building on apple#13188 (Peer LSAN suppressions + gRPC use-after-return fix) and
apple#13255 (client-side shutdown leak suppressions), this commit addresses the root
causes rather than suppressing symptoms.

Problem: CI ASAN nightly builds reported 128K+ bytes leaked in 2000+
allocations across 27 tests. The leaks came from two sources:

1. TaskQueue::clear() in Net2::stopImmediately() swapped out the timer and
   ready queues but never deleted the PromiseTask* pointers inside them.
   Each leaked PromiseTask held a Promise<Void> whose destruction would have
   freed waiting actors coroutine frames.

2. DatabaseContext::~DatabaseContext() cancelled some background actors but
   missed four others: logger (databaseLogger + tssLogger),
   clientStatusUpdater.actor, throttleExpirer, and statusLeaderMon.

Fix:

- TaskQueue::clear() now swaps queues into locals then iterates and deletes
  all Task* pointers. Deleting a PromiseTask fires broken_promise to waiting
  futures, which cancels the associated actors and frees their coroutine
  frames. The swap-first approach prevents infinite loops from actors that
  catch broken_promise and retry with a new delay().

- DatabaseContext destructor now cancels all four leaked background actors,
  following the same pattern already used for clientDBInfoMonitor et al.

- LSAN suppressions trimmed from 30+ entries to 8. The remaining suppressions
  cover genuinely unfixable cases: Peer objects (no safe destructor path),
  fdbcli transaction references not released before stopNetwork(), external
  client DatabaseContext cleanup deferred via onMainThreadVoid after network
  stop, and monitorProtocolVersion cross-thread dispatch.

Result: Zero LSAN reports, 91% ctest pass rate (remaining 4 failures are
ASAN slowness timeouts and the pre-existing makecontext/Severity=40 issue).
saintstack pushed a commit to saintstack/foundationdb that referenced this pull request May 26, 2026
…elling actors

Building on apple#13188 (Peer LSAN suppressions + gRPC use-after-return fix) and
apple#13255 (client-side shutdown leak suppressions), this commit addresses the root
causes rather than suppressing symptoms.

Problem: CI ASAN nightly builds reported 128K+ bytes leaked in 2000+
allocations across 27 tests. The leaks came from two sources:

1. TaskQueue::clear() in Net2::stopImmediately() swapped out the timer and
   ready queues but never deleted the PromiseTask* pointers inside them.
   Each leaked PromiseTask held a Promise<Void> whose destruction would have
   freed waiting actors coroutine frames.

2. DatabaseContext::~DatabaseContext() cancelled some background actors but
   missed four others: logger (databaseLogger + tssLogger),
   clientStatusUpdater.actor, throttleExpirer, and statusLeaderMon.

Fix:

- TaskQueue::clear() now swaps queues into locals then iterates and deletes
  all Task* pointers. Deleting a PromiseTask fires broken_promise to waiting
  futures, which cancels the associated actors and frees their coroutine
  frames. The swap-first approach prevents infinite loops from actors that
  catch broken_promise and retry with a new delay().

- DatabaseContext destructor now cancels all four leaked background actors,
  following the same pattern already used for clientDBInfoMonitor et al.

- LSAN suppressions trimmed from 30+ entries to 8. The remaining suppressions
  cover genuinely unfixable cases: Peer objects (no safe destructor path),
  fdbcli transaction references not released before stopNetwork(), external
  client DatabaseContext cleanup deferred via onMainThreadVoid after network
  stop, and monitorProtocolVersion cross-thread dispatch.

Result: Zero LSAN reports, 91% ctest pass rate (remaining 4 failures are
ASAN slowness timeouts and the pre-existing makecontext/Severity=40 issue).
saintstack added a commit that referenced this pull request May 26, 2026
…elling actors (#13278)

Building on #13188 (Peer LSAN suppressions + gRPC use-after-return fix) and
#13255 (client-side shutdown leak suppressions), this commit addresses the root
causes rather than suppressing symptoms.

Problem: CI ASAN nightly builds reported 128K+ bytes leaked in 2000+
allocations across 27 tests. The leaks came from two sources:

1. TaskQueue::clear() in Net2::stopImmediately() swapped out the timer and
   ready queues but never deleted the PromiseTask* pointers inside them.
   Each leaked PromiseTask held a Promise<Void> whose destruction would have
   freed waiting actors coroutine frames.

2. DatabaseContext::~DatabaseContext() cancelled some background actors but
   missed four others: logger (databaseLogger + tssLogger),
   clientStatusUpdater.actor, throttleExpirer, and statusLeaderMon.

Fix:

- TaskQueue::clear() now swaps queues into locals then iterates and deletes
  all Task* pointers. Deleting a PromiseTask fires broken_promise to waiting
  futures, which cancels the associated actors and frees their coroutine
  frames. The swap-first approach prevents infinite loops from actors that
  catch broken_promise and retry with a new delay().

- DatabaseContext destructor now cancels all four leaked background actors,
  following the same pattern already used for clientDBInfoMonitor et al.

- LSAN suppressions trimmed from 30+ entries to 8. The remaining suppressions
  cover genuinely unfixable cases: Peer objects (no safe destructor path),
  fdbcli transaction references not released before stopNetwork(), external
  client DatabaseContext cleanup deferred via onMainThreadVoid after network
  stop, and monitorProtocolVersion cross-thread dispatch.

Result: Zero LSAN reports, 91% ctest pass rate (remaining 4 failures are
ASAN slowness timeouts and the pre-existing makecontext/Severity=40 issue).
saintstack pushed a commit to saintstack/foundationdb that referenced this pull request May 28, 2026
The existing leak:ThreadSafeDatabase::createTransaction (added in apple#13278)
only catches the direct ThreadSafeTransaction allocation (2x2048 bytes).
Indirect leaks from transaction operations (TransactionState,
extractReadVersion, WriteMap entries) are attributed to their own allocation
sites, which have ThreadSafeTransaction methods in the call stack but don't
match the createTransaction suppression.

This causes the ASAN nightly to report 10,296 bytes in 48 allocations across
multiple fdb_c_api_test_* tests even with the fix from apple#13278 applied.

Related:
- apple#13188 — Initial LSAN suppressions + gRPC use-after-return fix
- apple#13242 — Remove explicit __lsan_do_leak_check() (symbolizer deadlock)
- apple#13255 — Client-side shutdown leak suppressions (30+ entries)
- apple#13278 — Fix root causes: TaskQueue::clear() + DatabaseContext destructor
- apple#13288 — Increase upgrade test shutdown timeout for ASAN
saintstack added a commit that referenced this pull request May 28, 2026
The existing leak:ThreadSafeDatabase::createTransaction (added in #13278)
only catches the direct ThreadSafeTransaction allocation (2x2048 bytes).
Indirect leaks from transaction operations (TransactionState,
extractReadVersion, WriteMap entries) are attributed to their own allocation
sites, which have ThreadSafeTransaction methods in the call stack but don't
match the createTransaction suppression.

This causes the ASAN nightly to report 10,296 bytes in 48 allocations across
multiple fdb_c_api_test_* tests even with the fix from #13278 applied.

Related:
- #13188 — Initial LSAN suppressions + gRPC use-after-return fix
- #13242 — Remove explicit __lsan_do_leak_check() (symbolizer deadlock)
- #13255 — Client-side shutdown leak suppressions (30+ entries)
- #13278 — Fix root causes: TaskQueue::clear() + DatabaseContext destructor
- #13288 — Increase upgrade test shutdown timeout for ASAN
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nightlies Issues to address failures in the nighty runs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants