Fix ASAN shutdown leaks and gRPC use-after-return crash#13188
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses shutdown-time ASAN/LSan issues and a gRPC coroutine lifetime bug by eliminating a dangling-reference hazard when deregistering gRPC role services, and by adding LeakSanitizer suppressions for known shutdown leaks.
Changes:
- Change
GrpcServer::deregisterRoleServices()and worker-sidederegisterGrpcService()to takeUIDby value to avoid referencing a temporary acrossco_await. - Update the gRPC server declaration/definition to match the new by-value signature.
- Add LSan suppressions for shutdown-time leaks related to transport peers / connection keeper/monitor and DDSketch allocations.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| fdbserver/worker/worker.actor.cpp | Passes gRPC service owner IDs by value to avoid coroutine lifetime issues. |
| fdbrpc/include/fdbrpc/FlowGrpc.h | Updates gRPC deregistration API signature (and retains sync variant by ref). |
| fdbrpc/FlowGrpc.cpp | Updates implementation of deregistration to match new by-value signature. |
| contrib/lsan.suppressions | Adds suppressions for known shutdown-time leaks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Two issues addressed: 1. LSan reports 4648+ bytes leaked per Peer object at shutdown in fdb_c_api_tester and unit_tests. FlowTransport is allocated into a global slot and never deleted; deleting it triggers actor cancellation cascades that access freed state. Add suppressions for the known shutdown-time Peer/DDSketch/connectionKeeper leaks. 2. GrpcServer::deregisterRoleServices() takes a const UID& then co_awaits stopServer(). The caller passes interf.id() which becomes a dangling reference after the coroutine suspends (the caller actor frame can be reclaimed). This causes a stack-use-after-return crash detected by ASAN, manifesting as a segfault in fdbcli during configure new. Fix by passing UID by value at both call sites.
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
|
I ran 100k again after feedback changes... I compiled with ASAN and see ASAN makecontext/swapcontext warnings which show as Severity=40 so some of the ctests fail. No stack-use-after-return anymore though. Lets try this PR. |
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
…elling actors Building on apple#13188 (Peer LSAN suppressions + gRPC use-after-return fix) and apple#13255 (client-side shutdown leak suppressions), this commit addresses the root causes rather than suppressing symptoms. Problem: CI ASAN nightly builds reported 128K+ bytes leaked in 2000+ allocations across 27 tests. The leaks came from two sources: 1. TaskQueue::clear() in Net2::stopImmediately() swapped out the timer and ready queues but never deleted the PromiseTask* pointers inside them. Each leaked PromiseTask held a Promise<Void> whose destruction would have freed waiting actors coroutine frames. 2. DatabaseContext::~DatabaseContext() cancelled some background actors but missed four others: logger (databaseLogger + tssLogger), clientStatusUpdater.actor, throttleExpirer, and statusLeaderMon. Fix: - TaskQueue::clear() now swaps queues into locals then iterates and deletes all Task* pointers. Deleting a PromiseTask fires broken_promise to waiting futures, which cancels the associated actors and frees their coroutine frames. The swap-first approach prevents infinite loops from actors that catch broken_promise and retry with a new delay(). - DatabaseContext destructor now cancels all four leaked background actors, following the same pattern already used for clientDBInfoMonitor et al. - LSAN suppressions trimmed from 30+ entries to 8. The remaining suppressions cover genuinely unfixable cases: Peer objects (no safe destructor path), fdbcli transaction references not released before stopNetwork(), external client DatabaseContext cleanup deferred via onMainThreadVoid after network stop, and monitorProtocolVersion cross-thread dispatch. Result: Zero LSAN reports, 91% ctest pass rate (remaining 4 failures are ASAN slowness timeouts and the pre-existing makecontext/Severity=40 issue).
…elling actors Building on apple#13188 (Peer LSAN suppressions + gRPC use-after-return fix) and apple#13255 (client-side shutdown leak suppressions), this commit addresses the root causes rather than suppressing symptoms. Problem: CI ASAN nightly builds reported 128K+ bytes leaked in 2000+ allocations across 27 tests. The leaks came from two sources: 1. TaskQueue::clear() in Net2::stopImmediately() swapped out the timer and ready queues but never deleted the PromiseTask* pointers inside them. Each leaked PromiseTask held a Promise<Void> whose destruction would have freed waiting actors coroutine frames. 2. DatabaseContext::~DatabaseContext() cancelled some background actors but missed four others: logger (databaseLogger + tssLogger), clientStatusUpdater.actor, throttleExpirer, and statusLeaderMon. Fix: - TaskQueue::clear() now swaps queues into locals then iterates and deletes all Task* pointers. Deleting a PromiseTask fires broken_promise to waiting futures, which cancels the associated actors and frees their coroutine frames. The swap-first approach prevents infinite loops from actors that catch broken_promise and retry with a new delay(). - DatabaseContext destructor now cancels all four leaked background actors, following the same pattern already used for clientDBInfoMonitor et al. - LSAN suppressions trimmed from 30+ entries to 8. The remaining suppressions cover genuinely unfixable cases: Peer objects (no safe destructor path), fdbcli transaction references not released before stopNetwork(), external client DatabaseContext cleanup deferred via onMainThreadVoid after network stop, and monitorProtocolVersion cross-thread dispatch. Result: Zero LSAN reports, 91% ctest pass rate (remaining 4 failures are ASAN slowness timeouts and the pre-existing makecontext/Severity=40 issue).
…elling actors (#13278) Building on #13188 (Peer LSAN suppressions + gRPC use-after-return fix) and #13255 (client-side shutdown leak suppressions), this commit addresses the root causes rather than suppressing symptoms. Problem: CI ASAN nightly builds reported 128K+ bytes leaked in 2000+ allocations across 27 tests. The leaks came from two sources: 1. TaskQueue::clear() in Net2::stopImmediately() swapped out the timer and ready queues but never deleted the PromiseTask* pointers inside them. Each leaked PromiseTask held a Promise<Void> whose destruction would have freed waiting actors coroutine frames. 2. DatabaseContext::~DatabaseContext() cancelled some background actors but missed four others: logger (databaseLogger + tssLogger), clientStatusUpdater.actor, throttleExpirer, and statusLeaderMon. Fix: - TaskQueue::clear() now swaps queues into locals then iterates and deletes all Task* pointers. Deleting a PromiseTask fires broken_promise to waiting futures, which cancels the associated actors and frees their coroutine frames. The swap-first approach prevents infinite loops from actors that catch broken_promise and retry with a new delay(). - DatabaseContext destructor now cancels all four leaked background actors, following the same pattern already used for clientDBInfoMonitor et al. - LSAN suppressions trimmed from 30+ entries to 8. The remaining suppressions cover genuinely unfixable cases: Peer objects (no safe destructor path), fdbcli transaction references not released before stopNetwork(), external client DatabaseContext cleanup deferred via onMainThreadVoid after network stop, and monitorProtocolVersion cross-thread dispatch. Result: Zero LSAN reports, 91% ctest pass rate (remaining 4 failures are ASAN slowness timeouts and the pre-existing makecontext/Severity=40 issue).
The existing leak:ThreadSafeDatabase::createTransaction (added in apple#13278) only catches the direct ThreadSafeTransaction allocation (2x2048 bytes). Indirect leaks from transaction operations (TransactionState, extractReadVersion, WriteMap entries) are attributed to their own allocation sites, which have ThreadSafeTransaction methods in the call stack but don't match the createTransaction suppression. This causes the ASAN nightly to report 10,296 bytes in 48 allocations across multiple fdb_c_api_test_* tests even with the fix from apple#13278 applied. Related: - apple#13188 — Initial LSAN suppressions + gRPC use-after-return fix - apple#13242 — Remove explicit __lsan_do_leak_check() (symbolizer deadlock) - apple#13255 — Client-side shutdown leak suppressions (30+ entries) - apple#13278 — Fix root causes: TaskQueue::clear() + DatabaseContext destructor - apple#13288 — Increase upgrade test shutdown timeout for ASAN
The existing leak:ThreadSafeDatabase::createTransaction (added in #13278) only catches the direct ThreadSafeTransaction allocation (2x2048 bytes). Indirect leaks from transaction operations (TransactionState, extractReadVersion, WriteMap entries) are attributed to their own allocation sites, which have ThreadSafeTransaction methods in the call stack but don't match the createTransaction suppression. This causes the ASAN nightly to report 10,296 bytes in 48 allocations across multiple fdb_c_api_test_* tests even with the fix from #13278 applied. Related: - #13188 — Initial LSAN suppressions + gRPC use-after-return fix - #13242 — Remove explicit __lsan_do_leak_check() (symbolizer deadlock) - #13255 — Client-side shutdown leak suppressions (30+ entries) - #13278 — Fix root causes: TaskQueue::clear() + DatabaseContext destructor - #13288 — Increase upgrade test shutdown timeout for ASAN
Two issues addressed:
LSan reports 4648+ bytes leaked per Peer object at shutdown in fdb_c_api_tester and unit_tests. FlowTransport is allocated into a global slot and never deleted; deleting it triggers actor cancellation cascades that access freed state. Add suppressions for the known shutdown-time Peer/DDSketch/connectionKeeper leaks.
GrpcServer::deregisterRoleServices() takes a const UID& then co_awaits stopServer(). The caller passes interf.id() which becomes a dangling reference after the coroutine suspends (the caller actor frame can be reclaimed). This causes a stack-use-after-return crash detected by ASAN, manifesting as a segfault in fdbcli during configure new. Fix by passing UID by value at both call sites.
20260511-165553-stack-034f0b9d67223a7a compressed=True data_size=36949840 duration=5027086 ended=100000 fail=1 fail_fast=10 max_runs=100000 pass=99999 priority=100 remaining=0:00:00 runtime=0:59:35 sanity=False started=100000 submitted=20260511-165553 timeout=5400 username=stackEach nightly ASAN run has about 11 of these in ctest runs....