Fix client-side LSAN shutdown leaks by freeing pending tasks and cancelling actors#13278
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses client shutdown memory leaks reported by LSAN/ASAN by explicitly freeing pending Net2 tasks during stop and by cancelling additional long-lived client background actors that were previously left running through DatabaseContext destruction.
Changes:
- Fix
TaskQueue::clear()to delete queuedTask*objects (timers, ready queue, and cross-thread queue) after swapping queues out. - Cancel additional
DatabaseContextbackground actors (logger,clientStatusUpdater.actor,throttleExpirer,statusLeaderMon) during destruction to prevent leaked actor frames/state. - Trim
contrib/lsan.suppressionsto keep only remaining known/unavoidable shutdown leak sources.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| flow/include/flow/TaskQueue.h | Swap-and-drain task queues during shutdown and delete pending tasks to avoid leaking PromiseTask/actor frames. |
| fdbclient/DatabaseContext.cpp | Cancel previously-missed background actors in DatabaseContext destructor to prevent shutdown-time leaks. |
| contrib/lsan.suppressions | Remove now-unnecessary suppressions and keep a smaller set for remaining intentional/unfixable shutdown leaks. |
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
…elling actors Building on apple#13188 (Peer LSAN suppressions + gRPC use-after-return fix) and apple#13255 (client-side shutdown leak suppressions), this commit addresses the root causes rather than suppressing symptoms. Problem: CI ASAN nightly builds reported 128K+ bytes leaked in 2000+ allocations across 27 tests. The leaks came from two sources: 1. TaskQueue::clear() in Net2::stopImmediately() swapped out the timer and ready queues but never deleted the PromiseTask* pointers inside them. Each leaked PromiseTask held a Promise<Void> whose destruction would have freed waiting actors coroutine frames. 2. DatabaseContext::~DatabaseContext() cancelled some background actors but missed four others: logger (databaseLogger + tssLogger), clientStatusUpdater.actor, throttleExpirer, and statusLeaderMon. Fix: - TaskQueue::clear() now swaps queues into locals then iterates and deletes all Task* pointers. Deleting a PromiseTask fires broken_promise to waiting futures, which cancels the associated actors and frees their coroutine frames. The swap-first approach prevents infinite loops from actors that catch broken_promise and retry with a new delay(). - DatabaseContext destructor now cancels all four leaked background actors, following the same pattern already used for clientDBInfoMonitor et al. - LSAN suppressions trimmed from 30+ entries to 8. The remaining suppressions cover genuinely unfixable cases: Peer objects (no safe destructor path), fdbcli transaction references not released before stopNetwork(), external client DatabaseContext cleanup deferred via onMainThreadVoid after network stop, and monitorProtocolVersion cross-thread dispatch. Result: Zero LSAN reports, 91% ctest pass rate (remaining 4 failures are ASAN slowness timeouts and the pre-existing makecontext/Severity=40 issue).
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
|
Retry to see if 'Error: Error while executing command: ctest -j ${NPROC} --no-compress-output -T test --output-on-failure. Reason: exit status 8' is related ... |
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
|
RandomSeed="4124547677" SourceVersion="e4cf9ad08c87d71db0ba0f4d2592514a053ed50b" Time="1779819717" BuggifyEnabled="1" DeterminismCheck="0" FaultInjectionEnabled="1" TestFile="tests/noSim/KeyValueStoreRocksDBTest.toml" |
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
|
Build timed out on last one. |
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
The existing leak:ThreadSafeDatabase::createTransaction (added in apple#13278) only catches the direct ThreadSafeTransaction allocation (2x2048 bytes). Indirect leaks from transaction operations (TransactionState, extractReadVersion, WriteMap entries) are attributed to their own allocation sites, which have ThreadSafeTransaction methods in the call stack but don't match the createTransaction suppression. This causes the ASAN nightly to report 10,296 bytes in 48 allocations across multiple fdb_c_api_test_* tests even with the fix from apple#13278 applied. Related: - apple#13188 — Initial LSAN suppressions + gRPC use-after-return fix - apple#13242 — Remove explicit __lsan_do_leak_check() (symbolizer deadlock) - apple#13255 — Client-side shutdown leak suppressions (30+ entries) - apple#13278 — Fix root causes: TaskQueue::clear() + DatabaseContext destructor - apple#13288 — Increase upgrade test shutdown timeout for ASAN
The existing leak:ThreadSafeDatabase::createTransaction (added in #13278) only catches the direct ThreadSafeTransaction allocation (2x2048 bytes). Indirect leaks from transaction operations (TransactionState, extractReadVersion, WriteMap entries) are attributed to their own allocation sites, which have ThreadSafeTransaction methods in the call stack but don't match the createTransaction suppression. This causes the ASAN nightly to report 10,296 bytes in 48 allocations across multiple fdb_c_api_test_* tests even with the fix from #13278 applied. Related: - #13188 — Initial LSAN suppressions + gRPC use-after-return fix - #13242 — Remove explicit __lsan_do_leak_check() (symbolizer deadlock) - #13255 — Client-side shutdown leak suppressions (30+ entries) - #13278 — Fix root causes: TaskQueue::clear() + DatabaseContext destructor - #13288 — Increase upgrade test shutdown timeout for ASAN
After #13188 (Peer LSAN suppressions + gRPC use-after-return fix) and #13255 (client-side shutdown leak suppressions), after finding new unsuppressed leaks in the ASAN nightly run, this commit goes back from adding yet more targeted suppressions and instead spends some time on cleaning up running processes at shutdown... so can make do with a smaller set of suppressions (The fear is so many suppressions, we miss actual leaks).
CI ASAN nightly builds reported 128K+ bytes leaked in 2000+ allocations across 27 tests. These leaks came from two sources missing suppressions:
TaskQueue::clear() in Net2::stopImmediately() swapped out the timer and ready queues but never deleted the PromiseTask* pointers inside them. Each leaked PromiseTask held a Promise whose destruction would have freed waiting actors coroutine frames.
DatabaseContext::~DatabaseContext() cancelled some background actors but missed four others: logger (databaseLogger + tssLogger), clientStatusUpdater.actor, throttleExpirer, and statusLeaderMon.
Rather than add more suppressions, work on some clean up in shutdown:
TaskQueue::clear() now swaps queues into locals then iterates and deletes all Task* pointers. Deleting a PromiseTask fires broken_promise to waiting futures, which cancels the associated actors and frees their coroutine frames. The swap-first approach prevents infinite loops from actors that catch broken_promise and retry with a new delay().
DatabaseContext destructor now cancels all four leaked background actors, following the same pattern already used for clientDBInfoMonitor et al.
LSAN suppressions trimmed from 30+ entries to 8. The remaining suppressions cover genuinely unfixable cases: Peer objects (no safe destructor path), fdbcli transaction references not released before stopNetwork(), external client DatabaseContext cleanup deferred via onMainThreadVoid after network stop, and monitorProtocolVersion cross-thread dispatch.
In test runs, zero LSAN reports, 91% ctest pass rate (remaining 4 failures are ASAN slowness timeouts and the pre-existing makecontext/Severity=40 issue). Hopefully same when run up in nightly.