fix(runtime): break Rc cycle that leaked fds when tasks parked on CancelToken/Submit/SubmitMulti#909
Conversation
|
Could you also add the tests you've mentioned in the description? And BTW please solve the clippy warnings. |
|
After some thoughts, I think that |
…kens
Every compio::time::sleep, timeout, and in-flight io_uring op parked a
task that held one of CancelToken::Inner, Submit, or SubmitMulti. All
three stored a strong Runtime (= Rc<RuntimeInner>) clone, forming a
reference cycle:
task -> CancelToken/Submit/SubmitMulti
-> Rc<RuntimeInner>
-> executor
-> task
Runtime::drop checks Rc::strong_count(&self.0) > 1 and early-returns
without calling executor.clear(). But executor.clear() is the only
thing that would ever drop those tasks. The cycle is never broken, so
the io_uring fd and every socket/pipe fd owned by any unfinished task
leak for the life of the process.
Fix: remove the stored runtime handle from Submit, SubmitMulti, and
CancelToken::Inner entirely. All three now obtain the runtime on demand
via the thread-local (Runtime::current / try_with_current):
- Poll paths call Runtime::current(), which is always valid because
tasks are polled from within the executor, which runs inside
Runtime::enter().
- PinnedDrop and CancelToken::cancel use Runtime::try_with_current(),
which silently no-ops if no runtime is active. In practice the drop
path runs inside executor.clear() -> Runtime::enter(), so the
thread-local IS set; the fallback handles the rare case of an op
being dropped after the runtime exits completely.
With no Rc stored inside tasks, strong_count is always 1 when the
last user-facing Runtime drops, executor.clear() always runs, and
every fd is closed.
Also adds a regression test (test_task_dropped_when_runtime_drops)
that verifies a task parked on sleep is dropped when the runtime drops.
The test fails on the unfixed code and passes after.
d3e606d to
060427c
Compare
|
Thanks for the review. I've reworked the fix along the lines you suggested — no New approach: thread-local instead of stored handle
No Overhead vs Regression test Added I also fixed the two RE: removing |
|
@paddor This comment looks REALLY like LLM-generated to me... Can you please not? |
|
Yep, I was just about to apologize for the AI blabla above. It's so very eager to update the PR even though I'm still testing the new fix locally. Give me a few minutes. |
|
OMQ's bench suite just completed successfully without resource exhaustion. So the fix seems to work as intended. Are you okay with the code quality itself? If you like I can update the PR to remove those ugly em dashes so it's ASCII only. |
Problem
Every
compio::time::sleep,timeout, and in-flight io_uring op parks a task that holds one ofCancelToken::Inner,Submit, orSubmitMulti. All three stored a strongRuntimeclone (Rc<RuntimeInner>), forming a reference cycle:Runtime::dropchecksRc::strong_count(&self.0) > 1and early-returns without callingexecutor.clear(). The only thing that would ever drop those tasks isexecutor.clear(). The cycle is never broken, so the io_uring fd and every socket/pipe fd owned by any unfinished task leak for the life of the process.Fix
Store
Weak<RuntimeInner>in all three structs instead of a fullRuntime.PinnedDrop(triggered byexecutor.clear()) and inCancelToken::cancelwhen the runtime is already gone,Weak::upgradereturnsNone— silently no-op, because the io_uring fd is closed and all pending ops are already cancelled by the kernel.Future::poll/Stream::poll_next,upgrade()always succeeds (the future is polled by the executor, which is owned by the runtime — the runtime must be alive), so we panic-upgrade there to keep the hot path clean.Adds
Runtime::weak()andRuntime::from_rc()aspub(crate)helpers.Tests
Existing tests in
compio-runtimecover the relevant scenarios and all pass: