Skip to content

Fix eviction silent failure and orphaned subprocesses#159

Merged
dcellison merged 4 commits intomainfrom
fix/eviction-silent-failure
Mar 24, 2026
Merged

Fix eviction silent failure and orphaned subprocesses#159
dcellison merged 4 commits intomainfrom
fix/eviction-silent-failure

Conversation

@dcellison
Copy link
Copy Markdown
Owner

Summary

Fix two resource management bugs in pool.py where subprocess instances were removed from tracking before shutdown completed, causing orphaned processes on failure.

Problem 1: Pop-before-shutdown ordering

Both _eviction_loop() and force_kill() called self._pool.pop() before instance.shutdown(). If shutdown failed, the subprocess was still running but invisible to the pool - no retry, no cleanup, just a silent memory leak.

Problem 2: Incomplete exception handling in force_kill()

force_kill() only caught TimeoutError. Any other exception from shutdown() (CancelledError, OSError, etc.) propagated to the /stop handler, crashing the user's command while leaving the subprocess untrackable.

Fix

  • Defer pop until after shutdown: Instance stays in the pool during shutdown. Only removed after the subprocess is confirmed dead.
  • SIGKILL fallback on any failure: Both paths now call instance.force_kill() (raw SIGKILL, effectively infallible) as a last resort before removing from tracking.
  • Catch all exceptions in force_kill(): Widen from TimeoutError to Exception so nothing propagates to callers.

Changes

Location Before After
force_kill() Pop first, catch only TimeoutError Pop after shutdown, catch all Exception
_eviction_loop() Pop first, log and move on Pop after shutdown, SIGKILL fallback on failure

Test plan

  • test_force_kill_catches_non_timeout_exceptions - RuntimeError caught, SIGKILL fallback, instance removed
  • test_force_kill_pops_after_successful_shutdown - instance in pool during shutdown, removed after
  • test_eviction_failure_triggers_sigkill_fallback - failed shutdown triggers force_kill, instance not orphaned
  • test_eviction_pops_after_shutdown - instance in pool during eviction shutdown, removed after
  • Existing tests pass (force_kill specific user, timeout fallback, no instance, TOCTOU guards)
  • All 1097 tests pass, make check clean

Fixes #154

Defer pool.pop() until after shutdown completes in both the eviction
loop and force_kill(). Previously, instances were removed from
tracking before shutdown, so a failed shutdown left the subprocess
running but invisible to the pool.

Eviction loop: on shutdown failure, escalate to instance.force_kill()
(raw SIGKILL) before removing from tracking. The subprocess is always
confirmed dead before being untracked.

force_kill(): widen exception handler from TimeoutError to Exception.
Any failure from shutdown() now falls back to SIGKILL instead of
propagating to the /stop handler and crashing the user's command.

Fixes #154
@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

Review: fix/eviction-silent-failure

The core fix (defer pop until after confirmed-dead) is correct and well-motivated. Three issues worth flagging:


Warning — Dead instances leak from eviction loop (regression)

pool.py, _eviction_loop

The _pool.pop() and _last_activity.pop() calls are now inside the if instance and instance.is_alive: block. If an instance is in the pool but has already died on its own (is_alive=False), the eviction loop skips both the shutdown and the cleanup. That instance stays in the pool indefinitely.

Original code always popped unconditionally before the alive check. The fix needs either an else: branch or to move the pop outside the if block for the dead-instance case:

instance = self._pool.get(chat_id)
if instance and instance.is_alive:
    try:
        await instance.shutdown()
    except Exception:
        log.exception(...)
        instance.force_kill()
# Pop unconditionally — clean up dead instances too
self._pool.pop(chat_id, None)
self._last_activity.pop(chat_id, None)

Warning — CancelledError not caught by except Exception (misleading comment + broken promise)

pool.py, force_kill, line ~222

The comment says:

# Any failure (timeout, CancelledError, OSError, etc.) -

But in Python 3.8+, asyncio.CancelledError is a BaseException, not an Exception. except Exception: will not catch it. If the task is cancelled during await asyncio.wait_for(...), the exception propagates and the _pool.pop() / _last_activity.pop() at the end never execute — the instance stays in the pool untracked.

Fix: use except BaseException: (and re-raise if it's CancelledError after doing cleanup), or add an explicit finally: block for the pop:

try:
    await asyncio.wait_for(instance.shutdown(), timeout=_FORCE_KILL_TIMEOUT)
except Exception:
    instance.force_kill()
    log.warning(...)
finally:
    self._pool.pop(chat_id, None)
    self._last_activity.pop(chat_id, None)

Suggestion — Double-shutdown window in force_kill()

pool.py, force_kill

Between self._pool.get(chat_id) and the eventual self._pool.pop() there is a suspension point (await wait_for). If two coroutines call force_kill(111) concurrently — e.g. a user command races with an eviction tick — both see the instance in the pool and both call shutdown() on the same object.

The old pop-first approach prevented this for concurrent force_kill pairs. The pool docstring notes reliance on a per-chat caller lock, but the eviction loop doesn't hold one. A simple guard (e.g. a _being_killed: set[int] checked at entry) would close the window without introducing a full lock.


Clean otherwise

The exception-broadening logic, the SIGKILL fallback placement, and the new tests (especially test_force_kill_pops_after_successful_shutdown checking pool membership during shutdown) are solid. The existing TOCTOU guards in _eviction_loop are preserved correctly.

- Move pop outside the is_alive block in eviction loop so dead
  instances are cleaned up (not leaked indefinitely)
- Use finally block in force_kill() so cleanup runs even if
  CancelledError (BaseException, not caught by except Exception)
  propagates
- Fix misleading comment about CancelledError being caught
@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

Review: fix/eviction-silent-failure (round 2)

All three prior review issues have been addressed. The dead-instance leak in the eviction loop is fixed (pops moved outside the if instance and instance.is_alive: block). The CancelledError gap in force_kill is fixed via finally. The double-shutdown suggestion was not taken — that's the author's call.

One new issue:


Warning — Eviction loop still doesn't protect against CancelledError

pool.py, _eviction_loop, lines ~330–340

force_kill() now correctly uses finally for the pop. The eviction loop does not:

if instance and instance.is_alive:
    try:
        await instance.shutdown()
    except Exception:          # CancelledError escapes here
        instance.force_kill()
# NOT in a finally — skipped if CancelledError propagates
self._pool.pop(chat_id, None)
self._last_activity.pop(chat_id, None)

If CancelledError is raised during await instance.shutdown(), neither the SIGKILL fallback nor the cleanup pops execute. The instance stays in the pool and the subprocess may be left running. This is the same structural gap that was fixed in force_kill() but not carried over here.

Practical blast radius is narrow — this only fires if the eviction task is cancelled mid-shutdown, at which point pool.shutdown() presumably handles cleanup — but it's an inconsistency worth closing:

try:
    if instance and instance.is_alive:
        try:
            await instance.shutdown()
        except Exception:
            log.exception("Error evicting subprocess for user %d, sending SIGKILL", chat_id)
            instance.force_kill()
finally:
    self._pool.pop(chat_id, None)
    self._last_activity.pop(chat_id, None)

Everything else looks clean. The finally-based cleanup in force_kill, the SIGKILL fallback placement, and the test coverage (particularly check_pool_during_shutdown verifying pool membership during the await) are all solid.

Same pattern as force_kill(): use try/finally so cleanup runs even if
CancelledError propagates from shutdown(). Ensures no orphaned
instances if the eviction task is cancelled mid-shutdown.
@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

Review: fix/eviction-silent-failure (round 3)

The round 2 warning (eviction loop CancelledError leaving the instance in the pool) is addressed. The outer try/finally in _eviction_loop now mirrors the fix in force_kill, so cleanup pops execute even if CancelledError escapes await instance.shutdown().

One new issue:


Suggestion — force_kill docstring overclaims

pool.py, force_kill docstring, lines ~210–215 (new text)

The docstring says:

Falls back to raw SIGKILL on any failure (timeout, cancellation, or other exception), so this never blocks callers for long and never propagates exceptions.

CancelledError is a BaseException, not an Exception. The except Exception block does not catch it. If the task is cancelled during asyncio.wait_for(...), CancelledError propagates out of force_kill — both the SIGKILL fallback and the "never propagates" guarantee are violated (though the finally correctly handles pool cleanup). The inline comment correctly removed CancelledError from its example list, but the docstring still overstates the guarantee.

Suggested fix — qualify the claim:

# Falls back to raw SIGKILL on any non-cancellation failure,
# so this never propagates ordinary exceptions to callers.

Clean otherwise

The finally-based cleanup in both force_kill and _eviction_loop, the SIGKILL fallback placement, the dead-instance pop (now outside the is_alive guard), and the test coverage (especially check_pool_during_shutdown verifying pool membership during await) are all correct. The double-shutdown window (suggestion, not taken) is the author's call.

@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

Review: fix/eviction-silent-failure (round 4)

All three prior warning-level issues have been addressed. The force_kill docstring now correctly qualifies the guarantee ("non-cancellation failure"), and the outer try/finally in _eviction_loop ensures pool cleanup even on CancelledError. One minor test inconsistency:


Suggestion — test_eviction_pops_after_shutdown missing _last_activity assertion

tests/test_pool.py, line ~580

        # Removed after shutdown completed
        assert 111 not in pool._pool
        # _last_activity not checked here

Every other eviction test in this PR checks both pool._pool and pool._last_activity after cleanup (e.g. test_eviction_failure_triggers_sigkill_fallback asserts both). This test only asserts _pool. Inconsistent and leaves a gap — if _last_activity cleanup regressed in the success path, this test wouldn't catch it.


Clean otherwise

The finally-based cleanup in both force_kill and _eviction_loop, the corrected docstring, the SIGKILL fallback placement, and the check_pool_during_shutdown/check_pool pattern for verifying pool membership during await are all solid. The double-shutdown window (suggestion, not taken) remains the author's call.

@dcellison dcellison merged commit 1cf748c into main Mar 24, 2026
1 check passed
@dcellison dcellison deleted the fix/eviction-silent-failure branch March 24, 2026 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lock eviction can fail silently, orphaning subprocesses

2 participants