Fix eviction silent failure and orphaned subprocesses#159
Conversation
Defer pool.pop() until after shutdown completes in both the eviction loop and force_kill(). Previously, instances were removed from tracking before shutdown, so a failed shutdown left the subprocess running but invisible to the pool. Eviction loop: on shutdown failure, escalate to instance.force_kill() (raw SIGKILL) before removing from tracking. The subprocess is always confirmed dead before being untracked. force_kill(): widen exception handler from TimeoutError to Exception. Any failure from shutdown() now falls back to SIGKILL instead of propagating to the /stop handler and crashing the user's command. Fixes #154
Review by KaiReview: fix/eviction-silent-failureThe core fix (defer pop until after confirmed-dead) is correct and well-motivated. Three issues worth flagging: Warning — Dead instances leak from eviction loop (regression)
The Original code always popped unconditionally before the alive check. The fix needs either an instance = self._pool.get(chat_id)
if instance and instance.is_alive:
try:
await instance.shutdown()
except Exception:
log.exception(...)
instance.force_kill()
# Pop unconditionally — clean up dead instances too
self._pool.pop(chat_id, None)
self._last_activity.pop(chat_id, None)Warning —
|
- Move pop outside the is_alive block in eviction loop so dead instances are cleaned up (not leaked indefinitely) - Use finally block in force_kill() so cleanup runs even if CancelledError (BaseException, not caught by except Exception) propagates - Fix misleading comment about CancelledError being caught
Review by KaiReview: fix/eviction-silent-failure (round 2)All three prior review issues have been addressed. The dead-instance leak in the eviction loop is fixed (pops moved outside the One new issue: Warning — Eviction loop still doesn't protect against CancelledError
if instance and instance.is_alive:
try:
await instance.shutdown()
except Exception: # CancelledError escapes here
instance.force_kill()
# NOT in a finally — skipped if CancelledError propagates
self._pool.pop(chat_id, None)
self._last_activity.pop(chat_id, None)If Practical blast radius is narrow — this only fires if the eviction task is cancelled mid-shutdown, at which point try:
if instance and instance.is_alive:
try:
await instance.shutdown()
except Exception:
log.exception("Error evicting subprocess for user %d, sending SIGKILL", chat_id)
instance.force_kill()
finally:
self._pool.pop(chat_id, None)
self._last_activity.pop(chat_id, None)Everything else looks clean. The |
Same pattern as force_kill(): use try/finally so cleanup runs even if CancelledError propagates from shutdown(). Ensures no orphaned instances if the eviction task is cancelled mid-shutdown.
Review by KaiReview: fix/eviction-silent-failure (round 3)The round 2 warning (eviction loop One new issue: Suggestion —
|
Review by KaiReview: fix/eviction-silent-failure (round 4)All three prior warning-level issues have been addressed. The Suggestion —
|
Summary
Fix two resource management bugs in
pool.pywhere subprocess instances were removed from tracking before shutdown completed, causing orphaned processes on failure.Problem 1: Pop-before-shutdown ordering
Both
_eviction_loop()andforce_kill()calledself._pool.pop()beforeinstance.shutdown(). If shutdown failed, the subprocess was still running but invisible to the pool - no retry, no cleanup, just a silent memory leak.Problem 2: Incomplete exception handling in force_kill()
force_kill()only caughtTimeoutError. Any other exception fromshutdown()(CancelledError,OSError, etc.) propagated to the/stophandler, crashing the user's command while leaving the subprocess untrackable.Fix
instance.force_kill()(raw SIGKILL, effectively infallible) as a last resort before removing from tracking.TimeoutErrortoExceptionso nothing propagates to callers.Changes
force_kill()TimeoutErrorException_eviction_loop()Test plan
test_force_kill_catches_non_timeout_exceptions- RuntimeError caught, SIGKILL fallback, instance removedtest_force_kill_pops_after_successful_shutdown- instance in pool during shutdown, removed aftertest_eviction_failure_triggers_sigkill_fallback- failed shutdown triggers force_kill, instance not orphanedtest_eviction_pops_after_shutdown- instance in pool during eviction shutdown, removed aftermake checkcleanFixes #154