Skip to content

Await cleanup after pool force_kill#140

Merged
dcellison merged 2 commits intomainfrom
fix/pool-force-kill-cleanup
Mar 23, 2026
Merged

Await cleanup after pool force_kill#140
dcellison merged 2 commits intomainfrom
fix/pool-force-kill-cleanup

Conversation

@dcellison
Copy link
Copy Markdown
Owner

Summary

  • Make ClaudePool.force_kill() async so it can call instance.shutdown() for full process cleanup
  • Previously, force_kill() called instance.force_kill() which only sends raw SIGKILL with no cleanup: stderr task leaked, process state not cleared, process group not killed
  • shutdown() does SIGTERM -> SIGKILL escalation -> cancel stderr -> clear state -> killpg
  • Wrapped in asyncio.wait_for(timeout=5) with raw force_kill() fallback so it never blocks indefinitely
  • Instance is popped from the pool before awaiting, so no concurrent access during cleanup

Changes

File Change
src/kai/pool.py force_kill() now async, calls shutdown() with timeout fallback
src/kai/bot.py Both callers (lock timeout, /stop) updated to await
tests/test_pool.py Existing tests made async, new timeout-fallback test added
tests/test_bot.py _make_mock_claude pool mock uses AsyncMock for force_kill

Test plan

  • test_force_kill_specific_user - verifies shutdown called on target, not on other users
  • test_force_kill_no_instance - no-op for unknown user
  • test_force_kill_shutdown_timeout_falls_back - when shutdown hangs, falls back to raw SIGKILL
  • Existing bot tests (test_sets_stop_event_and_kills, test_timeout_kills_claude) pass with async mock
  • All 1069 tests pass, make check clean

Fixes #133

Make ClaudePool.force_kill() async so it can call instance.shutdown()
instead of raw instance.force_kill(). shutdown() does full cleanup:
SIGTERM, SIGKILL escalation, stderr task cancellation, process state
clearing, and process group kill. Without this, force_kill left
orphaned stderr tasks, zombie processes, and unkilled process groups.

Wraps shutdown() in a 5-second timeout with raw force_kill() fallback
so it never blocks callers indefinitely.

Fixes #133
@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

PR Review: fix/pool-force-kill-cleanup

The overall approach is solid — pop-then-shutdown prevents concurrent access, and the timeout fallback prevents indefinite blocking. A few issues worth addressing:


Warning

TimeoutError catch breaks on Python < 3.11 (pool.py, except TimeoutError)

asyncio.wait_for raises asyncio.TimeoutError. Before Python 3.11, asyncio.TimeoutError is not a subclass of the built-in TimeoutError (it descends from concurrent.futures.TimeoutErrorException, while built-in TimeoutError descends from OSError). On Python < 3.11 the except clause misses the timeout, the exception propagates uncaught, and the fallback force_kill() is never called.

Fix:

except (TimeoutError, asyncio.TimeoutError):

This is safe across all versions and explicit about intent.


Suggestion

Timeout test will take ~5 seconds in CI (tests/test_pool.py, test_force_kill_shutdown_timeout_falls_back)

The test triggers the real 5-second asyncio.wait_for timeout. Consider patching the timeout constant or passing timeout as a parameter to make the test fast:

with patch("kai.pool._FORCE_KILL_TIMEOUT", 0.01):
    await pool.force_kill(111)

Or parameterize the timeout in force_kill. As written, every CI run pays a 5-second penalty for this one test.


Suggestion

Unhandled non-timeout exceptions from shutdown() (pool.py)

If shutdown() raises something other than TimeoutError (e.g. ProcessLookupError, OSError), it propagates from force_kill() to the caller. This may be intentional, but the docstring says "never blocks callers for long" without mentioning it can raise. Worth either catching Exception broadly with a log or documenting the exception contract explicitly.


Clean

  • bot.py call-site updates are correct and minimal.
  • Pop-before-await pattern correctly eliminates re-entrancy risk.
  • AsyncMock fix in test_bot.py is the right change.
  • test_force_kill_specific_user correctly verifies the target/non-target distinction.

The TimeoutError compatibility issue is the only one worth fixing before merge if the project runs on Python 3.10.

…nalty

The timeout test was waiting the full 5 seconds for wait_for to fire.
Extract the timeout as a module constant and patch it to 0.01s in the
test.
@dcellison
Copy link
Copy Markdown
Owner Author

Review by Kai

Review: fix/pool-force-kill-cleanup (revision)

The prior review's suggestions were handled correctly:

  • Timeout test slowness — fixed via patch("kai.pool._FORCE_KILL_TIMEOUT", 0.01) in test_force_kill_shutdown_timeout_falls_back. No longer a 5-second CI penalty.
  • TimeoutError compatibility and non-timeout exception handling — unchanged; author has seen these and made the call.

No new issues.

Spot-checks on the new test (test_force_kill_shutdown_timeout_falls_back):

  • side_effect=hang_forever correctly causes mock.shutdown() to return the hang_forever() coroutine, so asyncio.wait_for sees a real awaitable and the timeout fires as expected. The mechanics are sound.
  • _FORCE_KILL_TIMEOUT is a module-level constant (not captured in a closure at import time), so patch("kai.pool._FORCE_KILL_TIMEOUT", 0.01) correctly overrides it inside the with block.
  • The pop-before-await pattern means b's instance is never touched in test_force_kill_specific_user, which is what the assert_not_called() verifies.

bot.py call sites — both await pool.force_kill(chat_id) updates are correct and complete.

Clean. Ready to merge.

@dcellison dcellison merged commit 6382059 into main Mar 23, 2026
1 check passed
@dcellison dcellison deleted the fix/pool-force-kill-cleanup branch March 23, 2026 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

force_kill doesn't await cleanup, orphans stderr task and zombie process

2 participants