Skip to content

_drain_stderr observes stale self._proc due to ordering in _kill() #124

@dcellison

Description

@dcellison

Summary

In claude.py, _kill() clears self._proc = None (line 782) before cancelling self._stderr_task (line 797-798). The _drain_stderr task (line 276) has a while-loop that checks self._proc on each iteration. Between the proc reference being cleared and the task being cancelled, _drain_stderr can observe self._proc as None in a state that was not intended to be visible to it.

The ordering in _kill()

async def _kill(self) -> None:
    if self._proc:
        # ... SIGKILL + wait ...
        self._proc = None          # line 782: proc reference cleared
        self._pgid = None           # line 783
        self._session_id = None     # line 784
        self._session_started_at = None  # line 785
        # ... pgid cleanup ...
    if self._stderr_task:           # line 797: task cancelled AFTER proc cleared
        self._stderr_task.cancel()
        self._stderr_task = None

The drain loop in _drain_stderr:

while self._proc and self._proc.stderr:       # line 276
    try:
        line = await self._proc.stderr.readline()  # line 278

Race window

When _kill() sends SIGKILL and awaits self._proc.wait() at line 779, it yields to the event loop. Both _kill and _drain_stderr are waiting on the same process to die. When the process exits, the event loop can resume either coroutine first.

If _kill resumes first:

  1. _kill clears self._proc = None (synchronous, no yield)
  2. _kill proceeds to cancel _stderr_task (still synchronous)
  3. _drain_stderr resumes, checks while self._proc - False, exits cleanly

If _drain_stderr resumes first:

  1. readline returns b"" (EOF), drain checks if not line: break, exits
  2. _kill resumes, clears proc, cancels the already-completed task

Both orderings happen to work correctly today, but only because:

  • The while condition re-checks self._proc on every iteration (so a None proc exits the loop)
  • No yield point exists between the while-condition check and the self._proc.stderr.readline() expression evaluation (so proc can't become None between the guard and the access)
  • CancelledError is a BaseException, not caught by the except Exception on line 284 (so task cancellation propagates cleanly)

Why this matters

The correctness depends on three implementation details that are not documented and could easily be broken by routine maintenance:

  1. If someone adds a yield point (logging, metrics) between the while check and the readline, self._proc could become None between the guard and the self._proc.stderr access, causing an AttributeError.

  2. If someone changes except Exception to except BaseException (a common "catch everything" reflex), CancelledError gets swallowed and the drain loop continues checking a None proc.

  3. If the drain loop is refactored to cache proc = self._proc locally (a common optimization), the while-condition guard no longer protects against the reference going stale mid-iteration.

The fundamental issue is that _kill() invalidates state that another running task depends on, then cancels that task as an afterthought. The invariant should be: cancel the consumer before destroying what it consumes.

Severity

MEDIUM - not an active bug under current code, but a latent defect. The "accidental correctness" makes this a maintenance trap that will surface as a confusing AttributeError or silent malfunction the next time someone touches either _kill() or _drain_stderr().

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggood first issueGood for newcomers

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions