fix(envd): stop memory exhaustion when client disconnects from streaming process#2620
fix(envd): stop memory exhaustion when client disconnects from streaming process#2620arkamar wants to merge 17 commits into
Conversation
…onnect TestStart_ClientDisconnectLeavesOrphanProcess shows that when a client disconnects from a Start RPC, the child process and handler goroutines remain alive because procCtx is derived from context.Background().
TestStart_DisconnectStormHeapGrowth runs 5 Start RPC cycles with a fast stdout producer (yes), disconnects each time, and asserts that heap growth stays under 50 MiB. Currently fails because orphaned handlers keep pumping data into unbounded channel buffers.
When all Start/Connect RPC subscribers disconnect, the fan-out loop stops consuming from Source (via receiveWhenReady), which lets the Source buffer fill, which blocks the reader goroutine, which fills the pipe, which pauses the child process — natural Unix back-pressure with zero memory growth. Reader goroutines select on outCtx.Done() to unblock from a full Source send when the child exits (cmd.Wait cancels outCtx), preventing deadlock between the reader and the cleanup path. Also restructures handler.Wait() to call cmd.Wait() first (reap the child and close pipes), then cancel outCtx to unblock readers.
…cribers gone Send to an unbuffered Source channel deadlocks after the last subscriber is cancelled. receiveWhenReady parks on <-sig waiting for a subscriber that will never arrive, so the producer (Wait) blocks forever. This leaks the Wait goroutine and prevents processes.Delete from running.
…(Source) Calling close(Source) instead of CloseSource() after the last subscriber is cancelled leaks the fan-out goroutine. The closed flag is never set and NotifySubscriberChange is never called, so receiveWhenReady stays parked on <-sig forever. This matches start.go:101 where `defer close(startMultiplexer.Source)` bypasses CloseSource().
Two related fixes for regressions introduced by the back-pressure commit (3e6e57e): 1. EndEvent.Source send deadlock: Wait() sends to EndEvent.Source after all subscribers are gone. With buffer=0, this blocks forever because receiveWhenReady parks on <-sig. Fix: use buffer=1 for EndEvent so the single send always succeeds, and call CloseSource() after the send so the fan-out exits. 2. startMultiplexer fan-out leak: start.go used bare close(Source) instead of CloseSource(), so the closed flag was never set and NotifySubscriberChange was never called. The fan-out goroutine stayed parked on <-sig forever. Fix: use CloseSource().
When cancel() and CloseSource() fired in quick succession, the fan-out goroutine could grab a signal channel created *by* CloseSource's NotifySubscriberChange, then park on it forever since no further notifications would arrive. Re-check closed after acquiring the signal to close the window. Replace exited atomic.Bool with a done channel closed when run() finishes. Fork() selects on it to detect shutdown, and tests use it to wait deterministically for run() to exit.
The heap measurement is inherently racy — GC timing, ambient allocations, and uint64 underflow on heap shrinkage make it unreliable at high -count values. The back-pressure behavior it validated is already covered by the multiplex and disconnect tests.
There was a problem hiding this comment.
An organization admin can view or raise the cap at claude.ai/admin-settings/claude-code. The cap resets at the start of the next billing period.
Once the cap resets or is raised, reopen this pull request to trigger a review.
PR SummaryMedium Risk Overview Potential issues: Reviewed by Cursor Bugbot for commit 360254e. Bugbot is set up for automated code reviews on this repo. Configure here. |
❌ 9 Tests Failed:
View the full list of 15 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c55adcc559
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Code Review
The receiveWhenReady function contains a race condition that can lead to a lost wake-up and a deadlock of the fan-out loop. If a subscriber is added or removed after the HasSubscribers check but before the subSignal is fetched, the loop may acquire an unclosed signal channel and block indefinitely. Fetching the signal channel before checking the conditions ensures that any subsequent change will correctly close the channel being waited on.
Avoids a race where a subscriber added between HasSubscribers() and fetching subSignal could leave the fan-out parked on a fresh unclosed signal channel.
34be6db to
cde8dc9
Compare
cmd.StdoutPipe/StderrPipe are managed by cmd.Wait which closes the pipe read-ends on return, racing with readers that haven't finished. Replace them with manual os.Pipe so we control the lifecycle: write-ends are closed after Start (child inherited them), read-ends stay open until readers finish naturally via EOF. After cmd.Wait reaps the child, call Drain() on the data multiplexer to disable back-pressure, letting stuck readers unblock and see EOF. Then wait for all readers to exit before proceeding.
…han children After cmd.Wait() reaps the child, use SetReadDeadline on the pipe read-ends so readers drain any buffered data (reads with available data return instantly) then exit on deadline instead of blocking forever when an orphan grandchild holds the write-end open. Readers treat the deadline timeout the same as EOF — clean exit, no data loss. Add a dedicated readersDone channel that closes when stdout/stderr reader goroutines actually exit, replacing the outCtx/outCancel mechanism which is no longer needed. Fixes TestStart_OrphanGrandchildDoesNotHangStream and the TestCommandKillNextApp CI failure.
Add TestStart_OrphanGrandchildDoesNotHangStream: verifies that killing a process whose grandchild holds stdout open delivers the EndEvent within the stream timeout instead of hanging.
cde8dc9 to
ab0315b
Compare
…er on Fork Move close(m.done) inside the m.mu critical section in run() so that Fork()'s re-check under the same lock always observes the shutdown. Previously, close(m.done) happened after Unlock, creating a window where Fork could add a subscriber that run() never cleans up — leaving the channel open forever and hanging consumers.
Readers exit after the SetReadDeadline timeout fires, but the read-end file descriptors were never closed. Each non-PTY process leaked two fds. Close them after readersDone signals.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c342710. Configure here.
…ang Wait The SetReadDeadline escape mechanism only applied to non-PTY pipe read-ends. For PTY processes, the reader goroutine would block on tty.Read() forever when an orphan grandchild held the PTY slave open, deadlocking Wait() at <-readersDone. Set the same deadline on p.tty and treat timeout as EOF in the PTY reader loop.

When a Start/Connect client disconnects while a process is producing output, the fan-out loop keeps draining the Source channel with no subscribers — discarding every value but keeping the reader goroutines hot. Each reader allocates 32 KiB per read cycle, and with a fast producer envd RSS grows to hundreds of MiB in seconds, OOM-killing other processes in the sandbox.
The fix adds back-pressure: when no active subscribers remain, the fan-out stops consuming from Source, the channel fills, the reader blocks, and the OS pipe back-pressures the child. The child resumes instantly when a new client calls Fork. This is safe because output with no subscribers is already lost — the SDK has no replay mechanism.
Stdout/stderr pipes are replaced with manual
os.Pipe()socmd.Wait()doesn't close the read-ends prematurely, fixing output truncation on fast commands. After the child is reaped,SetReadDeadlineon the read-ends lets readers drain buffered data then exit cleanly instead of blocking forever when an orphan grandchild holds the write-end open. Reader goroutines close their own read-ends via defer.EndEvent buffer is changed from 0 to 1 so the send in
Wait()succeeds when the fan-out is parked. TheoutCtx/outCancelmechanism is removed entirely — a dedicatedreadersDonechannel tracks when readers actually exit, whichSendSignal(SIGKILL)can no longer bypass.close(m.done)is moved under the lock soFork()racing with shutdown cannot orphan a subscriber.