Skip to content

fix(envd): stop memory exhaustion when client disconnects from streaming process#2620

Draft
arkamar wants to merge 17 commits into
mainfrom
fix/envd-backpressure-on-disconnect
Draft

fix(envd): stop memory exhaustion when client disconnects from streaming process#2620
arkamar wants to merge 17 commits into
mainfrom
fix/envd-backpressure-on-disconnect

Conversation

@arkamar
Copy link
Copy Markdown
Contributor

@arkamar arkamar commented May 11, 2026

When a Start/Connect client disconnects while a process is producing output, the fan-out loop keeps draining the Source channel with no subscribers — discarding every value but keeping the reader goroutines hot. Each reader allocates 32 KiB per read cycle, and with a fast producer envd RSS grows to hundreds of MiB in seconds, OOM-killing other processes in the sandbox.

The fix adds back-pressure: when no active subscribers remain, the fan-out stops consuming from Source, the channel fills, the reader blocks, and the OS pipe back-pressures the child. The child resumes instantly when a new client calls Fork. This is safe because output with no subscribers is already lost — the SDK has no replay mechanism.

Stdout/stderr pipes are replaced with manual os.Pipe() so cmd.Wait() doesn't close the read-ends prematurely, fixing output truncation on fast commands. After the child is reaped, SetReadDeadline on the read-ends lets readers drain buffered data then exit cleanly instead of blocking forever when an orphan grandchild holds the write-end open. Reader goroutines close their own read-ends via defer.

EndEvent buffer is changed from 0 to 1 so the send in Wait() succeeds when the fan-out is parked. The outCtx/outCancel mechanism is removed entirely — a dedicated readersDone channel tracks when readers actually exit, which SendSignal(SIGKILL) can no longer bypass. close(m.done) is moved under the lock so Fork() racing with shutdown cannot orphan a subscriber.

arkamar added 10 commits May 11, 2026 09:12
…onnect

TestStart_ClientDisconnectLeavesOrphanProcess shows that when a client
disconnects from a Start RPC, the child process and handler goroutines
remain alive because procCtx is derived from context.Background().
TestStart_DisconnectStormHeapGrowth runs 5 Start RPC cycles with a
fast stdout producer (yes), disconnects each time, and asserts that
heap growth stays under 50 MiB. Currently fails because orphaned
handlers keep pumping data into unbounded channel buffers.
When all Start/Connect RPC subscribers disconnect, the fan-out loop
stops consuming from Source (via receiveWhenReady), which lets the
Source buffer fill, which blocks the reader goroutine, which fills
the pipe, which pauses the child process — natural Unix back-pressure
with zero memory growth.

Reader goroutines select on outCtx.Done() to unblock from a full
Source send when the child exits (cmd.Wait cancels outCtx), preventing
deadlock between the reader and the cleanup path.

Also restructures handler.Wait() to call cmd.Wait() first (reap the
child and close pipes), then cancel outCtx to unblock readers.
…cribers gone

Send to an unbuffered Source channel deadlocks after the last
subscriber is cancelled.  receiveWhenReady parks on <-sig waiting
for a subscriber that will never arrive, so the producer (Wait)
blocks forever.  This leaks the Wait goroutine and prevents
processes.Delete from running.
…(Source)

Calling close(Source) instead of CloseSource() after the last
subscriber is cancelled leaks the fan-out goroutine.  The closed
flag is never set and NotifySubscriberChange is never called, so
receiveWhenReady stays parked on <-sig forever.

This matches start.go:101 where `defer close(startMultiplexer.Source)`
bypasses CloseSource().
Two related fixes for regressions introduced by the back-pressure
commit (3e6e57e):

1. EndEvent.Source send deadlock: Wait() sends to EndEvent.Source
   after all subscribers are gone.  With buffer=0, this blocks
   forever because receiveWhenReady parks on <-sig.  Fix: use
   buffer=1 for EndEvent so the single send always succeeds, and
   call CloseSource() after the send so the fan-out exits.

2. startMultiplexer fan-out leak: start.go used bare close(Source)
   instead of CloseSource(), so the closed flag was never set and
   NotifySubscriberChange was never called.  The fan-out goroutine
   stayed parked on <-sig forever.  Fix: use CloseSource().
When cancel() and CloseSource() fired in quick succession, the fan-out
goroutine could grab a signal channel created *by* CloseSource's
NotifySubscriberChange, then park on it forever since no further
notifications would arrive.  Re-check closed after acquiring the
signal to close the window.

Replace exited atomic.Bool with a done channel closed when run()
finishes.  Fork() selects on it to detect shutdown, and tests use
it to wait deterministically for run() to exit.
The heap measurement is inherently racy — GC timing, ambient
allocations, and uint64 underflow on heap shrinkage make it
unreliable at high -count values. The back-pressure behavior it
validated is already covered by the multiplex and disconnect tests.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Code review skipped — your organization has reached its monthly code review spending cap.

An organization admin can view or raise the cap at claude.ai/admin-settings/claude-code. The cap resets at the start of the next billing period.

Once the cap resets or is raised, reopen this pull request to trigger a review.

@cursor
Copy link
Copy Markdown

cursor Bot commented May 11, 2026

PR Summary

Medium Risk
Touches process lifecycle and streaming I/O concurrency; mistakes here can cause hangs, truncated output, or panics under disconnect/kill scenarios.

Overview
MultiplexedChannel now stops consuming from Source when there are no active subscribers to avoid unbounded work/memory growth after client disconnects, adding explicit Drain() and CloseSource() signaling and new shutdown coordination (done, subSignal). The process handler switches stdout/stderr to manual os.Pipe() and adds post-Wait() draining/read-deadline logic so fast commands don’t lose final output and orphaned grandchildren don’t hang the stream.

Potential issues: CloseSource() is not idempotent and will panic on double-close; Handler.Wait() unconditionally calls p.tty.Close() which can panic when the process is not PTY-backed; and the new back-pressure semantics can block writers in unexpected paths if Drain()/CloseSource() ordering is wrong.

Reviewed by Cursor Bugbot for commit 360254e. Bugbot is set up for automated code reviews on this repo. Configure here.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

❌ 9 Tests Failed:

Tests completed Failed Passed Skipped
2621 9 2612 7
View the full list of 15 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestSandboxNotAutoPause

Flake rate in main: 54.43% (Passed 149 times, Failed 178 times)

Stack Traces | 15s run time
=== RUN   TestSandboxNotAutoPause
=== PAUSE TestSandboxNotAutoPause
=== CONT  TestSandboxNotAutoPause
    sandbox_auto_pause_test.go:149: Sandbox creation failed status=500 body={"code":500,"message":"Failed to place sandbox"}
    sandbox_auto_pause_test.go:149: Sandbox creation={Body:[123 34 99 111 100 101 34 58 53 48 48 44 34 109 101 115 115 97 103 101 34 58 34 70 97 105 108 101 100 32 116 111 32 112 108 97 99 101 32 115 97 110 100 98 111 120 34 125] HTTPResponse:0xc001199680 JSON201:<nil> JSON400:<nil> JSON401:<nil> JSON500:0xc0001a52f0}
    sandbox_auto_pause_test.go:149: 
        	Error Trace:	.../internal/utils/sandbox.go:163
        	            				.../api/sandboxes/sandbox_auto_pause_test.go:149
        	Error:      	Not equal: 
        	            	expected: 201
        	            	actual  : 500
        	Test:       	TestSandboxNotAutoPause
--- FAIL: TestSandboxNotAutoPause (15.02s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 76.54% (Passed 156 times, Failed 509 times)

Stack Traces | 220s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
--- FAIL: TestUpdateNetworkConfig (219.59s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/13_allow_internet_access_true_is_noop

Flake rate in main: 55.66% (Passed 141 times, Failed 177 times)

Stack Traces | 5.51s run time
=== RUN   TestUpdateNetworkConfig/13_allow_internet_access_true_is_noop
    sandbox_network_update_test.go:328: Command [curl] output: event:{start:{pid:1358}}
Executing command ls in sandbox ipkzy41garo2p1v2qpivd (user: root)
    sandbox_network_update_test.go:328: Command [curl] output: event:{end:{exit_code:28 exited:true status:"exit status 28" error:"exit status 28"}}
    sandbox_network_update_test.go:328: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:67
        	            				.../api/sandboxes/sandbox_network_update_test.go:58
        	            				.../api/sandboxes/sandbox_network_update_test.go:328
        	Error:      	Received unexpected error:
        	            	command curl in sandbox ir44vbmsico3zbjj4hdqr failed with exit code 28
        	Test:       	TestUpdateNetworkConfig/13_allow_internet_access_true_is_noop
        	Messages:   	https://8.8.8.8 should be reachable
--- FAIL: TestUpdateNetworkConfig/13_allow_internet_access_true_is_noop (5.51s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 77.03% (Passed 150 times, Failed 503 times)

Stack Traces | 4.49s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
Executing command curl in sandbox izkad7gtgvww8tg8rmci4
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1367}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1368}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox ir44vbmsico3zbjj4hdqr
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1369}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Wed, 13 May 2026 13:36:01 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox ir44vbmsico3zbjj4hdqr
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (4.49s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestTemplateBuildENV

Flake rate in main: 58.99% (Passed 146 times, Failed 210 times)

Stack Traces | 0s run time
=== RUN   TestTemplateBuildENV
=== PAUSE TestTemplateBuildENV
=== CONT  TestTemplateBuildENV
--- FAIL: TestTemplateBuildENV (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestTemplateBuildENV/ENV_with_multiline_value

Flake rate in main: 59.83% (Passed 139 times, Failed 207 times)

Stack Traces | 7.08s run time
=== RUN   TestTemplateBuildENV/ENV_with_multiline_value
=== PAUSE TestTemplateBuildENV/ENV_with_multiline_value
=== CONT  TestTemplateBuildENV/ENV_with_multiline_value
    build_template_test.go:134: test-ubuntu-env-multiline: [info] Building template 94mdu7k2sldi732uk4vy/02ab3128-418b-4b14-b5a3-b3f03127bc29
    build_template_test.go:134: test-ubuntu-env-multiline: [info] CACHED [base] FROM ubuntu:22.04 [ffd709f131f42dfab282de47a91dd2c139e900c1c11fc574b49b517a05ef0a32]
    build_template_test.go:134: test-ubuntu-env-multiline: [info] CACHED [base] DEFAULT USER user [90bdd4afa342293c931373351bf578872dec9179214ba3e8bf9edba311466213]
    build_template_test.go:134: test-ubuntu-env-multiline: [info] [builder 1/2] ENV MULTILINE line1
        line2
        line3 [e93da3f3765f20eb6407c336b9e4e0b9321d994ec5f6cb547743a2a4070eed23]
    build_template_test.go:134: test-ubuntu-env-multiline: [info] [builder 2/2] RUN [[ $(echo "$MULTILINE" | wc -l) -eq 3 ]] || exit 1 [477610d61cdf858776262d3331809539bcbcf16f706aac18515a57337bae1786]
    build_template_test.go:134: test-ubuntu-env-multiline: [error] Build failed: failed to run command '[[ $(echo "$MULTILINE" | wc -l) -eq 3 ]] || exit 1': exit status 1
    build_template_test.go:374: Build failed: {<nil> failed to run command '[[ $(echo "$MULTILINE" | wc -l) -eq 3 ]] || exit 1': exit status 1 0xc00099f650}
--- FAIL: TestTemplateBuildENV/ENV_with_multiline_value (7.08s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestTemplateBuildRUN

Flake rate in main: 55.02% (Passed 148 times, Failed 181 times)

Stack Traces | 0s run time
=== RUN   TestTemplateBuildRUN
=== PAUSE TestTemplateBuildRUN
=== CONT  TestTemplateBuildRUN
--- FAIL: TestTemplateBuildRUN (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestTemplateBuildRUN/Single_RUN_command

Flake rate in main: 55.02% (Passed 148 times, Failed 181 times)

Stack Traces | 178s run time
=== RUN   TestTemplateBuildRUN/Single_RUN_command
=== PAUSE TestTemplateBuildRUN/Single_RUN_command
=== CONT  TestTemplateBuildRUN/Single_RUN_command
    build_template_test.go:134: test-ubuntu-run: [info] Building template fpraqq7yalxj05gl36y8/1a9ea4aa-59ae-4cef-8491-23ee46c4a9fb
    build_template_test.go:134: test-ubuntu-run: [info] [base] FROM ubuntu:22.04 [ffd709f131f42dfab282de47a91dd2c139e900c1c11fc574b49b517a05ef0a32]
    build_template_test.go:134: test-ubuntu-run: [info] Base Docker image size: 30 MB
    build_template_test.go:134: test-ubuntu-run: [info] Creating file system and pulling Docker image
    build_template_test.go:134: test-ubuntu-run: [info] Uncompressing layer sha256:f63eb04151bcac21ad049f8d781b97b219aba392c5457907f8f3e88e43eb48ec 30 MB
    build_template_test.go:134: test-ubuntu-run: [info] Uncompressing layer sha256:7a7ee39ccc2214c3040d7fd378451dc3dd7382cbc81d5d55e5d9adf3e4086c47 12 MB
    build_template_test.go:134: test-ubuntu-run: [info] Uncompressing layer sha256:8c4b1b28875140ed3abacaf16ad0d696f6bef912f52d2148f261a23e3349465b 168 B
    build_template_test.go:134: test-ubuntu-run: [info] Layers extracted
    build_template_test.go:134: test-ubuntu-run: [info] Root filesystem structure: bin, boot, dev, etc, home, lib, lib32, lib64, libx32, media, mnt, opt, proc, root, run, sbin, srv, sys, tmp, usr, var
    build_template_test.go:134: test-ubuntu-run: [info] Provisioning sandbox template
    build_template_test.go:134: test-ubuntu-run: [info] Provisioning was successful, cleaning up
    build_template_test.go:134: test-ubuntu-run: [info] Sandbox template provisioned
    build_template_test.go:134: test-ubuntu-run: [info] [base] DEFAULT USER user [90bdd4afa342293c931373351bf578872dec9179214ba3e8bf9edba311466213]
    build_template_test.go:134: test-ubuntu-run: [info] [builder 1/1] RUN echo 'Hello, World!' [0f555930fc7ecac094fbde7e0c82c834b8c4c2a8ac16f4b0938c4a705d74f4fd]
    build_template_test.go:134: test-ubuntu-run: [info] [builder 1/1] [stdout]: Hello, World!
    build_template_test.go:134: test-ubuntu-run: [info] [finalize] Finalizing template build [b22adeee315d1e1b1ce9456d90a0416b80f86274893376bcb4dc6b00fe3ddf0f]
    build_template_test.go:134: test-ubuntu-run: [error] Build failed: build was cancelled
    build_template_test.go:167: Build failed: {<nil> build was cancelled <nil>}
--- FAIL: TestTemplateBuildRUN/Single_RUN_command (177.57s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost

Flake rate in main: 57.14% (Passed 258 times, Failed 344 times)

Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_0_0_0_0

Flake rate in main: 62.85% (Passed 146 times, Failed 247 times)

Stack Traces | 8.99s run time
=== RUN   TestBindLocalhost/bind_0_0_0_0
=== PAUSE TestBindLocalhost/bind_0_0_0_0
=== CONT  TestBindLocalhost/bind_0_0_0_0
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1266}}
Executing command python in sandbox intbq8ifind7rpxwwe2gs
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_0_0_0_0
        	Messages:   	Unexpected status code 502 for bind address 0.0.0.0
--- FAIL: TestBindLocalhost/bind_0_0_0_0 (8.99s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_127_0_0_1

Flake rate in main: 58.07% (Passed 148 times, Failed 205 times)

Stack Traces | 6.94s run time
=== RUN   TestBindLocalhost/bind_127_0_0_1
=== PAUSE TestBindLocalhost/bind_127_0_0_1
=== CONT  TestBindLocalhost/bind_127_0_0_1
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1266}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_127_0_0_1
        	Messages:   	Unexpected status code 502 for bind address 127.0.0.1
--- FAIL: TestBindLocalhost/bind_127_0_0_1 (6.94s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::1

Flake rate in main: 64.73% (Passed 146 times, Failed 268 times)

Stack Traces | 7.91s run time
=== RUN   TestBindLocalhost/bind_::1
=== PAUSE TestBindLocalhost/bind_::1
=== CONT  TestBindLocalhost/bind_::1
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1266}}
Executing command python in sandbox ikl350cgeh0owhj5w2tet
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::1
        	Messages:   	Unexpected status code 502 for bind address ::1
--- FAIL: TestBindLocalhost/bind_::1 (7.91s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_localhost

Flake rate in main: 64.65% (Passed 146 times, Failed 267 times)

Stack Traces | 9.34s run time
=== RUN   TestBindLocalhost/bind_localhost
=== PAUSE TestBindLocalhost/bind_localhost
=== CONT  TestBindLocalhost/bind_localhost
Executing command python in sandbox i5d8lcz12szsosgvmsppr
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1266}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_localhost
        	Messages:   	Unexpected status code 502 for bind address localhost
--- FAIL: TestBindLocalhost/bind_localhost (9.34s)
Executing command python in sandbox ip676rs7d9wwlhi6ltoxm
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 66.23% (Passed 156 times, Failed 306 times)

Stack Traces | 81.3s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:26: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (81.35s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 67.26% (Passed 146 times, Failed 300 times)

Stack Traces | 44.9s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1277}}
Executing command bash in sandbox is2z7212t462vgpv77o2z (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory before tmpfs mount: 183 MB\nFree memory before tmpfs mount: 801 MB\nMemory to use in integrity test (80% of free, min 64MB): 640 MB\n"}}
Executing command bash in sandbox is2z7212t462vgpv77o2z (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"640+0 records in\n640+0 records out\n671088640 bytes (671 MB, 640 MiB) copied, 3.23417 s, 207 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tCommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=640\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 3.19\n\tPercent of CPU this job got: 98%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:03.24\n\tAverage shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n\tAverage total size (kbytes): 0\n\tMaximum resident set size (kbytes): 2724\n\tAverage resident set size (kbytes): 0\n\tMajor (requiring I/O) page faults: 2\n\tMinor (reclaiming a frame) page faults: 345\n\tVoluntary context switches: 3\n\tInvoluntary context switches: 29\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket messages received: 0\n\tSignals delivered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 830 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox im0a9wuadu7fwlb2eod5m
Executing command bash in sandbox im0a9wuadu7fwlb2eod5m (user: root)
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{start:{pid:1294}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{data:{stdout:"9296417b9fe3941f16d9a8e6b99ab635642c10012f81614806922684c0f54e4a\n"}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_memory_integrity_test.go:74: Command [bash] completed successfully in sandbox im0a9wuadu7fwlb2eod5m
Executing command bash in sandbox im0a9wuadu7fwlb2eod5m (user: root)
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{start:{pid:1297}}
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{data:{stdout:"9296417b9fe3941f16d9a8e6b99ab635642c10012f81614806922684c0f54e4a\n"}}
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_memory_integrity_test.go:99: Command [bash] completed successfully in sandbox im0a9wuadu7fwlb2eod5m
Executing command bash in sandbox im0a9wuadu7fwlb2eod5m (user: root)
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{start:{pid:1300}}
    sandbox_memory_integrity_test.go:100: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:100
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox im0a9wuadu7fwlb2eod5m: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (44.89s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c55adcc559

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread packages/envd/internal/services/process/handler/handler.go Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The receiveWhenReady function contains a race condition that can lead to a lost wake-up and a deadlock of the fan-out loop. If a subscriber is added or removed after the HasSubscribers check but before the subSignal is fetched, the loop may acquire an unclosed signal channel and block indefinitely. Fetching the signal channel before checking the conditions ensures that any subsequent change will correctly close the channel being waited on.

Comment thread packages/envd/internal/services/process/handler/multiplex.go
@arkamar arkamar marked this pull request as draft May 11, 2026 12:40
Avoids a race where a subscriber added between HasSubscribers()
and fetching subSignal could leave the fan-out parked on a fresh
unclosed signal channel.
Comment thread packages/envd/internal/services/process/handler/handler.go Outdated
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 11, 2026

ENG-3933

Comment thread packages/envd/internal/services/process/handler/multiplex.go
@arkamar arkamar force-pushed the fix/envd-backpressure-on-disconnect branch from 34be6db to cde8dc9 Compare May 11, 2026 20:03
Comment thread packages/envd/internal/services/process/handler/handler.go
arkamar added 3 commits May 11, 2026 23:27
cmd.StdoutPipe/StderrPipe are managed by cmd.Wait which closes the
pipe read-ends on return, racing with readers that haven't finished.
Replace them with manual os.Pipe so we control the lifecycle:
write-ends are closed after Start (child inherited them), read-ends
stay open until readers finish naturally via EOF.

After cmd.Wait reaps the child, call Drain() on the data multiplexer
to disable back-pressure, letting stuck readers unblock and see EOF.
Then wait for all readers to exit before proceeding.
…han children

After cmd.Wait() reaps the child, use SetReadDeadline on the pipe
read-ends so readers drain any buffered data (reads with available
data return instantly) then exit on deadline instead of blocking
forever when an orphan grandchild holds the write-end open. Readers
treat the deadline timeout the same as EOF — clean exit, no data
loss.

Add a dedicated readersDone channel that closes when stdout/stderr
reader goroutines actually exit, replacing the outCtx/outCancel
mechanism which is no longer needed.

Fixes TestStart_OrphanGrandchildDoesNotHangStream and the
TestCommandKillNextApp CI failure.
Add TestStart_OrphanGrandchildDoesNotHangStream: verifies that
killing a process whose grandchild holds stdout open delivers the
EndEvent within the stream timeout instead of hanging.
@arkamar arkamar force-pushed the fix/envd-backpressure-on-disconnect branch from cde8dc9 to ab0315b Compare May 12, 2026 15:14
Comment thread packages/envd/internal/services/process/handler/handler.go
arkamar added 2 commits May 12, 2026 20:23
…er on Fork

Move close(m.done) inside the m.mu critical section in run() so that
Fork()'s re-check under the same lock always observes the shutdown.
Previously, close(m.done) happened after Unlock, creating a window
where Fork could add a subscriber that run() never cleans up —
leaving the channel open forever and hanging consumers.
Readers exit after the SetReadDeadline timeout fires, but the
read-end file descriptors were never closed. Each non-PTY process
leaked two fds. Close them after readersDone signals.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c342710. Configure here.

Comment thread packages/envd/internal/services/process/handler/handler.go
…ang Wait

The SetReadDeadline escape mechanism only applied to non-PTY pipe
read-ends. For PTY processes, the reader goroutine would block on
tty.Read() forever when an orphan grandchild held the PTY slave
open, deadlocking Wait() at <-readersDone. Set the same deadline
on p.tty and treat timeout as EOF in the PTY reader loop.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants