Skip to content

fix(envd): fix fan-out deadlock when process subscriber disconnects#2579

Merged
arkamar merged 2 commits into
mainfrom
envd/fix-multiplex-fanout-deadlock
May 7, 2026
Merged

fix(envd): fix fan-out deadlock when process subscriber disconnects#2579
arkamar merged 2 commits into
mainfrom
envd/fix-multiplex-fanout-deadlock

Conversation

@arkamar
Copy link
Copy Markdown
Contributor

@arkamar arkamar commented May 6, 2026

The fan-out loop sent to unbuffered subscriber channels while holding
RLock. If a subscriber stopped reading (e.g. client disconnect), the
send blocked forever, preventing remove() from acquiring the write lock.
This froze the output stream for all subscribers and hung any new
Connect RPC to that process.

Each subscriber now carries a done channel; fan-out delivers via
select { case s.ch <- v: case <-s.done: } so a cancelled subscriber
never wedges the loop. Includes regression tests.

Also bumps envd version to 0.5.16.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Code review skipped — your organization has reached its monthly code review spending cap.

An organization admin can view or raise the cap at claude.ai/admin-settings/claude-code. The cap resets at the start of the next billing period.

Once the cap resets or is raised, push a new commit or reopen this pull request to trigger a review.

@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

@cursor
Copy link
Copy Markdown

cursor Bot commented May 6, 2026

PR Summary

Medium Risk
Touches concurrency and channel lifecycle in process event fan-out; mistakes could introduce missed events, panics on close, or new deadlocks, though changes are localized and covered by new tests.

Overview
Fixes a deadlock in MultiplexedChannel where fan-out could block forever sending to a subscriber that stopped reading, preventing unsubscribe and wedging output for all subscribers. Subscribers now carry a done signal so sends are guarded by select and cancellation unblocks in-flight sends; shutdown also closes remaining subscriber channels, adds regression tests for abandon/cancel/close/leak cases, and bumps version to 0.5.16.

Reviewed by Cursor Bugbot for commit a7ab522. Bugbot is set up for automated code reviews on this repo. Configure here.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

❌ 8 Tests Failed:

Tests completed Failed Passed Skipped
2593 8 2585 7
View the full list of 10 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestTeamMetrics

Flake rate in main: 70.45% (Passed 39 times, Failed 93 times)

Stack Traces | 2.04s run time
=== RUN   TestTeamMetrics
=== PAUSE TestTeamMetrics
=== CONT  TestTeamMetrics
    team_metrics_test.go:61: 
        	Error Trace:	.../api/metrics/team_metrics_test.go:61
        	Error:      	Should be true
        	Test:       	TestTeamMetrics
        	Messages:   	MaxConcurrentSandboxes should be >= 0
--- FAIL: TestTeamMetrics (2.04s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 72.96% (Passed 43 times, Failed 116 times)

Stack Traces | 47.5s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
--- FAIL: TestUpdateNetworkConfig (47.48s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 73.20% (Passed 41 times, Failed 112 times)

Stack Traces | 6.24s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1353}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox ivhkhkl5vebpdq90utdbx
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1354}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1355}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Thu, 07 May 2026 17:12:13 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox ivhkhkl5vebpdq90utdbx
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (6.24s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestDeleteTemplate

Flake rate in main: 51.32% (Passed 37 times, Failed 39 times)

Stack Traces | 300s run time
=== RUN   TestDeleteTemplate
=== PAUSE TestDeleteTemplate
=== CONT  TestDeleteTemplate
    build_template_test.go:134: test-to-delete: [info] Building template cjyjs1jbmv4drkdzzpdr/bd029b34-723a-420a-8c1b-6f679621f710
    build_template_test.go:134: test-to-delete: [info] [base] FROM ubuntu:22.04 [ffd709f131f42dfab282de47a91dd2c139e900c1c11fc574b49b517a05ef0a32]
    build_template_test.go:134: test-to-delete: [info] Base Docker image size: 30 MB
    build_template_test.go:134: test-to-delete: [info] Creating file system and pulling Docker image
    build_template_test.go:134: test-to-delete: [info] Uncompressing layer sha256:f63eb04151bcac21ad049f8d781b97b219aba392c5457907f8f3e88e43eb48ec 30 MB
    build_template_test.go:134: test-to-delete: [info] Uncompressing layer sha256:77ad4d6ec0899dbcb40e2e711c6da6ca86243abfdc7d627b9fab816655132da4 12 MB
    build_template_test.go:134: test-to-delete: [info] Uncompressing layer sha256:8c4b1b28875140ed3abacaf16ad0d696f6bef912f52d2148f261a23e3349465b 168 B
    build_template_test.go:134: test-to-delete: [info] Layers extracted
    build_template_test.go:134: test-to-delete: [info] Root filesystem structure: bin, boot, dev, etc, home, lib, lib32, lib64, libx32, media, mnt, opt, proc, root, run, sbin, srv, sys, tmp, usr, var
    build_template_test.go:134: test-to-delete: [info] Provisioning sandbox template
    build_template_test.go:134: test-to-delete: [info] Provisioning was successful, cleaning up
    build_template_test.go:134: test-to-delete: [info] Sandbox template provisioned
    build_template_test.go:134: test-to-delete: [info] CACHED [base] DEFAULT USER user [90bdd4afa342293c931373351bf578872dec9179214ba3e8bf9edba311466213]
    build_template_test.go:134: test-to-delete: [info] [builder 1/1] RUN echo 'Hello, World!' [0f555930fc7ecac094fbde7e0c82c834b8c4c2a8ac16f4b0938c4a705d74f4fd]
    build_template_test.go:134: test-to-delete: [info] [builder 1/1] [stdout]: Hello, World!
    build_template_test.go:134: test-to-delete: [info] [finalize] Finalizing template build [b22adeee315d1e1b1ce9456d90a0416b80f86274893376bcb4dc6b00fe3ddf0f]
    delete_template_test.go:19: 
        	Error Trace:	.../api/templates/build_template_test.go:97
        	            				.../api/templates/delete_template_test.go:19
        	Error:      	Received unexpected error:
        	            	Get "http://localhost:.../builds/bd029b34-723a-420a-8c1b-6f679621f710/status?level=info&logsOffset=16": context deadline exceeded
        	Test:       	TestDeleteTemplate
--- FAIL: TestDeleteTemplate (300.42s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost

Flake rate in main: 54.20% (Passed 60 times, Failed 71 times)

Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_0_0_0_0

Flake rate in main: 58.43% (Passed 37 times, Failed 52 times)

Stack Traces | 7.82s run time
=== RUN   TestBindLocalhost/bind_0_0_0_0
=== PAUSE TestBindLocalhost/bind_0_0_0_0
=== CONT  TestBindLocalhost/bind_0_0_0_0
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1253}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_0_0_0_0
        	Messages:   	Unexpected status code 502 for bind address 0.0.0.0
--- FAIL: TestBindLocalhost/bind_0_0_0_0 (7.82s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::1

Flake rate in main: 59.78% (Passed 37 times, Failed 55 times)

Stack Traces | 7.79s run time
=== RUN   TestBindLocalhost/bind_::1
=== PAUSE TestBindLocalhost/bind_::1
=== CONT  TestBindLocalhost/bind_::1
Executing command python in sandbox i5g5jbi80s050ocb4dmwv
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1253}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::1
        	Messages:   	Unexpected status code 502 for bind address ::1
--- FAIL: TestBindLocalhost/bind_::1 (7.79s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_localhost

Flake rate in main: 60.22% (Passed 37 times, Failed 56 times)

Stack Traces | 8.77s run time
=== RUN   TestBindLocalhost/bind_localhost
=== PAUSE TestBindLocalhost/bind_localhost
=== CONT  TestBindLocalhost/bind_localhost
Executing command python in sandbox iyww42n91rybhawsunf43
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1253}}
Executing command python in sandbox if3ij7f43j9037nur3w7c
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_localhost
        	Messages:   	Unexpected status code 502 for bind address localhost
--- FAIL: TestBindLocalhost/bind_localhost (8.77s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 60.50% (Passed 47 times, Failed 72 times)

Stack Traces | 75s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:26: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (75.02s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 64.08% (Passed 37 times, Failed 66 times)

Stack Traces | 25.4s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1259}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\nUsed memory before tmpfs mount: 185 MB\nFree memory before tmpfs mount: 799 MB\nMemory to use in integrity test (80% of free, min 64MB): 639 MB\n"}}
Executing command bash in sandbox ii8om2v22docpk6etbsmi (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"639+0 records in\n639+0 records out\n670040064 bytes (670 MB, 639 MiB) copied, 3.22555 s, 208 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tCommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=639\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 3.20\n\tPercent of CPU this job got: 99%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:03.23\n\tAverage shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n\tAverage total size (kbytes): 0\n\tMaximum resident set size (kbytes): 2632\n\tAverage resident set size (kbytes): 0\n\tMajor (requiring I/O) page faults: 2\n\tMinor (reclaiming a frame) page faults: 344\n\tVoluntary context switches: 3\n\tInvoluntary context switches: 38\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket messages received: 0\n\tSignals delivered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 829 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox i15h96cqd5r2mt8opi5t1
Executing command bash in sandbox i15h96cqd5r2mt8opi5t1 (user: root)
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{start:{pid:1275}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{data:{stdout:"5aca405b4adc35a939e9fc5347c1d342813d1bf0eaee4338f1e6eac295570d7f\n"}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_memory_integrity_test.go:74: Command [bash] completed successfully in sandbox i15h96cqd5r2mt8opi5t1
Executing command bash in sandbox i15h96cqd5r2mt8opi5t1 (user: root)
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{start:{pid:1278}}
    sandbox_memory_integrity_test.go:100: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:100
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox i15h96cqd5r2mt8opi5t1: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (25.42s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62d63c2d5f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread packages/envd/internal/services/process/handler/multiplex.go Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The unbuffered channel returned by Fork can cause a deadlock during bootstrap event writes if the consumer goroutine exits early. Providing a buffer of at least 1 ensures the initial write completes without hanging the request handler. Iterating over the channels slice without holding a mutex is unsafe because the remove method modifies the underlying array in-place. This race condition can cause the fan-out loop to skip subscribers or process them multiple times, so the loop should use a shallow copy of the slice.

Comment thread packages/envd/internal/services/process/handler/multiplex.go
Comment thread packages/envd/internal/services/process/handler/multiplex.go Outdated
Comment thread packages/envd/internal/services/process/handler/multiplex.go Outdated
@arkamar arkamar force-pushed the envd/fix-multiplex-fanout-deadlock branch from 62d63c2 to 2b1c1d4 Compare May 7, 2026 14:50
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2b1c1d4. Configure here.

Comment thread packages/envd/internal/services/process/handler/multiplex.go
arkamar added 2 commits May 7, 2026 18:59
The fan-out loop sent to unbuffered subscriber channels while holding
RLock. If a subscriber stopped reading (e.g. client disconnect), the
send blocked forever, preventing remove() from acquiring the write lock.
This froze the output stream for all subscribers and hung any new
Connect RPC to that process.

Each subscriber now carries a done channel; fan-out delivers via
select { case s.ch <- v: case <-s.done: } so a cancelled subscriber
never wedges the loop. Includes regression tests.
@arkamar arkamar force-pushed the envd/fix-multiplex-fanout-deadlock branch from 2b1c1d4 to a7ab522 Compare May 7, 2026 17:01
@arkamar arkamar merged commit a67f983 into main May 7, 2026
51 checks passed
@arkamar arkamar deleted the envd/fix-multiplex-fanout-deadlock branch May 7, 2026 21:33
arkamar added a commit that referenced this pull request May 13, 2026
…eration

run() copied only the slice header under RLock, sharing the backing
array with remove(). When remove() shifted elements in-place via
append(channels[:i], channels[i+1:]...), the fan-out's stale snapshot
would skip one subscriber and deliver to another twice — silent
stdout/stderr corruption. Deep-copy the slice under RLock so the
iteration is immune to concurrent mutations.

Fixes: a67f983 ("fix(envd): fix fan-out deadlock when process subscriber disconnects (#2579)")
arkamar added a commit that referenced this pull request May 13, 2026
…eration

run() copied only the slice header under RLock, sharing the backing
array with remove(). When remove() shifted elements in-place via
append(channels[:i], channels[i+1:]...), the fan-out's stale snapshot
would skip one subscriber and deliver to another twice — silent
stdout/stderr corruption. Deep-copy the slice under RLock so the
iteration is immune to concurrent mutations.

Fixes: a67f983 ("fix(envd): fix fan-out deadlock when process subscriber disconnects (#2579)")
AdaAibaby pushed a commit to AdaAibaby/infra that referenced this pull request May 14, 2026
…2b-dev#2579)

The fan-out loop sent to unbuffered subscriber channels while holding
RLock. If a subscriber stopped reading (e.g. client disconnect), the
send blocked forever, preventing remove() from acquiring the write lock.
This froze the output stream for all subscribers and hung any new
Connect RPC to that process.

Each subscriber now carries a done channel; fan-out delivers via
`select { case s.ch <- v: case <-s.done: }` so a cancelled subscriber
never wedges the loop. Includes regression tests.

Also bumps envd version to 0.5.16.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants