chore: make the callback sync by jakubno · Pull Request #2467 · e2b-dev/infra

jakubno · 2026-04-21T12:52:45Z

Make OnInsert sync

cursor · 2026-04-21T12:52:50Z

PR Summary

Medium Risk
Switching callbacks from async to sync can introduce new latency or deadlock risks if any subscriber performs slow/locking work on the critical sandbox lifecycle path.

Overview
Changes sandbox map lifecycle notifications so MapSubscriber callbacks (notably OnInsert during MarkRunning) run synchronously in the state-changing goroutine instead of being dispatched asynchronously, and updates documentation to require subscribers to remain non-blocking.

^{Reviewed by Cursor Bugbot for commit d59d81a. Bugbot is set up for automated code reviews on this repo. Configure here.}

claude · 2026-04-21T13:04:56Z

 // MapSubscriber receives lifecycle notifications from the sandbox Map.
+//
+// Callbacks are invoked synchronously from the goroutine that performed the
+// state change. Implementations must be non-blocking; if async work is needed,
+// the subscriber is responsible for dispatching it.
 type MapSubscriber interface {
 	// OnInsert is triggered when a sandbox transitions to the running state.
 	OnInsert(ctx context.Context, sandbox *Sandbox)


🟣 NFSHandler.OnNetworkRelease performs blocking I/O (blocking on <-doneCh inside chroot.Close()), which violates the new non-blocking contract added to MapSubscriber in this PR (map.go:14-21). This is a pre-existing issue: NetworkReleased was already invoked synchronously before this PR and NFSHandler was already blocking, but the PR replaces the old 'synchronous, caller can rely on completion' comment with a new 'must be non-blocking' requirement without updating NFSHandler to comply. A slow chroot teardown will stall the goroutine calling NetworkReleased, delaying sandbox network slot recycling.

Extended reasoning...

What the bug is

The PR adds an interface-level contract to MapSubscriber (map.go:14-18):

Callbacks are invoked synchronously from the goroutine that performed the state change. Implementations must be non-blocking; if async work is needed, the subscriber is responsible for dispatching it.

At the same time, the PR removes the old NetworkReleased comment that said "Subscribers are invoked synchronously so the caller can rely on them having completed before taking any follow-up action." The old contract explicitly allowed blocking; the new contract explicitly forbids it.

The code path that triggers the violation

NFSHandler.OnNetworkRelease (nfsproxy/chroot/nfs.go:93-113) acquires h.mu.Lock(), removes entries from an internal map, releases the lock, then calls chroot.Close() in a loop for each removed entry. chroot.Close() calls fs.ns.Close() (chroot.go:119-121), which dispatches to mountNS.Close() (mountns.go:76-99). mountNS.Close() closes the stopCh channel to signal a dedicated OS thread to restore the original mount namespace, then blocks on <-doneCh (mountns.go:91), waiting for that thread to call unix.Setns() and signal completion. This is a genuine synchronous blocking operation that can take non-trivial time for each chroot being torn down.

Why existing code doesn't prevent it

NetworkReleased calls m.trigger() synchronously (no goroutine dispatch), and trigger() iterates all subscribers calling OnNetworkRelease in sequence. There is no timeout, context cancellation check, or goroutine wrapper around the subscriber calls. OnNetworkRelease is called directly on the same goroutine that initiated the network release.

Impact

Each chroot teardown in OnNetworkRelease stalls the goroutine calling NetworkReleased for the duration of the mount namespace restoration syscall. When multiple network slots are released concurrently or sequentially, these stalls accumulate, delaying sandbox network slot recycling. Future contributors reading the MapSubscriber interface will see "must be non-blocking" and assume compliance, potentially introducing race conditions or deadlocks if they rely on that guarantee.

Proof by example

A sandbox is stopped; its IP slot is released via NetworkReleased(ctx, ip).

NetworkReleased calls m.trigger(), which iterates subscribers and calls NFSHandler.OnNetworkRelease(ctx, sbx).

NFSHandler.OnNetworkRelease iterates its chroot map and calls chroot.Close() for each entry.

Each chroot.Close() → mountNS.Close() closes stopCh and then does <-doneCh.

The dedicated OS goroutine receives the stop signal, calls unix.Setns() to restore the original namespace, then closes doneCh.

Only after doneCh is closed does OnNetworkRelease return, unblocking NetworkReleased.

The network slot recycling (and any follow-up logic) is delayed by the sum of all Setns syscall latencies.

How to fix

Wrap the blocking work in a goroutine inside NFSHandler.OnNetworkRelease, or provide a separate async teardown path. For example:

func (h *NFSHandler) OnNetworkRelease(ctx context.Context, sbx *sandbox.Sandbox) { chroots := h.collectChroots(sbx) // non-blocking extraction go func() { for _, c := range chroots { c.Close() } }() }

This would satisfy the new non-blocking contract while preserving the teardown semantics.

chore: make the callback sync

d59d81a

jakubno requested review from ValentaTomas and dobrac as code owners April 21, 2026 12:52

jakubno assigned arkamar Apr 21, 2026

jakubno requested review from arkamar and removed request for ValentaTomas and dobrac April 21, 2026 12:53

arkamar approved these changes Apr 21, 2026

View reviewed changes

jakubno enabled auto-merge (squash) April 21, 2026 12:54

jakubno merged commit 6f76dc2 into main Apr 21, 2026
47 checks passed

jakubno deleted the fix/make-oninsert-sync branch April 21, 2026 13:03

claude Bot reviewed Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: make the callback sync#2467

chore: make the callback sync#2467
jakubno merged 1 commit intomainfrom
fix/make-oninsert-sync

jakubno commented Apr 21, 2026 •

edited

Loading

Uh oh!

cursor Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

claude Bot Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jakubno commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

claude Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jakubno commented Apr 21, 2026 •

edited

Loading

cursor Bot commented Apr 21, 2026 •

edited

Loading