Skip to content

chore: make the callback sync#2467

Merged
jakubno merged 1 commit intomainfrom
fix/make-oninsert-sync
Apr 21, 2026
Merged

chore: make the callback sync#2467
jakubno merged 1 commit intomainfrom
fix/make-oninsert-sync

Conversation

@jakubno
Copy link
Copy Markdown
Member

@jakubno jakubno commented Apr 21, 2026

Make OnInsert sync

@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 21, 2026

PR Summary

Medium Risk
Switching callbacks from async to sync can introduce new latency or deadlock risks if any subscriber performs slow/locking work on the critical sandbox lifecycle path.

Overview
Changes sandbox map lifecycle notifications so MapSubscriber callbacks (notably OnInsert during MarkRunning) run synchronously in the state-changing goroutine instead of being dispatched asynchronously, and updates documentation to require subscribers to remain non-blocking.

Reviewed by Cursor Bugbot for commit d59d81a. Bugbot is set up for automated code reviews on this repo. Configure here.

@jakubno jakubno requested review from arkamar and removed request for ValentaTomas and dobrac April 21, 2026 12:53
@jakubno jakubno enabled auto-merge (squash) April 21, 2026 12:54
@jakubno jakubno merged commit 6f76dc2 into main Apr 21, 2026
47 checks passed
@jakubno jakubno deleted the fix/make-oninsert-sync branch April 21, 2026 13:03
Comment on lines 14 to 21
// MapSubscriber receives lifecycle notifications from the sandbox Map.
//
// Callbacks are invoked synchronously from the goroutine that performed the
// state change. Implementations must be non-blocking; if async work is needed,
// the subscriber is responsible for dispatching it.
type MapSubscriber interface {
// OnInsert is triggered when a sandbox transitions to the running state.
OnInsert(ctx context.Context, sandbox *Sandbox)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 NFSHandler.OnNetworkRelease performs blocking I/O (blocking on <-doneCh inside chroot.Close()), which violates the new non-blocking contract added to MapSubscriber in this PR (map.go:14-21). This is a pre-existing issue: NetworkReleased was already invoked synchronously before this PR and NFSHandler was already blocking, but the PR replaces the old 'synchronous, caller can rely on completion' comment with a new 'must be non-blocking' requirement without updating NFSHandler to comply. A slow chroot teardown will stall the goroutine calling NetworkReleased, delaying sandbox network slot recycling.

Extended reasoning...

What the bug is

The PR adds an interface-level contract to MapSubscriber (map.go:14-18):

Callbacks are invoked synchronously from the goroutine that performed the state change. Implementations must be non-blocking; if async work is needed, the subscriber is responsible for dispatching it.

At the same time, the PR removes the old NetworkReleased comment that said "Subscribers are invoked synchronously so the caller can rely on them having completed before taking any follow-up action." The old contract explicitly allowed blocking; the new contract explicitly forbids it.

The code path that triggers the violation

NFSHandler.OnNetworkRelease (nfsproxy/chroot/nfs.go:93-113) acquires h.mu.Lock(), removes entries from an internal map, releases the lock, then calls chroot.Close() in a loop for each removed entry. chroot.Close() calls fs.ns.Close() (chroot.go:119-121), which dispatches to mountNS.Close() (mountns.go:76-99). mountNS.Close() closes the stopCh channel to signal a dedicated OS thread to restore the original mount namespace, then blocks on <-doneCh (mountns.go:91), waiting for that thread to call unix.Setns() and signal completion. This is a genuine synchronous blocking operation that can take non-trivial time for each chroot being torn down.

Why existing code doesn't prevent it

NetworkReleased calls m.trigger() synchronously (no goroutine dispatch), and trigger() iterates all subscribers calling OnNetworkRelease in sequence. There is no timeout, context cancellation check, or goroutine wrapper around the subscriber calls. OnNetworkRelease is called directly on the same goroutine that initiated the network release.

Impact

Each chroot teardown in OnNetworkRelease stalls the goroutine calling NetworkReleased for the duration of the mount namespace restoration syscall. When multiple network slots are released concurrently or sequentially, these stalls accumulate, delaying sandbox network slot recycling. Future contributors reading the MapSubscriber interface will see "must be non-blocking" and assume compliance, potentially introducing race conditions or deadlocks if they rely on that guarantee.

Proof by example

  1. A sandbox is stopped; its IP slot is released via NetworkReleased(ctx, ip).
  2. NetworkReleased calls m.trigger(), which iterates subscribers and calls NFSHandler.OnNetworkRelease(ctx, sbx).
  3. NFSHandler.OnNetworkRelease iterates its chroot map and calls chroot.Close() for each entry.
  4. Each chroot.Close() → mountNS.Close() closes stopCh and then does <-doneCh.
  5. The dedicated OS goroutine receives the stop signal, calls unix.Setns() to restore the original namespace, then closes doneCh.
  6. Only after doneCh is closed does OnNetworkRelease return, unblocking NetworkReleased.
  7. The network slot recycling (and any follow-up logic) is delayed by the sum of all Setns syscall latencies.

How to fix

Wrap the blocking work in a goroutine inside NFSHandler.OnNetworkRelease, or provide a separate async teardown path. For example:

func (h *NFSHandler) OnNetworkRelease(ctx context.Context, sbx *sandbox.Sandbox) {
    chroots := h.collectChroots(sbx) // non-blocking extraction
    go func() {
        for _, c := range chroots {
            c.Close()
        }
    }()
}

This would satisfy the new non-blocking contract while preserving the teardown semantics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants