fix(orch): cgroup.kill backstop on rmdir EBUSY by ValentaTomas · Pull Request #2560 · e2b-dev/infra

ValentaTomas · 2026-05-04T18:07:15Z

INC-582: six firecracker processes for the same sandbox left running on orch-client-n2-r8sb after a burst of checkpoints. Root cause is fc.Process.Stop signaling only the unshare/bash/ip-netns-exec wrapper PID — ip netns exec forks firecracker as a child without forwarding signals, so firecracker is reparented to init and keeps running. The orchestrator never sees an error.

This PR is a defensive cleanup-only fix: in cgroup.Remove we keep the existing rmdir as the fast path, and on EBUSY (process still in the cgroup) log the original error and fall back to cgroup.kill + retry rmdir for up to 2s. The signal path in fc.Process.Stop is intentionally untouched — fixing the firecracker SIGTERM-propagation properly is a behavioral change to sandbox shutdown that needs its own PR.

Trade-off: we still leave firecracker running until the next cgroup.Remove call, but the sandbox will then actually die instead of accumulating, and the new cgroup rmdir failed, falling back to cgroup.kill warn is a queryable Loki signal we don't have today.

Related: PR #2453 (already merged Apr 21) fixes the duplicate-SandboxID overwrite in the sandboxes map, which compounded the leak when the deployed ec97b441 orchestrator allowed multiple in-flight checkpoints to clobber each other's map entry. The next foxtrot deploy picks that up automatically.

Full evidence, code-path trace, empirical signal-propagation test and follow-up plan in investigation/2026-05-04-inc-582-stale-checkpoint-fcs/REPORT.md in the debugger repo.

cursor · 2026-05-04T18:07:24Z

PR Summary

Medium Risk
Medium risk because cleanup now attempts to kill any remaining processes in the cgroup and can block removal for up to 2 seconds, which could affect sandbox shutdown behavior under load.

Overview
If os.Remove of the cgroup directory fails (typically EBUSY), Remove now logs a warning, writes to cgroup.kill, and retries rmdir for up to 2 seconds; this can kill leftover processes in the cgroup and can delay cleanup. The cgroup.kill write is best-effort (missing file is ignored), so on older kernels the removal can still fail after the retry window.

^{Reviewed by Cursor Bugbot for commit 48b9424. Bugbot is set up for automated code reviews on this repo. Configure here.}

codecov · 2026-05-04T18:09:32Z

❌ 2 Tests Failed:

Tests completed	Failed	Passed	Skipped
2366	2	2364	5

View the full list of 2 ❄️ flaky test(s)

github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 55.56% (Passed 8 times, Failed 10 times)

Stack Traces | 25.9s run time

=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
Executing command curl in sandbox i8zn2nlz96gxmb86bp1tt
--- FAIL: TestUpdateNetworkConfig (25.95s)

github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 50.00% (Passed 8 times, Failed 8 times)

Stack Traces | 1.29s run time

=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
Executing command curl in sandbox ikrgwp4ydd25kfxvno4tx
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1346}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox ikrgwp4ydd25kfxvno4tx
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1347}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1348}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Tue, 05 May 2026 02:43:29 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox ikrgwp4ydd25kfxvno4tx
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (1.29s)

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

gemini-code-assist

Code Review

The error from os.ReadFile in countCgroupProcs is ignored, which could cause killCgroup to incorrectly assume no processes are running if a read failure occurs, potentially leaving orphaned processes and causing the cgroup removal to fail.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Test asserts on zombie process that still exists
- Added cmd.Wait() call before signal check to reap zombie process, allowing the assertion to correctly detect process death.

Or push these changes by commenting:

@cursor push ba60dcab2d

Preview (ba60dcab2d)

diff --git a/packages/orchestrator/pkg/sandbox/cgroup/manager_test.go b/packages/orchestrator/pkg/sandbox/cgroup/manager_test.go
--- a/packages/orchestrator/pkg/sandbox/cgroup/manager_test.go
+++ b/packages/orchestrator/pkg/sandbox/cgroup/manager_test.go
@@ -296,6 +296,8 @@
 	_, statErr := os.Stat(handle.Path())
 	assert.True(t, os.IsNotExist(statErr), "cgroup directory should be removed")
 
+	cmd.Wait()
+
 	procErr := cmd.Process.Signal(syscall.Signal(0))
 	assert.Error(t, procErr, "leaked process should have been killed by cgroup.kill")
 }

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit 9b23f28. Configure here.}

arkamar

I am in general for this change, but I would suggest to: first try to do the regular os.Remove() as we did before, then log the error if it fails and finally send the cgroup.kill and remove. I think we want to have the error logged. The current solution will hide issues.

qodo-code-review · 2026-05-05T01:41:26Z

PR Reviewer Guide 🔍

Warning

/review is deprecated. Use /agentic_review instead (removed after 2026-05-31).

Here are some key observations to aid the review process:

⚡ Recommended focus areas for review

Test Flakiness

TestCgroupHandleRemoveKillsLeakedProcess only skips when not root, but it will fail on kernels/cgroup configurations without cgroup.kill support (or where it’s disabled), since Remove will warn and then the process may remain alive and/or rmdir may fail; the test should detect cgroup.kill availability (or cgroup v2 + kernel support) and skip accordingly to avoid environment-dependent CI failures.

func TestCgroupHandleRemoveKillsLeakedProcess(t *testing.T) {
	t.Parallel()

	if os.Geteuid() != 0 {
		t.Skip("test requires root privileges")
	}

	ctx := context.Background()
	mgr, err := NewManager()
	require.NoError(t, err)

	err = mgr.Initialize(ctx)
	require.NoError(t, err)

	handle, err := mgr.Create(ctx, "test-remove-kills-leaked")
	require.NoError(t, err)

	// Simulate the production failure mode: a wrapper process exec-chain that
	// leaves a long-lived child in the cgroup after the parent exits. Here we
	// just spawn a sleep directly and walk away from it.
	cmd := exec.CommandContext(ctx, "sleep", "300")
	cmd.SysProcAttr = &syscall.SysProcAttr{
		UseCgroupFD: true,
		CgroupFD:    handle.GetFD(),
	}
	require.NoError(t, cmd.Start())
	defer func() { _ = cmd.Process.Kill() }()

	require.NoError(t, handle.ReleaseCgroupFD())

	require.NoError(t, handle.Remove(ctx))

	_, statErr := os.Stat(handle.Path())
	assert.True(t, os.IsNotExist(statErr), "cgroup directory should be removed")

	// Reap and confirm cgroup.kill terminated the process. cmd.Wait() also
	// avoids a zombie that would otherwise pass a kill -0 liveness check.
	waitDone := make(chan error, 1)
	go func() { waitDone <- cmd.Wait() }()
	select {
	case waitErr := <-waitDone:
		require.Error(t, waitErr, "leaked process should have been killed by cgroup.kill")
	case <-time.After(3 * time.Second):
		t.Fatal("leaked process did not exit after cgroup.kill")
	}

ValentaTomas · 2026-05-05T01:44:52Z

@arkamar addressed in 9e7065a — cgroup.Remove now tries plain rmdir first; on EBUSY it logs cgroup rmdir failed, falling back to cgroup.kill (with the original errno) before writing cgroup.kill and retrying. Common no-leak path stays silent.

@qodo-merge-pro test now skips on kernels without cgroup.kill (probes for the file in the freshly created cgroup) instead of failing.

ValentaTomas

@arkamar — addressed in 9e7065a. cgroup.Remove now tries plain rmdir first; on EBUSY it logs a warn before falling back to cgroup.kill + retry. Leaks stay visible in metrics. Sorry for the previous garbled reply on this thread.

claude

⚠️ Code review skipped — your organization's overage spend limit has been reached.

Code review is billed via overage credits. To resume reviews, an organization admin can raise the monthly limit at claude.ai/admin-settings/claude-code.

Once credits are available, reopen this pull request to trigger a review.

ValentaTomas · 2026-05-05T02:28:33Z

Scope reduced to just the cgroup.kill backstop — pure defensive cleanup, the signal path in fc.Process.Stop is intentionally untouched. Per arkamar's review the rmdir runs first; on EBUSY we log the original error before falling back to cgroup.kill + retry, so leaks stay visible. Force-pushed to a single commit (d9772fd).

cgroup.Remove only does rmdir; if a process leaked into the cgroup (see INC-582 — fc.Process.Stop signals only the wrapper, not the firecracker child reparented to init), rmdir fails with EBUSY and the sandbox stays running on the node. Common path is unchanged. On EBUSY we now log the original error so leaks are visible in metrics, then write cgroup.kill (cgroups v2, kernel 5.14+) and retry rmdir for up to 2s. The signal path in fc.Process.Stop is intentionally untouched to avoid changing how firecracker is shut down today.

claude

LGTM — small, defensive cleanup with rmdir-first fast path and a bounded cgroup.kill fallback on EBUSY.

Extended reasoning...

Overview

Single-file change to packages/orchestrator/pkg/sandbox/cgroup/manager.go adding a backstop in CgroupHandle.Remove: try plain rmdir first, and on failure log the original error before writing cgroup.kill and retrying rmdir for up to 2s (with context cancellation honored and 200ms back-off). Motivated by INC-582 leaked firecracker processes.

Security risks

None meaningful. The new write target (cgroup.kill) is inside the cgroup directory the orchestrator already owns and manages, so no new privilege boundary is crossed. Worst-case is a 2s delay in Remove on a busy cgroup, which is bounded and ctx-cancelable.

Level of scrutiny

Low-to-moderate. This touches sandbox shutdown — production-relevant — but the change is purely additive on the failure path: the no-leak common path stays a single rmdir, and the fallback only runs when rmdir already failed. The 2s budget and context-aware retry loop are reasonable.

Other factors

The PR went through several iterations and addressed all prior bot/reviewer feedback (gemini's ignored-error issue, cursor's zombie-test issue, qodo's environment-skip request). Scope was deliberately narrowed to the defensive backstop only — the signal-propagation root cause in fc.Process.Stop is correctly left to a follow-up. New test coverage exists and the author confirmed it passes in arm64 CI.

e2b-request-same-site-reviewers Bot assigned tvi May 4, 2026

gemini-code-assist Bot reviewed May 4, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/cgroup/manager.go Outdated

cursor Bot reviewed May 4, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/cgroup/manager_test.go Outdated

arkamar reviewed May 4, 2026

View reviewed changes

ValentaTomas mentioned this pull request May 4, 2026

feat(block): add Tracker with NotPresent/Dirty/Zero states #2545

Merged

ValentaTomas changed the title ~~fix(orch): kill leaked firecracker via cgroup.kill before rmdir~~ fix(orch): cgroup.kill before rmdir as defense-in-depth for leaked FCs May 4, 2026

ValentaTomas unassigned tvi May 5, 2026

ValentaTomas changed the title ~~fix(orch): cgroup.kill before rmdir as defense-in-depth for leaked FCs~~ fix(orch): signal fc process group + cgroup.kill backstop for leaked firecrackers May 5, 2026

ValentaTomas marked this pull request as ready for review May 5, 2026 01:41

ValentaTomas requested review from dobrac and jakubno as code owners May 5, 2026 01:41

e2b-request-same-site-reviewers Bot assigned levb May 5, 2026

ValentaTomas commented May 5, 2026

View reviewed changes

claude Bot reviewed May 5, 2026

View reviewed changes

ValentaTomas unassigned levb May 5, 2026

ValentaTomas force-pushed the fix/orch-cgroup-kill-leaked-fc branch from 9e7065a to d9772fd Compare May 5, 2026 02:28

ValentaTomas changed the title ~~fix(orch): signal fc process group + cgroup.kill backstop for leaked firecrackers~~ fix(orch): cgroup.kill backstop on rmdir EBUSY May 5, 2026

ValentaTomas force-pushed the fix/orch-cgroup-kill-leaked-fc branch from d9772fd to 48b9424 Compare May 5, 2026 02:34

ValentaTomas enabled auto-merge (squash) May 5, 2026 02:54

claude Bot reviewed May 5, 2026

View reviewed changes

ValentaTomas requested review from arkamar and sitole and removed request for dobrac and jakubno May 5, 2026 04:39

arkamar approved these changes May 5, 2026

View reviewed changes

ValentaTomas merged commit 031e111 into main May 5, 2026
48 checks passed

ValentaTomas deleted the fix/orch-cgroup-kill-leaked-fc branch May 5, 2026 09:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(orch): cgroup.kill backstop on rmdir EBUSY#2560

fix(orch): cgroup.kill backstop on rmdir EBUSY#2560
ValentaTomas merged 1 commit intomainfrom
fix/orch-cgroup-kill-leaked-fc

ValentaTomas commented May 4, 2026 •

edited

Loading

Uh oh!

cursor Bot commented May 4, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 4, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

arkamar left a comment

Uh oh!

qodo-code-review Bot commented May 5, 2026

Uh oh!

ValentaTomas commented May 5, 2026

Uh oh!

ValentaTomas left a comment •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

ValentaTomas commented May 5, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ValentaTomas commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

codecov Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 2 Tests Failed:

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arkamar left a comment

Choose a reason for hiding this comment

Uh oh!

qodo-code-review Bot commented May 5, 2026

PR Reviewer Guide 🔍

Uh oh!

ValentaTomas commented May 5, 2026

Uh oh!

ValentaTomas left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ValentaTomas commented May 5, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ValentaTomas commented May 4, 2026 •

edited

Loading

cursor Bot commented May 4, 2026 •

edited

Loading

codecov Bot commented May 4, 2026 •

edited

Loading

cursor Bot left a comment •

edited

Loading

ValentaTomas left a comment •

edited

Loading