Skip to content

fix(orch): cgroup.kill backstop on rmdir EBUSY#2560

Merged
ValentaTomas merged 1 commit intomainfrom
fix/orch-cgroup-kill-leaked-fc
May 5, 2026
Merged

fix(orch): cgroup.kill backstop on rmdir EBUSY#2560
ValentaTomas merged 1 commit intomainfrom
fix/orch-cgroup-kill-leaked-fc

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

@ValentaTomas ValentaTomas commented May 4, 2026

INC-582: six firecracker processes for the same sandbox left running on orch-client-n2-r8sb after a burst of checkpoints. Root cause is fc.Process.Stop signaling only the unshare/bash/ip-netns-exec wrapper PID — ip netns exec forks firecracker as a child without forwarding signals, so firecracker is reparented to init and keeps running. The orchestrator never sees an error.

This PR is a defensive cleanup-only fix: in cgroup.Remove we keep the existing rmdir as the fast path, and on EBUSY (process still in the cgroup) log the original error and fall back to cgroup.kill + retry rmdir for up to 2s. The signal path in fc.Process.Stop is intentionally untouched — fixing the firecracker SIGTERM-propagation properly is a behavioral change to sandbox shutdown that needs its own PR.

Trade-off: we still leave firecracker running until the next cgroup.Remove call, but the sandbox will then actually die instead of accumulating, and the new cgroup rmdir failed, falling back to cgroup.kill warn is a queryable Loki signal we don't have today.

Related: PR #2453 (already merged Apr 21) fixes the duplicate-SandboxID overwrite in the sandboxes map, which compounded the leak when the deployed ec97b441 orchestrator allowed multiple in-flight checkpoints to clobber each other's map entry. The next foxtrot deploy picks that up automatically.

Full evidence, code-path trace, empirical signal-propagation test and follow-up plan in investigation/2026-05-04-inc-582-stale-checkpoint-fcs/REPORT.md in the debugger repo.

@cursor
Copy link
Copy Markdown

cursor Bot commented May 4, 2026

PR Summary

Medium Risk
Medium risk because cleanup now attempts to kill any remaining processes in the cgroup and can block removal for up to 2 seconds, which could affect sandbox shutdown behavior under load.

Overview
If os.Remove of the cgroup directory fails (typically EBUSY), Remove now logs a warning, writes to cgroup.kill, and retries rmdir for up to 2 seconds; this can kill leftover processes in the cgroup and can delay cleanup. The cgroup.kill write is best-effort (missing file is ignored), so on older kernels the removal can still fail after the retry window.

Reviewed by Cursor Bugbot for commit 48b9424. Bugbot is set up for automated code reviews on this repo. Configure here.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
2366 2 2364 5
View the full list of 2 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 55.56% (Passed 8 times, Failed 10 times)

Stack Traces | 25.9s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
Executing command curl in sandbox i8zn2nlz96gxmb86bp1tt
--- FAIL: TestUpdateNetworkConfig (25.95s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 50.00% (Passed 8 times, Failed 8 times)

Stack Traces | 1.29s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
Executing command curl in sandbox ikrgwp4ydd25kfxvno4tx
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1346}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox ikrgwp4ydd25kfxvno4tx
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1347}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1348}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Tue, 05 May 2026 02:43:29 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox ikrgwp4ydd25kfxvno4tx
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (1.29s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The error from os.ReadFile in countCgroupProcs is ignored, which could cause killCgroup to incorrectly assume no processes are running if a read failure occurs, potentially leaving orphaned processes and causing the cgroup removal to fail.

Comment thread packages/orchestrator/pkg/sandbox/cgroup/manager.go Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Test asserts on zombie process that still exists
    • Added cmd.Wait() call before signal check to reap zombie process, allowing the assertion to correctly detect process death.

Create PR

Or push these changes by commenting:

@cursor push ba60dcab2d
Preview (ba60dcab2d)
diff --git a/packages/orchestrator/pkg/sandbox/cgroup/manager_test.go b/packages/orchestrator/pkg/sandbox/cgroup/manager_test.go
--- a/packages/orchestrator/pkg/sandbox/cgroup/manager_test.go
+++ b/packages/orchestrator/pkg/sandbox/cgroup/manager_test.go
@@ -296,6 +296,8 @@
 	_, statErr := os.Stat(handle.Path())
 	assert.True(t, os.IsNotExist(statErr), "cgroup directory should be removed")
 
+	cmd.Wait()
+
 	procErr := cmd.Process.Signal(syscall.Signal(0))
 	assert.Error(t, procErr, "leaked process should have been killed by cgroup.kill")
 }

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 9b23f28. Configure here.

Comment thread packages/orchestrator/pkg/sandbox/cgroup/manager_test.go Outdated
Copy link
Copy Markdown
Contributor

@arkamar arkamar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in general for this change, but I would suggest to: first try to do the regular os.Remove() as we did before, then log the error if it fails and finally send the cgroup.kill and remove. I think we want to have the error logged. The current solution will hide issues.

@ValentaTomas ValentaTomas changed the title fix(orch): kill leaked firecracker via cgroup.kill before rmdir fix(orch): cgroup.kill before rmdir as defense-in-depth for leaked FCs May 4, 2026
@ValentaTomas ValentaTomas changed the title fix(orch): cgroup.kill before rmdir as defense-in-depth for leaked FCs fix(orch): signal fc process group + cgroup.kill backstop for leaked firecrackers May 5, 2026
@ValentaTomas ValentaTomas marked this pull request as ready for review May 5, 2026 01:41
@qodo-code-review
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Warning

/review is deprecated. Use /agentic_review instead (removed after 2026-05-31).

Here are some key observations to aid the review process:

⚡ Recommended focus areas for review

Test Flakiness

TestCgroupHandleRemoveKillsLeakedProcess only skips when not root, but it will fail on kernels/cgroup configurations without cgroup.kill support (or where it’s disabled), since Remove will warn and then the process may remain alive and/or rmdir may fail; the test should detect cgroup.kill availability (or cgroup v2 + kernel support) and skip accordingly to avoid environment-dependent CI failures.

func TestCgroupHandleRemoveKillsLeakedProcess(t *testing.T) {
	t.Parallel()

	if os.Geteuid() != 0 {
		t.Skip("test requires root privileges")
	}

	ctx := context.Background()
	mgr, err := NewManager()
	require.NoError(t, err)

	err = mgr.Initialize(ctx)
	require.NoError(t, err)

	handle, err := mgr.Create(ctx, "test-remove-kills-leaked")
	require.NoError(t, err)

	// Simulate the production failure mode: a wrapper process exec-chain that
	// leaves a long-lived child in the cgroup after the parent exits. Here we
	// just spawn a sleep directly and walk away from it.
	cmd := exec.CommandContext(ctx, "sleep", "300")
	cmd.SysProcAttr = &syscall.SysProcAttr{
		UseCgroupFD: true,
		CgroupFD:    handle.GetFD(),
	}
	require.NoError(t, cmd.Start())
	defer func() { _ = cmd.Process.Kill() }()

	require.NoError(t, handle.ReleaseCgroupFD())

	require.NoError(t, handle.Remove(ctx))

	_, statErr := os.Stat(handle.Path())
	assert.True(t, os.IsNotExist(statErr), "cgroup directory should be removed")

	// Reap and confirm cgroup.kill terminated the process. cmd.Wait() also
	// avoids a zombie that would otherwise pass a kill -0 liveness check.
	waitDone := make(chan error, 1)
	go func() { waitDone <- cmd.Wait() }()
	select {
	case waitErr := <-waitDone:
		require.Error(t, waitErr, "leaked process should have been killed by cgroup.kill")
	case <-time.After(3 * time.Second):
		t.Fatal("leaked process did not exit after cgroup.kill")
	}

@ValentaTomas
Copy link
Copy Markdown
Member Author

@arkamar addressed in 9e7065acgroup.Remove now tries plain rmdir first; on EBUSY it logs cgroup rmdir failed, falling back to cgroup.kill (with the original errno) before writing cgroup.kill and retrying. Common no-leak path stays silent.

@qodo-merge-pro test now skips on kernels without cgroup.kill (probes for the file in the freshly created cgroup) instead of failing.

Copy link
Copy Markdown
Member Author

@ValentaTomas ValentaTomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arkamar — addressed in 9e7065a. cgroup.Remove now tries plain rmdir first; on EBUSY it logs a warn before falling back to cgroup.kill + retry. Leaks stay visible in metrics. Sorry for the previous garbled reply on this thread.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Code review skipped — your organization's overage spend limit has been reached.

Code review is billed via overage credits. To resume reviews, an organization admin can raise the monthly limit at claude.ai/admin-settings/claude-code.

Once credits are available, reopen this pull request to trigger a review.

@ValentaTomas ValentaTomas force-pushed the fix/orch-cgroup-kill-leaked-fc branch from 9e7065a to d9772fd Compare May 5, 2026 02:28
@ValentaTomas ValentaTomas changed the title fix(orch): signal fc process group + cgroup.kill backstop for leaked firecrackers fix(orch): cgroup.kill backstop on rmdir EBUSY May 5, 2026
@ValentaTomas
Copy link
Copy Markdown
Member Author

Scope reduced to just the cgroup.kill backstop — pure defensive cleanup, the signal path in fc.Process.Stop is intentionally untouched. Per arkamar's review the rmdir runs first; on EBUSY we log the original error before falling back to cgroup.kill + retry, so leaks stay visible. Force-pushed to a single commit (d9772fd).

cgroup.Remove only does rmdir; if a process leaked into the cgroup
(see INC-582 — fc.Process.Stop signals only the wrapper, not the
firecracker child reparented to init), rmdir fails with EBUSY and the
sandbox stays running on the node. Common path is unchanged. On EBUSY
we now log the original error so leaks are visible in metrics, then
write cgroup.kill (cgroups v2, kernel 5.14+) and retry rmdir for up to
2s. The signal path in fc.Process.Stop is intentionally untouched to
avoid changing how firecracker is shut down today.
@ValentaTomas ValentaTomas force-pushed the fix/orch-cgroup-kill-leaked-fc branch from d9772fd to 48b9424 Compare May 5, 2026 02:34
@ValentaTomas ValentaTomas enabled auto-merge (squash) May 5, 2026 02:54
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — small, defensive cleanup with rmdir-first fast path and a bounded cgroup.kill fallback on EBUSY.

Extended reasoning...

Overview

Single-file change to packages/orchestrator/pkg/sandbox/cgroup/manager.go adding a backstop in CgroupHandle.Remove: try plain rmdir first, and on failure log the original error before writing cgroup.kill and retrying rmdir for up to 2s (with context cancellation honored and 200ms back-off). Motivated by INC-582 leaked firecracker processes.

Security risks

None meaningful. The new write target (cgroup.kill) is inside the cgroup directory the orchestrator already owns and manages, so no new privilege boundary is crossed. Worst-case is a 2s delay in Remove on a busy cgroup, which is bounded and ctx-cancelable.

Level of scrutiny

Low-to-moderate. This touches sandbox shutdown — production-relevant — but the change is purely additive on the failure path: the no-leak common path stays a single rmdir, and the fallback only runs when rmdir already failed. The 2s budget and context-aware retry loop are reasonable.

Other factors

The PR went through several iterations and addressed all prior bot/reviewer feedback (gemini's ignored-error issue, cursor's zombie-test issue, qodo's environment-skip request). Scope was deliberately narrowed to the defensive backstop only — the signal-propagation root cause in fc.Process.Stop is correctly left to a follow-up. New test coverage exists and the author confirmed it passes in arm64 CI.

@ValentaTomas ValentaTomas requested review from arkamar and sitole and removed request for dobrac and jakubno May 5, 2026 04:39
@ValentaTomas ValentaTomas merged commit 031e111 into main May 5, 2026
48 checks passed
@ValentaTomas ValentaTomas deleted the fix/orch-cgroup-kill-leaked-fc branch May 5, 2026 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants