fix(orch): cgroup.kill backstop on rmdir EBUSY#2560
Conversation
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit 48b9424. Bugbot is set up for automated code reviews on this repo. Configure here. |
❌ 2 Tests Failed:
View the full list of 2 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Test asserts on zombie process that still exists
- Added cmd.Wait() call before signal check to reap zombie process, allowing the assertion to correctly detect process death.
Or push these changes by commenting:
@cursor push ba60dcab2d
Preview (ba60dcab2d)
diff --git a/packages/orchestrator/pkg/sandbox/cgroup/manager_test.go b/packages/orchestrator/pkg/sandbox/cgroup/manager_test.go
--- a/packages/orchestrator/pkg/sandbox/cgroup/manager_test.go
+++ b/packages/orchestrator/pkg/sandbox/cgroup/manager_test.go
@@ -296,6 +296,8 @@
_, statErr := os.Stat(handle.Path())
assert.True(t, os.IsNotExist(statErr), "cgroup directory should be removed")
+ cmd.Wait()
+
procErr := cmd.Process.Signal(syscall.Signal(0))
assert.Error(t, procErr, "leaked process should have been killed by cgroup.kill")
}You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit 9b23f28. Configure here.
arkamar
left a comment
There was a problem hiding this comment.
I am in general for this change, but I would suggest to: first try to do the regular os.Remove() as we did before, then log the error if it fails and finally send the cgroup.kill and remove. I think we want to have the error logged. The current solution will hide issues.
PR Reviewer Guide 🔍Warning
Here are some key observations to aid the review process:
|
|
@arkamar addressed in 9e7065a — @qodo-merge-pro test now skips on kernels without |
There was a problem hiding this comment.
Code review is billed via overage credits. To resume reviews, an organization admin can raise the monthly limit at claude.ai/admin-settings/claude-code.
Once credits are available, reopen this pull request to trigger a review.
9e7065a to
d9772fd
Compare
|
Scope reduced to just the |
cgroup.Remove only does rmdir; if a process leaked into the cgroup (see INC-582 — fc.Process.Stop signals only the wrapper, not the firecracker child reparented to init), rmdir fails with EBUSY and the sandbox stays running on the node. Common path is unchanged. On EBUSY we now log the original error so leaks are visible in metrics, then write cgroup.kill (cgroups v2, kernel 5.14+) and retry rmdir for up to 2s. The signal path in fc.Process.Stop is intentionally untouched to avoid changing how firecracker is shut down today.
d9772fd to
48b9424
Compare
There was a problem hiding this comment.
LGTM — small, defensive cleanup with rmdir-first fast path and a bounded cgroup.kill fallback on EBUSY.
Extended reasoning...
Overview
Single-file change to packages/orchestrator/pkg/sandbox/cgroup/manager.go adding a backstop in CgroupHandle.Remove: try plain rmdir first, and on failure log the original error before writing cgroup.kill and retrying rmdir for up to 2s (with context cancellation honored and 200ms back-off). Motivated by INC-582 leaked firecracker processes.
Security risks
None meaningful. The new write target (cgroup.kill) is inside the cgroup directory the orchestrator already owns and manages, so no new privilege boundary is crossed. Worst-case is a 2s delay in Remove on a busy cgroup, which is bounded and ctx-cancelable.
Level of scrutiny
Low-to-moderate. This touches sandbox shutdown — production-relevant — but the change is purely additive on the failure path: the no-leak common path stays a single rmdir, and the fallback only runs when rmdir already failed. The 2s budget and context-aware retry loop are reasonable.
Other factors
The PR went through several iterations and addressed all prior bot/reviewer feedback (gemini's ignored-error issue, cursor's zombie-test issue, qodo's environment-skip request). Scope was deliberately narrowed to the defensive backstop only — the signal-propagation root cause in fc.Process.Stop is correctly left to a follow-up. New test coverage exists and the author confirmed it passes in arm64 CI.


INC-582: six firecracker processes for the same sandbox left running on
orch-client-n2-r8sbafter a burst of checkpoints. Root cause isfc.Process.Stopsignaling only theunshare/bash/ip-netns-execwrapper PID —ip netns execforks firecracker as a child without forwarding signals, so firecracker is reparented to init and keeps running. The orchestrator never sees an error.This PR is a defensive cleanup-only fix: in
cgroup.Removewe keep the existingrmdiras the fast path, and on EBUSY (process still in the cgroup) log the original error and fall back tocgroup.kill+ retryrmdirfor up to 2s. The signal path infc.Process.Stopis intentionally untouched — fixing the firecracker SIGTERM-propagation properly is a behavioral change to sandbox shutdown that needs its own PR.Trade-off: we still leave firecracker running until the next
cgroup.Removecall, but the sandbox will then actually die instead of accumulating, and the newcgroup rmdir failed, falling back to cgroup.killwarn is a queryable Loki signal we don't have today.Related: PR #2453 (already merged Apr 21) fixes the duplicate-
SandboxIDoverwrite in the sandboxes map, which compounded the leak when the deployedec97b441orchestrator allowed multiple in-flight checkpoints to clobber each other's map entry. The next foxtrot deploy picks that up automatically.Full evidence, code-path trace, empirical signal-propagation test and follow-up plan in
investigation/2026-05-04-inc-582-stale-checkpoint-fcs/REPORT.mdin the debugger repo.