-
Notifications
You must be signed in to change notification settings - Fork 0
Pipeline Design 441
Problem: The ruflo_with_timeout function in scripts/lib/ruflo-adapter.sh (lines 373–437) leaks child Node processes when timeouts fire. Each loop iteration leaks ~1.5 processes (385 orphaned procs × ~4 MB = 1.5 GB cumulative over ~260 iterations).
Root Cause: When ruflo_with_timeout invokes a shell function that spawns Node, the Node process becomes a child of the wrapper subshell. Node then spawns grandchildren (agentdb workers, LLM daemon processes). The current cleanup uses pkill -TERM -P "$bg_pid" (line 416), which kills direct children only. Grandchildren remain alive with Node as parent; when Node is killed, grandchildren become orphans, are adopted by init(1), and accumulate in memory.
Constraints:
- Must support bash 3.2 (macOS default) — no associative arrays, no modern shell features
- Must be fail-open: timeout failures never block the pipeline
- Must preserve temp file cleanup on all code paths
- Must handle systems without
pkillbinary
┌─────────────────────────────────────────────────────────────────┐
│ ruflo_with_timeout() │
│ (wrapper function responsible for timeout + cleanup) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Process Group Manager (setsid isolation) │ │
│ │ ├─ Creates isolated process group at spawn (line 404) │ │
│ │ └─ All descendants inherit group membership │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ├─ spawns ─────────────────────────┐ │
│ │ │
│ ┌──────────────────────────────────┐ │ │
│ │ Shell Subshell (bg_pid) │ │ │
│ │ • Creates new process group │ │ │
│ │ via setsid │ │ │
│ └──────────────────────────────────┘ │ │
│ │ │ │
│ ├─ execs ─────────────┐ │ │
│ │ │ │
│ ┌──────────────────────────────┐│ │ │
│ │ Ruflo/Node Process ││ │ │
│ │ • Same process group ││ │ │
│ │ • Parent = subshell ││ │ │
│ └──────────────────────────────┘│ │ │
│ │ │ │ │
│ ├─ spawns ──────────┐ │ │ │
│ │ │ │ │
│ ┌──────────────────────────────┐┌─┘ │ │ │
│ │ Grandchildren (agentdb, ││ │ │ │
│ │ LLM daemon, etc.) ││ │ │ │
│ │ • Same process group ││ │ │ │
│ │ • Parent = Node ││ │ │ │
│ └──────────────────────────────┘│ │ │ │
│ │ │ │ │
│ ┌──────────────────────────────────────────────────┴───┴──────┐│
│ │ Signal Delivery Layer ││
│ │ ├─ Timeout detected: kill -TERM -<negative_bg_pid> ││
│ │ │ (kills ENTIRE process group, not just direct children) ││
│ │ ├─ Grace period: sleep 1 ││
│ │ └─ Force kill: kill -KILL -<negative_bg_pid> ││
│ │ (ensures all descendants terminate) ││
│ └───────────────────────────────────────────────────────────┘│
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Resource Cleanup (guaranteed on all paths) │ │
│ │ ├─ rm -f "$_rft_tmp" (temp file) │ │
│ │ └─ wait $bg_pid (reap zombie) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
// Runs a command with a wall-clock timeout and graceful process group termination.
// STDIN/STDOUT/STDERR are captured in a temp file and printed on success.
// On timeout or failure, circuit-breaker increments RUFLO_FAILURE_COUNT.
function ruflo_with_timeout(
timeout_seconds: number, // Wall-clock limit (1–3600 seconds)
...command: string[] // Command + args to run (shell function or binary)
): {
exit_code: number; // 0 = success, 124 = timeout, >0 = command failure
stdout: string; // Output from command (printed to stdout)
side_effects: {
RUFLO_FAILURE_COUNT: number; // Incremented on failure, exported
RUFLO_AVAILABLE: boolean; // May flip to false if failure_count >= 5
};
}
// Error contracts:
// - Returns 1 (non-zero) immediately if ruflo is disabled (fail-open)
// - Catches mktemp failures → returns 1, doesn't run command
// - Catches timeout → sends SIGTERM then SIGKILL to process group
// - Catches wait failures → returns exit code from child
// - ALWAYS cleans up temp file, even on error// Send a signal to an entire process group (all descendants of a parent PID).
// POSIX standard: negative PID = process group ID.
// Fallback to pkill when kill -s <sig> -<pid> fails.
function _kill_process_group(
signal: string, // "TERM" or "KILL" (uppercase, no SIG prefix)
pid: number // Process ID (will be negated to form group ID)
): {
exit_code: number; // 0 = signal sent, 1 = signal failed/not supported
fallback_used: boolean; // true if kill -s failed and pkill was used
}-
Process Group Isolation: Subshell runs with
setsid⟹ all children inherit group ID -
Signal Propagation:
kill -s <signal> -<pid>sends signal to entire group, not just direct children - Cleanup Guarantee: Temp file removed regardless of code path (success, timeout, error)
- Circuit Breaker: Failure counter increments atomically; disabled once counter ≥ RUFLO_MAX_FAILURES (default 5)
1. ruflo_with_timeout(30, "shell_fn", "arg1")
│
├─ mktemp → _rft_tmp created
│
├─ ( setsid shell_fn "arg1" ) >_rft_tmp &
│ └─ Subshell spawned, running in new process group
│ ├─ Inherits setsid group isolation
│ ├─ Spawns Node → inherited group
│ └─ Node spawns grandchildren → inherited group
│
├─ Poll: while kill -0 $bg_pid && waited < 30
│ ├─ kill -0 checks process still alive (no signal sent)
│ └─ sleep 1 per iteration
│
├─ Process completes
│ └─ wait $bg_pid → exit 0
│
├─ cat _rft_tmp → print output
│
└─ rm -f _rft_tmp → cleanup, return 0
1. Wait loop reaches timeout_s seconds
│
├─ kill -0 $bg_pid → still alive after timeout
│
├─ Send SIGTERM to entire process group:
│ └─ kill -TERM -<negative_bg_pid> (negation sends to group)
│
├─ Grace period: sleep 1
│ ├─ Node gets SIGTERM, starts cleanup
│ └─ Grandchildren get SIGTERM (same signal)
│
├─ If still alive, send SIGKILL:
│ └─ kill -KILL -<negative_bg_pid>
│ ├─ Ungraceful termination
│ └─ All descendants killed immediately
│
├─ wait $bg_pid → reap zombie
│
├─ rm -f _rft_tmp → cleanup
│
├─ Increment RUFLO_FAILURE_COUNT (exported)
│
└─ return 124 (timeout exit code)
1. If kill -TERM -<negative_pid> fails (not supported):
│
├─ Call _kill_process_group with fallback:
│ └─ pkill -TERM -P $bg_pid
│ ├─ Kills direct children only (suboptimal)
│ ├─ Grandchildren may survive
│ └─ Better than nothing on systems without process group support
│
├─ Emit warning: "Process group kill failed, using pkill fallback"
│
└─ Continue with grace period + SIGKILL
| Component | Responsibility | Error Handling |
|---|---|---|
| Process Group Creator (setsid) | Isolate all descendants | If setsid unavailable: continue without group isolation; fallback to pkill -P |
| Timeout Watcher | Detect wall-clock expiry | Always succeeds (polling loop has failsafe bound) |
| Signal Sender | Kill process group with SIGTERM/SIGKILL | Errors caught; fallback to pkill; always proceeds with grace period |
| Process Reaper (wait) | Collect exit status, prevent zombies | Errors suppressed (process may already be dead); captured as exit code |
| Temp File Manager | Clean up FD resources | Always runs; errors suppressed to prevent abort-on-cleanup (fail-open) |
| Circuit Breaker | Disable ruflo after N failures | Increment atomic counter; exported immediately; threshold checked at next call |
Caller (e.g., ruflo_hive_init)
│
└─ ruflo_with_timeout(30, "some_cmd")
│
├─ Any error inside → exit_code = 1 or 124
│ └─ Caller receives non-zero exit
│ └─ RUFLO_FAILURE_COUNT incremented (exported)
│ └─ If counter ≥ 5, RUFLO_AVAILABLE set to false
│
└─ Caller decides: retry, skip, or disable ruflo
(never aborts pipeline due to fail-open semantics)
Chosen Approach: Use POSIX process groups (setsid + negative PID kill) with explicit TERM→KILL sequence, bash 3.2 compatible.
Why This Design:
- Correctness: Process groups capture ALL descendants (Node + grandchildren), not just direct children
- Portability: POSIX standard; works on Linux, macOS, BSD — no GNU-only dependencies
-
Compatibility: Bash 3.2 supports
setsidand negative PID inkill— no modern shell features required - Reliability: Two-phase kill (SIGTERM + grace + SIGKILL) ensures termination even if processes ignore SIGTERM
-
Safety: Fallback to
pkill -Pwhen process groups unavailable — degrades gracefully rather than leaking
| Alternative | Pros | Cons | Why Not Selected |
|---|---|---|---|
| Process groups + TERM→KILL (selected) | ✓ Captures all descendants ✓ POSIX standard ✓ Bash 3.2 compatible | Slightly more code | Solves root cause, most reliable |
Kill by name (pkill -f "ruflo") |
Simple, one line | Kills unrelated processes globally; huge blast radius | Too risky; no isolation |
System timeout binary + --kill-after
|
GNU standard tool | macOS timeout(1) doesn't support --kill-after; requires fallback anyway | Forces fallback complexity; we already have one |
| EXIT trap cleanup only | Single cleanup point | Doesn't clean mid-loop; 1.5 GB leak persists until pipeline end | Doesn't solve the core issue |
| Resource limits (ulimit -p) | Prevents spawn | Doesn't clean existing orphans; breaks legitimate multi-process pipelines | Incomplete solution |
| Cgroup isolation (Linux only) | Perfect isolation | Not available on macOS; breaks portability of test suite | Lost macOS support |
File: scripts/lib/ruflo-adapter.sh
-
Add
_kill_process_grouphelper (lines 366–372, beforeruflo_with_timeout)_kill_process_group() { local _sig="${1:-TERM}" _pid="${2:-}" [[ -z "$_pid" ]] && return 1 if kill -s "$_sig" -"$_pid" 2>/dev/null; then return 0 elif declare -f pkill >/dev/null 2>&1; then pkill -"${_sig,,}" -P "$_pid" 2>/dev/null || true fi return 1 }
-
Add
setsidto subprocess spawn (line 404)- Change:
( "$@" ) >"$_rft_tmp" & - To:
( setsid "$@" ) >"$_rft_tmp" &
- Change:
-
Replace direct
pkill -Pwith process group kill (line 416)- Change:
pkill -TERM -P "$bg_pid" 2>/dev/null || true - To:
_kill_process_group "TERM" "$bg_pid"
- Change:
-
Add explicit TERM→KILL grace period (after line 416)
sleep 1 _kill_process_group "KILL" "$bg_pid"
File: scripts/sw-ruflo-timeout-test.sh
- Add Test 8: Run 10 timeout iterations, spawn grandchild per iteration, verify zero orphans
- Update header: Note both issues #426 and #441 fixed
File: scripts/sw-e2e-smoke-test.sh
-
Baseline measurement: Before first test, capture
pgrep -c -f "node"count - Delta check: After final test, measure final count, assert delta ≤ 3 (allow variance)
- Zero orphaned processes after timeout (verified by pgrep post-test)
- All 8 tests pass in
sw-ruflo-timeout-test.sh(including new Test 8) - No FD regression (Tests 1–7 still pass — issue #426 fixed)
- Process delta ≤ 3 in smoke test (allowing natural variance)
- Manual 10-iteration test: ≤1 proc leak per iteration (vs. ~1.48 before fix)
- Bash 3.2 compatible — no modern shell constructs
- Fail-open semantics preserved — timeouts never block pipeline
- Temp file always cleaned — no FD leaks on any code path
- Circuit breaker still functional — failure count increments, ruflo disabled after threshold
| Risk | Likelihood | Mitigation |
|---|---|---|
setsid unavailable on some system |
Low | Fallback: pkill -P still works (suboptimal but functional) |
| Negative PID unsupported | Very low | POSIX standard since 1988; all major systems support |
| Grace period too short (1s) | Very low | Node responds <100ms; 1s is conservative |
| Grace period blocks pipeline | Low | Worst-case +260s for full test run (~4 min); acceptable |
| Trap re-entrancy | None | Only called once per ruflo_with_timeout invocation |
| Flock contention | None | Process group kill ensures all locks released |
| Silent failure (no error signal) | Low | Emit warning if kill -s fails; continue anyway |
-
Modified:
scripts/lib/ruflo-adapter.sh(lines 404, 416, +grace period, +helper) -
Modified:
scripts/sw-ruflo-timeout-test.sh(add Test 8) -
Modified:
scripts/sw-e2e-smoke-test.sh(baseline + delta check) - Related issues fixed: #426 (FD hang), #441 (process leak)