Skip to content

Pipeline Design 441

ezigus edited this page Apr 27, 2026 · 1 revision

Architecture Decision Record

Design: fix(harness): ruflo memory calls leak child Node processes


Context

Problem: The ruflo_with_timeout function in scripts/lib/ruflo-adapter.sh (lines 373–437) leaks child Node processes when timeouts fire. Each loop iteration leaks ~1.5 processes (385 orphaned procs × ~4 MB = 1.5 GB cumulative over ~260 iterations).

Root Cause: When ruflo_with_timeout invokes a shell function that spawns Node, the Node process becomes a child of the wrapper subshell. Node then spawns grandchildren (agentdb workers, LLM daemon processes). The current cleanup uses pkill -TERM -P "$bg_pid" (line 416), which kills direct children only. Grandchildren remain alive with Node as parent; when Node is killed, grandchildren become orphans, are adopted by init(1), and accumulate in memory.

Constraints:

  • Must support bash 3.2 (macOS default) — no associative arrays, no modern shell features
  • Must be fail-open: timeout failures never block the pipeline
  • Must preserve temp file cleanup on all code paths
  • Must handle systems without pkill binary

Component Diagram

┌─────────────────────────────────────────────────────────────────┐
│                    ruflo_with_timeout()                          │
│  (wrapper function responsible for timeout + cleanup)             │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  Process Group Manager (setsid isolation)                │   │
│  │  ├─ Creates isolated process group at spawn (line 404)  │   │
│  │  └─ All descendants inherit group membership             │   │
│  └──────────────────────────────────────────────────────────┘   │
│                            │                                      │
│                            ├─ spawns ─────────────────────────┐   │
│                                                               │   │
│                         ┌──────────────────────────────────┐ │   │
│                         │  Shell Subshell (bg_pid)         │ │   │
│                         │  • Creates new process group     │ │   │
│                         │    via setsid                    │ │   │
│                         └──────────────────────────────────┘ │   │
│                                    │                         │   │
│                                    ├─ execs ─────────────┐  │   │
│                                                          │  │   │
│                         ┌──────────────────────────────┐│  │   │
│                         │  Ruflo/Node Process          ││  │   │
│                         │  • Same process group        ││  │   │
│                         │  • Parent = subshell         ││  │   │
│                         └──────────────────────────────┘│  │   │
│                                    │                    │  │   │
│                                    ├─ spawns ──────────┐ │  │   │
│                                                        │ │  │   │
│                     ┌──────────────────────────────┐┌─┘ │  │   │
│                     │  Grandchildren (agentdb,    ││   │  │   │
│                     │  LLM daemon, etc.)           ││   │  │   │
│                     │  • Same process group        ││   │  │   │
│                     │  • Parent = Node             ││   │  │   │
│                     └──────────────────────────────┘│   │  │   │
│                                                      │   │  │   │
│  ┌──────────────────────────────────────────────────┴───┴──────┐│
│  │  Signal Delivery Layer                                       ││
│  │  ├─ Timeout detected: kill -TERM -<negative_bg_pid>         ││
│  │  │  (kills ENTIRE process group, not just direct children) ││
│  │  ├─ Grace period: sleep 1                                   ││
│  │  └─ Force kill: kill -KILL -<negative_bg_pid>              ││
│  │     (ensures all descendants terminate)                     ││
│  └───────────────────────────────────────────────────────────┘│
│                                                                   │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  Resource Cleanup (guaranteed on all paths)              │   │
│  │  ├─ rm -f "$_rft_tmp"  (temp file)                       │   │
│  │  └─ wait $bg_pid       (reap zombie)                     │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

Interface Contracts

Public API: ruflo_with_timeout

// Runs a command with a wall-clock timeout and graceful process group termination.
// STDIN/STDOUT/STDERR are captured in a temp file and printed on success.
// On timeout or failure, circuit-breaker increments RUFLO_FAILURE_COUNT.

function ruflo_with_timeout(
  timeout_seconds: number,     // Wall-clock limit (1–3600 seconds)
  ...command: string[]          // Command + args to run (shell function or binary)
): {
  exit_code: number;            // 0 = success, 124 = timeout, >0 = command failure
  stdout: string;               // Output from command (printed to stdout)
  side_effects: {
    RUFLO_FAILURE_COUNT: number;     // Incremented on failure, exported
    RUFLO_AVAILABLE: boolean;        // May flip to false if failure_count >= 5
  };
}

// Error contracts:
// - Returns 1 (non-zero) immediately if ruflo is disabled (fail-open)
// - Catches mktemp failures → returns 1, doesn't run command
// - Catches timeout → sends SIGTERM then SIGKILL to process group
// - Catches wait failures → returns exit code from child
// - ALWAYS cleans up temp file, even on error

Internal Helper: _kill_process_group (NEW)

// Send a signal to an entire process group (all descendants of a parent PID).
// POSIX standard: negative PID = process group ID.
// Fallback to pkill when kill -s <sig> -<pid> fails.

function _kill_process_group(
  signal: string,           // "TERM" or "KILL" (uppercase, no SIG prefix)
  pid: number              // Process ID (will be negated to form group ID)
): {
  exit_code: number;       // 0 = signal sent, 1 = signal failed/not supported
  fallback_used: boolean;  // true if kill -s failed and pkill was used
}

Invariants

  • Process Group Isolation: Subshell runs with setsid ⟹ all children inherit group ID
  • Signal Propagation: kill -s <signal> -<pid> sends signal to entire group, not just direct children
  • Cleanup Guarantee: Temp file removed regardless of code path (success, timeout, error)
  • Circuit Breaker: Failure counter increments atomically; disabled once counter ≥ RUFLO_MAX_FAILURES (default 5)

Data Flow

Success Path

1. ruflo_with_timeout(30, "shell_fn", "arg1")
   │
   ├─ mktemp → _rft_tmp created
   │
   ├─ ( setsid shell_fn "arg1" ) >_rft_tmp &
   │  └─ Subshell spawned, running in new process group
   │     ├─ Inherits setsid group isolation
   │     ├─ Spawns Node → inherited group
   │     └─ Node spawns grandchildren → inherited group
   │
   ├─ Poll: while kill -0 $bg_pid && waited < 30
   │  ├─ kill -0 checks process still alive (no signal sent)
   │  └─ sleep 1 per iteration
   │
   ├─ Process completes
   │  └─ wait $bg_pid → exit 0
   │
   ├─ cat _rft_tmp → print output
   │
   └─ rm -f _rft_tmp → cleanup, return 0

Timeout Path

1. Wait loop reaches timeout_s seconds
   │
   ├─ kill -0 $bg_pid → still alive after timeout
   │
   ├─ Send SIGTERM to entire process group:
   │  └─ kill -TERM -<negative_bg_pid>  (negation sends to group)
   │
   ├─ Grace period: sleep 1
   │  ├─ Node gets SIGTERM, starts cleanup
   │  └─ Grandchildren get SIGTERM (same signal)
   │
   ├─ If still alive, send SIGKILL:
   │  └─ kill -KILL -<negative_bg_pid>
   │     ├─ Ungraceful termination
   │     └─ All descendants killed immediately
   │
   ├─ wait $bg_pid → reap zombie
   │
   ├─ rm -f _rft_tmp → cleanup
   │
   ├─ Increment RUFLO_FAILURE_COUNT (exported)
   │
   └─ return 124 (timeout exit code)

Fallback Path (no pkill/setsid)

1. If kill -TERM -<negative_pid> fails (not supported):
   │
   ├─ Call _kill_process_group with fallback:
   │  └─ pkill -TERM -P $bg_pid
   │     ├─ Kills direct children only (suboptimal)
   │     ├─ Grandchildren may survive
   │     └─ Better than nothing on systems without process group support
   │
   ├─ Emit warning: "Process group kill failed, using pkill fallback"
   │
   └─ Continue with grace period + SIGKILL

Error Boundaries

Component Responsibility Error Handling
Process Group Creator (setsid) Isolate all descendants If setsid unavailable: continue without group isolation; fallback to pkill -P
Timeout Watcher Detect wall-clock expiry Always succeeds (polling loop has failsafe bound)
Signal Sender Kill process group with SIGTERM/SIGKILL Errors caught; fallback to pkill; always proceeds with grace period
Process Reaper (wait) Collect exit status, prevent zombies Errors suppressed (process may already be dead); captured as exit code
Temp File Manager Clean up FD resources Always runs; errors suppressed to prevent abort-on-cleanup (fail-open)
Circuit Breaker Disable ruflo after N failures Increment atomic counter; exported immediately; threshold checked at next call

Error Propagation

Caller (e.g., ruflo_hive_init)
   │
   └─ ruflo_with_timeout(30, "some_cmd")
      │
      ├─ Any error inside → exit_code = 1 or 124
      │  └─ Caller receives non-zero exit
      │  └─ RUFLO_FAILURE_COUNT incremented (exported)
      │  └─ If counter ≥ 5, RUFLO_AVAILABLE set to false
      │
      └─ Caller decides: retry, skip, or disable ruflo
         (never aborts pipeline due to fail-open semantics)

Decision

Chosen Approach: Use POSIX process groups (setsid + negative PID kill) with explicit TERM→KILL sequence, bash 3.2 compatible.

Why This Design:

  1. Correctness: Process groups capture ALL descendants (Node + grandchildren), not just direct children
  2. Portability: POSIX standard; works on Linux, macOS, BSD — no GNU-only dependencies
  3. Compatibility: Bash 3.2 supports setsid and negative PID in kill — no modern shell features required
  4. Reliability: Two-phase kill (SIGTERM + grace + SIGKILL) ensures termination even if processes ignore SIGTERM
  5. Safety: Fallback to pkill -P when process groups unavailable — degrades gracefully rather than leaking

Alternatives Considered

Alternative Pros Cons Why Not Selected
Process groups + TERM→KILL (selected) ✓ Captures all descendants ✓ POSIX standard ✓ Bash 3.2 compatible Slightly more code Solves root cause, most reliable
Kill by name (pkill -f "ruflo") Simple, one line Kills unrelated processes globally; huge blast radius Too risky; no isolation
System timeout binary + --kill-after GNU standard tool macOS timeout(1) doesn't support --kill-after; requires fallback anyway Forces fallback complexity; we already have one
EXIT trap cleanup only Single cleanup point Doesn't clean mid-loop; 1.5 GB leak persists until pipeline end Doesn't solve the core issue
Resource limits (ulimit -p) Prevents spawn Doesn't clean existing orphans; breaks legitimate multi-process pipelines Incomplete solution
Cgroup isolation (Linux only) Perfect isolation Not available on macOS; breaks portability of test suite Lost macOS support

Implementation Plan

Phase 1: Core Fix

File: scripts/lib/ruflo-adapter.sh

  1. Add _kill_process_group helper (lines 366–372, before ruflo_with_timeout)

    _kill_process_group() {
      local _sig="${1:-TERM}" _pid="${2:-}"
      [[ -z "$_pid" ]] && return 1
      if kill -s "$_sig" -"$_pid" 2>/dev/null; then
        return 0
      elif declare -f pkill >/dev/null 2>&1; then
        pkill -"${_sig,,}" -P "$_pid" 2>/dev/null || true
      fi
      return 1
    }
  2. Add setsid to subprocess spawn (line 404)

    • Change: ( "$@" ) >"$_rft_tmp" &
    • To: ( setsid "$@" ) >"$_rft_tmp" &
  3. Replace direct pkill -P with process group kill (line 416)

    • Change: pkill -TERM -P "$bg_pid" 2>/dev/null || true
    • To: _kill_process_group "TERM" "$bg_pid"
  4. Add explicit TERM→KILL grace period (after line 416)

    sleep 1
    _kill_process_group "KILL" "$bg_pid"

Phase 2: Test Coverage

File: scripts/sw-ruflo-timeout-test.sh

  1. Add Test 8: Run 10 timeout iterations, spawn grandchild per iteration, verify zero orphans
  2. Update header: Note both issues #426 and #441 fixed

Phase 3: Integration Validation

File: scripts/sw-e2e-smoke-test.sh

  1. Baseline measurement: Before first test, capture pgrep -c -f "node" count
  2. Delta check: After final test, measure final count, assert delta ≤ 3 (allow variance)

Validation Criteria

  • Zero orphaned processes after timeout (verified by pgrep post-test)
  • All 8 tests pass in sw-ruflo-timeout-test.sh (including new Test 8)
  • No FD regression (Tests 1–7 still pass — issue #426 fixed)
  • Process delta ≤ 3 in smoke test (allowing natural variance)
  • Manual 10-iteration test: ≤1 proc leak per iteration (vs. ~1.48 before fix)
  • Bash 3.2 compatible — no modern shell constructs
  • Fail-open semantics preserved — timeouts never block pipeline
  • Temp file always cleaned — no FD leaks on any code path
  • Circuit breaker still functional — failure count increments, ruflo disabled after threshold

Risk Areas

Risk Likelihood Mitigation
setsid unavailable on some system Low Fallback: pkill -P still works (suboptimal but functional)
Negative PID unsupported Very low POSIX standard since 1988; all major systems support
Grace period too short (1s) Very low Node responds <100ms; 1s is conservative
Grace period blocks pipeline Low Worst-case +260s for full test run (~4 min); acceptable
Trap re-entrancy None Only called once per ruflo_with_timeout invocation
Flock contention None Process group kill ensures all locks released
Silent failure (no error signal) Low Emit warning if kill -s fails; continue anyway

Key Files

  • Modified: scripts/lib/ruflo-adapter.sh (lines 404, 416, +grace period, +helper)
  • Modified: scripts/sw-ruflo-timeout-test.sh (add Test 8)
  • Modified: scripts/sw-e2e-smoke-test.sh (baseline + delta check)
  • Related issues fixed: #426 (FD hang), #441 (process leak)

Clone this wiki locally