Skip to content

feat: never leave child processes alive after the parent#1886

Merged
paul-nechifor merged 1 commit intodevfrom
paul/feat/process-cleanup
Apr 21, 2026
Merged

feat: never leave child processes alive after the parent#1886
paul-nechifor merged 1 commit intodevfrom
paul/feat/process-cleanup

Conversation

@paul-nechifor
Copy link
Copy Markdown
Contributor

Problem

Certain child processes might not get shut down correctly, especially processes started by modules.

Closes DIM-XXX

Solution

  • Mark all processes by adding a DIMOS_RUN_ID env var for them.
  • Start a watchdog process which watches for the main process to die. When it dies, all children (and children of children, etc) are killed.
  • Also kill when a new process starts.

Breaking Changes

None.

How to Test

Contributor License Agreement

  • I have read and approved the CLA.

@paul-nechifor paul-nechifor enabled auto-merge (squash) April 20, 2026 23:36
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 20, 2026

Greptile Summary

This PR introduces a three-layer orphan-process prevention system: every child process inherits a DIMOS_RUN_ID env var, a watchdog sidecar kills remaining children when the main PID disappears, and cleanup_stale() sweeps survivors at the next dimos run. The design is sound overall and covers the SIGKILL case that purely signal-based teardown misses.

  • P1 (process_lifecycle.py): wait_for_pid_exit treats PermissionError as "still alive." If the watched PID is recycled by a higher-privileged process, the watchdog loops indefinitely and never sweeps the run's children, creating a permanent orphan watchdog and surviving worker processes.

Confidence Score: 4/5

Safe to merge for most deployments, but the PID-reuse infinite-loop in wait_for_pid_exit should be addressed before shipping to production environments with high process churn.

One P1 finding: PermissionError in wait_for_pid_exit can cause an infinite watchdog loop on PID reuse under a different UID. The scenario is rare on a dedicated robotics system but is a correctness issue with no existing fallback once the watchdog stalls. All other findings are P2.

dimos/core/coordination/process_lifecycle.py — specifically the wait_for_pid_exit function's PermissionError handling.

Important Files Changed

Filename Overview
dimos/core/coordination/process_lifecycle.py New module: provides kill_run_processes (psutil env-scan sweep), spawn_watchdog (sidecar Popen), and wait_for_pid_exit (poll loop). The PermissionError branch in wait_for_pid_exit can cause an infinite loop on PID reuse by a privileged process; watchdog stderr is silently discarded.
dimos/core/coordination/watchdog_main.py New sidecar entry-point: waits for the main PID to disappear, sleeps 0.5 s grace period, then calls kill_run_processes. Logic is straightforward; correctness depends on wait_for_pid_exit behaving correctly.
dimos/core/coordination/test_process_lifecycle.py New test suite covering kill targeting, exclusion, SIGTERM escalation, psutil error tolerance, PID-exit wait, and watchdog sweep. The watchdog integration test spawns the watchdog without stripping DIMOS_RUN_ID_ENV, so the watchdog itself is tagged with the run ID and may not be cleaned up after the test.
dimos/core/daemon.py Refactored: install_signal_handlers now calls kill_run_processes in the SIGTERM/SIGINT handler. The double entry.remove() on SIGTERM in the foreground path is safe but redundant.
dimos/core/run_registry.py Added cleanup_stale which sweeps descendants of dead runs before removing their registry entries, providing a second-line defence if the watchdog also died. Logic is correct and safe for concurrent dimos run invocations.
dimos/robot/cli/dimos.py Wires spawn_watchdog into both daemon and foreground startup paths; sets DIMOS_RUN_ID_ENV on os.environ before any children are forked. Double entry.remove() on foreground SIGTERM is safe but a minor control-flow issue.

Sequence Diagram

sequenceDiagram
    participant CLI as dimos run (main)
    participant Coord as ModuleCoordinator(workers)
    participant WD as watchdog_main sidecar
    participant Reg as RunRegistry

    CLI->>CLI: os.environ[DIMOS_RUN_ID_ENV] = run_id
    CLI->>Coord: ModuleCoordinator.build(blueprint)
    CLI->>Reg: entry.save()
    CLI->>WD: spawn_watchdog(run_id) — no DIMOS_RUN_ID_ENV in env
    activate WD
    WD->>WD: wait_for_pid_exit(main_pid)

    Note over CLI,Coord: Normal or signal-triggered shutdown

    CLI->>Coord: coordinator.stop()
    CLI->>CLI: kill_run_processes(run_id) [excludes self]
    CLI->>Reg: entry.remove()
    CLI->>CLI: sys.exit(0) / process ends

    WD->>WD: PID gone, sleep 0.5 s grace
    WD->>WD: kill_run_processes(run_id) — sweeps any survivors
    deactivate WD

    Note over Reg: Next dimos run calls cleanup_stale() as third-line safety net
Loading

Comments Outside Diff (1)

  1. dimos/robot/cli/dimos.py, line 327-330 (link)

    P2 Double entry.remove() on SIGTERM in foreground mode

    When SIGTERM is received while coordinator.loop() is running, _shutdown in daemon.py calls entry.remove() and then sys.exit(0). The resulting SystemExit propagates through the try block and hits the finally clause, which calls entry.remove() a second time. The second call is harmless (missing_ok=True), but this is a subtle control-flow gap worth closing.

Reviews (1): Last reviewed commit: "feat: never leave child processes alive ..." | Re-trigger Greptile

Comment thread dimos/core/coordination/process_lifecycle.py
Comment thread dimos/core/coordination/process_lifecycle.py
Copy link
Copy Markdown
Contributor

@leshy leshy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, is it hard to give watchdog voice in the logs?
I'd like to get a warning when it's killing some runaway children, what do you think?

@paul-nechifor paul-nechifor merged commit 2eb12b8 into dev Apr 21, 2026
4 checks passed
@paul-nechifor paul-nechifor deleted the paul/feat/process-cleanup branch April 21, 2026 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants