Skip to content

Client::stop() SIGKILLs CLI without grace period — orphans MCP child processes downstream #1381

@austenstone

Description

@austenstone

Summary

Client::stop() in src/lib.rs SIGKILLs the agent runtime CLI immediately after session.destroy returns, with no SIGTERM grace period. This races with the runtime's own MCP cleanup and causes orphaned MCP stdio child processes to accumulate across normal app restarts in every downstream consumer (notably the GitHub Copilot Tauri app).

Current behavior

// crates/copilot-sdk/src/lib.rs:1894 (vendored into github/github-app)
if let Some(mut child) = child
    && let Err(e) = child.kill().await   // tokio Child::kill = SIGKILL on Unix
{
    errors.push(Error::Io(e));
}

Tokio's Child::kill() is unconditionally SIGKILL on Unix. SIGKILL is uncatchable, so the runtime's MCP cleanup (which calls a synchronous pgrep -P enumeration inside the Node process before signaling descendants via process.kill) is interrupted mid-flight whenever cleanup takes longer than the few ms between session.destroy returning and SIGKILL landing.

This nullifies the protections from:

  • copilot-agent-runtime PR #7517 (killProcessTree on transport close)
  • copilot-agent-runtime PR #8103 (fire-and-forget MCP shutdown in dispose())

Both shipped, both verified present in the running runtime, leak still happens.

Evidence

On my machine running github-app 0aa3a6b41 (May 21 build) + copilot-agent-runtime built from HEAD (May 20), after 3 days of normal Copilot usage:

11 orphaned `uv tool uvx microsoft-fabric-rti-mcp` processes
~220 MB resident total
oldest 3 days, newest 1 hour old (well after #8103 merged)
all ppid=1 (reparented to init)
each carries a live python child also orphaned (22 leaked processes total)

Detection:

ps -Ao pid,ppid,etime,rss,command | awk '$2==1 && /uvx|fabric-rti/'

The leak is reliably reproducible by quitting GitHub Copilot. Node-based MCP servers (npm-exec'd) clean up correctly via stdin-EOF voluntary exit; uvx-launched servers do not because uv tool uvx is a Rust supervisor that holds the child's stdin pipe open from its own side, so the python child never sees EOF. The only reliable cleanup path for uvx-launched servers is the runtime's killProcessTree — which is racing the SDK's SIGKILL and losing.

Proposed fix

In Client::stop(), replace child.kill().await with a SIGTERM-then-SIGKILL escalation:

#[cfg(unix)]
{
    use nix::sys::signal::{self, Signal};
    use nix::unistd::Pid;
    if let Some(pid_raw) = child.id() {
        let _ = signal::kill(Pid::from_raw(pid_raw as i32), Signal::SIGTERM);
    }
    match tokio::time::timeout(std::time::Duration::from_secs(3), child.wait()).await {
        Ok(_) => {}  // graceful exit, killProcessTree had time to run
        Err(_) => {
            if let Err(e) = child.kill().await {
                errors.push(Error::Io(e));
            }
        }
    }
}
#[cfg(not(unix))]
{
    if let Err(e) = child.kill().await {
        errors.push(Error::Io(e));
    }
}

Pair this with a runtime-side fix that installs process.on("SIGTERM"|"SIGINT") handlers in the CLI entrypoint and awaits session.dispose() (also needed because today the runtime has no signal handlers at allgrep process.on.*SIG src --include="*.ts" → zero hits). Tracked at github/copilot-agent-runtime#8598.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions