Skip to content

Container auto-fix: red container → configurable agent session, with safeguards#35

Open
falkoro wants to merge 2 commits into
masterfrom
feat/container-autofix
Open

Container auto-fix: red container → configurable agent session, with safeguards#35
falkoro wants to merge 2 commits into
masterfrom
feat/container-autofix

Conversation

@falkoro
Copy link
Copy Markdown
Owner

@falkoro falkoro commented Jun 3, 2026

What

A red container (unhealthy / restarting / crashed) gets a Fix button (🔧). Clicking it opens a danger-gated modal; on confirm, ShellDeck spawns a tmux session running a configurable agent, seeded with the container's read-only diagnostics, to attempt a safe remediation while you watch it live.

  • Configurable "fix agents" registry — {label, command}, seeded with Claude/Grok/Codex, managed in-app (a "Fix agents" editor; GET/POST /api/fix-agents). {prompt} / {promptfile} placeholders are substituted, else the prompt is appended.
  • Per-host protected flag → propose-only. On a protected host the agent is told to inspect read-only and print recommended commands, run nothing. Unprotected hosts attempt a safe fix. (Checkbox in the remote-hosts editor.)
  • Danger gate. Modal shows host/container, a red "potentially dangerous" warning (or blue propose-only banner), an agent picker, and a required acknowledgement checkbox — Confirm is disabled until it's checked.
  • Read-only diagnostic (ps -a / inspect / logs --tail 200, zero mutating verbs). Prompt is written to a file and passed quoted. A destructive-verb ban (rm/rmi/compose down -v/volume prune/system prune/kill/reboot…) is baked into the agent's instructions, scoped to only the target container.
  • Fix sessions are dynamic sessions — they appear as normal cards and close for real via POST /api/sessions/remove (kill + drop).

Safety model (read this)

ShellDeck cannot hard-sandbox an interactive agent session on a host — so protection is layered, not absolute:

  1. danger-ack gate before spawn, 2. read-only diagnostic, 3. protected→propose-only prompt mode, 4. destructive-verb ban in the prompt.

Two residual risks, both LOW/by-design (from the security review):

  • propose-only is advisory (prompt-level; not server-enforced). A protected host gives no technical guarantee against mutation if the agent ignores the prompt.
  • a hostile container's logs are embedded in the prompt → prompt-injection risk against the fixer agent (out of the shell-injection threat model, but real).

Security review

Independent security-reviewer pass: no CRITICAL/HIGH/MED findings. No command-injection path — name/engine/target are charset-validated to a metacharacter-free set, the attacker-influenced prompt/diagnostics are shell_word single-quote-escaped (or never inlined on the default branch), and the quoting composes correctly through tmux → sh -c → zsh -lic. All write endpoints are login+unlock+CSRF-gated; GET /api/fix-agents is unlock-gated; write_prompt can't escape its dir.

⚠️ Overlap with #34

This branch adds its own src/dynamic_sessions.rs (master lacks it). Its DynamicSession is {name,label,family,alias,badge,start}, which diverges from #34's {name,label,start,cwd}. Whichever lands second needs a small struct-unification merge. Recommend merging #34 first, then I'll reconcile this on top (one dynamic_sessions module).

Verification

  • 40 Rust tests pass (incl. new fix.rs guards: quoting can't be broken by attacker prompt, diagnostic is read-only, session names stay in charset).
  • Browser-verified on an isolated instance against a live (unhealthy) container: Fix button only on red cards; modal agent picker + ack-gates-Confirm + danger banner; fix-agents editor; protected checkbox; end-to-end spawn captured live diagnostics + correct SAFE-FIX instructions, and close killed the tmux session + cleared dynamic-sessions.
  • No cargo fmt churn; compact style preserved.

Note for the productization track

The spawn shell is hardcoded zsh (fine for this box). For the open-source/enterprise build it should honor $SHELL/a configured default-command — follow-up, not a blocker.

🤖 Generated with Claude Code

A red (unhealthy/restarting/crashed) container gets a Fix button that spawns a
tmux session running a configurable agent, seeded with the container's
read-only diagnostics, to attempt a safe remediation while the operator watches.

- Configurable "fix agents" registry (label + command), seeded with
  Claude/Grok/Codex, editable in-app via /api/fix-agents.
- Per-host `protected` flag → propose-only mode (agent inspects + prints
  recommended commands, runs nothing). Advisory (prompt-level).
- Danger modal: agent picker + required acknowledgement gate before spawn.
- Read-only diagnostic (ps/inspect/logs, no mutating verbs); prompt written to
  a file and passed quoted; destructive-verb ban baked into the agent prompt.
- Fix sessions are dynamic sessions: appear as cards, real close via
  /api/sessions/remove (kill + drop from dashboard).
- All write endpoints login+unlock+CSRF gated; GET /api/fix-agents unlock-gated.

Security-reviewed: no injection (charset-validated name/engine/target +
shell_word escaping holds through tmux→sh -c→zsh -lic). 40 tests pass;
UI + spawn/close browser-verified against a live unhealthy container.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 3, 2026 14:49
…cord

Commit security review caught a protected-host safety-gate bypass: the SSH
target was taken from the client request while `protected` was recomputed by
re-matching that target — so addressing a protected host via an alternate
target string (IP vs hostname, different user@) failed the string match,
propose became false, and the agent would run in full mutating mode against a
host the operator marked protected.

resolve_host() now derives target + protected + label from ONE record: a named
host (id/label) is authoritative and the client-sent target is ignored; a
free-form target to an unconfigured host fails safe to propose-only; a matching
record honours its own flag. Covered by unit tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a guarded “Fix” workflow for unhealthy/restarting/crashed containers that spawns a tmux-backed “fix agent” session seeded with read-only diagnostics, plus new runtime configuration for fix agents, protected remote hosts (propose-only mode), and persisted dynamic sessions.

Changes:

  • Add backend support for /api/fix and /api/fix-agents (load/save), including prompt-file generation and tmux session spawning.
  • Introduce persisted “dynamic sessions” and merge them into the session list; add a new “remove” endpoint for sessions.
  • Update the UI to show a Fix (🔧) button on “red” containers, a danger-gated modal, a “Fix agents” editor, and a protected flag in the remote-host editor.

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/tmux.rs Adds dynamic sessions into the session model and introduces create_dynamic_session.
src/routes.rs Adds /api/fix, /api/fix-agents, and /api/sessions/remove; adjusts stop behavior to remove dynamic entries.
src/remote_hosts.rs Persists/normalizes new protected flag for remote hosts and updates tests.
src/main.rs Wires new modules (dynamic_sessions, fix, fix_agents).
src/fix.rs Implements fix-session creation: diagnostics, prompt generation, agent command construction, and dynamic session spawn.
src/fix_agents.rs Implements fix-agent registry load/save with normalization and seeding.
src/dynamic_sessions.rs Implements persistence layer for dynamic sessions (load/save/remove).
src/config.rs Adds config paths for fix agents and dynamic sessions, plus protected on remote hosts.
public/metrics.js Adds Fix button + danger-gated modal + /api/fix call in served JS.
public/events.js Adds “Fix agents” button to the Configure area.
public/actions.js Reworks remote-host editor into row UI with protected; adds fix-agents editor.
frontend/metrics.ts TypeScript source for Fix button + modal + /api/fix call.
frontend/events.ts TypeScript source for “Fix agents” button insertion.
frontend/actions.ts TypeScript source for remote-host editor rows + protected, plus fix-agents editor.
.gitignore Ignores fix-agents.json, dynamic-sessions.json, fix-prompts/, and local Playwright scratch.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/fix.rs
Comment on lines +58 to +63

// Resolve the SSH target, propose-only flag, and display label from ONE authoritative source.
// The `protected` (propose-only) gate must be decided by the same record that supplies the
// target, so a client can't dodge a protected host by sending an alternate target string that
// fails to string-match the configured record. Returns (target, propose_only, host_label).
fn resolve_host(
Comment thread src/tmux.rs
Comment on lines +396 to +400
pub async fn create_dynamic_session(config: Arc<Config>, spec: KnownSession) -> Result<String, String> {
dynamic_sessions::save_entry(config.clone(), &spec).await?;
match launch_spec(&spec, "created").await {
Ok(message) => Ok(message),
Err(error) => {
Comment thread src/routes.rs
Comment on lines +730 to 743
async fn api_session_remove(
State(state): State<AppState>,
headers: HeaderMap,
connect: ConnectInfo<SocketAddr>,
axum::Json(body): axum::Json<NameBody>,
) -> Response {
if let Some(response) = guard(&state, &headers, &connect) {
return response;
}
if let Err(response) = require_unlock(&state, &headers).and_then(|_| require_action(&headers)) {
return response;
}
stop_session_result(state.config.clone(), &body.name).await
}
Comment thread src/routes.rs
Comment on lines +1030 to +1044
async fn stop_session_result(config: std::sync::Arc<config::Config>, name: &str) -> Response {
match tmux::stop_session(name).await {
Ok(message) => {
let _ = dynamic_sessions::remove(config, name).await;
webutil::json_response(
StatusCode::OK,
&serde_json::json!({ "ok": true, "message": message }),
)
}
Err(error) => webutil::json_response(
StatusCode::BAD_REQUEST,
&serde_json::json!({ "error": error }),
),
}
}
Comment thread frontend/actions.ts
Comment on lines +509 to 523
function readRemoteHostRows(root: HTMLElement): RemoteHostEntry[] {
return Array.from(root.querySelectorAll<HTMLElement>('[data-remote-host-row]'))
.map((row) => {
const value = (field: string): string => row.querySelector<HTMLInputElement>(`[data-host-field="${field}"]`)?.value.trim() || '';
const label = value('label');
const id = value('id') || idFromLabel(label);
return {
id,
label,
target: value('target'),
protected: Boolean(row.querySelector<HTMLInputElement>('[data-host-field="protected"]')?.checked),
};
})
.filter((host) => host.id && host.label && host.target);
}
Comment thread public/actions.js
Comment on lines +498 to 512
function readRemoteHostRows(root) {
return Array.from(root.querySelectorAll('[data-remote-host-row]'))
.map((row) => {
const value = (field) => row.querySelector(`[data-host-field="${field}"]`)?.value.trim() || '';
const label = value('label');
const id = value('id') || idFromLabel(label);
return {
id,
label,
target: value('target'),
protected: Boolean(row.querySelector('[data-host-field="protected"]')?.checked),
};
})
.filter((host) => host.id && host.label && host.target);
}
Comment thread frontend/metrics.ts
Comment on lines +570 to +571
await Promise.all([loadFixAgents(), loadRemoteHostConfig()]);
const remote = remoteHostConfig.find((item) => item.id === host || item.label === host);
Comment thread public/metrics.js
Comment on lines +542 to +543
await Promise.all([loadFixAgents(), loadRemoteHostConfig()]);
const remote = remoteHostConfig.find((item) => item.id === host || item.label === host);
@falkoro
Copy link
Copy Markdown
Owner Author

falkoro commented Jun 7, 2026

🧠 Grok Composer review (grok-composer-2.5)

Review (diff only)

Scope note: The diff is truncated in the middle (container-fix handler, fix_agents / dynamic_sessions modules, and related routes). Findings below are from the visible hunks only.

1. Brand consistency

Clean. No header/footer logos, favicons, og:image/schema, hero mockups, or marketing copy in this diff.

2. Spot Suite architecture

Not applicable / clean for suite rules. This is a local ops dashboard (tmux, SSH, Docker/Podman, JSON config on disk)—not tenant-scoped Workers/D1/OAuth/SECRETS KV.

Positive:

  • .gitignore:12-16fix-agents.json, dynamic-sessions.json, fix-prompts/, .playwright-mcp/ kept out of git.
  • main.rs (fix-agents save handler) — require_unlock + require_action before persisting agents.

No flags for: client-supplied tenant_id, unscoped D1, committed secrets, missing OAuth/cron/migration, or marketing Pages deploy.

3. Correctness + security

Medium — confirm server enforces protected (not visible in diff).
frontend/actions.ts:490 adds protected?: boolean and frontend/actions.ts:507 exposes it in the remote-host editor; frontend/metrics.ts:248 shows a Fix action for unhealthy containers. If the container-fix API only checks unlock/action headers and does not read protected from persisted host config, production hosts can be auto-fixed via direct API call. Needs server-side check in the truncated fix handler.

Medium — arbitrary shell execution surface (likely intentional, still worth hardening).
frontend/actions.ts:576 stores free-form command for fix agents; src/tmux.rs:391 runs spec.start via tmux new-session. Anyone who can unlock the dashboard and save agents / spawn dynamic sessions effectively has shell access. Acceptable for a personal ops tool; risky if this endpoint is ever network-exposed beyond localhost/trusted operators. Consider allowlisting agent IDs/commands or validating against configured agents only at execution time.

Low — dynamic-session cleanup looks correct.
src/tmux.rs:399-406 rolls back dynamic_sessions on failed launch; main.rs:1030-1042 stop_session_result removes dynamic session on stop.

Low — UI escaping is consistent.
frontend/actions.ts:507, frontend/actions.ts:576, frontend/metrics.ts:248 use escapeHtml for injected attribute values.

Low — frontend/events.ts:217-224 duplicates editRemoteHostsBtn lookup to inject the Fix agents button. Works, but brittle if markup already includes #editFixAgentsBtn.

Clean: No customer-facing “workspace” copy; protected-host label says “production,” not suite tenant terminology.


Verdict: Solid ops-dashboard extension with sensible gitignore and auth on the visible fix-agents API; verify truncated server code enforces protected on fix and treats fix-agent commands as privileged shell, not just validated form input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants