Container auto-fix: red container → configurable agent session, with safeguards#35
Container auto-fix: red container → configurable agent session, with safeguards#35falkoro wants to merge 2 commits into
Conversation
A red (unhealthy/restarting/crashed) container gets a Fix button that spawns a tmux session running a configurable agent, seeded with the container's read-only diagnostics, to attempt a safe remediation while the operator watches. - Configurable "fix agents" registry (label + command), seeded with Claude/Grok/Codex, editable in-app via /api/fix-agents. - Per-host `protected` flag → propose-only mode (agent inspects + prints recommended commands, runs nothing). Advisory (prompt-level). - Danger modal: agent picker + required acknowledgement gate before spawn. - Read-only diagnostic (ps/inspect/logs, no mutating verbs); prompt written to a file and passed quoted; destructive-verb ban baked into the agent prompt. - Fix sessions are dynamic sessions: appear as cards, real close via /api/sessions/remove (kill + drop from dashboard). - All write endpoints login+unlock+CSRF gated; GET /api/fix-agents unlock-gated. Security-reviewed: no injection (charset-validated name/engine/target + shell_word escaping holds through tmux→sh -c→zsh -lic). 40 tests pass; UI + spawn/close browser-verified against a live unhealthy container. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cord Commit security review caught a protected-host safety-gate bypass: the SSH target was taken from the client request while `protected` was recomputed by re-matching that target — so addressing a protected host via an alternate target string (IP vs hostname, different user@) failed the string match, propose became false, and the agent would run in full mutating mode against a host the operator marked protected. resolve_host() now derives target + protected + label from ONE record: a named host (id/label) is authoritative and the client-sent target is ignored; a free-form target to an unconfigured host fails safe to propose-only; a matching record honours its own flag. Covered by unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a guarded “Fix” workflow for unhealthy/restarting/crashed containers that spawns a tmux-backed “fix agent” session seeded with read-only diagnostics, plus new runtime configuration for fix agents, protected remote hosts (propose-only mode), and persisted dynamic sessions.
Changes:
- Add backend support for
/api/fixand/api/fix-agents(load/save), including prompt-file generation and tmux session spawning. - Introduce persisted “dynamic sessions” and merge them into the session list; add a new “remove” endpoint for sessions.
- Update the UI to show a Fix (🔧) button on “red” containers, a danger-gated modal, a “Fix agents” editor, and a
protectedflag in the remote-host editor.
Reviewed changes
Copilot reviewed 14 out of 15 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/tmux.rs | Adds dynamic sessions into the session model and introduces create_dynamic_session. |
| src/routes.rs | Adds /api/fix, /api/fix-agents, and /api/sessions/remove; adjusts stop behavior to remove dynamic entries. |
| src/remote_hosts.rs | Persists/normalizes new protected flag for remote hosts and updates tests. |
| src/main.rs | Wires new modules (dynamic_sessions, fix, fix_agents). |
| src/fix.rs | Implements fix-session creation: diagnostics, prompt generation, agent command construction, and dynamic session spawn. |
| src/fix_agents.rs | Implements fix-agent registry load/save with normalization and seeding. |
| src/dynamic_sessions.rs | Implements persistence layer for dynamic sessions (load/save/remove). |
| src/config.rs | Adds config paths for fix agents and dynamic sessions, plus protected on remote hosts. |
| public/metrics.js | Adds Fix button + danger-gated modal + /api/fix call in served JS. |
| public/events.js | Adds “Fix agents” button to the Configure area. |
| public/actions.js | Reworks remote-host editor into row UI with protected; adds fix-agents editor. |
| frontend/metrics.ts | TypeScript source for Fix button + modal + /api/fix call. |
| frontend/events.ts | TypeScript source for “Fix agents” button insertion. |
| frontend/actions.ts | TypeScript source for remote-host editor rows + protected, plus fix-agents editor. |
| .gitignore | Ignores fix-agents.json, dynamic-sessions.json, fix-prompts/, and local Playwright scratch. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| // Resolve the SSH target, propose-only flag, and display label from ONE authoritative source. | ||
| // The `protected` (propose-only) gate must be decided by the same record that supplies the | ||
| // target, so a client can't dodge a protected host by sending an alternate target string that | ||
| // fails to string-match the configured record. Returns (target, propose_only, host_label). | ||
| fn resolve_host( |
| pub async fn create_dynamic_session(config: Arc<Config>, spec: KnownSession) -> Result<String, String> { | ||
| dynamic_sessions::save_entry(config.clone(), &spec).await?; | ||
| match launch_spec(&spec, "created").await { | ||
| Ok(message) => Ok(message), | ||
| Err(error) => { |
| async fn api_session_remove( | ||
| State(state): State<AppState>, | ||
| headers: HeaderMap, | ||
| connect: ConnectInfo<SocketAddr>, | ||
| axum::Json(body): axum::Json<NameBody>, | ||
| ) -> Response { | ||
| if let Some(response) = guard(&state, &headers, &connect) { | ||
| return response; | ||
| } | ||
| if let Err(response) = require_unlock(&state, &headers).and_then(|_| require_action(&headers)) { | ||
| return response; | ||
| } | ||
| stop_session_result(state.config.clone(), &body.name).await | ||
| } |
| async fn stop_session_result(config: std::sync::Arc<config::Config>, name: &str) -> Response { | ||
| match tmux::stop_session(name).await { | ||
| Ok(message) => { | ||
| let _ = dynamic_sessions::remove(config, name).await; | ||
| webutil::json_response( | ||
| StatusCode::OK, | ||
| &serde_json::json!({ "ok": true, "message": message }), | ||
| ) | ||
| } | ||
| Err(error) => webutil::json_response( | ||
| StatusCode::BAD_REQUEST, | ||
| &serde_json::json!({ "error": error }), | ||
| ), | ||
| } | ||
| } |
| function readRemoteHostRows(root: HTMLElement): RemoteHostEntry[] { | ||
| return Array.from(root.querySelectorAll<HTMLElement>('[data-remote-host-row]')) | ||
| .map((row) => { | ||
| const value = (field: string): string => row.querySelector<HTMLInputElement>(`[data-host-field="${field}"]`)?.value.trim() || ''; | ||
| const label = value('label'); | ||
| const id = value('id') || idFromLabel(label); | ||
| return { | ||
| id, | ||
| label, | ||
| target: value('target'), | ||
| protected: Boolean(row.querySelector<HTMLInputElement>('[data-host-field="protected"]')?.checked), | ||
| }; | ||
| }) | ||
| .filter((host) => host.id && host.label && host.target); | ||
| } |
| function readRemoteHostRows(root) { | ||
| return Array.from(root.querySelectorAll('[data-remote-host-row]')) | ||
| .map((row) => { | ||
| const value = (field) => row.querySelector(`[data-host-field="${field}"]`)?.value.trim() || ''; | ||
| const label = value('label'); | ||
| const id = value('id') || idFromLabel(label); | ||
| return { | ||
| id, | ||
| label, | ||
| target: value('target'), | ||
| protected: Boolean(row.querySelector('[data-host-field="protected"]')?.checked), | ||
| }; | ||
| }) | ||
| .filter((host) => host.id && host.label && host.target); | ||
| } |
| await Promise.all([loadFixAgents(), loadRemoteHostConfig()]); | ||
| const remote = remoteHostConfig.find((item) => item.id === host || item.label === host); |
| await Promise.all([loadFixAgents(), loadRemoteHostConfig()]); | ||
| const remote = remoteHostConfig.find((item) => item.id === host || item.label === host); |
🧠 Grok Composer review (grok-composer-2.5)Review (diff only)Scope note: The diff is truncated in the middle (container-fix handler, 1. Brand consistencyClean. No header/footer logos, favicons, og:image/schema, hero mockups, or marketing copy in this diff. 2. Spot Suite architectureNot applicable / clean for suite rules. This is a local ops dashboard (tmux, SSH, Docker/Podman, JSON config on disk)—not tenant-scoped Workers/D1/OAuth/SECRETS KV. Positive:
No flags for: client-supplied 3. Correctness + securityMedium — confirm server enforces Medium — arbitrary shell execution surface (likely intentional, still worth hardening). Low — dynamic-session cleanup looks correct. Low — UI escaping is consistent. Low — Clean: No customer-facing “workspace” copy; protected-host label says “production,” not suite tenant terminology. Verdict: Solid ops-dashboard extension with sensible gitignore and auth on the visible fix-agents API; verify truncated server code enforces |
What
A red container (unhealthy / restarting / crashed) gets a Fix button (🔧). Clicking it opens a danger-gated modal; on confirm, ShellDeck spawns a tmux session running a configurable agent, seeded with the container's read-only diagnostics, to attempt a safe remediation while you watch it live.
{label, command}, seeded with Claude/Grok/Codex, managed in-app (a "Fix agents" editor;GET/POST /api/fix-agents).{prompt}/{promptfile}placeholders are substituted, else the prompt is appended.protectedflag → propose-only. On a protected host the agent is told to inspect read-only and print recommended commands, run nothing. Unprotected hosts attempt a safe fix. (Checkbox in the remote-hosts editor.)ps -a/inspect/logs --tail 200, zero mutating verbs). Prompt is written to a file and passed quoted. A destructive-verb ban (rm/rmi/compose down -v/volume prune/system prune/kill/reboot…) is baked into the agent's instructions, scoped to only the target container.POST /api/sessions/remove(kill + drop).Safety model (read this)
ShellDeck cannot hard-sandbox an interactive agent session on a host — so protection is layered, not absolute:
Two residual risks, both LOW/by-design (from the security review):
Security review
Independent security-reviewer pass: no CRITICAL/HIGH/MED findings. No command-injection path —
name/engine/targetare charset-validated to a metacharacter-free set, the attacker-influenced prompt/diagnostics areshell_wordsingle-quote-escaped (or never inlined on the default branch), and the quoting composes correctly throughtmux → sh -c → zsh -lic. All write endpoints are login+unlock+CSRF-gated;GET /api/fix-agentsis unlock-gated;write_promptcan't escape its dir.This branch adds its own
src/dynamic_sessions.rs(master lacks it). ItsDynamicSessionis{name,label,family,alias,badge,start}, which diverges from #34's{name,label,start,cwd}. Whichever lands second needs a small struct-unification merge. Recommend merging #34 first, then I'll reconcile this on top (onedynamic_sessionsmodule).Verification
fix.rsguards: quoting can't be broken by attacker prompt, diagnostic is read-only, session names stay in charset).(unhealthy)container: Fix button only on red cards; modal agent picker + ack-gates-Confirm + danger banner; fix-agents editor; protected checkbox; end-to-end spawn captured live diagnostics + correct SAFE-FIX instructions, and close killed the tmux session + cleared dynamic-sessions.cargo fmtchurn; compact style preserved.Note for the productization track
The spawn shell is hardcoded
zsh(fine for this box). For the open-source/enterprise build it should honor$SHELL/a configured default-command — follow-up, not a blocker.🤖 Generated with Claude Code