Skip to content

Plan: transparent safe harness updates #277

@shiny-code-bot

Description

@shiny-code-bot

Plan: Transparent Safe Harness Updates

Intent

Make Every Code keep itself current without interrupting active work. The desired end state is that Every Code stages safe updates in the background and automatically moves to the newest safe harness at the next quiet boundary, while preserving local dev workflows, package-manager ownership, and long-running review/job safety.

This plan is based on the finalized chat design and agent approval pass from May 31, 2026. Agents approved the direction with required guardrails around signed updates, rollback, quiescence, local dev isolation, and background job durability.

Product Policy

Default eventual behavior:

  • Direct Every Code-managed release installs: auto-check, auto-download, verify, stage, and restart after a quiet boundary.
  • Local dev builds: never pull a GitHub release over the dev binary; detect newer local builds and restart after a quiet boundary into the local build.
  • npm/Homebrew/package-managed installs: do not mutate package-owned files; either provide package-manager instructions or route execution through a launcher-owned harness cache.
  • Active non-durable background work: update may stage, but restart must defer.

"Transparent" means automatic and low-friction, not invisible. Every restart decision must be observable through logs/status and recoverable through rollback.

Required Architecture

code command
  -> stable launcher / supervisor
      -> versioned harness binary
          -> TUI/core session
          -> rollout + history snapshot
          -> background job registry

The stable launcher owns version selection, staged update activation, rollback, restart intent, and process handoff. The harness owns Every Code runtime behavior. The launcher must not become a second runtime.

Required Guardrails

  • Signed update manifests: use Ed25519 or equivalent. Sign version, platform assets, SHA256 hashes, archive sizes, and channel metadata.
  • Atomic staging and rollback: download into launcher-owned cache, activate with an atomic pointer/version switch, keep previous version, and mark failed versions ignored.
  • Build identity: embed channel/fingerprint metadata such as release channel, git SHA, build timestamp/profile, and package version.
  • Core quiescence API: expose a structured core-owned RestartQuiescence::Ready | Blocked(Vec<Blocker>) result instead of scattered TUI heuristics.
  • Restart intent record: write a durable CODE_HOME intent containing staged version, previous version, session id, rollout path, cwd, argv, safe env subset, timestamp, and rollback pointer.
  • Flush barrier: force and await rollout/history snapshot flush before restart/resume.
  • Background job registry: track background reviews/execs/agents with is_durable; non-durable jobs block restart.
  • User controls: support settings and env overrides for auto update/restart, especially for CI and dogfood rollout.

Quiescence Blockers

Restart must defer when any blocker is active:

  • active model turn or streaming response
  • foreground exec or tool call
  • pending approval, request-user-input, or dynamic tool callback
  • queued post-turn input or unsent user input
  • pending manual compact
  • active Auto Drive coordinator phase that is not restart-safe
  • active review/subagent/background command that is not durable
  • unflushed rollout/history snapshot
  • shutdown already in progress
  • incompatible rollout/job/restart-intent schema

Milestones

1. Design Contract

Document install classes, update trust model, launcher ownership, quiescence blockers, restart intent schema, rollback behavior, and dev/release separation.

2. Staging-Only Secure Updater

Implement signed manifest verification, checksum verification, cache layout, update locks, atomic staged writes, stale cleanup, and package-manager/dev exclusions. Keep user-facing update manual at this stage.

3. Stable Launcher

Add launcher-owned harness selection:

active version
staged version
previous version
ignored broken versions
local dev override

Direct installs can use the full launcher flow. npm/Homebrew must not be mutated unless they explicitly delegate to this launcher cache.

4. Safe Restart From Rollout

Implement restart-after-turn using the current session id and rollout resume. Force snapshot flush, shut down cleanly, restart, and resume. This is the first dogfood milestone that should materially reduce harness restart friction.

5. Core Quiescence Gate

Add structured blockers and tests. TUI/update logic must ask core whether restart is safe. TaskComplete is necessary, but not sufficient.

6. Automatic Restart Scheduling

Wire staged updates to restart automatically when quiescent. If blocked, defer and show a small status notice.

7. Local Dev Reload

Dev builds skip GitHub release updates. They may watch the local resolved executable/build output fingerprint and schedule restart after turn into the newer local build.

8. Durable Jobs

Add a CODE_HOME/state/jobs registry for background reviews/execs/agents. Only durable jobs may survive restart; non-durable jobs defer restart.

9. Default-On Release

After dogfood and at least two stable releases with clean telemetry, enable background staging and restart-after-turn by default for direct-managed release installs.

Validation Gates

Before default-on release behavior:

  • manifest signature valid/invalid/tampered
  • checksum mismatch
  • wrong platform/channel/schema
  • interrupted download/extract/switch
  • rollback after broken staged binary
  • direct vs npm vs Homebrew vs dev install classification
  • local dev build never pulls GitHub release
  • quiescence blockers individually tested
  • restart after normal turn resumes same session
  • restart request during active review defers
  • restart request during background exec defers
  • rollout/history snapshot flushed before restart
  • TUI raw mode/alternate screen restored
  • VT100 snapshots for update ready/deferred/restarted/rollback notices

UX Copy Direction

Keep update state quiet but visible when relevant:

Update staged: Every Code v0.7.2 will start after this turn.
Restart deferred: background review is still running.
Restarting into Every Code v0.7.2...
Update failed; restored Every Code v0.7.1.
New local build detected. Restarting after this turn.

Current Status

Planned. No implementation has started from this plan. The recommended first implementation PR is the restart contract: flush barrier, quiescence API, restart intent, and resume flow. Secure background update staging should follow after the restart path is boring and testable.

Finish Line

Every Code can safely stage verified updates in the background and automatically restart into the newest compatible harness at a quiet boundary, while local dev sessions keep using local builds, package-manager-owned installs are not mutated unexpectedly, and active non-durable reviews/jobs are never killed by update automation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    planDurable planning issueplan:activeCurrent active plan

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions