docs(plans): redesign Agent UI as a thin host over sidecar agents (v2)#1913
Conversation
Recasts the Agent UI architecture from a fat Python backend that hosts each agent in-process (~19K LoC) to a thin host that renders UI and supervises out-of-process agent sidecars. Every agent exposes ONE REST contract — fixed-function endpoints plus a POST /query agent loop (SSE) that reasons and chains tools into multi-step workflows — consumed identically by the gaia <agent> CLI, the Agent UI, and integrators. The UI thus dogfoods the exact product third parties integrate. Captures the design in two docs (overview drives, detailed plan follows): - agent-ui.mdx: architecture, thin-host boundary, /query SSE, hub mirror, dev mode, card-rendering change, and a strangler-fig migration path. - agent-ui-agent-capabilities-plan.md §0: the per-sidecar REST contract, the /query SSE event schema, the host AgentSidecarManager + proxy, mid-workflow confirmation, OAuth forwarding (host owns consent + refresh), CLI parity, shared-service custody (memory/RAG/session index), and what the fat backend's routers become. Grounded in and corrected against the current code: the frontend renders cards by fence-parsing today (a real frontend change), /v1/connections is currently a host-side intake (role inverts to host-forwards-out), user memory/RAG are cross-agent (host custody, not per-sidecar), and destructive-step confirmations run over SSE without a WebSocket. Supersedes the incremental email cutover (#1910).
|
Verdict: Approve ✅ Docs-only PR adding a v2 "sidecar-first" architecture design across two plan docs: every agent becomes an out-of-process REST sidecar (with a The standout: the "grounded against the current code" claims actually check out. I verified every code reference and every LoC figure — all accurate. That's exactly the discipline a design doc needs to be trustworthy, and it's rare. One trivial cleanup before merge (non-blocking): a stray full-width bracket artifact in one line. No code paths touched, so no tests/evals in play. 🔍 Technical detailsVerified accurate (spot-checked, all correct):
🟢 Minor — full-width bracket artifact ( Strengths:
|
Addresses gaps an architecture review surfaced against the sidecar-first design: - Host↔sidecar auth + reverse callback contract (§0.11): a loopback port that can send email is not safe unauthenticated — host mints a per-spawn secret required on every request; the same secret gates the sidecar→host custody callback API. - Shared LLM broker (§0.12): Lemonade is single-tenant per model slot, so N sidecars would race-evict each other's models — a host-owned broker serializes model loads and leases the slot. - Concurrency/cancellation/failure (§0.13): per-run_id isolation, a cancel path wired to stop/disconnect, synthetic crash error events, idle-reaping + a live- sidecar cap. - One instance per machine (§0.14): CLI and UI attach to the same sidecar via a lockfile/registry rather than spawning rivals that re-race token refresh. - Version negotiation + render fallback (§0.15); dev-mode discovery for unpublished agents (§0.16); a /query SSE eval + relay-test strategy (§0.17). - Conversation history is host custody, not sidecar (a stateless, uninstall-able process can't be the system of record); approve-what-you-saw invariant for confirmations (§0.4); MCP registry + Lemonade added to the custody table (§0.9); phased roadmap (A–D) reconciled with the sidecar model.
…ion, audit) Second adversarial review surfaced 11 more gaps, all addressed: - Authorization, not just authentication (§0.11): the per-spawn secret proved "the host spawned me" but not WHICH agent — unscoped callback routes let a hub-installed third-party agent read all user memory/RAG and any session transcript. Bind the secret to the agent id and scope every /host/v1/* route per-agent, with a per-agent shared-read grant surfaced at install. - Dispatch (§0.18): with N agents the host had no defined way to pick which sidecar answers a free-form message — v1 = explicit agent picker. - Existing-user data migration (§0.10 step 0): code migration ≠ data migration; existing ~/.gaia sessions/memory have no agent tag/partition. - Audit as host-custody sink (§0.19): actions happen in sidecars, so an agent-private log leaves the observability dashboard blind and is erased on uninstall — append consequential actions to a host-owned, per-agent-scoped log. - Broker covers ALL model families (§0.12): embedder/VLM/voice/SD contention re-creates the #1030 eviction bug; plus interactive>background priority, hot-model affinity, and a "switching model" status event. - /query request contract (§0.1): host mints run_id (fixes the cancel race), context pushed in the body (not pulled), body fields specced. - Uninstall data lifecycle (§0.20), offline/sideload + install footprint (§0.21), and reconciled internal inconsistencies (§0.7↔§0.14 CLI attach; §0.4 stateless vs run_id-resume artifacts; secret delivery off env).
…lock) Third review found two architecture holes the sidecar split created: - The host = a headless custody/supervisor daemon, NOT the web UI (§0.0). The web UI and the gaia <agent> CLI are both thin clients that attach to it, so CLI-only use (no browser open) still reaches /host/v1/* for memory/RAG/audit — the hole the callback + audit contracts otherwise reopened whenever the UI wasn't running. - Autonomy: the host daemon holds the cron clock; the sidecar does the work, spawned at fire time (§0.22, §0.9). A reaped/idle sidecar (§0.13) can't fire its own schedule, so the clock cannot live inside it — resolves the §0.9↔§0.13 contradiction. - Custom render cards are first-party-only in v1 (§0.15): a sidecar can't inject a React component into the signed thin-UI bundle; third-party cards degrade to the generic result card.
…uth) Round-4 review found two consequences of the §0.0 host-daemon split that the pre-split sections hadn't absorbed: - Sole spawner: §0.7/§0.14 (and the mdx three-clients list) still had the CLI/UI spawn the sidecar and race as rivals. Post-split only the daemon spawns; UI and CLI attach to it. Rewrote §0.7 (CLI is a daemon client), §0.14 (one daemon per machine; the daemon holds one sidecar per agent — the CLI-vs-UI rival-spawn is structurally gone), and the mdx list. - Client→daemon auth (new leg): the daemon now exposes the custody API + the /v1/<agent>/* proxy on a loopback port, and UI/CLI are external clients with no defined credential — an unauthenticated local API that could read all memory/RAG/transcripts or drive any sidecar. §0.11 now specs three auth legs; the daemon mints a 0600 client-auth token (or UDS peer-cred) clients present.
…sequencing A code-grounded feasibility pass found no fatal blocker but corrected three overclaims and named what's reuse vs net-new (§0.23): - Event vocab (§0.2): the loop→SSE seam (SSEOutputHandler) already exists, but its vocabulary (status/step/thinking/tool_start/tool_end/chunk/answer) is NOT §0.2's contract — a translation layer is required, specced as the wire contract. - Streaming proxy (§0.3): today's EmailSidecarProxy fully BUFFERS (requests + resp.json()); SSE passthrough + cancel is net-new httpx streaming, not "generalize the existing proxy." - Footprint (§0.21): the ~90MB email binary is small only because it excludes the ML stack; an in-process-ML agent can't, and PyInstaller has no cross-binary linking — "shared runtime" is aspirational, keep in-process ML out of sidecars. - §0.23 build sequencing: Lemonade broker (net-new cross-process IPC, zero scaffolding today) + a streaming/freeze spike are the critical path; /query, CLI rewrite, and OAuth forward-out are de-risked by existing seams.
…0.24) A security/threat-model review found the plan authenticated the sidecar↔host channel well but had no model for CONTAINING a hostile agent — critical once one-click third-party agents are installable. Added §0.24: - Sign the lock/catalog (SHA-256 is integrity, not authenticity) + anti-rollback; signed updates with re-consent on scope widening. - Tier-gate + least-privilege the OAuth forward (a third-party agent shouldn't get the whole live-mailbox grant one-click); explicit install consent naming scopes. - Constrain sidecar network egress (a hostile sidecar with a mailbox token + open network can exfiltrate mail and break "100% local") — no-network-default + declared allowlist. - Encrypt data at rest (tokens/memory/RAG/transcripts/secrets under ~/.gaia are plaintext today; 0600 doesn't stop a stolen laptop) — OS keychain + pull the credential vault forward to when §0.6 lands. - Tamper-evident audit log (hash-chain); cross-agent prompt-injection taint model; no unpinned npm MCP auto-install; telemetry local-only by default. The coupled trust-root + containment decision (signing + tiers + egress + least-priv) gates third-party agents and needs maintainer sign-off.
An operational/lifecycle review found the plan specced the daemon's responsibilities but never how it's born/reborn/controlled/updated: - §0.25 Daemon lifecycle: OS process manager starts it at boot + restarts on crash (fixes the §0.0 "auto-start on first call" vs §0.22 "always-on cron clock" contradiction — nothing fires the 8am brief after a reboot); stale instance.json liveness-check + atomic reclaim; `gaia daemon status|stop|restart|logs` + run_id stamped across all tiers for diagnostics; daemon↔client version skew on app update (drain + restart the stale daemon). - §0.26 On-disk state layout + update-survival map (ephemeral vs durable vs config; daemon config had no home). - §0.5: install provisions the MODEL, not just the binary (cold-start #1655 class), + the full first-run sequence. - §0.9: session-focus model when two clients drive one host-owned transcript.
A full-plan consistency re-review (post-security/operational) found two real section-vs-section contradictions; both fixed: - §0.24 egress vs §0.11/§0.12: "no-network-by-default" would sever the sidecar→ daemon loopback (callback API + broker) the custody model requires. Added a carve-out: the daemon's loopback control channel is always-allowed, distinct from the external-host egress allowlist; a net-namespace variant must plumb the daemon socket in. - §0.24 keychain vs §0.11/§0.14 client-auth token: the same secret was specified two ways. Separated the classes — durable custody secrets (OAuth refresh tokens, custody stores) go in the OS keychain / encrypted at rest; the ephemeral, re-minted client-auth + per-spawn secrets stay 0600 (client-readable, harmless to lose), matching §0.26's ephemeral/durable layout. The re-review verified the other six flagged seams (daemon restart vs runs, model-pull vs broker, audit host-sink vs private log, session-focus vs run isolation, etc.) are already consistent.
…ans (§0.27) A requirements-traceability + cross-plan review found: - Requirement over-claim: the plan said "every capability is a deterministic call," but specification.html marks the advanced/personalization tier "agent only." Corrected §0.1 + the overview: the sidecar exposes the FULL capability set split across two surfaces — the deterministic core as fixed-function endpoints, the advanced tier (task extraction, follow-up, daily briefing, scheduled send, sender prioritization, inbox profiling, preferences) behind /query + §0.22. - §0.27 Relationship to sibling plans: v2 SUPERSEDES the "sidecar reads grants.json directly" model (connectors.mdx + email-sidecar-agent-ui.md → §0.6 forwards tokens), security-model.mdx's "localhost is the trust boundary" (→ §0.11/§0.24), and the impl plan's "UI backend owns the sidecar" (interim only; the daemon owns it in v2). RECONCILES autonomy-engine (= the v2 daemon), MCP server ownership (host owns registry, sidecar owns connections), and setup-wizard vs §0.5 model provisioning. Cross-links hub-publish Part 2 = §0.5. - Added a supersede banner to the v1 email-sidecar-agent-ui.md (the doc PR #1910 was built from) so readers don't follow the inverted grants model. Requirements 2–7 (CLI parity, thin UI, dev mode, hub mirror, both surfaces, backwards-compat) verified well-captured.
A one-paragraph test plan was thin for a system that now spans a daemon, a cross-process broker, an SSE relay, three auth legs, and a data migration. §0.17 now covers four tiers: /query behavior eval (event-sequence assertions, serial), deterministic seam tests (SSE relay cancel/crash; all three auth legs incl. per-agent callback scoping; broker serialization+priority; daemon stale-lock + restart), migration idempotency/cold-state, and an on-hardware e2e golden path.
|
🟡 The "What this replaces" summary in 🔍 Technical details
"RAG, memory" appear in the sidecar parenthetical, but:
Suggested fix for the "What this replaces" sentence: |
…ency, ids) A data-model review found the design was contract-heavy but specified almost none of the contracts it leans on: - §0.28 Agent manifest schema — the load-bearing artifact ~8 sections reference as "the manifest declares X" but was never defined; the existing binaries.lock.json carries NONE of those fields. Separates the lock (binary integrity) from the manifest (capability + policy), requires the manifest to ride INSIDE the §0.24 signature envelope (else its egress/scopes/tier are forgeable), and tables the fields with the section that needs each + fail-loud install validation. - §0.29 Custody-store consistency — all writes serialized through the daemon (single physical writer, generalizing §0.6); SERIALIZED audit appends as a hard requirement of §0.24's hash-chain (concurrent appends fork the chain → false tamper); SQLite WAL/busy_timeout + RAG read-during-index. - §0.15 Contract evolution beyond install: version the /host/v1/* callback API for new-daemon/old-sidecar skew; unknown SSE event type surfaces "unsupported," never silently dropped; additive-MINOR vs deprecation-window policy. - §0.30 Identifier catalog — session id (the callback authz key, undefined), action_id (audit + uninstall handle, absent), batch_id; run_id is the model.
The reverse contract was referenced by §0.11/§0.9/§0.19/§0.29 but never listed — the same 'load-bearing but undefined' pattern as the manifest. §0.31 tables its routes (rag/query, memory get/post, sessions fetch, audit append, model lease) with shapes, per-agent authz scoping, the serialized-audit-append + single-writer rules, its own versioned MAJOR, and a typed fail-loud error taxonomy (403/409/429/ 503) — no silent empty results.
…workflow goal
The core stated requirement — /query for complex, cross-agent workflow automation
('summarize this and email it') — was deferred in §0.18 without design. §0.32
designs it: the orchestrator is itself a first-class agent (sidecar), keeping the
host thin; agent-to-agent calls route THROUGH the host (POST /host/v1/agents/{id}/
invoke) so every hop is authorized/audited/broker-leased/taint-tracked (no direct
sidecar mesh); taint travels with cross-agent data and every destructive step keeps
the confirmation gate (can't launder an injected instruction into an un-approved
send); cost is surfaced with interactive priority. Sequenced after the v1 picker,
on the same contract.
…table §0.32's agent-to-agent invoke route was specified in §0.32 but missing from §0.31's /host/v1/* table; add it (orchestrator-scoped) so the API spec is complete.
Per feedback: /query isn't UI-specific — it belongs on every agent-serving REST surface. The OpenAI-compatible API server (src/gaia/api/, `gaia api`) today serves /v1/chat/completions + /v1/models + an in-process /v1/email mount, but not the agentic loop. §0.33: - Adds POST /v1/<agent>/query (SSE) to the API server — the SAME contract, proxied to the sidecar (one agent-loop implementation, no in-process loop). - Keeps /v1/chat/completions (OpenAI-SDK drop-in) alongside /query (the agentic superset with tool events + confirmation + workflows). - Supersedes the API server's in-process /v1/email mount (openai_server.py:143 — the surface left out of PR #1910's scope) → sidecar-backed like the UI's. - Flags that this is a NETWORK-exposed surface (unlike the loopback daemon), so /query here sits behind API-key auth and still enforces the confirmation gate + containment — it must NOT inherit §0.11's loopback trust assumptions. Reframed the overview's "three thin clients" → "one contract, many front-doors" (CLI, UI, the gaia api server, integrators), all proxying to the one sidecar loop.
|
🟡 §0.27 identifies four plan docs as "now wrong" but this push adds a banner to only one of them — the others still actively mislead implementers.
The fix is small: add the same 🔍 Technical detailsFiles needing banners (per §0.27, none modified in this PR):
Suggested banner (same pattern as > **⚠️ Partially superseded by Agent UI v2** ([`agent-ui.mdx`](agent-ui.mdx) +
> [`agent-ui-agent-capabilities-plan.md`](agent-ui-agent-capabilities-plan.md) §0).
> [one-line summary of what specifically changed]. Read v2 §0 for the current model. |
Per the 'is this a fit for full autonomy?' question: the architecture is infrastructure-ready (always-on daemon + scheduler, tamper-evident audit, persistent memory, background broker priority, containment) but policy-hostile — its safety/interaction model is human-in-the-loop by construction. §0.34 names the six autonomy-layer gaps (pre-authorization policy replacing per-action approval; event-driven triggers vs cron-only; long-lived goals/self-initiation; async notify-and-resume escalation; graduated autonomy levels; continuous-monitoring vs ephemeral reaping) and the structural implication: autonomy is a NEW host-side layer ABOVE the agent contract (the autonomy-engine.mdx engine, hosted in the daemon), driving /query under pre-authorization with audit + containment + broker priority as guardrails. Sequenced after the human-in-the-loop v1.
…g fixes) The structural review found the architecture SOUND (coherent custodian, not a god-object; versioned runtime cycle, not a build cycle) and flagged framing + premature-hardening trims. Applied: - Honesty on independence/dogfooding: the sidecar owns no durable state, so custody-backed /query needs a host; publish /host/v1/* as a third-party- implementable custody interface + formalize the bare-integrator degraded tier with a capability matrix (§0.35, mdx). - Naming: stop calling the daemon "thin host" — it's the custodian daemon; "thin" is the UI only (mdx heading + boundary table + naming note). - Slogan: "one AGENT contract, many front-doors; two CONTROL-PLANE contracts behind it" (callback §0.31 + client↔daemon §0.11/§0.25). - Assert the daemon's internal module seams; mark the broker as later-extractable. - Trim premature hardening to third-party-gate (v1: plain append-only audit not hash-chain; proxy allowlist not netns; priority queueing not preemption). - Render boundary: ship generic render primitives as the default; bespoke cards are the first-party exception.
…al (§0.36) Per 'can vendors easily digest the sidecars?': yes at the deterministic/bare tier (language-agnostic HTTP, OpenAI-compat chat, self-contained binary+npm, versioned contract), but the FULL agentic experience has two frictions to design out — (1) custody (memory/RAG/audit live in the host, so rich /query needs the custodian or a third-party /host/v1/* impl) and (2) the LLM backend dependency (agent + a reachable model host; email forces local-only). §0.36 formalizes a published, tested tier/capability matrix (standalone / +OAuth / full) so vendors pick a tier with eyes open instead of hitting the frictions by surprise, and makes the LLM backend an explicit documented dependency.
…(§0.37) Per 'can one sidecar provide both experiences?': yes, and it resolves the §0.35 #1 risk at its root. §0.9 put custody in the host because N sidecars sharing one user's data need a single writer — but that only holds in the multi-agent Agent UI; a standalone agent is its own single writer. So custody is a PLUGGABLE provider (ports & adapters) behind the /host/v1/* interface, with three auto-selected adapters: embedded (default — self-contained rich agent, own SQLite), delegated (host injects /host/v1/* → shared single-writer host, the §0.9 model), ephemeral (stateless drop-in). Single-writer invariant holds in every mode (never multi-writer). One binary does both; footprint stays light (storage, not ML libs); /host/v1/* is now both the delegated wire protocol and the third-party interface (the embedded adapter is GAIA's reference impl). Refines §0.9's 'custody must be the host' to 'host is the provider the Agent UI selects.'
…des (§0.32)
A2A mediation is a host-role capability like custody (§0.37) + brokering (§0.12).
Delegated = full mediated A2A via /host/v1/agents/{id}/invoke; embedded single-agent
= no A2A; embedded multi-agent needs a coordinator (host or a third-party impl) —
direct sidecar->sidecar loses taint/audit/broker guarantees; ephemeral = caller-
orchestrated call-chaining. Unifying point: the host is the multi-agent coordination
plane bundling custody + brokering + A2A mediation, needed only with multiple agents.
…spectrum) (§0.40) Per 'is there a hybrid where some state is shared and some embedded?': yes, the natural generalization of §0.37. Custody is several stores (grants/memory/RAG/ sessions/audit), so the provider choice is per-STORE, not global — a composite CustodyProvider. Governing rule preserves §0.9's invariant per store: shared-across- agents ⇒ delegated (host single writer); agent-private ⇒ embedded (agent single writer); never a shared store with multiple writers. High-value configs: shared identity + private cognition (share logins, isolate memories), shared memory + private corpus, central audit + local rest. Benefits: fine-grained sharing + graceful degradation (host needed only for delegated stores). Cost: per-store config in the manifest (extends §0.28). Embedded (all-private) and Delegated (all-shared) are the spectrum endpoints.
'Store' was load-bearing but undefined. A store = one named, independently- addressable custody data-domain (own data kind, interface, lifecycle, sharing scope): grants/memory/rag/sessions/audit, each mapped to its /host/v1/* interface. Boundary test: a domain is its own store iff it has (a) a distinct single-writer requirement and (b) a coherent independently-swappable interface; state that shares a lifecycle+writer is one store (transcript + session-index = one 'sessions' store) — keeps it at ~5 stores, not per-field. 'Per-store' = each independently picks embedded or delegated.
…1 capstone) Per 'can we collapse the three modes into one architecture that fits A2A + privacy + autonomy?': yes — they're facets of ONE primitive, a scoped capability grant enforced by an always-present mediation plane (the object-capability / policy-enforcement-point model). Everything an agent touches (store, agent, connector, action, autonomous act) is a grant-checked, mediated, audited, taint-tracked capability; an agent runs under grant = manifest needs ∩ user consent. This resolves privacy (the grant IS the boundary — email agent has the mailbox, others don't), A2A (B runs under B's own grant; A gets only B's taint-tracked result), autonomy (an autonomy-level field per capability), and the custody modes (a store is a capability with a scope: private=embedded/shared=delegated/none=ephemeral). One code path + one security kernel; the modes are host-richness/scope configs — even embedded = sidecar + a minimal bundled host. §0.32/§0.34/§0.37/§0.40 are facets of this one architecture.
…prior-art §0.42 Stress-test verdict: the 'one capability grant unifies everything' framing was a vocabulary, not an architecture — it relabeled five orthogonal mechanisms (authz, information-flow, confirmation, arbitration, storage-topology) as one, and contradicted §0.34/§0.24. Corrected to clean SEPARATION OF CONCERNS: the grant is the authorization/isolation plane (one genuine unification of §0.11+§0.40+§0.32 routing); safety (pragmatic stack, not formal IFC), confirmation (§0.4), arbitration (§0.12), and autonomy (§0.34 layer) are separate planes. Adds the concrete fixes (grant checks tool boundary not reasoning; revocation = per-call grant re-read; embedded has no PEP; audit actions not reads; taint is a hint). §0.42 benchmarks vs OpenClaw (permission cascade; 'compromised skill inherits all' = our process isolation is stronger), Hermes (one-core-many-frontends validates §0.33; persistent memory table-stakes), and MCP/A2A (align, don't invent). Also restores the accidentally-dropped '## 1.' header.
…mmary for onboarding Adds an 'At a glance' section to the overview: a plain-language mental model for a new colleague (sidecars = per-agent out-of-process services; the custodian daemon manages/holds-shared-state/brokers-the-model/routes-A2A; UI+CLI+API are thin clients; grant = an agent's reach; process isolation contains compromise; security = a few simple separate layers; align with MCP/A2A standards), an ASCII diagram of the three tiers + the two control-plane edges, and the security-planes + custody-spectrum insets.
…roduction line New sibling doc to the runtime architecture. The runtime (agent-ui) is how agents RUN; the factory is how they're PRODUCED. Thesis: the field can learn skills but almost none MANUFACTURE trustworthy, shippable, isolated agent PRODUCTS — the differentiator is the production line, not codegen. Stitches existing components (Builder scaffold, skill-synthesis, skill-format, tool-loader, the src/gaia/eval scorecard framework, packaging/freeze+gen_*, the release_agent_*.yml CI) into one recipe-driven line: specify → scaffold → compose → synthesize → EVAL-GATE 🚦 → manifest → package → SIGN 🚦 → publish → (runtime installs). Key properties: eval is a hard quality-gate (the trust differentiator vs skill-learning with no bar); a declarative reproducible recipe.yaml (a Dockerfile for agents); the manifest (§0.28) is the factory→runtime hand-off with provenance (recipe hash + scorecard digest) inside the signed envelope (§0.24); an optional closed-loop generate→eval→refine mode. Includes the component inventory (exists vs net-new stitch), a strangler-fig phased build (email as reference), the honest competitive read, and open decisions.
…C (live SDK) Critical correction: an agent is NOT a static artifact frozen against an SDK snapshot — the SDK changes constantly, so a snapshot rots. The factory is the DEVELOPER FLOW automated, against the LIVE SDK: clone → scope → GitHub issues/milestones → spec → iterate spec → synthetic datasets → implement (against live SDK) → eval + optimize → PRs into the codebase → build + ship → and MAINTAIN as the SDK evolves. Engine = agentic coding: Claude Code (skills + memory, already in CI via claude.yml) and/or a custom Anthropic Agent-SDK orchestrator, driving the GAIA coder (CodeAgent on origin/coder), running in an isolated live-SDK worktree. This very session (clone, scope, spec, adversarial-review iterate, memory, live repo, PR) is the manual prototype. Keystone property: continuous maintenance — an SDK delta that breaks an agent's eval is a factory trigger (re-scope/re-implement/re-eval/re-PR/re-ship), with synthetic datasets + baselines as the regression net; builds pin the SDK commit as provenance. The factory also opens PRs into the codebase (agent code, and SDK improvements it needs), through the real claude.yml review + CI. Grounds every stage in existing components (coder branch, eval framework + --fix + synthetic corpus, gh, code_index, packaging line); the net-new work is the orchestrator + the SDK-delta loop. Keeps the manifest (§0.28)/Hub/signing seam to the runtime.
…into the factory Restructured the build into difficulty-ordered milestones (§10, easiest→hardest) and folded the adversarial review's corrections (§11.5). Milestones automate deterministic/existing work first, judgment-heavy/unsolved-research work last: M0 Generalize the ship half (recipe-driven, per-agent; prove on a 2nd non-email agent) M1 Provenance + edge-verified releases (no LLM in the loop) M2 Independent eval oracle + confidence-bound gate (human-curated held-out set) M3 Assisted dev automation (agentic coding, human-gated) M4 SDK-delta maintenance loop (last; hard rules: serial-eval cap, no SDK auto-PR) Review corrections: "reproducible against an SDK commit" is a category error (LLM codegen isn't reproducible) → traceability + source-hash, not regeneration; "orchestrate don't reinvent" overclaims reuse (release_agent_email.yml is 718 lines, email-hardcoded, npm OIDC bound to the filename) → M0 proves generalization empirically; the eval gate has no independent oracle (factory writes the agent AND its eval data AND sets the bar) → M2 adds a human-curated held-out oracle + LCB-over-k-runs gating as a hard prereq for M3-M4; M4's maintenance loop has a convergence hazard + a serial-eval throughput ceiling (N agents on one Lemonade slot); AI auto-PRing the shared SDK is a hard boundary, not a rec (fails + files an issue) — escalated. Downgraded the coder/#1913 deps to "exists on a branch."
…radius holes in-place Second review found the first-round fixes were partly prose over unclosed holes. Fixed the findings in the CANONICAL sections (not just a critique capstone — the review flagged that anti-pattern): - Held-out oracle self-certification: "different provenance" was defined against the LLM implementer (trivially true). Now requires curator ≠ SPEC AUTHOR (the real circular source), plus a human-judged real-data label slice (by-construction labels alone encode the author's definition as truth). §5.5. - Oracle was consumed (stages 7/13) but produced nowhere and could rot as the agent grows. Added stage 5b (curate/extend the held-out oracle) with a coverage-delta gate. §3, §5.5. - SDK-release human gate fired blind to blast radius (approving a tag before the N-agent fan-out is known = relocated rubber stamp). Now feeds the gate a pre-cut all-agent dry-run (approve the radius, not the tag) and disambiguates SDK-release (M4) from agent-publish (stage 16). §2.5, M4. - Propagated LCB/k-runs gating into the canonical gate spec (§5, stage 13), not only §11.5. - §10 now names unmerged-branch prerequisites (#1913 manifest schema; origin/coder) instead of "exists"; M0 flags per-agent OIDC provisioning as supply-chain work + packaging-parity risk. Non-convergence capacity cost (mis-sized-cadence signal) added to §2.5. - 🔒 seed-from-real is a recurring PII intake → scrub/consent gate on every refresh (§5.5). - §11.6 records what the fixes did/didn't close (closed / hardened / fixed / residual).
…cy layer, not levels) Verified all six §0.x cross-references to the runtime doc (#1913) are accurate (§0.4 confirmation, §0.5 install/Hub, §0.15 contract-version, §0.24 signing, §0.28 manifest schema, §0.34 autonomy). Corrected one: the runtime doc explicitly says §0.34 is 'a policy + temporal engine, not one enum field,' so calling it 'autonomy levels' contradicted the sibling doc — now 'autonomy policy layer.'
…he live SDK) (amd#1914) ## Why this matters Agents rot when frozen against an SDK snapshot — the GAIA SDK changes constantly. The **Agent Factory is the developer lifecycle itself, automated, against the *live* SDK**: clone → scope → issues → spec → synthetic data → implement → eval-gate → PR → build/ship → **maintain on SDK deltas**. It splits into a **dev half** (net-new agentic-coding automation) and a **ship half** (the existing, rigorous `release_agent_email.yml` + `packaging/*` pipeline, which the factory orchestrates rather than reinvents), tied by the maintenance loop — the differentiator nobody in the field (Hermes/OpenClaw learn *skills*) ships. The plan is deliberately hardened: it has been through two adversarial architecture reviews, an independent cold read, and two domain-specialist reviews (eval methodology vs `scorecard_gate.py`/`benchmark.py`; release mechanics vs the live workflow/Hub Worker), with every correction folded into the normative sections (§11.5–§11.7 record what changed). Every "exists" claim is code-verified — including the honest ones: the whole-package zip is disabled (Cloudflare 413), the coder lives on `origin/coder`, and the manifest schema depends on amd#1913. Load-bearing design points: a concrete `recipe` as the single authored input (§1.5) · per-stage human approve/deny gates incl. a blast-radius dry-run before any SDK release (§2.5) · the factory's own least-privilege authority (§2.6) · a statistically sound eval gate (fixed bar + non-inferiority band, matching the shipped `scorecard_gate.py`; §5) · synthetic-data discipline with a human-curated held-out oracle, curator ≠ spec author, split by thread-id (§5.5) · recovery levers that match the real infra (roll-forward, dist-tag, version pin; §6.5) · difficulty-ordered milestones M0→M4 with merge-prerequisites named (§10). ## Test plan - [ ] Review `docs/plans/agent-factory.md` end-to-end (reading map at top; §11.5–§11.7 record the review corrections). - [ ] Confirm the §12 open decisions (orchestrator substrate; trusted-lane auto-approve scope; stage-18 trigger policy; recipe-vs-manifest). - [ ] 🔒 Sign off on the two escalated items: the PII scrub/consent gate for seed-from-real (§5.5) and the SDK-release blast-radius gate (§2.5). - [ ] Docs-only; no code paths affected.
Why this matters
Today the Agent UI is a ~19K-LoC Python backend that hosts each agent in-process — so the app can't dogfood the out-of-process product integrators actually consume, and every agent's logic is entangled with the UI. This plan recasts it: every agent is its own out-of-process sidecar exposing one REST contract, and the Agent UI becomes a thin host that renders components and supervises sidecars. All agent reasoning runs in the sidecar; the UI (and the
gaia <agent>CLI, and integrators) all drive the same contract.The keystone the current product is missing: a
POST /queryagent endpoint (SSE) that reasons and chains tools into multi-step workflows — co-equal with the deterministic fixed-function endpoints.gaia emailand the sidecar become endpoint-identical.This supersedes the incremental email cutover (#1910), whose in-UI
EmailProxyAgent(a reduced 5-tool loop) is replaced by relaying to the sidecar's full/query— which also removes that PR's tool-surface reduction.What's in the PR
Design captured across two docs (overview drives, detailed plan follows):
agent-ui.mdx— architecture, the thin-host boundary,/querySSE, hub mirror (one-click install/uninstall), dev mode, the card-rendering change, and a strangler-fig migration path.agent-ui-agent-capabilities-plan.md§0 — the per-sidecar REST contract, the/querySSE event schema, the hostAgentSidecarManager+ proxy, mid-workflow confirmation, OAuth forwarding, CLI parity, shared-service custody, and what each fat-backend router becomes.Grounded/corrected against the current code (not hand-waved)
The design was reviewed against the real code and three optimistic claims were fixed:
MessageBubble.tsxSTRUCTURED_PAYLOAD_LANGS) → v2's typedrenderSSE event is a real (small) frontend change, called out honestly./v1/connectionsis a host-side intake today (connectors.py:94,api.py:266) → v2 inverts the role to host-forwards-out, and assigns token refresh to the single-writer host.Open decisions flagged for sign-off (not silently resolved)
Test plan
agent-ui.mdx"Architecture (v2 — sidecar-first)" section for the vision + boundaries.agent-ui-agent-capabilities-plan.md§0.1–§0.10 for the contract,/queryschema, supervisor, migration path.