v0.4.4.5 feat: autonomous investigation orchestrator (Approach D) — gated off#232
v0.4.4.5 feat: autonomous investigation orchestrator (Approach D) — gated off#232WZ wants to merge 17 commits into
Conversation
…ybrid stop New agent (src/agents/orchestrator.ts) for Approach D: an unbounded read-only move-loop that wraps (not replaces) the fixed investigation DAG. This increment is the pure, fully-injected control flow — unit-testable without an LLM or MCP. - Moves: hypothesize / query / test / conclude (spawn-subagent + follow-cause recognized but deferred to increments 3-4). - DECISION 1 (hybrid stop): conclude only stops the loop when the leading hypothesis is deterministically confirmed by the keystone (verdict 'satisfied'); the LLM's self-confidence is recorded but never the gate. - DECISION 2 (safety harness): budget / depth / strikes→operator-pause / tool-cap / wall-clock, all hard limits, checked before each move. Plus a no-progress stall guard and a 1000-move backstop. - Gated behind config.agent.orchestratorEnabled (default false). - 14 unit tests cover the hybrid stop (rejects high self-confidence without keystone backing), every guard outcome, real-keystone integration, and graceful handling of bad moves. tsc clean.
… runner Make the orchestrator core actually run. - createLlmDecideMove: the agent's brain. An LLM picks the next move from the read-only state. Follows the project's structured-output convention (generateText + JSON parse, NO tools / NO responseFormat — sidesteps the gpt-oss <|constrain|>json quirk). parseMove is robust to fenced/prose-wrapped JSON and schema drift (→ graceful null, never a throw); LlmUnavailableError propagates so the runner fails cleanly. - runAutonomousOrchestrator: headless entry wiring the pure core's three injected deps to real impls — decideMove (LLM), gatherEvidence (createGatherEvidence, read-only by construction), evaluate (evaluatePrediction keystone). Decide + query token usage feeds the budget guard via estimateTokens. - export HypothesisPredictionSchema (reused to validate LLM-emitted predictions). - 12 unit tests (parseMove per move type + messy output, prompt rendering, decide-fn with injected callModel incl. error degradation). callModel is a test seam so move selection is verified without a live model. tsc clean, full suite 2434 green.
…l, gate Make the orchestrator triggerable and streamable over the WebSocket. - WS protocol: orchestrator_investigate (client) + orchestrator:started/step/ complete/error (server). Reuses the existing AgentStreamEvent (one stream shape for deep mode + orchestrator); adds OrchestratorStreamStats footer. - orchestrate adapter on createMastraAdapters: reuses investigation providers + model, runs runAutonomousOrchestrator, maps the core's TraceEntry stream to AgentStreamEvent. DEFAULT_ORCHESTRATOR_GUARDS (conservative; config knobs later). - ws-handler: handleOrchestratorInvestigate (seeds focus + time window from a completed investigation) + runOrchestratorStreamed. Gated on config.agent.orchestratorEnabled — rejects when off (defense in depth). - traceEntryToStreamEvent: pure, plain-English move→stream mapping, 7 tests. - server injects window.__ORCHESTRATOR_ENABLED__ only when enabled; Window global declared. Trigger stays hidden client-side by default. tsc clean, 33 orchestrator tests green.
Generalize the shipped DeepModeStream into a reusable AgentStream and wire the orchestrator trigger + live stream. - AgentStream: the shared structured stream (colored verbs, status icons, indented sub-steps, generic footer items). DeepModeStream is now a thin wrapper over it (label + footer only); no behavior change for deep mode. - OrchestratorStream: same rendering, orchestrator footer (moves / queries / depth / strikes / tokens / elapsed; strikes turn amber once >0). - InvestigationPane: 'Investigate autonomously' trigger (Compass icon) gated on window.__ORCHESTRATOR_ENABLED__ + isComplete; handles orchestrator:started/ step/complete/error; renders OrchestratorStream. Unlike deep mode it needs no prior loopOutcome (it investigates from scratch). - App: onOrchestrate → orchestrator_investigate WS message. tsc clean, web bundle builds, full suite 2442 green.
Add the subagent capability to the core loop + surface run outcomes. - spawn-subagent move now folds a depth-1 sub-investigation's findings back into evidence (injected spawnSubagent dep), counts subagents, and enforces a new maxSubagents guard. Absent dep / over-limit → graceful skip + trace. follow-cause stays deferred (increment 4). - OrchestratorState.subagents + stats.subagents + OrchestratorStreamStats + ws-handler complete mapping + OrchestratorStream footer. - Outcome banner (the freebie): AgentStream gains a terminal banner slot; OrchestratorStream maps outcome → plain-English callout (confirmed / paused / hit-a-limit / inconclusive) so a run never just stops ambiguously. Full operator-pause continue/escalate/wait card stays in increment 5. - 4 new core tests (folds findings, graceful skip, maxSubagents limit, follow-cause still deferred). tsc clean, web builds, full suite 2445 green. Live dispatch (spawnSubagent → investigationAgent.investigate) is increment 3b.
Wire spawn-subagent to a real scoped sub-investigation. - orchestrate adapter: spawnSubagent resolves a ServiceConfig (from config or a minimal one) and runs investigationAgent.investigate(..., template=standard, readOnlyTools=true) on the target service, folding its conclusion (rootCause + summary) back as one infra observation the orchestrator can test against. Read-only; failures degrade to no findings (never aborts the parent). - runAutonomousOrchestrator passes spawnSubagent through to the core dep. - System prompt now teaches the spawn-subagent move + when to use it (after local hypotheses keep failing, investigate a related service instead of guessing) — without this the LLM never picked the move. - Subagent token usage is bounded by maxSubagents + wall-clock in v1 (not the token budget); noted for a later refinement. tsc clean, full suite green.
…tStream A run was looking hung between completed steps (decideMove thinking, a long query, a running subagent showed no activity). Add a liveness signal to the shared AgentStream (both deep mode + orchestrator): - header shows ticking '· live · Ns' while running - a pulsing '◉ working… Ns' row below the steps while running A progress bar doesn't fit an unbounded agent (no known total), so this is a moving liveness indicator instead. Frontend-only; tsc clean.
The agent can now follow the incident into a connected service instead of
exhausting local hypotheses and pausing.
- ws-handler resolves the incident service's dependency-graph neighbors (both
directions, via inferDependencyGraph over the stack's services — mirrors
GET /api/dependencies/:service) and threads them in. Empty graph → empty
list → follow-cause disables gracefully.
- core: OrchestratorState.dependencies + a real follow-cause move — a scoped
read-only sub-investigation on a neighbor, grounded so the agent can ONLY
follow into a known dependency (not wander). Reuses the subagent budget.
- prompt: teaches follow-cause + nudges the agent to pivot to a dependency
after just 1-2 local rule-outs (addresses the 'burned all strikes locally
then paused' behavior).
- stream: follow-cause/spawn-subagent now map to done rows ('followed the trail
to …'). 4 new core tests + a prompt-rendering test. tsc clean, suite 2448.
Each follow-cause/spawn ran a full standard sub-investigation (~2-3 min); an autonomous run can spawn several (the impala validation took 8.6 min for two). Drop subagents to the quick (metrics-only) template — ~1 min each. Trades some depth (no logs) for cost; revisit if subagents miss log-based causes.
…chain Turn cross-service runs from 'exhausted' into 'confirmed', and surface the chain. - Prompt (part A): after a follow-cause/subagent returns findings, the agent must hypothesize the specific cause they point to (with a checkable prediction) and test it — never stop right after following. Directly fixes the impala caveat (followed both deps but never confirmed). - Causal chain (part B): assembleCausalChain derives the ordered path — incident service → each followed dependency → confirmed root cause — from the finished run's trace. Emitted on orchestrator:complete; OrchestratorStream renders a 'Causal chain' summary card (root cause in green) below the stream. This finally closes the long-standing CAUSAL CHAIN gap. - 2 new chain-assembly tests. tsc clean, full suite green, web builds.
…ent 6 polish
Increment 5 — the strike limit now hands the call to a human instead of just
stopping. When `maxStrikes` is hit the loop emits `orchestrator:operator_pause`
and blocks on the operator's decision:
- continue → reset strikes and resume (other guards still bound it)
- escalate / wait → stop with that disposition
Pieces:
- core: `onOperatorPause` hook (was stranded) + a hard cap of
MAX_OPERATOR_CONTINUES so a hung/looping operator can't spin the loop
forever. Unit tests: continue-then-stop, escalate immediate-stop, no-hook
(unchanged behavior), and the continue cap.
- wiring: threaded through runAutonomousOrchestrator + the orchestrate adapter.
- WS: `orchestrator:operator_pause` (server) / `orchestrator_decision`
(client); a per-connection pending-pause registry resolves the decision,
with a 5-minute timeout → escalate and cleanup on WS close so a disconnect
never strands a blocked loop.
- UI: OperatorPauseCard (continue / escalate to on-call / instrument & wait);
InvestigationPane pause handlers + disposition banner; App wires the send.
escalate/wait have no backend yet (recorded only) — noted as follow-up.
Increment 6 polish (causal chain shipped earlier; these were the remaining
spec §8 items):
- source attribution: each causal-chain link now carries the finding /
prediction it rests on (CausalChainLink type), preferring the subagent's
folded conclusion for followed links.
- one-line trace summary ("N moves · M queries · K subagents · <outcome> at
depth D") on the run footer.
Stays gated off (config.agent.orchestratorEnabled defaults false).
Validated live on localhost over the WS against `impala` (2 deps): the loop hit
the strike limit, emitted the pause, resumed on a `continue` decision, and
confirmed a root cause — full causal chain (incident → 2 deps → root cause)
with attribution + trace summary. tsc clean, full suite 2456 passing.
|
Increments 5 + 6 polish landed in Increment 5 — interactive operator-pause. The strike limit now emits Increment 6 polish. Causal-chain links now carry source attribution (the finding/prediction each rests on), plus a one-line trace summary on the footer. Still gated off ( Validated live over the WS: a run hit the strike limit, paused, resumed on a |
Both bite only when config.agent.orchestratorEnabled is flipped on, but they let a direct WS client bypass the UI's gating: - Rate limiting: classifyWsMessage only bucketed `chat` and `deep_investigate` as investigation traffic, so `orchestrator_investigate` (and `deep_mode_investigate`, same defect) fell through to the looser `general` 20/min cap. Both are heavy autonomous LLM runs — route them through the stricter investigation bucket. - Precondition: the orchestrator handler ran for any existing investigation id, including `running`/`failed`/report-less rows. The UI only shows the trigger after completion; a direct WS message bypassed that. Reject non-complete or report-less rows, mirroring the deep-mode handler. Adds classifyWsMessage cases for both new types and four orchestrator guard tests (gate off, not-found, still-running, report-less).
…dog/abort, false-confirm guard
The first broader-validation pass (6 real incidents) surfaced three blockers
that all argued against un-gating. Fixed on this branch (still gated off):
1. Quick-template synthesis input bug. The synthesis step's input schema
requires the parallel-keyed shape `{ "metrics-evidence": … }`, but the quick
template chained `.then(metricsStep).then(synthesisStep)` — handing synthesis
the metrics step's RAW output (no `metrics-evidence` key) → "Step input
validation failed" on every quick run, which degraded to an empty report.
Since orchestrator subagents use the quick template, this silently gutted
their findings ("Investigation complete" with no content). Fix: quick now
uses `.parallel([metricsStep])` like standard/full. Regression test runs the
real quick workflow to success (red→green). Live: subagent findings are now
substantive ("lacked resilience to single-pod operation…").
2. No silent hangs / pile-on. Two runs streamed zero steps in 8 min under
resource contention. Added: a per-operation watchdog (`opTimeoutMs`, default
150s) that abandons a hung gather/subagent so one stuck MCP/LLM call can't
strand the loop between guard checks; a cooperative abort `signal` checked
each move, wired so WS disconnect aborts the run (no headless run-on); and a
per-connection concurrency guard (one run per investigation — a double-click
no longer spawns a second parallel run). New `aborted` outcome.
3. Cross-service false-confirm. agw-admin-ui (a 0-replica deployment) was
"confirmed" as caused by a degraded payment-service with NO follow-cause into
it — a keystone false-confirm (the prediction was observably true but not
causally linked). Guard: a confirmed cause that names a dependency never
followed-cause'd into is rejected (nudges to investigate it first); mentions
of the incident service's own behaviour are fine. Prompt hardened: observing
a dependency is unhealthy is correlational; cross-service causes need a
follow-cause. Live: agw now confirms the correct self-service cause
("deployment is missing from the cluster entirely [confirmed by not_found]").
Also: dedupe the causal chain (a service followed more than once is one link).
tsc clean, full suite 2466. Live re-validation confirmed all three on real
incidents + the concurrency guard / abort-on-disconnect over the WS.
QA of the inc-7 fixes surfaced a layout wart: once every chain link carries an attribution subline (the inc-6 + inc-7 source-attribution work), the inline horizontal arrows go ragged — the separator floats far-right of a multi-line block and links wrap inconsistently. Render the chain as a vertical stack instead: one link per row, a subtle ↓ connector between rows, evidence indented beneath each label, root cause in green. Reads cleanly as cause→effect and handles multi-line attribution. Presentation-only (OrchestratorStream.tsx).
Add docs/orchestrator-agentic-loop.md — the move-loop, the two decisions (hybrid keystone stop + safety harness), the interactive operator-pause, cross-service follow-cause + the false-confirm guard, the causal-chain/trace output, and the layer-by-layer architecture. Mermaid flow charts: the move loop, the operator-pause sequence, and the component/data-flow. Mirrors the style of docs/architecture-overview.md.
Summary
The autonomous investigation orchestrator (Approach D): an unbounded, read-only move-loop where an LLM picks the next move to find the real root cause — the version that investigates (forms hypotheses, gathers evidence, follows the dependency graph across services), vs deep mode which only re-judges an existing answer.
Ships OFF by default (
config.agent.orchestratorEnabled, defaultfalse) — hidden from users, server-gated (the "Investigate autonomously" trigger is suppressed;orchestrator_investigateis rejected when off). This PR lands the validated core; it is not exposed.What it does (increments 1 → 6)
src/agents/orchestrator.ts) — pure, fully-injected control flow. Moves:hypothesize/query/test/conclude/spawn-subagent/follow-cause.conclude, but the loop only stops when the Step-2 corroboration keystone (evaluatePrediction) deterministically confirms the leader (satisfied). Self-confidence directs the search; it never ends it.createLlmDecideMove(structured output viagenerateText+ JSON parse, no tools/responseFormat → sidesteps the gpt-oss quirk; robust to messy output) wired tocreateGatherEvidence(read-only) + the keystone.DeepModeStreaminto a reusableAgentStream;OrchestratorStreamreuses it with a live "working… Ns" indicator, a terminal outcome banner (so runs never stop ambiguously), and a Causal chain card.spawn-subagentruns a scoped read-only sub-investigation and folds its findings back into evidence (maxSubagentsbudget;quicktemplate for cost).inferDependencyGraph) and lets the agent follow the cause into a known dependency (grounded — can't wander).assembleCausalChainderives the ordered path (incident → followed deps → root cause) and renders it.Validation (live, on real incidents)
Each major step was smoke-tested against a live stack (read-only MCP):
outcome=confirmed).Tests
tsc --noEmitclean; full suite 2450 passing (175 files).Deferred (follow-ups)
quicktemplate (~1 min each); an autonomous run can still take several minutes. The budget/wall-clock guards bound it.🤖 Generated with Claude Code