v0.4.4.5 feat: autonomous investigation orchestrator (Approach D) — gated off by WZ · Pull Request #232 · WZ/dops-assistant

WZ · 2026-06-03T02:40:38Z

Summary

The autonomous investigation orchestrator (Approach D): an unbounded, read-only move-loop where an LLM picks the next move to find the real root cause — the version that investigates (forms hypotheses, gathers evidence, follows the dependency graph across services), vs deep mode which only re-judges an existing answer.

Ships OFF by default (config.agent.orchestratorEnabled, default false) — hidden from users, server-gated (the "Investigate autonomously" trigger is suppressed; orchestrator_investigate is rejected when off). This PR lands the validated core; it is not exposed.

What it does (increments 1 → 6)

Move-loop + safety harness + hybrid stop (src/agents/orchestrator.ts) — pure, fully-injected control flow. Moves: hypothesize / query / test / conclude / spawn-subagent / follow-cause.
- Hybrid stop (the crux): the agent may propose conclude, but the loop only stops when the Step-2 corroboration keystone (evaluatePrediction) deterministically confirms the leader (satisfied). Self-confidence directs the search; it never ends it.
- Safety harness: budget / depth / strikes → operator-pause / tool-cap / wall-clock / stall guard / 1000-move backstop — all checked before each move.
LLM brain + headless runner — createLlmDecideMove (structured output via generateText + JSON parse, no tools/responseFormat → sidesteps the gpt-oss quirk; robust to messy output) wired to createGatherEvidence (read-only) + the keystone.
WS trigger + shared agent-stream UX — generalized the shipped DeepModeStream into a reusable AgentStream; OrchestratorStream reuses it with a live "working… Ns" indicator, a terminal outcome banner (so runs never stop ambiguously), and a Causal chain card.
Subagents (depth-1) — spawn-subagent runs a scoped read-only sub-investigation and folds its findings back into evidence (maxSubagents budget; quick template for cost).
Follow-cause via the dependency graph — resolves the incident service's neighbors (inferDependencyGraph) and lets the agent follow the cause into a known dependency (grounded — can't wander).
Hypothesize-from-findings + causal chain — after following a dependency, the agent turns the finding into a tested hypothesis; on finish, assembleCausalChain derives the ordered path (incident → followed deps → root cause) and renders it.

Validation (live, on real incidents)

Each major step was smoke-tested against a live stack (read-only MCP):

Base loop: ruled out OOMKill → pivoted → keystone-confirmed nginx-ingress degradation (outcome=confirmed).
Operator-pause: ruled out 3 local causes → stopped + banner (didn't guess).
Follow-cause: on a service with 2 dependency neighbors, ruled out local cause → followed both neighbors (2 scoped sub-investigations), folding findings back.

Tests

tsc --noEmit clean; full suite 2450 passing (175 files).
New unit tests: the harness (every guard outcome), hybrid stop (rejects high self-confidence without keystone backing), move parsing, subagent/follow-cause handling, causal-chain assembly, trace→stream mapping.

Deferred (follow-ups)

Increment 5 — the interactive operator-pause card (continue / escalate / instrument-&-wait). The core pause hook is designed; the WS round-trip + UI card aren't in this PR.
Increment 7 — broader accuracy validation across more incident types.
Cost: subagents use the quick template (~1 min each); an autonomous run can still take several minutes. The budget/wall-clock guards bound it.

🤖 Generated with Claude Code

…ybrid stop New agent (src/agents/orchestrator.ts) for Approach D: an unbounded read-only move-loop that wraps (not replaces) the fixed investigation DAG. This increment is the pure, fully-injected control flow — unit-testable without an LLM or MCP. - Moves: hypothesize / query / test / conclude (spawn-subagent + follow-cause recognized but deferred to increments 3-4). - DECISION 1 (hybrid stop): conclude only stops the loop when the leading hypothesis is deterministically confirmed by the keystone (verdict 'satisfied'); the LLM's self-confidence is recorded but never the gate. - DECISION 2 (safety harness): budget / depth / strikes→operator-pause / tool-cap / wall-clock, all hard limits, checked before each move. Plus a no-progress stall guard and a 1000-move backstop. - Gated behind config.agent.orchestratorEnabled (default false). - 14 unit tests cover the hybrid stop (rejects high self-confidence without keystone backing), every guard outcome, real-keystone integration, and graceful handling of bad moves. tsc clean.

… runner Make the orchestrator core actually run. - createLlmDecideMove: the agent's brain. An LLM picks the next move from the read-only state. Follows the project's structured-output convention (generateText + JSON parse, NO tools / NO responseFormat — sidesteps the gpt-oss <|constrain|>json quirk). parseMove is robust to fenced/prose-wrapped JSON and schema drift (→ graceful null, never a throw); LlmUnavailableError propagates so the runner fails cleanly. - runAutonomousOrchestrator: headless entry wiring the pure core's three injected deps to real impls — decideMove (LLM), gatherEvidence (createGatherEvidence, read-only by construction), evaluate (evaluatePrediction keystone). Decide + query token usage feeds the budget guard via estimateTokens. - export HypothesisPredictionSchema (reused to validate LLM-emitted predictions). - 12 unit tests (parseMove per move type + messy output, prompt rendering, decide-fn with injected callModel incl. error degradation). callModel is a test seam so move selection is verified without a live model. tsc clean, full suite 2434 green.

…l, gate Make the orchestrator triggerable and streamable over the WebSocket. - WS protocol: orchestrator_investigate (client) + orchestrator:started/step/ complete/error (server). Reuses the existing AgentStreamEvent (one stream shape for deep mode + orchestrator); adds OrchestratorStreamStats footer. - orchestrate adapter on createMastraAdapters: reuses investigation providers + model, runs runAutonomousOrchestrator, maps the core's TraceEntry stream to AgentStreamEvent. DEFAULT_ORCHESTRATOR_GUARDS (conservative; config knobs later). - ws-handler: handleOrchestratorInvestigate (seeds focus + time window from a completed investigation) + runOrchestratorStreamed. Gated on config.agent.orchestratorEnabled — rejects when off (defense in depth). - traceEntryToStreamEvent: pure, plain-English move→stream mapping, 7 tests. - server injects window.__ORCHESTRATOR_ENABLED__ only when enabled; Window global declared. Trigger stays hidden client-side by default. tsc clean, 33 orchestrator tests green.

Generalize the shipped DeepModeStream into a reusable AgentStream and wire the orchestrator trigger + live stream. - AgentStream: the shared structured stream (colored verbs, status icons, indented sub-steps, generic footer items). DeepModeStream is now a thin wrapper over it (label + footer only); no behavior change for deep mode. - OrchestratorStream: same rendering, orchestrator footer (moves / queries / depth / strikes / tokens / elapsed; strikes turn amber once >0). - InvestigationPane: 'Investigate autonomously' trigger (Compass icon) gated on window.__ORCHESTRATOR_ENABLED__ + isComplete; handles orchestrator:started/ step/complete/error; renders OrchestratorStream. Unlike deep mode it needs no prior loopOutcome (it investigates from scratch). - App: onOrchestrate → orchestrator_investigate WS message. tsc clean, web bundle builds, full suite 2442 green.

Add the subagent capability to the core loop + surface run outcomes. - spawn-subagent move now folds a depth-1 sub-investigation's findings back into evidence (injected spawnSubagent dep), counts subagents, and enforces a new maxSubagents guard. Absent dep / over-limit → graceful skip + trace. follow-cause stays deferred (increment 4). - OrchestratorState.subagents + stats.subagents + OrchestratorStreamStats + ws-handler complete mapping + OrchestratorStream footer. - Outcome banner (the freebie): AgentStream gains a terminal banner slot; OrchestratorStream maps outcome → plain-English callout (confirmed / paused / hit-a-limit / inconclusive) so a run never just stops ambiguously. Full operator-pause continue/escalate/wait card stays in increment 5. - 4 new core tests (folds findings, graceful skip, maxSubagents limit, follow-cause still deferred). tsc clean, web builds, full suite 2445 green. Live dispatch (spawnSubagent → investigationAgent.investigate) is increment 3b.

Wire spawn-subagent to a real scoped sub-investigation. - orchestrate adapter: spawnSubagent resolves a ServiceConfig (from config or a minimal one) and runs investigationAgent.investigate(..., template=standard, readOnlyTools=true) on the target service, folding its conclusion (rootCause + summary) back as one infra observation the orchestrator can test against. Read-only; failures degrade to no findings (never aborts the parent). - runAutonomousOrchestrator passes spawnSubagent through to the core dep. - System prompt now teaches the spawn-subagent move + when to use it (after local hypotheses keep failing, investigate a related service instead of guessing) — without this the LLM never picked the move. - Subagent token usage is bounded by maxSubagents + wall-clock in v1 (not the token budget); noted for a later refinement. tsc clean, full suite green.

…tStream A run was looking hung between completed steps (decideMove thinking, a long query, a running subagent showed no activity). Add a liveness signal to the shared AgentStream (both deep mode + orchestrator): - header shows ticking '· live · Ns' while running - a pulsing '◉ working… Ns' row below the steps while running A progress bar doesn't fit an unbounded agent (no known total), so this is a moving liveness indicator instead. Frontend-only; tsc clean.

The agent can now follow the incident into a connected service instead of exhausting local hypotheses and pausing. - ws-handler resolves the incident service's dependency-graph neighbors (both directions, via inferDependencyGraph over the stack's services — mirrors GET /api/dependencies/:service) and threads them in. Empty graph → empty list → follow-cause disables gracefully. - core: OrchestratorState.dependencies + a real follow-cause move — a scoped read-only sub-investigation on a neighbor, grounded so the agent can ONLY follow into a known dependency (not wander). Reuses the subagent budget. - prompt: teaches follow-cause + nudges the agent to pivot to a dependency after just 1-2 local rule-outs (addresses the 'burned all strikes locally then paused' behavior). - stream: follow-cause/spawn-subagent now map to done rows ('followed the trail to …'). 4 new core tests + a prompt-rendering test. tsc clean, suite 2448.

Each follow-cause/spawn ran a full standard sub-investigation (~2-3 min); an autonomous run can spawn several (the impala validation took 8.6 min for two). Drop subagents to the quick (metrics-only) template — ~1 min each. Trades some depth (no logs) for cost; revisit if subagents miss log-based causes.

…chain Turn cross-service runs from 'exhausted' into 'confirmed', and surface the chain. - Prompt (part A): after a follow-cause/subagent returns findings, the agent must hypothesize the specific cause they point to (with a checkable prediction) and test it — never stop right after following. Directly fixes the impala caveat (followed both deps but never confirmed). - Causal chain (part B): assembleCausalChain derives the ordered path — incident service → each followed dependency → confirmed root cause — from the finished run's trace. Emitted on orchestrator:complete; OrchestratorStream renders a 'Causal chain' summary card (root cause in green) below the stream. This finally closes the long-standing CAUSAL CHAIN gap. - 2 new chain-assembly tests. tsc clean, full suite green, web builds.

…or, gated off)

…ent 6 polish Increment 5 — the strike limit now hands the call to a human instead of just stopping. When `maxStrikes` is hit the loop emits `orchestrator:operator_pause` and blocks on the operator's decision: - continue → reset strikes and resume (other guards still bound it) - escalate / wait → stop with that disposition Pieces: - core: `onOperatorPause` hook (was stranded) + a hard cap of MAX_OPERATOR_CONTINUES so a hung/looping operator can't spin the loop forever. Unit tests: continue-then-stop, escalate immediate-stop, no-hook (unchanged behavior), and the continue cap. - wiring: threaded through runAutonomousOrchestrator + the orchestrate adapter. - WS: `orchestrator:operator_pause` (server) / `orchestrator_decision` (client); a per-connection pending-pause registry resolves the decision, with a 5-minute timeout → escalate and cleanup on WS close so a disconnect never strands a blocked loop. - UI: OperatorPauseCard (continue / escalate to on-call / instrument & wait); InvestigationPane pause handlers + disposition banner; App wires the send. escalate/wait have no backend yet (recorded only) — noted as follow-up. Increment 6 polish (causal chain shipped earlier; these were the remaining spec §8 items): - source attribution: each causal-chain link now carries the finding / prediction it rests on (CausalChainLink type), preferring the subagent's folded conclusion for followed links. - one-line trace summary ("N moves · M queries · K subagents · <outcome> at depth D") on the run footer. Stays gated off (config.agent.orchestratorEnabled defaults false). Validated live on localhost over the WS against `impala` (2 deps): the loop hit the strike limit, emitted the pause, resumed on a `continue` decision, and confirmed a root cause — full causal chain (incident → 2 deps → root cause) with attribution + trace summary. tsc clean, full suite 2456 passing.

WZ · 2026-06-03T04:04:41Z

Increments 5 + 6 polish landed in 32976d9.

Increment 5 — interactive operator-pause. The strike limit now emits orchestrator:operator_pause and blocks on the operator's decision: continue (reset strikes, resume — other guards still bound it), escalate to on-call, or instrument & wait. Core gets a MAX_OPERATOR_CONTINUES cap so a hung/looping operator can't spin forever; WS layer has a per-connection pending-pause registry with a 5-min timeout→escalate and cleanup on disconnect; UI gets the pause card + disposition banner. escalate/wait record intent only (no paging/scheduler backend yet — follow-up).

Increment 6 polish. Causal-chain links now carry source attribution (the finding/prediction each rests on), plus a one-line trace summary on the footer.

Still gated off (config.agent.orchestratorEnabled defaults false).

Validated live over the WS: a run hit the strike limit, paused, resumed on a continue decision, and confirmed a root cause with a full causal chain (incident → 2 deps → root cause) + attribution + trace summary. tsc clean, full suite 2456 passing. The pause card is code-complete + type-checked + builds but hasn't had a visual browser pass yet.

Both bite only when config.agent.orchestratorEnabled is flipped on, but they let a direct WS client bypass the UI's gating: - Rate limiting: classifyWsMessage only bucketed `chat` and `deep_investigate` as investigation traffic, so `orchestrator_investigate` (and `deep_mode_investigate`, same defect) fell through to the looser `general` 20/min cap. Both are heavy autonomous LLM runs — route them through the stricter investigation bucket. - Precondition: the orchestrator handler ran for any existing investigation id, including `running`/`failed`/report-less rows. The UI only shows the trigger after completion; a direct WS message bypassed that. Reject non-complete or report-less rows, mirroring the deep-mode handler. Adds classifyWsMessage cases for both new types and four orchestrator guard tests (gate off, not-found, still-running, report-less).

…dog/abort, false-confirm guard The first broader-validation pass (6 real incidents) surfaced three blockers that all argued against un-gating. Fixed on this branch (still gated off): 1. Quick-template synthesis input bug. The synthesis step's input schema requires the parallel-keyed shape `{ "metrics-evidence": … }`, but the quick template chained `.then(metricsStep).then(synthesisStep)` — handing synthesis the metrics step's RAW output (no `metrics-evidence` key) → "Step input validation failed" on every quick run, which degraded to an empty report. Since orchestrator subagents use the quick template, this silently gutted their findings ("Investigation complete" with no content). Fix: quick now uses `.parallel([metricsStep])` like standard/full. Regression test runs the real quick workflow to success (red→green). Live: subagent findings are now substantive ("lacked resilience to single-pod operation…"). 2. No silent hangs / pile-on. Two runs streamed zero steps in 8 min under resource contention. Added: a per-operation watchdog (`opTimeoutMs`, default 150s) that abandons a hung gather/subagent so one stuck MCP/LLM call can't strand the loop between guard checks; a cooperative abort `signal` checked each move, wired so WS disconnect aborts the run (no headless run-on); and a per-connection concurrency guard (one run per investigation — a double-click no longer spawns a second parallel run). New `aborted` outcome. 3. Cross-service false-confirm. agw-admin-ui (a 0-replica deployment) was "confirmed" as caused by a degraded payment-service with NO follow-cause into it — a keystone false-confirm (the prediction was observably true but not causally linked). Guard: a confirmed cause that names a dependency never followed-cause'd into is rejected (nudges to investigate it first); mentions of the incident service's own behaviour are fine. Prompt hardened: observing a dependency is unhealthy is correlational; cross-service causes need a follow-cause. Live: agw now confirms the correct self-service cause ("deployment is missing from the cluster entirely [confirmed by not_found]"). Also: dedupe the causal chain (a service followed more than once is one link). tsc clean, full suite 2466. Live re-validation confirmed all three on real incidents + the concurrency guard / abort-on-disconnect over the WS.

QA of the inc-7 fixes surfaced a layout wart: once every chain link carries an attribution subline (the inc-6 + inc-7 source-attribution work), the inline horizontal arrows go ragged — the separator floats far-right of a multi-line block and links wrap inconsistently. Render the chain as a vertical stack instead: one link per row, a subtle ↓ connector between rows, evidence indented beneath each label, root cause in green. Reads cleanly as cause→effect and handles multi-line attribution. Presentation-only (OrchestratorStream.tsx).

Add docs/orchestrator-agentic-loop.md — the move-loop, the two decisions (hybrid keystone stop + safety harness), the interactive operator-pause, cross-service follow-cause + the false-confirm guard, the causal-chain/trace output, and the layer-by-layer architecture. Mermaid flow charts: the move loop, the operator-pause sequence, and the component/data-flow. Mirrors the style of docs/architecture-overview.md.

WZ added 12 commits June 2, 2026 12:24

chore(release): bump VERSION 0.4.4.4 → 0.4.4.5 (autonomous orchestrat…

46a1cf6

…or, gated off)

WZ added 3 commits June 2, 2026 21:26

WZ force-pushed the feat/orchestrator branch from c641238 to bb27c28 Compare June 3, 2026 16:39

WZ force-pushed the feat/orchestrator branch from bb27c28 to c48b7e0 Compare June 3, 2026 16:42

docs(orchestrator): add pseudocode of the move-loop while loop

e5fb892

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.4.5 feat: autonomous investigation orchestrator (Approach D) — gated off#232

v0.4.4.5 feat: autonomous investigation orchestrator (Approach D) — gated off#232
WZ wants to merge 17 commits into
mainfrom
feat/orchestrator

WZ commented Jun 3, 2026

Uh oh!

WZ commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

WZ commented Jun 3, 2026

Summary

What it does (increments 1 → 6)

Validation (live, on real incidents)

Tests

Deferred (follow-ups)

Uh oh!

WZ commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant