Skip to content

v0.4.4.5 feat: autonomous investigation orchestrator (Approach D) — gated off#232

Open
WZ wants to merge 17 commits into
mainfrom
feat/orchestrator
Open

v0.4.4.5 feat: autonomous investigation orchestrator (Approach D) — gated off#232
WZ wants to merge 17 commits into
mainfrom
feat/orchestrator

Conversation

@WZ
Copy link
Copy Markdown
Owner

@WZ WZ commented Jun 3, 2026

Summary

The autonomous investigation orchestrator (Approach D): an unbounded, read-only move-loop where an LLM picks the next move to find the real root cause — the version that investigates (forms hypotheses, gathers evidence, follows the dependency graph across services), vs deep mode which only re-judges an existing answer.

Ships OFF by default (config.agent.orchestratorEnabled, default false) — hidden from users, server-gated (the "Investigate autonomously" trigger is suppressed; orchestrator_investigate is rejected when off). This PR lands the validated core; it is not exposed.

What it does (increments 1 → 6)

  • Move-loop + safety harness + hybrid stop (src/agents/orchestrator.ts) — pure, fully-injected control flow. Moves: hypothesize / query / test / conclude / spawn-subagent / follow-cause.
    • Hybrid stop (the crux): the agent may propose conclude, but the loop only stops when the Step-2 corroboration keystone (evaluatePrediction) deterministically confirms the leader (satisfied). Self-confidence directs the search; it never ends it.
    • Safety harness: budget / depth / strikes → operator-pause / tool-cap / wall-clock / stall guard / 1000-move backstop — all checked before each move.
  • LLM brain + headless runnercreateLlmDecideMove (structured output via generateText + JSON parse, no tools/responseFormat → sidesteps the gpt-oss quirk; robust to messy output) wired to createGatherEvidence (read-only) + the keystone.
  • WS trigger + shared agent-stream UX — generalized the shipped DeepModeStream into a reusable AgentStream; OrchestratorStream reuses it with a live "working… Ns" indicator, a terminal outcome banner (so runs never stop ambiguously), and a Causal chain card.
  • Subagents (depth-1)spawn-subagent runs a scoped read-only sub-investigation and folds its findings back into evidence (maxSubagents budget; quick template for cost).
  • Follow-cause via the dependency graph — resolves the incident service's neighbors (inferDependencyGraph) and lets the agent follow the cause into a known dependency (grounded — can't wander).
  • Hypothesize-from-findings + causal chain — after following a dependency, the agent turns the finding into a tested hypothesis; on finish, assembleCausalChain derives the ordered path (incident → followed deps → root cause) and renders it.

Validation (live, on real incidents)

Each major step was smoke-tested against a live stack (read-only MCP):

  • Base loop: ruled out OOMKill → pivoted → keystone-confirmed nginx-ingress degradation (outcome=confirmed).
  • Operator-pause: ruled out 3 local causes → stopped + banner (didn't guess).
  • Follow-cause: on a service with 2 dependency neighbors, ruled out local cause → followed both neighbors (2 scoped sub-investigations), folding findings back.

Tests

  • tsc --noEmit clean; full suite 2450 passing (175 files).
  • New unit tests: the harness (every guard outcome), hybrid stop (rejects high self-confidence without keystone backing), move parsing, subagent/follow-cause handling, causal-chain assembly, trace→stream mapping.

Deferred (follow-ups)

  • Increment 5 — the interactive operator-pause card (continue / escalate / instrument-&-wait). The core pause hook is designed; the WS round-trip + UI card aren't in this PR.
  • Increment 7 — broader accuracy validation across more incident types.
  • Cost: subagents use the quick template (~1 min each); an autonomous run can still take several minutes. The budget/wall-clock guards bound it.

🤖 Generated with Claude Code

WZ added 12 commits June 2, 2026 12:24
…ybrid stop

New agent (src/agents/orchestrator.ts) for Approach D: an unbounded read-only
move-loop that wraps (not replaces) the fixed investigation DAG. This increment
is the pure, fully-injected control flow — unit-testable without an LLM or MCP.

- Moves: hypothesize / query / test / conclude (spawn-subagent + follow-cause
  recognized but deferred to increments 3-4).
- DECISION 1 (hybrid stop): conclude only stops the loop when the leading
  hypothesis is deterministically confirmed by the keystone (verdict
  'satisfied'); the LLM's self-confidence is recorded but never the gate.
- DECISION 2 (safety harness): budget / depth / strikes→operator-pause /
  tool-cap / wall-clock, all hard limits, checked before each move. Plus a
  no-progress stall guard and a 1000-move backstop.
- Gated behind config.agent.orchestratorEnabled (default false).
- 14 unit tests cover the hybrid stop (rejects high self-confidence without
  keystone backing), every guard outcome, real-keystone integration, and
  graceful handling of bad moves. tsc clean.
… runner

Make the orchestrator core actually run.

- createLlmDecideMove: the agent's brain. An LLM picks the next move from the
  read-only state. Follows the project's structured-output convention
  (generateText + JSON parse, NO tools / NO responseFormat — sidesteps the
  gpt-oss <|constrain|>json quirk). parseMove is robust to fenced/prose-wrapped
  JSON and schema drift (→ graceful null, never a throw); LlmUnavailableError
  propagates so the runner fails cleanly.
- runAutonomousOrchestrator: headless entry wiring the pure core's three
  injected deps to real impls — decideMove (LLM), gatherEvidence
  (createGatherEvidence, read-only by construction), evaluate
  (evaluatePrediction keystone). Decide + query token usage feeds the budget
  guard via estimateTokens.
- export HypothesisPredictionSchema (reused to validate LLM-emitted predictions).
- 12 unit tests (parseMove per move type + messy output, prompt rendering,
  decide-fn with injected callModel incl. error degradation). callModel is a
  test seam so move selection is verified without a live model. tsc clean,
  full suite 2434 green.
…l, gate

Make the orchestrator triggerable and streamable over the WebSocket.

- WS protocol: orchestrator_investigate (client) + orchestrator:started/step/
  complete/error (server). Reuses the existing AgentStreamEvent (one stream
  shape for deep mode + orchestrator); adds OrchestratorStreamStats footer.
- orchestrate adapter on createMastraAdapters: reuses investigation providers +
  model, runs runAutonomousOrchestrator, maps the core's TraceEntry stream to
  AgentStreamEvent. DEFAULT_ORCHESTRATOR_GUARDS (conservative; config knobs later).
- ws-handler: handleOrchestratorInvestigate (seeds focus + time window from a
  completed investigation) + runOrchestratorStreamed. Gated on
  config.agent.orchestratorEnabled — rejects when off (defense in depth).
- traceEntryToStreamEvent: pure, plain-English move→stream mapping, 7 tests.
- server injects window.__ORCHESTRATOR_ENABLED__ only when enabled; Window
  global declared. Trigger stays hidden client-side by default.
tsc clean, 33 orchestrator tests green.
Generalize the shipped DeepModeStream into a reusable AgentStream and wire the
orchestrator trigger + live stream.

- AgentStream: the shared structured stream (colored verbs, status icons,
  indented sub-steps, generic footer items). DeepModeStream is now a thin
  wrapper over it (label + footer only); no behavior change for deep mode.
- OrchestratorStream: same rendering, orchestrator footer (moves / queries /
  depth / strikes / tokens / elapsed; strikes turn amber once >0).
- InvestigationPane: 'Investigate autonomously' trigger (Compass icon) gated on
  window.__ORCHESTRATOR_ENABLED__ + isComplete; handles orchestrator:started/
  step/complete/error; renders OrchestratorStream. Unlike deep mode it needs no
  prior loopOutcome (it investigates from scratch).
- App: onOrchestrate → orchestrator_investigate WS message.
tsc clean, web bundle builds, full suite 2442 green.
Add the subagent capability to the core loop + surface run outcomes.

- spawn-subagent move now folds a depth-1 sub-investigation's findings back
  into evidence (injected spawnSubagent dep), counts subagents, and enforces a
  new maxSubagents guard. Absent dep / over-limit → graceful skip + trace.
  follow-cause stays deferred (increment 4).
- OrchestratorState.subagents + stats.subagents + OrchestratorStreamStats +
  ws-handler complete mapping + OrchestratorStream footer.
- Outcome banner (the freebie): AgentStream gains a terminal banner slot;
  OrchestratorStream maps outcome → plain-English callout (confirmed / paused /
  hit-a-limit / inconclusive) so a run never just stops ambiguously. Full
  operator-pause continue/escalate/wait card stays in increment 5.
- 4 new core tests (folds findings, graceful skip, maxSubagents limit,
  follow-cause still deferred). tsc clean, web builds, full suite 2445 green.

Live dispatch (spawnSubagent → investigationAgent.investigate) is increment 3b.
Wire spawn-subagent to a real scoped sub-investigation.

- orchestrate adapter: spawnSubagent resolves a ServiceConfig (from config or a
  minimal one) and runs investigationAgent.investigate(..., template=standard,
  readOnlyTools=true) on the target service, folding its conclusion (rootCause +
  summary) back as one infra observation the orchestrator can test against.
  Read-only; failures degrade to no findings (never aborts the parent).
- runAutonomousOrchestrator passes spawnSubagent through to the core dep.
- System prompt now teaches the spawn-subagent move + when to use it (after
  local hypotheses keep failing, investigate a related service instead of
  guessing) — without this the LLM never picked the move.
- Subagent token usage is bounded by maxSubagents + wall-clock in v1 (not the
  token budget); noted for a later refinement. tsc clean, full suite green.
…tStream

A run was looking hung between completed steps (decideMove thinking, a long
query, a running subagent showed no activity). Add a liveness signal to the
shared AgentStream (both deep mode + orchestrator):
- header shows ticking '· live · Ns' while running
- a pulsing '◉ working… Ns' row below the steps while running
A progress bar doesn't fit an unbounded agent (no known total), so this is a
moving liveness indicator instead. Frontend-only; tsc clean.
The agent can now follow the incident into a connected service instead of
exhausting local hypotheses and pausing.

- ws-handler resolves the incident service's dependency-graph neighbors (both
  directions, via inferDependencyGraph over the stack's services — mirrors
  GET /api/dependencies/:service) and threads them in. Empty graph → empty
  list → follow-cause disables gracefully.
- core: OrchestratorState.dependencies + a real follow-cause move — a scoped
  read-only sub-investigation on a neighbor, grounded so the agent can ONLY
  follow into a known dependency (not wander). Reuses the subagent budget.
- prompt: teaches follow-cause + nudges the agent to pivot to a dependency
  after just 1-2 local rule-outs (addresses the 'burned all strikes locally
  then paused' behavior).
- stream: follow-cause/spawn-subagent now map to done rows ('followed the trail
  to …'). 4 new core tests + a prompt-rendering test. tsc clean, suite 2448.
Each follow-cause/spawn ran a full standard sub-investigation (~2-3 min); an
autonomous run can spawn several (the impala validation took 8.6 min for two).
Drop subagents to the quick (metrics-only) template — ~1 min each. Trades some
depth (no logs) for cost; revisit if subagents miss log-based causes.
…chain

Turn cross-service runs from 'exhausted' into 'confirmed', and surface the chain.

- Prompt (part A): after a follow-cause/subagent returns findings, the agent
  must hypothesize the specific cause they point to (with a checkable
  prediction) and test it — never stop right after following. Directly fixes
  the impala caveat (followed both deps but never confirmed).
- Causal chain (part B): assembleCausalChain derives the ordered path —
  incident service → each followed dependency → confirmed root cause — from the
  finished run's trace. Emitted on orchestrator:complete; OrchestratorStream
  renders a 'Causal chain' summary card (root cause in green) below the stream.
  This finally closes the long-standing CAUSAL CHAIN gap.
- 2 new chain-assembly tests. tsc clean, full suite green, web builds.
…ent 6 polish

Increment 5 — the strike limit now hands the call to a human instead of just
stopping. When `maxStrikes` is hit the loop emits `orchestrator:operator_pause`
and blocks on the operator's decision:
  - continue          → reset strikes and resume (other guards still bound it)
  - escalate / wait   → stop with that disposition

Pieces:
  - core: `onOperatorPause` hook (was stranded) + a hard cap of
    MAX_OPERATOR_CONTINUES so a hung/looping operator can't spin the loop
    forever. Unit tests: continue-then-stop, escalate immediate-stop, no-hook
    (unchanged behavior), and the continue cap.
  - wiring: threaded through runAutonomousOrchestrator + the orchestrate adapter.
  - WS: `orchestrator:operator_pause` (server) / `orchestrator_decision`
    (client); a per-connection pending-pause registry resolves the decision,
    with a 5-minute timeout → escalate and cleanup on WS close so a disconnect
    never strands a blocked loop.
  - UI: OperatorPauseCard (continue / escalate to on-call / instrument & wait);
    InvestigationPane pause handlers + disposition banner; App wires the send.
    escalate/wait have no backend yet (recorded only) — noted as follow-up.

Increment 6 polish (causal chain shipped earlier; these were the remaining
spec §8 items):
  - source attribution: each causal-chain link now carries the finding /
    prediction it rests on (CausalChainLink type), preferring the subagent's
    folded conclusion for followed links.
  - one-line trace summary ("N moves · M queries · K subagents · <outcome> at
    depth D") on the run footer.

Stays gated off (config.agent.orchestratorEnabled defaults false).

Validated live on localhost over the WS against `impala` (2 deps): the loop hit
the strike limit, emitted the pause, resumed on a `continue` decision, and
confirmed a root cause — full causal chain (incident → 2 deps → root cause)
with attribution + trace summary. tsc clean, full suite 2456 passing.
@WZ
Copy link
Copy Markdown
Owner Author

WZ commented Jun 3, 2026

Increments 5 + 6 polish landed in 32976d9.

Increment 5 — interactive operator-pause. The strike limit now emits orchestrator:operator_pause and blocks on the operator's decision: continue (reset strikes, resume — other guards still bound it), escalate to on-call, or instrument & wait. Core gets a MAX_OPERATOR_CONTINUES cap so a hung/looping operator can't spin forever; WS layer has a per-connection pending-pause registry with a 5-min timeout→escalate and cleanup on disconnect; UI gets the pause card + disposition banner. escalate/wait record intent only (no paging/scheduler backend yet — follow-up).

Increment 6 polish. Causal-chain links now carry source attribution (the finding/prediction each rests on), plus a one-line trace summary on the footer.

Still gated off (config.agent.orchestratorEnabled defaults false).

Validated live over the WS: a run hit the strike limit, paused, resumed on a continue decision, and confirmed a root cause with a full causal chain (incident → 2 deps → root cause) + attribution + trace summary. tsc clean, full suite 2456 passing. The pause card is code-complete + type-checked + builds but hasn't had a visual browser pass yet.

WZ added 3 commits June 2, 2026 21:26
Both bite only when config.agent.orchestratorEnabled is flipped on, but
they let a direct WS client bypass the UI's gating:

- Rate limiting: classifyWsMessage only bucketed `chat` and
  `deep_investigate` as investigation traffic, so `orchestrator_investigate`
  (and `deep_mode_investigate`, same defect) fell through to the looser
  `general` 20/min cap. Both are heavy autonomous LLM runs — route them
  through the stricter investigation bucket.

- Precondition: the orchestrator handler ran for any existing investigation
  id, including `running`/`failed`/report-less rows. The UI only shows the
  trigger after completion; a direct WS message bypassed that. Reject
  non-complete or report-less rows, mirroring the deep-mode handler.

Adds classifyWsMessage cases for both new types and four orchestrator
guard tests (gate off, not-found, still-running, report-less).
…dog/abort, false-confirm guard

The first broader-validation pass (6 real incidents) surfaced three blockers
that all argued against un-gating. Fixed on this branch (still gated off):

1. Quick-template synthesis input bug. The synthesis step's input schema
   requires the parallel-keyed shape `{ "metrics-evidence": … }`, but the quick
   template chained `.then(metricsStep).then(synthesisStep)` — handing synthesis
   the metrics step's RAW output (no `metrics-evidence` key) → "Step input
   validation failed" on every quick run, which degraded to an empty report.
   Since orchestrator subagents use the quick template, this silently gutted
   their findings ("Investigation complete" with no content). Fix: quick now
   uses `.parallel([metricsStep])` like standard/full. Regression test runs the
   real quick workflow to success (red→green). Live: subagent findings are now
   substantive ("lacked resilience to single-pod operation…").

2. No silent hangs / pile-on. Two runs streamed zero steps in 8 min under
   resource contention. Added: a per-operation watchdog (`opTimeoutMs`, default
   150s) that abandons a hung gather/subagent so one stuck MCP/LLM call can't
   strand the loop between guard checks; a cooperative abort `signal` checked
   each move, wired so WS disconnect aborts the run (no headless run-on); and a
   per-connection concurrency guard (one run per investigation — a double-click
   no longer spawns a second parallel run). New `aborted` outcome.

3. Cross-service false-confirm. agw-admin-ui (a 0-replica deployment) was
   "confirmed" as caused by a degraded payment-service with NO follow-cause into
   it — a keystone false-confirm (the prediction was observably true but not
   causally linked). Guard: a confirmed cause that names a dependency never
   followed-cause'd into is rejected (nudges to investigate it first); mentions
   of the incident service's own behaviour are fine. Prompt hardened: observing
   a dependency is unhealthy is correlational; cross-service causes need a
   follow-cause. Live: agw now confirms the correct self-service cause
   ("deployment is missing from the cluster entirely [confirmed by not_found]").

Also: dedupe the causal chain (a service followed more than once is one link).

tsc clean, full suite 2466. Live re-validation confirmed all three on real
incidents + the concurrency guard / abort-on-disconnect over the WS.
QA of the inc-7 fixes surfaced a layout wart: once every chain link carries an
attribution subline (the inc-6 + inc-7 source-attribution work), the inline
horizontal arrows go ragged — the separator floats far-right of a multi-line
block and links wrap inconsistently. Render the chain as a vertical stack
instead: one link per row, a subtle ↓ connector between rows, evidence indented
beneath each label, root cause in green. Reads cleanly as cause→effect and
handles multi-line attribution. Presentation-only (OrchestratorStream.tsx).
@WZ WZ force-pushed the feat/orchestrator branch from c641238 to bb27c28 Compare June 3, 2026 16:39
Add docs/orchestrator-agentic-loop.md — the move-loop, the two decisions
(hybrid keystone stop + safety harness), the interactive operator-pause,
cross-service follow-cause + the false-confirm guard, the causal-chain/trace
output, and the layer-by-layer architecture. Mermaid flow charts: the move
loop, the operator-pause sequence, and the component/data-flow. Mirrors the
style of docs/architecture-overview.md.
@WZ WZ force-pushed the feat/orchestrator branch from bb27c28 to c48b7e0 Compare June 3, 2026 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant