Skip to content

feat: improve investigation agent with agentic RCA patterns#5

Merged
WZ merged 3 commits into
mainfrom
feature/rca-agentic-patterns
Mar 8, 2026
Merged

feat: improve investigation agent with agentic RCA patterns#5
WZ merged 3 commits into
mainfrom
feature/rca-agentic-patterns

Conversation

@WZ
Copy link
Copy Markdown
Owner

@WZ WZ commented Mar 7, 2026

Summary

  • Investigation agent improvements: LLM-based time range extraction (handles natural language dates), enriched evidence sections in RCA reports, parallel evidence phase tuning (6 iterations: 4 productive + 2 wind-down), raw metric data and condensed logs passed to synthesis, real Grafana dashboard URL pre-population, deterministic severity/evidence guardrails
  • Code cleanup: Removed ~100 lines of over-engineered code (STOP_WORDS list, complex regex date parser, suggestStepSeconds heuristics, toRfc3339Window complexity). Simplified extractTimeRange to ISO-only fallback, extractQueryKeywords to length filter, suggestStepSeconds to formula-based calculation
  • README documentation: Added detailed ASCII pipeline diagram showing all 7 investigation phases, LLM call summary table, and explanation of how pre-fetch relates to the discovery agent

Test plan

  • npx vitest run — all tests pass
  • npx tsc --noEmit — no type errors
  • Manual end-to-end investigation with natural language date ("March 3, 2026")
  • Verify enriched evidence section in RCA report output
  • Verify real dashboard URLs appear in report

🤖 Generated with Claude Code

…iciency

- Add LLM-based time range extraction (handles any date format including
  unicode dashes, natural language like "March 3" or "this Thursday")
- Add timeRangeFrom/timeRangeTo to AnomalyAssessment schema for LLM extraction
- Pre-populate real dashboard URLs from GRAFANA_URL instead of LLM hallucination
- Pass raw metric tool data to synthesis for data-grounded analysis
- Strengthen synthesis evidence requirements (3-5 items per category with examples)
- Add verified-counts-only instruction to log correlation prompt
- Increase log sampleLines from 3→5 and add lastSeen to condensed logs
- Reduce evidence iterations from 7→6 (4 productive + 2 wind-down)
- Refactor: simplify extractTimeRange to ISO-only fallback (LLM handles rest)
- Refactor: replace 150-word STOP_WORDS + extractQueryKeywords with simple tokenizer
- Refactor: simplify suggestStepSeconds to formula-based (durationSec/100)
- Refactor: simplify toRfc3339Window, remove unused Grafana relative parsing
- Add toolData to PhaseResult for raw evidence pass-through

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 7, 2026 23:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades the investigation workflow to a more structured, multi-phase “agentic RCA” pipeline with improved evidence grounding (raw tool data + stronger prompts), deterministic dashboard linking/panel capture, and a CLI UX refresh (intent routing, better input handling, and abort shortcuts).

Changes:

  • Refactors InvestigationAgent into a 6-phase pipeline (planning → parallel evidence → timeline → synthesis → reflection) with structured findings schemas and stronger prompting.
  • Improves tool/evidence handling (tool response compaction, hallucinated tool-call filtering, deterministic dashboard URLs, time-window extraction).
  • Updates CLI experience (custom text input, intent routing, keyboard shortcuts, logging initialization order) and documentation.

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
src/llm/openai.ts Adjusts Responses API handling (truncation behavior, hallucinated function calls) and default text fallback behavior.
src/llm/openai.test.ts Updates tests to reflect tool-call behavior when tools are provided.
src/interfaces/cli/cli-utils.test.ts Updates formatting assertions to match new RCA markdown/emoji headings.
src/interfaces/cli/CliTextInput.tsx Adds custom Ink text input with cursor control and readline-style shortcuts.
src/interfaces/cli/App.tsx Switches to CliTextInput, replaces intent classifier with router, adds UI routing labels and abort/shortcut handling.
src/index.ts Replaces IntentClassifier wiring with IntentRouter.
src/cli.tsx Reworks dotenv/log-level initialization and uses dynamic imports to avoid ESM hoisting issues.
src/agent/types.ts Extends AnomalyAssessment with timeRangeFrom/timeRangeTo.
src/agent/rca-types.ts Introduces structured observation types and expands RCA-related phase/report types.
src/agent/rca-prompts.ts Replaces prompt constants with prompt builders; adds planning/reflection prompts and new JSON schemas.
src/agent/rca-prompts.test.ts Updates prompt/schema tests for new builders and structured schemas.
src/agent/prompts.ts Requires time window extraction and adds timeRangeFrom/timeRangeTo to anomaly schema.
src/agent/investigation.ts Major refactor: planning + parallel evidence + timeline + synthesis + reflection, tool prefetching, severity validation, truncation/compaction.
src/agent/investigation.test.ts Updates tests for new phases (planning/reflection), structured findings, and helper functions.
src/agent/intent.ts Replaces IntentClassifier with IntentRouter; adds direct service matching from user text.
src/agent/intent.test.ts Updates tests for IntentRouter and new service matching behavior.
src/agent/core.ts Enhances create_temp_panel tool to include datasource/time range and auto-detect Prometheus datasource.
package.json Adds an npm bin entry for rca-agent.
package-lock.json Mirrors the new npm bin entry.
README.md Expands docs: architecture, pipeline explanation, examples, and CLI shortcuts.
CLAUDE.md Adds contributor-facing safety/process notes (secrets scanning, staging discipline).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/interfaces/cli/App.tsx
Comment thread src/agent/core.ts
Comment thread package.json
Comment thread src/agent/investigation.ts
Comment thread src/agent/investigation.ts
Comment thread src/llm/openai.ts
Comment thread src/agent/investigation.ts
Comment thread src/agent/investigation.ts
Comment thread src/agent/investigation.ts
Comment thread src/agent/investigation.ts
Wilson Li and others added 2 commits March 7, 2026 16:05
…y to README

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@WZ WZ merged commit 4ba125f into main Mar 8, 2026
1 check passed
WZ added a commit that referenced this pull request Apr 15, 2026
Two related bugs surfaced in prod on the next discover run after shipping
the JSONC parser fix:

1. Tool-arg coercions never ran for discovery. The guard at the top of
   the wrap call was:

       const tools = wrappedOnToolCall
         ? wrapToolsWithCallbacks(discoveryTools, wrappedOnToolCall, "discovery")
         : discoveryTools;

   When `config.onToolCall` is undefined (auto-discovery on cold start,
   CLI runs without callbacks), tools passed through raw, so both
   `coercePrometheusArgs` and `coerceLokiArgs` were skipped.
   Consequence: the LLM's standard `startTime: null` on instant queries
   crashed the Grafana MCP tool with a schema validation error, the same
   class of bug fix #5 was supposed to prevent for everyone.

   Fix: always wrap discovery tools. `wrapToolsWithCallbacks` already
   handles undefined `onToolCall` via `?.`, so coercions now run
   unconditionally and the callback path is still optional.

2. Datasource UIDs were hallucinated. The discover agent called
   `grafana_list_loki_label_names` with `datasourceUid: "loki"`
   (made-up short name) instead of the real `ef34v1a9h2fi8c` it had
   already learned from its own `grafana_list_datasources` call earlier
   in the same run. Result: label lookup failed, every service ended
   up with `logLabels: {}`, and fix #4 (Loki-compatible log labels from
   PR #85) was effectively neutralized.

   Fix: mirror the evidence-agent pattern and pre-fetch datasource UIDs
   at the start of `runDiscoverStep`, then inject them into the prompt
   as an `<untrusted_datasource_hints>` block. Uses the same UID-first
   wording evidence uses. Graceful fallback to empty hints if
   list_datasources isn't available or the call fails — the agent can
   still run, it'll just eat the round-trip discovering them itself
   (same as today).

Both fixes live in `src/workflows/steps/discover.ts` — scope-contained
to the discovery path. No tests changed because the coercion logic
is already covered by tool-utils.test.ts; the new hint fetcher is a
best-effort wrapper around an existing MCP tool call with no branching
to meaningfully unit-test.
WZ added a commit that referenced this pull request Apr 16, 2026
Two related bugs surfaced in prod on the next discover run after shipping
the JSONC parser fix:

1. Tool-arg coercions never ran for discovery. The guard at the top of
   the wrap call was:

       const tools = wrappedOnToolCall
         ? wrapToolsWithCallbacks(discoveryTools, wrappedOnToolCall, "discovery")
         : discoveryTools;

   When `config.onToolCall` is undefined (auto-discovery on cold start,
   CLI runs without callbacks), tools passed through raw, so both
   `coercePrometheusArgs` and `coerceLokiArgs` were skipped.
   Consequence: the LLM's standard `startTime: null` on instant queries
   crashed the Grafana MCP tool with a schema validation error, the same
   class of bug fix #5 was supposed to prevent for everyone.

   Fix: always wrap discovery tools. `wrapToolsWithCallbacks` already
   handles undefined `onToolCall` via `?.`, so coercions now run
   unconditionally and the callback path is still optional.

2. Datasource UIDs were hallucinated. The discover agent called
   `grafana_list_loki_label_names` with `datasourceUid: "loki"`
   (made-up short name) instead of the real `ef34v1a9h2fi8c` it had
   already learned from its own `grafana_list_datasources` call earlier
   in the same run. Result: label lookup failed, every service ended
   up with `logLabels: {}`, and fix #4 (Loki-compatible log labels from
   PR #85) was effectively neutralized.

   Fix: mirror the evidence-agent pattern and pre-fetch datasource UIDs
   at the start of `runDiscoverStep`, then inject them into the prompt
   as an `<untrusted_datasource_hints>` block. Uses the same UID-first
   wording evidence uses. Graceful fallback to empty hints if
   list_datasources isn't available or the call fails — the agent can
   still run, it'll just eat the round-trip discovering them itself
   (same as today).

Both fixes live in `src/workflows/steps/discover.ts` — scope-contained
to the discovery path. No tests changed because the coercion logic
is already covered by tool-utils.test.ts; the new hint fetcher is a
best-effort wrapper around an existing MCP tool call with no branching
to meaningfully unit-test.
WZ added a commit that referenced this pull request Apr 16, 2026
* fix(discover): parse JSONC and extract top-level arrays

Observed in prod: the discover agent emitted a valid-looking JSON array
with JavaScript-style block comments as section dividers:

  [
    {...},
    /* StatefulSets */
    {...},
    /* DaemonSets */
    {...}
  ]

`JSON.parse` rejected it. `safeJsonParse` fell through to its markdown-
fence-extraction fallback, which still called `JSON.parse` on the body
and hit the same rejection. The remaining fallbacks only looked for
top-level `{...}` objects, so they found individual service records inside
the array and the discover step saw `parsed = <single object>` instead
of `parsed = <array>`, failed its `Array.isArray` check, and logged
"discovery returned empty result" three times in a row.

Two fixes:

1. `stripJsoncComments()` — a string-aware pass that removes `// ...`
   line comments and `/* ... */` block comments and trailing commas
   before `]` / `}`. Respects JSON string boundaries so content like
   `"see /* important */ docs"` survives unchanged. Tried after direct
   parse and after markdown-fence extraction.

2. `extractLastJsonArray()` / `extractFirstJsonArray()` — balanced-
   bracket scan for top-level `[...]` as a last-resort fallback
   (object extraction still runs first so evidence/metrics agents
   keep returning their enclosing object, not the inner observations
   array).

Also tightened the discover agent prompt to explicitly forbid JS-style
comments and trailing commas in the output JSON.

Test coverage: 6 new regression cases in processors.test.ts including
the exact JSONC-with-block-comments failure mode from prod, string-
preservation of comment-like sequences inside values, and trailing-
comma tolerance.

* fix(discover): always apply arg coercions + pre-inject datasource hints

Two related bugs surfaced in prod on the next discover run after shipping
the JSONC parser fix:

1. Tool-arg coercions never ran for discovery. The guard at the top of
   the wrap call was:

       const tools = wrappedOnToolCall
         ? wrapToolsWithCallbacks(discoveryTools, wrappedOnToolCall, "discovery")
         : discoveryTools;

   When `config.onToolCall` is undefined (auto-discovery on cold start,
   CLI runs without callbacks), tools passed through raw, so both
   `coercePrometheusArgs` and `coerceLokiArgs` were skipped.
   Consequence: the LLM's standard `startTime: null` on instant queries
   crashed the Grafana MCP tool with a schema validation error, the same
   class of bug fix #5 was supposed to prevent for everyone.

   Fix: always wrap discovery tools. `wrapToolsWithCallbacks` already
   handles undefined `onToolCall` via `?.`, so coercions now run
   unconditionally and the callback path is still optional.

2. Datasource UIDs were hallucinated. The discover agent called
   `grafana_list_loki_label_names` with `datasourceUid: "loki"`
   (made-up short name) instead of the real `ef34v1a9h2fi8c` it had
   already learned from its own `grafana_list_datasources` call earlier
   in the same run. Result: label lookup failed, every service ended
   up with `logLabels: {}`, and fix #4 (Loki-compatible log labels from
   PR #85) was effectively neutralized.

   Fix: mirror the evidence-agent pattern and pre-fetch datasource UIDs
   at the start of `runDiscoverStep`, then inject them into the prompt
   as an `<untrusted_datasource_hints>` block. Uses the same UID-first
   wording evidence uses. Graceful fallback to empty hints if
   list_datasources isn't available or the call fails — the agent can
   still run, it'll just eat the round-trip discovering them itself
   (same as today).

Both fixes live in `src/workflows/steps/discover.ts` — scope-contained
to the discovery path. No tests changed because the coercion logic
is already covered by tool-utils.test.ts; the new hint fetcher is a
best-effort wrapper around an existing MCP tool call with no branching
to meaningfully unit-test.

* fix(tool-utils): strip Mastra tool marker so our coercions actually run

Root cause of every "our coercion didn't run" symptom we've been chasing:

Mastra's CoreToolBuilder.createExecute() validates tool input against the
tool's inputSchema BEFORE calling our wrapped execute, but only for tools
that Mastra considers its own (detected via the
`Symbol.for("mastra.core.tool.Tool")` marker, which is copied by
`{ ...tool }` spread). When the LLM sends `startTime: null` on an instant
Prometheus query, Mastra's validateToolInput rejects it with

    Tool input validation failed ... startTime: must be string

and returns the error to the LLM — our `coercePrometheusArgs` never gets
a chance to fire because execute was never called.

The same latent bug was silently hiding `coerceLokiArgs` as well, but that
one only fixes shape-valid-but-wrong-value inputs (direction "forward" →
"backward") so validation passed and our wrapper ran, masking the problem.
`coercePrometheusArgs` fixes shape-invalid inputs (null → string), which
validation blocks upstream — which is why the same bug kept resurfacing
in prod even after three "fixes".

The fix: strip the `MASTRA_TOOL_MARKER` symbol (and the outputSchema) from
the wrapped tool object. Mastra's `isMastraTool()` now returns false, so
the framework takes the Vercel/AI-SDK tool path, which SKIPS input
validation. Our wrapped execute runs on the raw LLM args, coerces them,
then calls the original Mastra Tool instance's execute directly — which
forwards to the underlying MCP tool with the fixed args.

Tool-spec fidelity (what the LLM sees) is unchanged because inputSchema
is preserved. Only the server-side validation step is bypassed. Output
validation is also skipped (outputSchema dropped) to avoid false negatives
on the permissive MCP content-wrapper response shape.

Test coverage in tool-utils-coercion.test.ts:
- marker stripped, outputSchema dropped, inputSchema preserved, execute
  replaced (direct assertions on the wrapped object)
- coercePrometheusArgs runs end-to-end via the wrapper (already present)
- coerceLokiArgs drops stepSeconds and forces direction=backward end-to-end

* chore(helm): default imagePullPolicy to Always

Fixed-tag deploys (0.1.4, 0.2.0, etc.) that get re-pushed under the same
tag don't pick up the new digest when the node has a cached copy under
that tag. This bit us immediately after the marker-strip fix — I rebuilt
and re-pushed 0.1.4, but the rollout kept running the old digest because
kubelet saw "already have 0.1.4 cached" and skipped the pull.

Always-pull costs ~2-5s on pod start in exchange for the guarantee that
re-pushing a tag actually does what you expect. For a single-replica
deployment that only rolls on real code changes, that's fine.

Override to IfNotPresent per-environment if you never re-push tags and
care about faster rollouts.

* chore: bump to 0.1.5

* fix(discover): log complete prompt hints at debug level, not a placeholder

The `logLlmCall` at the success + error paths of runDiscoverStep was
passing `prompt: "${discoverPrompt}\n\n[hints: ${fullHints.length} chars]"`,
which replaced the actual hints content with a char-count placeholder.
With LLM_LOG_LEVEL=debug set, the operator should see the ENTIRE LLM
request — user message + the full datasource hints + recipes + skills —
not a stub. Inline `fullHints` directly so the debug entry is reproducible
from the log alone.

Bump to 0.1.6.

* fix(web): redirect to Operations Desk after accepting discovery

Accepting a discovery result used to leave the operator on the Services
page with the grid re-rendered. They then had to manually click the
Operations Desk (Dashboard) nav item to see the freshly-populated service
list in the live catalog. Automate that last step: after the
`discover:accept` WS message is sent, ServicesPage fires a new
`onDiscoveryAccepted` callback which App.tsx wires to
`setLeftPane({ type: "dashboard" })`.

No-op default on the prop so existing tests (and anyone calling the
component without the callback) keep working unchanged.

Bumping to 0.1.7.

* feat(observability): log full prompts on start + per-tool agentic loop

Adds two debugging affordances that were missing from the 0.1.4 LLM
observability work:

1. logLlmCallStart now accepts an optional `prompt` field and emits the
   full content at debug level as a separate "LLM <agent> start prompt"
   line. Previously the full prompt was only visible in the completion
   log, so you couldn't inspect what the model saw during a hang.

2. logToolCall now emits an info summary (tool name + bytes + duration)
   plus debug details (full args + result) for every tool call inside
   discover, evidence phases, and chat. Previously discover and evidence
   collected tool events silently and only flushed them in the final
   LLM call summary, making the agentic loop invisible in real time.

Bump 0.1.7 -> 0.1.8.

* fix(discover): coerce hallucinated datasource UIDs + harden prompt

The discover agent consistently passed datasourceUid: "prometheus" (a
hallucinated short name) instead of the real UID from the pre-injected
hints block. The MCP server rejected it with "Tool input validation"
errors, burning 2-4 retries per discovery run.

Two-layer fix:

1. wrapToolsWithCallbacks now intercepts hallucinated short names via a
   pre-built Map<shortName, realUid> and substitutes the real UID before
   the MCP call. Logs at info level when substitution fires so we can
   count hallucinations and track whether the prompt fix reduces them.

2. The discover agent's system prompt now renders datasource UIDs in a
   dedicated "CRITICAL: non-negotiable" block, separated from the recipe
   hints that previously said "these are suggestions".

Extended the coercion map through WorkflowConfig to evidence phases and
anomaly detection for consistency.

Bump 0.1.8 -> 0.1.9.

* fix(discover): truncated JSON recovery + queryType coercion + retry UI

Three discovery improvements found via 0.1.8/0.1.9 per-tool-call logs:

1. Response truncation: max_tokens 16384 was too small for 71-service
   environments. Bumped to 32768 and added recoverTruncatedJsonArray
   to safeJsonParse — finds the last complete } in a truncated [{...},
   {... array and closes it. Strategy runs before single-object
   extraction so we don't pick up a stray inner object.

2. queryType: null coercion in coercePrometheusArgs — defaults to
   "instant", eliminating 1 wasted tool call per discovery attempt.

3. Retry visibility: new discover:retry WebSocket event + UI indicator
   showing "Attempt 2 of 3" so the user knows it's retrying, not stuck.

Also added parse failure diagnostics: logs response length + first/last
200 chars so truncation is diagnosable from a single log line.

Bump 0.1.9 -> 0.1.10.

* feat(server): log version on startup

Reads package.json version and includes it in the startup log line:
"dops-assistant v0.1.10 web server running on port 3000"

Structured fields include both port and version for log queries.

* fix(web): reset discovery state on Run Discovery click

When clicking Run Discovery after a previous run, the UI showed
"Validating services..." (stale phase from prior run) instead of
"Discovering services...". The onStartDiscovery handler only sent the
WS message without resetting discoveryState. Now resets to phase
"discovery" + status "running" before sending, so the UI immediately
shows the correct initial state.

* feat(web): add shimmer animation to discovery progress bar

The progress bar was static between tool calls, making the UI look
stuck during LLM thinking phases. Added a shimmer gradient overlay
that animates continuously, giving visual feedback that the process
is still active even when no iteration events are firing.

* fix(discover): cap tool results at 30k chars to prevent token exhaustion

k8s_pods_list returns 83k chars of raw pod data. With max_tokens=32768,
the LLM exhausts its entire token budget digesting the pod list and
produces a 0-char response (32938 output tokens, 0 chars text).

Added maxToolResultChars param to wrapToolsWithCallbacks. Discovery
tools are now capped at 30k chars with a truncation note. Moved
truncateMcpResult to tool-utils.ts (was duplicated in agents.ts).

This cuts discovery input tokens from ~90k to ~40k and prevents the
0-char exhaustion failure mode.

* fix(discover): prompt hint to avoid unfiltered pod lists

Replace the 30k tool result cap with a prompt instruction telling the
agent to prefer deployments/statefulsets or metric queries over raw pod
lists. Raw k8s_pods_list returns 83k+ chars and exhausts the token
budget. The prompt now explicitly tells the agent to use fieldSelector
or labelSelector if it must list pods, and to prefer pre-aggregated
metric queries instead.

Kept truncateMcpResult in tool-utils.ts as a shared utility but removed
the discovery-specific cap — the prompt hint is the primary fix.

* fix(agents): tighten k8s and log query guidance to reduce token waste

infra.ts: tell the agent to pass labelSelector/fieldSelector directly
in tool calls instead of fetching all pods/events then filtering
in-memory. An unfiltered k8s_pods_list returns 83k+ chars.

logs.ts: clarify that "without filter" means without pattern filter,
not without the service label selector. Added explicit line limits
(100 for error context, 10 for existence check).

discover.ts: fix blocklist test — use generic "metric or log query
tool" instead of hardcoded tool names in the datasource UID block.

* fix(discover): use filtered pod list as second pass, not skip it entirely

The previous prompt told the agent to prefer metrics over pod lists,
which caused it to skip k8s_pods_list entirely. This lost 38 of 47
services — container-level services (celery workers, sidecars,
multi-container pods) whose names differ from their deployment name
are invisible to kube_deployment_status_replicas metrics.

Now: metrics first for the initial sweep, then a filtered pod list
(excluding kube-system/kube-public/kube-node-lease) as a second pass
to catch container-level services.

* fix: consul skill exhaustive discovery + stackId in poller logs

consul-bare-metal-discovery.md: changed "Common bare-metal services"
to "Important: discover ALL Consul services, not just known ones".
The agent was treating the example list as exhaustive and skipping
the consul_catalog query, missing 6 sub-components (hdfs-journalnode,
hdfs-zkfc, hive-metastore, impala-catalog, kudu-tserver, fazbdregistry).

service-health-poller.ts: added stackId to all poll log lines (tool
not found, query result, empty query result, poll complete). Previously
impossible to tell which stack's poller was failing.

* fix(test): add buildDatasourceUidMap to ws-handler mock
WZ added a commit that referenced this pull request Apr 23, 2026
* feat(scan): Slice A — schema + registry shape for discovery-owned probe rules

Prepares the scan pipeline for discovery-written probe rules. Schema adds
the per-rule `source` discriminator ("metrics" | "logs") and the per-service
`probeRules` field; registry persists a new top-level object shape
{services, globalProbeRules} with forward-compat reads of legacy flat-array
files. No runtime behavior change — Slice B (discovery prompt) and Slice C
(four-track probe evaluator) build on this.

Schema (src/config/schema.ts):
  - ProbeMetricRuleSchema gains `source: z.enum(["metrics","logs"]).default("metrics")`
    so the probe knows which MCP tool role to dispatch against. Default is
    "metrics" — every existing config and #115 k8s-native rule continues to
    route through the Prometheus tool unchanged.
  - ServiceSchema gains `probeRules: z.array(ProbeMetricRuleSchema).optional().default([])`.
    Populated by discovery in Slice B; empty today.
  - ProbeSchema gains `logsQueryTimeoutMs: z.number().default(10_000)`.
    Loki `count_over_time` over a 15m window regularly takes 5-20s; reusing
    the 3s Prometheus timeout would produce silent false negatives. Eng-review
    decision 2026-04-22.
  - ProbeMetricRuleSchema + ThresholdSchema moved above ServiceSchema so the
    probeRules field can reference them (Zod requires forward declaration).

Registry (src/services/registry.ts):
  - Persisted shape inverted from ServiceConfig[] to RegistryFile
    {services, globalProbeRules}.
  - Forward-compat reader: legacy flat-array services.yaml files parse into
    {services: parsed, globalProbeRules: []} without migration. The first
    write upgrades the file on disk. Operators on old files continue to load.
  - Public API preserved: load()/save()/getVersion()/rollback()/listVersions()
    keep their signatures. No call site in routes.ts, agents.ts,
    scan-scheduler.ts, or service-health-poller.ts had to change.
  - save() internally reads current globalProbeRules and carries them forward
    on every write — routes.ts metadata edits, service renames, and rollback
    cannot silently clobber the discovery-written top-level rules. Eng-review
    decision 2026-04-22 (silent-clobber guard).
  - rollback() preserves CURRENT globalProbeRules (not historic ones). Historic
    snapshots from before Slice A have globalProbeRules=[]; rolling back to
    them would wipe the live rules.
  - New methods: loadGlobalRules(), saveGlobalRules(rules, source),
    loadAll() (atomic combined snapshot for the scan probe's per-tick read),
    saveAll(file, source) (atomic combined write — preferred for discovery,
    produces one version entry instead of two), getVersionFile(id) (historic
    snapshot including globals-as-of-that-version).

GUI validator (src/server/scan-rule-validator.ts):
  - RuleSchema gains `source: z.enum(["metrics","logs"]).default("metrics")`
    so GUI-authored rules parse into the canonical ProbeMetricRule shape.
    Log-source rules come from discovery in Slice B; the GUI editor stays
    metrics-only for v1.

Discovery types (src/types/discovery-types.ts):
  - ValidatedServiceConfigSchema and ServiceRegistryVersionSchema pre-thread
    probeRules so Slice B can populate the field without a second type-level
    change. Empty default today.

Tests (10 new, 3 updated):
  - schema.test.ts: source defaults, log-source accept, invalid source reject,
    logsQueryTimeoutMs default + validation, per-service probeRules parse.
  - registry.test.ts: legacy-shape read, saveGlobalRules round-trip,
    save()-preserves-globals (silent-clobber guard), rollback()-preserves-
    current-globals, saveAll() atomic write (one version entry), forward-
    compat loadAll() on legacy files, first-save upgrades on-disk shape.
  - scan-service-override.test.ts + scan-settings.test.ts: two shape
    assertions now include `source: "metrics"` where the validator defaults it.

Full suite: 1381/1381 pass in isolation per file. One rate-limit test flakes
under parallel load, passes standalone — not caused by this change.

Design doc: ~/.gstack/projects/WZ-dops-assistant/wli02-feat-llm-driven-probe-rules-design-20260422-continuation.md

* feat(discover): Slice B — LLM writes probe rules + discover-eval harness

Teaches the discovery agent to emit per-service probeRules and top-level
globalProbeRules, then plumbs them through validate → accept → registry
so they land in services.yaml. Adds discover-eval as a quality gate so a
prompt regression that stops emitting rules surfaces before it ships.

No runtime behavior change for the probe yet — Slice C (four-track probe
evaluator) is what actually reads the rules. This slice is data-producing
only: the fields show up in services.yaml, the probe still runs against
the hardcoded config.yaml defaults.

Prompt changes (src/agents/discover.ts):
  - Output shape: JSON OBJECT {services, globalProbeRules} instead of a
    bare services array. Bare-array form stays accepted for backward-compat
    with stacks that don't produce globals.
  - Per-service probeRules guidance: the agent writes `pod_restarts` for
    k8s workloads with a resolvable namespace, and `log_errors` for
    services with non-empty logLabels. Both are spec'd with canonical
    names, selectors, and thresholds so output stays consistent run to run.
  - globalProbeRules guidance: label-key introspection (majority-wins on
    app / service / job / deployment / statefulset / daemonset). Rewrites
    availability rules using the correct key for the stack. When the stack
    matches the hardcoded k8s defaults, globalProbeRules stays empty (no
    duplication).

Data flow (src/workflows/steps/discover.ts, src/workflows/discovery.ts,
src/server/agents.ts, src/server/ws-handler.ts, src/cli/commands/discover.tsx):
  - runDiscoverStep returns `{services, globalProbeRules}`; both legacy
    bare-array and new object-form agent outputs parse correctly.
  - runDiscovery threads the pair through validation.
  - IDiscoverAgent.discover() returns DiscoveryResult; accept() takes an
    optional third globalProbeRules arg.
  - MastraDiscoverAdapter.accept() routes to registryStore.saveAll() when
    globals are provided (one atomic version entry), else falls through to
    save() which preserves the file's current globals (silent-clobber
    guard from Slice A). Exactly the API shape Slice A's registry was
    designed to accept.
  - ws-handler caches the full DiscoveryResult in pendingDiscovery so the
    discover:accept handler can pull globalProbeRules server-side — the UI
    only echoes services over the wire.
  - cli discover.tsx holds globalProbeRules in state alongside services
    and passes both on accept.

Eval harness (src/eval/discover-eval.ts, src/eval/fixtures/discover-k8s-fixture.yaml):
  - Four 25-point dimensions, 100 total: globals present, per-service
    present (partial credit when minority of services have rules),
    PromQL parses, LogQL parses.
  - Lightweight parsers catch the failure modes an LLM actually produces:
    empty string, unbalanced braces/parens, YOUR_NAMESPACE / <namespace>
    placeholder tokens. Runtime NaN fail-safe in Slice C catches
    "parses but returns empty vector".
  - Fixture at src/eval/fixtures/discover-k8s-fixture.yaml represents
    expected well-formed output on an `app=`-keyed k8s stack. Scores
    100/100 today; any drift in scoring calibration surfaces as a test
    failure immediately.
  - npm run test:discover-eval — gates at --min-score 75 against the
    fixture. Wire up against a live services.yaml in CI once the agent
    has run on a real cluster.

Tests:
  - discovery.test.ts: new case covering the object-form agent output
    path; existing cases updated to assert {services, globalProbeRules}
    shape.
  - discover-eval.test.ts (new, 17 tests): PromQL/LogQL parsers, each
    dimension scorer, top-level eval against fixture + empty + legacy
    flat-array inputs.

Full suite: 1399/1400. One rate-limit flake under parallel load, passes
standalone — same pre-existing one from the Slice A run.

Design doc: ~/.gstack/projects/WZ-dops-assistant/wli02-feat-llm-driven-probe-rules-design-20260422-continuation.md

* feat(scan): Slice C — four-track probe evaluator + Loki metric-queryType

Makes the probe READ what Slice B's discovery agent writes. Replaces the
single-track config.yaml-only loop with a four-track evaluator that fans out
across global rules, per-service rules (metrics + logs), config defaults,
and a probe.logs fallback. Hysteresis state keys are origin-namespaced so
same-named rules on different tracks track independently.

This is the slice that actually changes runtime behavior. Stacks where
discovery has run will now evaluate the discovery-written rules; stacks
where it hasn't continue to fall through to PR #115's k8s-native defaults
(byte-identical regression-tested).

Probe (src/server/anomaly-probe.ts):
  - ProbeOptions gains `registryStore: ServiceRegistryStore` and
    `lokiDatasourceUid?: string`. The store is read once per tick via
    `loadAll()` for an atomic {services, globalProbeRules} snapshot —
    discovery running mid-tick can't produce a half-state.
  - New `RuleOrigin` type ("global" | "service" | "default" | "override" |
    "logs-fallback"). Threaded into ProbeHit and the state key.
  - `stateKey(service, origin, ruleName)` exported — namespaced so a
    global "availability" and a per-service "availability" with the same
    name share neither hysteresis nor a state-clearing pass.
  - Shared `withTimeoutAndAbort(tool, args, signal, ms)` helper extracted
    from executeInstant. Both metric and log executors use it. DRY win
    flagged in the code-quality review.
  - `executeInstantLogs(...)` calls the Loki MCP tool with
    `queryType: "metric"` so `count_over_time(...)` returns a scalar
    (vs `query_loki_logs`'s default which returns log lines). Uses
    probe.logsQueryTimeoutMs (10s default) — Prom's 3s timeout would
    silently NaN out on slow Loki clusters. NaN on any failure, never
    throws. If the tool rejects queryType:"metric" (older Grafana MCP),
    the error is logged once via withTimeoutAndAbort and every log-source
    rule scores NaN for the rest of the tick.
  - `findLogQueryTool(tools)` mirrors findMetricQueryTool: prefers
    `query_loki_logs`, falls back to generic log-query tool names.
  - `runProbe` rewritten as four-track evaluator. Per service:
    * Operator override (disabled or rules) wins. Marked origin:"override".
    * Tracks 1 vs 4 — globalProbeRules REPLACE config.yaml defaults when
      non-empty (origin:"global"); otherwise config defaults fire
      (origin:"default"). Tier 4 stays as the ultimate fallback for
      stacks where discovery has never run.
    * Tracks 2 + 3 — per-service probeRules are ADDITIVE on top of the
      base. Discovery writes only the rules it has unique context for
      (pod_restarts with real namespace, log_errors with real Loki
      labels). origin:"service".
    * probe.logs generic fallback fires only when probe.logs.enabled,
      the service has logLabels, and no per-service log-source rule was
      written. origin:"logs-fallback". Threshold scales errorRateThreshold
      by window minutes (raw count = per-min × min).
  - Single shared mapWithConcurrency semaphore across all four tracks +
    fallback — probe.concurrency caps total in-flight queries per tick.
  - Service registry resolution and per-service config lookup is O(1) via
    a Map built once per tick.

Scheduler (src/server/scan-scheduler.ts):
  - Threads registryStore + lokiDatasourceUid through to runProbe.
  - New optional dep `getLokiDatasourceUid?: () => string | undefined` —
    when undefined, log-source rules score NaN, metrics-source rules
    continue. Wired live so operators can add Loki later without restart.
  - resetHysteresisForService updated for 3-part key: switched from
    `slice(0, lastIndexOf(":")) === service` to `startsWith(service+":")`.
    The previous parser would have silently failed to clear any
    Slice-C-format key (it would have compared "svc-a" to "svc-a:global").
    Bug never reached prod — caught in this slice's own development.
  - resetHysteresisForChangedRules works unchanged: rule name is still
    after the last colon in the new key format.

Tests (src/server/anomaly-probe.test.ts):
  - All 14 existing runProbe call sites updated to pass `registryStore`
    via a small `fakeRegistryStore({...})` helper. State-key assertions
    use the new origin-namespaced helper `defaultKey(service, ruleName)`
    so the change is visible at the test layer too.
  - `mockLogsTools` added at module scope; getToolsByRole mock now routes
    by role so log-source tracks can supply their own tool surface.
  - Nine new four-track scenarios:
    * Track 1 replaces track 4 when global rules exist.
    * Track 4 regression: byte-identical to PR #115 when no globals.
    * Track 2 additive — global + per-service both fire.
    * Track 3 calls the LOGS tool (not metrics) with queryType:"metric"
      and the Loki UID.
    * Track 3 scores NaN with no logs tool wired.
    * probe.logs fallback fires from logLabels when no per-service
      log-source rule exists.
    * probe.logs fallback does NOT fire when one does.
    * Origin-namespaced state keys: same-named rules on different tracks
      keep independent counters.
    * Registry snapshot atomicity: loadAll() called exactly once per
      tick across N services.

Full suite: 1407/1407 (102/102 files green this run — no rate-limit flake).
Type check clean.

Design doc: ~/.gstack/projects/WZ-dops-assistant/wli02-feat-llm-driven-probe-rules-design-20260422-continuation.md

* fix(scan): address /review adversarial findings — validate LLM output, preserve globals on empty sweep, GC orphaned hysteresis

Three HIGH-priority findings from the pre-landing /review that both Claude
(subagent) and Codex adversarial passes converged on. Plus three mechanical
auto-fixes from the critical-pass.

## Mechanical auto-fixes

- `src/eval/rca-eval.test.ts` — ProbeHit literal missing the new required
  `origin` field. tsconfig.json excludes `*.test.ts` from tsc so the type
  error didn't surface; vitest passed because buildInvestigationMessage
  doesn't read `origin`. Fixed before the next consumer trips over it.
- `src/web/components/scan/types.ts` — comment claimed "keep in lockstep
  with ProbeMetricRule" but RuleDraft deliberately omits `source` (GUI
  editor is metrics-only in v1; validator defaults to "metrics" on PUT).
  Comment updated to reflect the actual contract.
- `src/server/anomaly-probe.ts buildGenericLogQLFromLabels` — escape
  backslash before double-quote so logLabel values containing `\` don't
  double-escape. k8s RFC 1123 label values never need this in practice,
  but the function takes an untrusted map and the fix is 3 characters.
- `src/server/scan-scheduler.ts:329` — inline comment said
  `"service:ruleName"` but Slice C changed the format to
  `"service:origin:ruleName"`. Parse still works (lastIndexOf gets the
  right colon), but the comment would mislead a future maintainer trying
  to extract the service prefix. Updated with an explicit warning not to
  use `slice(0, colonIdx)` for service extraction.

## Adversarial fix A — validate LLM-written rules before persistence

Cross-model HIGH (Claude F3 + Codex #3). runDiscoverStep was casting raw
JSON from the agent to ProbeMetricRule with `as`. Worst cases documented:
malformed threshold op silently never trips; rule name `db:slow` breaks
state-key parsing; `source: "log"` (typo) routes a LogQL query to the
Prometheus tool and NaNs — the operator sees a services.yaml that claims
log monitoring exists, but none runs.

Fix in `src/workflows/steps/discover.ts`:
- `ProbeMetricRuleSchema` exported from `src/config/schema.ts` so the
  discovery path can safeParse against the canonical shape.
- `validateDiscoveredRules(raw, source)` drops rules that fail shape
  validation (including invalid threshold op, missing query, wrong
  source value) AND rules whose name violates `^[^:]+$` (the state-key
  delimiter constraint scan-rule-validator already enforces on the GUI
  path). Structured warn log per dropped rule so operators can see what
  the LLM produced.
- `validateDiscoveredServices(raw)` applies the same guard to per-service
  `probeRules` before the service reaches the full-shape validator.

The scan-rule-validator (GUI path) is explicitly scoped — reusing it
would have bundled its `.strict()` mode which rejects unknown keys and
would break when we add fields later. Parallel schema is the right
tradeoff.

## Adversarial fix B — globals survive empty service sweep

Codex HIGH #2. runDiscoverStep:283 required `parsed.services.length > 0`
to accept the object form, but runDiscovery:41 has an explicit
`complete-empty` phase that surfaces globalProbeRules when services is
empty. Today a transient empty discovery would silently wipe the learned
label-key override, pushing the probe back to config.yaml defaults.

Fix: the object-form branch now accepts when EITHER services OR
globalProbeRules is non-empty. A discovery run that only wrote globals
(label-key introspection succeeded, service sweep came back empty) is
a valid useful result.

## Adversarial fix C — GC orphaned hysteresis state at tick start

Cross-model HIGH (Claude F1 + Codex #5). `resetHysteresisForChangedRules`
only ran on `PUT /api/scan/settings` (config.yaml reload). Discovery's
accept() path writes to services.yaml but never resets state. Old
counters under `{svc}:global:OLD_RULE_NAME` leaked forever after a
discovery-driven rule rename. Same pattern for removed per-service
rules and hidden services.

Fix in `runProbe` (`src/server/anomaly-probe.ts`): after building the
tick's task list, compute the set of active stateKeys, and drop every
consecutiveState entry not in that set. Covers every rule-change path —
discovery writes, config reloads, override toggles, hidden-service
changes — with one Map diff per tick. Debug-log the orphan count.

## Testing

- `discovery.test.ts`: +3 new tests exercising the three fixes
  (B: object form accepted with empty services + non-empty globals;
  A: rule with `db:slow` name dropped; A: rule with invalid threshold
  op dropped). Mock adapter extended with a per-test reply-override
  escape hatch.
- `anomaly-probe.test.ts`: +1 test pre-loading orphan state keys,
  running a tick with a different rule name, verifying both orphans
  were GC'd. Exercises both the "rule renamed" and "service removed"
  cases in one assertion.
- Full suite: 102/102 files green. Type check clean.

## Deferred (noted in the design doc "Not in Scope")

- File locking on registry writes — RMW race window is real but narrow;
  no existing writer uses locking. P2.
- Static query-cost bounds in discover-eval — runtime timeouts
  (queryTimeoutMs 3s, logsQueryTimeoutMs 10s) cap per-query wall-clock.
  P2 if prompt-driven DDoS ever surfaces.

Design doc: ~/.gstack/projects/WZ-dops-assistant/wli02-feat-llm-driven-probe-rules-design-20260422-continuation.md

* chore: bump version and changelog (v0.1.0.0)

MINOR bump — discovery-owned probe rules ship. Discovery agent writes
per-service `probeRules` and top-level `globalProbeRules` into services.yaml;
four-track probe evaluator reads them; new discover-eval harness gates
output quality.

Also syncs `package.json.version` (0.1.14, stale from old 3-digit scheme)
back in line with the VERSION file's 4-digit scheme.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs: update CLAUDE.md for discovery-owned probe rules (v0.1.0.0)

- Add Where-to-Look entries for anomaly-probe, scan-scheduler, discover-eval
- Note services.yaml shape change ({services, globalProbeRules}, flat-array forward-compat)
- Document four-track probe evaluator + LLM output validation
- Add npm run test:discover-eval to commands
- Refresh test-file count (73 -> 100+)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
WZ added a commit that referenced this pull request Apr 23, 2026
* refactor(scan): remove Status section from Settings tab

Status display (next run / last run / ticking) and "Scan now" trigger now
live in the Operation Desk view. Keeping them here duplicated the same
data and forced a 10s polling interval that the Settings tab doesn't need.

Removes the status fetch, the polling interval + its cleanup, the trigger
handler + timers, and the Field / formatTimestamp helpers that were only
used here.

Addresses finding #2 from /design-review on Settings → Scan.

* style(scan): plain-language copy on the Settings tab

Rewrites every helper sentence to avoid internal jargon ("probe", "tick",
"trip", "PromQL") that operators don't need to learn to read the settings
page. "Cron expression" label becomes "Schedule" (the expression itself is
still shown). Helper text is lifted from 10px mono to 12px regular so it's
actually readable. The top-of-page blurb now leads with what the feature
does for the user, not how it works internally.

Addresses finding #3 from /design-review on Settings → Scan.

* feat(scan): cron preset shortcuts on Settings tab

Adds a row of pill buttons above the schedule input (Every 15 min, Hourly,
Every 4 hours, Daily, Weekly) that fill the cron field in one click. The
pill matching the current value is highlighted so the user can see which
preset their schedule matches, if any. Typing a custom expression still
works and simply leaves no pill highlighted.

Addresses finding #4 from /design-review on Settings → Scan.

* feat(scan): default timezone to the browser's timezone

When the user hasn't saved a GUI timezone yet (settings.source.timezone
is still "config", i.e. the UTC default from config.yaml), pre-fill the
form with Intl.DateTimeFormat().resolvedOptions().timeZone. Marks the
form dirty so the Save button lights up and the user can persist it with
one click.

Users who explicitly want UTC can clear the field and save; a user who
has already saved a specific timezone keeps theirs.

Addresses finding #5 from /design-review on Settings → Scan.

* style(scan): tidy per-rule editor layout

Collapses the three header actions (move up, move down, remove) from
three differently-styled bordered buttons into a single horizontal icon
strip with subtle hover backgrounds. Each icon is a 28px square, no
per-button border, destructive hover only on the × button. The rule
card now has one clean header row instead of two ragged columns.

Other touch-ups in the same editor:
- "Consecutive ticks" label → "Scans in a row" (matches the vocabulary
  used elsewhere on the tab)
- Drops the redundant helper under the ticks input (label is clear)
- Rewrites the query helper to drop "Probe" jargon
- Remove-confirm dialog uses plain language instead of "hysteresis state"

Addresses finding #1 from /design-review on Settings → Scan.

* style(design): dark-theme Add recipient modal

Align the email recipient editor modal with the dark-theme design tokens
used by EmailRecipientsSection. Swap the white card for bg-card, adopt the
mono uppercase label/title pattern, use bg-background/40 inputs with
border-border/40, accent-primary radios/checkboxes, and replace raw button
elements with the shared Button (outline Cancel, primary Save). Backdrop
gains backdrop-blur and proper dialog ARIA.

* chore: bump version and changelog (v0.1.2.0)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants