Skip to content

refactor: remove Slack/scheduler, rename ChatAgent, add Dockerfile#4

Merged
WZ merged 2 commits into
mainfrom
feature/mvp
Mar 5, 2026
Merged

refactor: remove Slack/scheduler, rename ChatAgent, add Dockerfile#4
WZ merged 2 commits into
mainfrom
feature/mvp

Conversation

@WZ
Copy link
Copy Markdown
Owner

@WZ WZ commented Mar 5, 2026

Summary

  • Remove Slack bot interface, webhook notifications, RCA blocks, and scheduler
  • Rename AgentCoreChatAgent, AgentTaskChatRequest, AgentResultChatResponse
  • Remove related Prometheus metrics and config schemas
  • Add production Dockerfile (multi-stage build)
  • Add historical design/plan docs

Test plan

  • npx tsc --noEmit passes
  • npx vitest run passes
  • docker build . succeeds

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings March 5, 2026 09:03
Wilson Li and others added 2 commits March 5, 2026 01:05
…e to ChatAgent

- Remove Slack bot interface, webhook notifications, and RCA blocks
- Remove scheduler and anomaly check config
- Rename AgentCore → ChatAgent, AgentTask → ChatRequest, AgentResult → ChatResponse
- Remove related metrics (scheduler_checks, slack_messages, alert_notifications)
- Clean up config schema, index.ts, and docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@WZ WZ merged commit bc92edc into main Mar 5, 2026
1 check passed
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the project from a Slack + scheduler-driven assistant into a CLI-first assistant by removing Slack/scheduler functionality, renaming the core agent API, and adding containerization and historical design/plan docs.

Changes:

  • Remove Slack bot/webhook notifications, the scheduler, and related metrics/config schema.
  • Rename the core agent surface area (AgentCoreChatAgent, runchat, plus associated request/response types).
  • Add a multi-stage production Dockerfile and update CLI rendering/prompting to better fit terminal usage.

Reviewed changes

Copilot reviewed 30 out of 31 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/scheduler/scheduler.ts Removed scheduler implementation.
src/scheduler/scheduler.test.ts Removed scheduler tests.
src/observability/metrics.ts Removed Slack/scheduler-related Prometheus counters.
src/notifications/slack-webhook.ts Removed Slack webhook notifier.
src/notifications/slack-webhook.test.ts Removed Slack webhook tests.
src/notifications/rca-blocks.ts Removed Slack Block Kit formatting.
src/notifications/rca-blocks.test.ts Removed RCA Slack blocks tests.
src/interfaces/slack.ts Removed Slack bot interface.
src/interfaces/slack.test.ts Removed Slack bot tests.
src/interfaces/cli/Markdown.tsx Add table width capping/truncation for terminal rendering.
src/interfaces/cli/App.tsx Update CLI app to use ChatAgent.chat() API.
src/index.ts Remove Slack/scheduler wiring; keep MCP/LLM init + observability + discovery.
src/config/schema.ts Remove Slack/scheduler config; keep discovery + agent config.
src/config/schema.test.ts Remove scheduler config schema test.
src/cli.tsx Update CLI to instantiate ChatAgent and adjust startup output.
src/agent/types.ts Rename request/response types to ChatRequest/ChatResponse.
src/agent/prompts.ts Add instruction to avoid markdown tables in terminal output.
src/agent/core.ts Rename AgentCoreChatAgent and run()chat().
src/agent/core.test.ts Update tests to new ChatAgent.chat() API.
package.json Remove Slack/scheduler deps and update package description.
docs/runbook.md Remove Slack/scheduler setup/config and update wording to CLI mode.
docs/plans/2026-02-28-investigation-context-gathering-plan.md Add historical plan doc.
docs/plans/2026-02-28-investigation-context-gathering-design.md Add historical design doc.
docs/plans/2026-02-27-service-discovery-plan.md Add historical plan doc.
docs/plans/2026-02-27-service-discovery-design.md Add historical design doc.
docs/plans/2026-02-25-rca-investigation-pipeline-plan.md Add historical plan doc.
docs/architecture-overiew.md Update architecture doc to CLI-first components.
config.yaml.example Remove Slack/scheduler sections from example config.
README.md Update README to describe CLI-first usage and remove Slack references.
Dockerfile Add multi-stage production Docker build.
Comments suppressed due to low confidence (1)

package.json:11

  • npm start/main still point at dist/index.js, but src/index.ts no longer starts any UI or request handler (it just connects to MCP and logs "running"). This makes the default production entrypoint effectively do nothing. Align the entrypoint with the CLI (e.g., make start run the compiled CLI entry, or add a real server interface in index.ts and document it) and update main accordingly so consumers don’t run a no-op process.
  "description": "Agentic infrastructure monitoring assistant — Grafana MCP + CLI",
  "type": "module",
  "main": "dist/index.js",
  "scripts": {

    "build": "tsc",
    "start": "node dist/index.js",
    "dev": "tsx src/index.ts",

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md
Comment on lines 40 to +88
@@ -61,21 +57,15 @@ cp config.yaml.example config.yaml
# 3. Set environment variables (or use a .env loader)
export OPENAI_API_KEY=sk-...
export GRAFANA_SERVICE_ACCOUNT_TOKEN=glsa_...
export SLACK_BOT_TOKEN=xoxb-...
export SLACK_APP_TOKEN=xapp-...
export SLACK_WEBHOOK_URL=https://hooks.slack.com/...

# 4. (Optional) Auto-discover services from Consul/Prometheus
npm run discover

# 5. Run in development mode
npm run dev
# 5. Run the CLI
npm run cli

# 6. Or build and run in production
npm run build && npm start

# 6. Or run with Docker Compose
docker compose up --build
```

The config path defaults to `config.yaml` in the working directory. Override with `CONFIG_PATH=/path/to/config.yaml`.
@@ -88,17 +78,14 @@ The config path defaults to `config.yaml` in the working directory. Override wit
| `grafana.mcpServer` | MCP transport (`stdio` or `http`), command/URL, enabled tools |
| `services` | Services to monitor — PromQL queries and Loki log labels per service |
| `discovery` | Auto-discovery settings — `autoRefresh`, `excludeServices`, `consulMetric` |
| `scheduler.anomalyCheck` | Check interval (e.g. `"5m"`), which services, max concurrency |
| `agent` | Max agentic loop iterations, conversation memory size and TTL |
| `notifications.slack` | Webhook URL and channel for outbound anomaly alerts |
| `interfaces.slack` | Enable the conversational Slack bot, bot token, app token |

See [docs/runbook.md](docs/runbook.md) for a full annotated configuration reference and setup guides for Slack and Grafana.
See [docs/runbook.md](docs/runbook.md) for a full annotated configuration reference and Grafana setup guide.

## Documentation

- [docs/architecture.md](docs/architecture.md) — component walkthrough and design decisions
- [docs/runbook.md](docs/runbook.md) — Slack setup, Grafana MCP setup, full config reference, tuning, troubleshooting
- [docs/runbook.md](docs/runbook.md) — Grafana MCP setup, full config reference, tuning, troubleshooting
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README references docs/architecture.md, but the repo contains docs/architecture-overiew.md (and no docs/architecture.md). This makes the documentation links in the README broken; update the links (and/or rename the architecture doc file) so users can actually navigate to the architecture writeup.

Copilot uses AI. Check for mistakes.
Comment on lines 5 to 8
dops-assistant is built as a layered dependency graph — each layer depends only on the layers below it. This makes components testable in isolation (mocking the layer below) and makes it straightforward to swap implementations (e.g. a different LLM provider) without touching the layers above.

The three entry points into the system — the Scheduler, Slack Bot, and CLI — all delegate to the Agent Core, which is the only component that knows about both the LLM and the MCP client. A separate Discovery Agent uses the same LLM + MCP stack to auto-discover services at startup or via the `npm run discover` CLI.
The CLI classifies user input via the IntentClassifier, then delegates to **ChatAgent** (conversational questions) or **InvestigationAgent** (structured RCA). A separate **DiscoveryAgent** uses the same LLM + MCP stack to auto-discover services at startup or via `npm run discover`.

Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overview frames the CLI as the entrypoint, but later in this doc (outside this hunk) it states config is loaded once in src/index.ts. Given the refactor away from Slack/scheduler and the updated README emphasizing the CLI, please ensure the rest of this architecture doc consistently reflects the current entrypoints and where config is loaded (CLI vs index.ts vs discover).

Copilot uses AI. Check for mistakes.
Comment thread Dockerfile
# Default path; override with CONFIG_PATH env var
ENV CONFIG_PATH=/app/config.yaml

CMD ["node", "dist/index.js"]
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image CMD runs dist/index.js, but src/index.ts doesn’t start an interactive CLI or any API handler anymore (it only connects to MCP and starts observability). As a result, the container will start successfully but provide no way to interact with the assistant. Update the container entrypoint to launch the intended interface (likely the CLI) or implement/expose a non-interactive service interface in index.ts and keep the Docker CMD consistent with that.

Suggested change
CMD ["node", "dist/index.js"]
CMD ["node", "dist/cli.js"]

Copilot uses AI. Check for mistakes.
Comment thread src/index.ts
Comment on lines 40 to 68
@@ -70,47 +67,8 @@ async function main(): Promise<void> {
}
}
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index.ts now constructs ChatAgent, InvestigationAgent, IntentClassifier, ConversationMemory, and even runs optional service discovery, but none of these are ever used after initialization. This leaves a long-running process that holds an MCP connection open and logs "dops-assistant running" without any interface to accept user input. Either wire these components into an actual entrypoint (CLI or API), or remove the unused initialization and make index.ts explicitly just an observability/health daemon.

Copilot uses AI. Check for mistakes.
Add to `scripts` section:

```json
"discover": "NODE_TLS_REJECT_UNAUTHORIZED=0 NODE_NO_WARNINGS=1 tsx src/discover.tsx"
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discover script uses NODE_TLS_REJECT_UNAUTHORIZED=0, which disables TLS certificate verification for the discovery CLI and any HTTPS requests it makes. An attacker on the network could man-in-the-middle connections to Grafana or your LLM endpoint, intercepting API keys and investigation data or injecting tampered responses. Replace this with proper certificate trust (for example, custom CAs or per-client TLS options) and ensure CLI scripts run with TLS verification enabled by default.

Suggested change
"discover": "NODE_TLS_REJECT_UNAUTHORIZED=0 NODE_NO_WARNINGS=1 tsx src/discover.tsx"
"discover": "NODE_NO_WARNINGS=1 tsx src/discover.tsx"

Copilot uses AI. Check for mistakes.
WZ pushed a commit that referenced this pull request Mar 23, 2026
… fallback

- Add onTokenUsage callback for the generateText call so token usage
  is tracked in phase metrics (Codex adversarial finding #4)
- Wrap regex fallback in try/catch with ultimate 8h default so a bad
  date in the user message can never abort the workflow (Codex #2)
WZ added a commit that referenced this pull request Mar 24, 2026
* feat: LLM-based time range extraction and report time window display

Replace regex-only time parsing with LLM-based extraction in the anomaly
step so natural language like "last Fri around 4PM" resolves to the correct
investigation window. Display the investigated time range in the RCA report
header across web, CLI, and markdown export.

- Add generateText call in anomaly step with 10s timeout and validation
  (no future dates, no >30d ago, from <= to)
- Fallback chain: LLM → regex (resolved to absolute UTC) → 8h default
- New resolveTimeRangeToAbsolute() converts Grafana-relative strings to
  absolute UTC for report persistence (prevents stale "now-8h" metadata)
- Thread timeRange through evidence → synthesis → post-synthesis pipeline
  via pass-through field in shared evidence factory
- Add timeRange to RcaReport type, all workflow schemas, and server adapter
- Display in RcaReport.tsx header with formatTimeRange helper
- Add to CLI App.tsx and formatRcaMarkdown.ts
- 7 new tests (resolveTimeRangeToAbsolute + markdown timeRange)

* fix: move clearTimeout to finally block in LLM time extraction

Prevents timeout callback leak when generateText throws (e.g., AbortController
fires). The callback was harmless (abort on aborted controller is a no-op) but
the finally block is the correct pattern.

* fix: address adversarial review findings

- Guard against invalid ISO dates in resolveTimeRangeToAbsolute (e.g.,
  "2026-02-30" which creates Invalid Date and throws on toISOString)
- Include changesFindings in timeRange fallback chain in synthesis step
- Use toLocaleString instead of toLocaleDateString for cross-day ranges
  (toLocaleDateString ignores hour/minute options in some environments)

* fix: report token usage from time extraction LLM call and guard regex fallback

- Add onTokenUsage callback for the generateText call so token usage
  is tracked in phase metrics (Codex adversarial finding #4)
- Wrap regex fallback in try/catch with ultimate 8h default so a bad
  date in the user message can never abort the workflow (Codex #2)

* fix: use correct AI SDK token usage property names

LanguageModelUsage uses inputTokens/outputTokens, not
promptTokens/completionTokens.

* fix: LLM prompt must convert local times to UTC

The LLM was treating user-stated times (e.g., "4PM") as UTC instead of
converting from the server's local timezone. Added explicit timezone
conversion instructions with worked examples showing the UTC offset
calculation. "around 4PM" in PDT (UTC-7) should produce 23:00Z, not 16:00Z.

---------

Co-authored-by: Wilson Li <wli02@fortinet.com>
WZ added a commit that referenced this pull request Apr 2, 2026
refactor: remove Slack/scheduler, rename ChatAgent, add Dockerfile
WZ added a commit that referenced this pull request Apr 2, 2026
* feat: LLM-based time range extraction and report time window display

Replace regex-only time parsing with LLM-based extraction in the anomaly
step so natural language like "last Fri around 4PM" resolves to the correct
investigation window. Display the investigated time range in the RCA report
header across web, CLI, and markdown export.

- Add generateText call in anomaly step with 10s timeout and validation
  (no future dates, no >30d ago, from <= to)
- Fallback chain: LLM → regex (resolved to absolute UTC) → 8h default
- New resolveTimeRangeToAbsolute() converts Grafana-relative strings to
  absolute UTC for report persistence (prevents stale "now-8h" metadata)
- Thread timeRange through evidence → synthesis → post-synthesis pipeline
  via pass-through field in shared evidence factory
- Add timeRange to RcaReport type, all workflow schemas, and server adapter
- Display in RcaReport.tsx header with formatTimeRange helper
- Add to CLI App.tsx and formatRcaMarkdown.ts
- 7 new tests (resolveTimeRangeToAbsolute + markdown timeRange)

* fix: move clearTimeout to finally block in LLM time extraction

Prevents timeout callback leak when generateText throws (e.g., AbortController
fires). The callback was harmless (abort on aborted controller is a no-op) but
the finally block is the correct pattern.

* fix: address adversarial review findings

- Guard against invalid ISO dates in resolveTimeRangeToAbsolute (e.g.,
  "2026-02-30" which creates Invalid Date and throws on toISOString)
- Include changesFindings in timeRange fallback chain in synthesis step
- Use toLocaleString instead of toLocaleDateString for cross-day ranges
  (toLocaleDateString ignores hour/minute options in some environments)

* fix: report token usage from time extraction LLM call and guard regex fallback

- Add onTokenUsage callback for the generateText call so token usage
  is tracked in phase metrics (Codex adversarial finding #4)
- Wrap regex fallback in try/catch with ultimate 8h default so a bad
  date in the user message can never abort the workflow (Codex #2)

* fix: use correct AI SDK token usage property names

LanguageModelUsage uses inputTokens/outputTokens, not
promptTokens/completionTokens.

* fix: LLM prompt must convert local times to UTC

The LLM was treating user-stated times (e.g., "4PM") as UTC instead of
converting from the server's local timezone. Added explicit timezone
conversion instructions with worked examples showing the UTC offset
calculation. "around 4PM" in PDT (UTC-7) should produce 23:00Z, not 16:00Z.

---------

Co-authored-by: Wilson Li <wli02@fortinet.com>
WZ added a commit that referenced this pull request Apr 15, 2026
Two related bugs surfaced in prod on the next discover run after shipping
the JSONC parser fix:

1. Tool-arg coercions never ran for discovery. The guard at the top of
   the wrap call was:

       const tools = wrappedOnToolCall
         ? wrapToolsWithCallbacks(discoveryTools, wrappedOnToolCall, "discovery")
         : discoveryTools;

   When `config.onToolCall` is undefined (auto-discovery on cold start,
   CLI runs without callbacks), tools passed through raw, so both
   `coercePrometheusArgs` and `coerceLokiArgs` were skipped.
   Consequence: the LLM's standard `startTime: null` on instant queries
   crashed the Grafana MCP tool with a schema validation error, the same
   class of bug fix #5 was supposed to prevent for everyone.

   Fix: always wrap discovery tools. `wrapToolsWithCallbacks` already
   handles undefined `onToolCall` via `?.`, so coercions now run
   unconditionally and the callback path is still optional.

2. Datasource UIDs were hallucinated. The discover agent called
   `grafana_list_loki_label_names` with `datasourceUid: "loki"`
   (made-up short name) instead of the real `ef34v1a9h2fi8c` it had
   already learned from its own `grafana_list_datasources` call earlier
   in the same run. Result: label lookup failed, every service ended
   up with `logLabels: {}`, and fix #4 (Loki-compatible log labels from
   PR #85) was effectively neutralized.

   Fix: mirror the evidence-agent pattern and pre-fetch datasource UIDs
   at the start of `runDiscoverStep`, then inject them into the prompt
   as an `<untrusted_datasource_hints>` block. Uses the same UID-first
   wording evidence uses. Graceful fallback to empty hints if
   list_datasources isn't available or the call fails — the agent can
   still run, it'll just eat the round-trip discovering them itself
   (same as today).

Both fixes live in `src/workflows/steps/discover.ts` — scope-contained
to the discovery path. No tests changed because the coercion logic
is already covered by tool-utils.test.ts; the new hint fetcher is a
best-effort wrapper around an existing MCP tool call with no branching
to meaningfully unit-test.
WZ added a commit that referenced this pull request Apr 16, 2026
Two related bugs surfaced in prod on the next discover run after shipping
the JSONC parser fix:

1. Tool-arg coercions never ran for discovery. The guard at the top of
   the wrap call was:

       const tools = wrappedOnToolCall
         ? wrapToolsWithCallbacks(discoveryTools, wrappedOnToolCall, "discovery")
         : discoveryTools;

   When `config.onToolCall` is undefined (auto-discovery on cold start,
   CLI runs without callbacks), tools passed through raw, so both
   `coercePrometheusArgs` and `coerceLokiArgs` were skipped.
   Consequence: the LLM's standard `startTime: null` on instant queries
   crashed the Grafana MCP tool with a schema validation error, the same
   class of bug fix #5 was supposed to prevent for everyone.

   Fix: always wrap discovery tools. `wrapToolsWithCallbacks` already
   handles undefined `onToolCall` via `?.`, so coercions now run
   unconditionally and the callback path is still optional.

2. Datasource UIDs were hallucinated. The discover agent called
   `grafana_list_loki_label_names` with `datasourceUid: "loki"`
   (made-up short name) instead of the real `ef34v1a9h2fi8c` it had
   already learned from its own `grafana_list_datasources` call earlier
   in the same run. Result: label lookup failed, every service ended
   up with `logLabels: {}`, and fix #4 (Loki-compatible log labels from
   PR #85) was effectively neutralized.

   Fix: mirror the evidence-agent pattern and pre-fetch datasource UIDs
   at the start of `runDiscoverStep`, then inject them into the prompt
   as an `<untrusted_datasource_hints>` block. Uses the same UID-first
   wording evidence uses. Graceful fallback to empty hints if
   list_datasources isn't available or the call fails — the agent can
   still run, it'll just eat the round-trip discovering them itself
   (same as today).

Both fixes live in `src/workflows/steps/discover.ts` — scope-contained
to the discovery path. No tests changed because the coercion logic
is already covered by tool-utils.test.ts; the new hint fetcher is a
best-effort wrapper around an existing MCP tool call with no branching
to meaningfully unit-test.
WZ added a commit that referenced this pull request Apr 16, 2026
* fix(discover): parse JSONC and extract top-level arrays

Observed in prod: the discover agent emitted a valid-looking JSON array
with JavaScript-style block comments as section dividers:

  [
    {...},
    /* StatefulSets */
    {...},
    /* DaemonSets */
    {...}
  ]

`JSON.parse` rejected it. `safeJsonParse` fell through to its markdown-
fence-extraction fallback, which still called `JSON.parse` on the body
and hit the same rejection. The remaining fallbacks only looked for
top-level `{...}` objects, so they found individual service records inside
the array and the discover step saw `parsed = <single object>` instead
of `parsed = <array>`, failed its `Array.isArray` check, and logged
"discovery returned empty result" three times in a row.

Two fixes:

1. `stripJsoncComments()` — a string-aware pass that removes `// ...`
   line comments and `/* ... */` block comments and trailing commas
   before `]` / `}`. Respects JSON string boundaries so content like
   `"see /* important */ docs"` survives unchanged. Tried after direct
   parse and after markdown-fence extraction.

2. `extractLastJsonArray()` / `extractFirstJsonArray()` — balanced-
   bracket scan for top-level `[...]` as a last-resort fallback
   (object extraction still runs first so evidence/metrics agents
   keep returning their enclosing object, not the inner observations
   array).

Also tightened the discover agent prompt to explicitly forbid JS-style
comments and trailing commas in the output JSON.

Test coverage: 6 new regression cases in processors.test.ts including
the exact JSONC-with-block-comments failure mode from prod, string-
preservation of comment-like sequences inside values, and trailing-
comma tolerance.

* fix(discover): always apply arg coercions + pre-inject datasource hints

Two related bugs surfaced in prod on the next discover run after shipping
the JSONC parser fix:

1. Tool-arg coercions never ran for discovery. The guard at the top of
   the wrap call was:

       const tools = wrappedOnToolCall
         ? wrapToolsWithCallbacks(discoveryTools, wrappedOnToolCall, "discovery")
         : discoveryTools;

   When `config.onToolCall` is undefined (auto-discovery on cold start,
   CLI runs without callbacks), tools passed through raw, so both
   `coercePrometheusArgs` and `coerceLokiArgs` were skipped.
   Consequence: the LLM's standard `startTime: null` on instant queries
   crashed the Grafana MCP tool with a schema validation error, the same
   class of bug fix #5 was supposed to prevent for everyone.

   Fix: always wrap discovery tools. `wrapToolsWithCallbacks` already
   handles undefined `onToolCall` via `?.`, so coercions now run
   unconditionally and the callback path is still optional.

2. Datasource UIDs were hallucinated. The discover agent called
   `grafana_list_loki_label_names` with `datasourceUid: "loki"`
   (made-up short name) instead of the real `ef34v1a9h2fi8c` it had
   already learned from its own `grafana_list_datasources` call earlier
   in the same run. Result: label lookup failed, every service ended
   up with `logLabels: {}`, and fix #4 (Loki-compatible log labels from
   PR #85) was effectively neutralized.

   Fix: mirror the evidence-agent pattern and pre-fetch datasource UIDs
   at the start of `runDiscoverStep`, then inject them into the prompt
   as an `<untrusted_datasource_hints>` block. Uses the same UID-first
   wording evidence uses. Graceful fallback to empty hints if
   list_datasources isn't available or the call fails — the agent can
   still run, it'll just eat the round-trip discovering them itself
   (same as today).

Both fixes live in `src/workflows/steps/discover.ts` — scope-contained
to the discovery path. No tests changed because the coercion logic
is already covered by tool-utils.test.ts; the new hint fetcher is a
best-effort wrapper around an existing MCP tool call with no branching
to meaningfully unit-test.

* fix(tool-utils): strip Mastra tool marker so our coercions actually run

Root cause of every "our coercion didn't run" symptom we've been chasing:

Mastra's CoreToolBuilder.createExecute() validates tool input against the
tool's inputSchema BEFORE calling our wrapped execute, but only for tools
that Mastra considers its own (detected via the
`Symbol.for("mastra.core.tool.Tool")` marker, which is copied by
`{ ...tool }` spread). When the LLM sends `startTime: null` on an instant
Prometheus query, Mastra's validateToolInput rejects it with

    Tool input validation failed ... startTime: must be string

and returns the error to the LLM — our `coercePrometheusArgs` never gets
a chance to fire because execute was never called.

The same latent bug was silently hiding `coerceLokiArgs` as well, but that
one only fixes shape-valid-but-wrong-value inputs (direction "forward" →
"backward") so validation passed and our wrapper ran, masking the problem.
`coercePrometheusArgs` fixes shape-invalid inputs (null → string), which
validation blocks upstream — which is why the same bug kept resurfacing
in prod even after three "fixes".

The fix: strip the `MASTRA_TOOL_MARKER` symbol (and the outputSchema) from
the wrapped tool object. Mastra's `isMastraTool()` now returns false, so
the framework takes the Vercel/AI-SDK tool path, which SKIPS input
validation. Our wrapped execute runs on the raw LLM args, coerces them,
then calls the original Mastra Tool instance's execute directly — which
forwards to the underlying MCP tool with the fixed args.

Tool-spec fidelity (what the LLM sees) is unchanged because inputSchema
is preserved. Only the server-side validation step is bypassed. Output
validation is also skipped (outputSchema dropped) to avoid false negatives
on the permissive MCP content-wrapper response shape.

Test coverage in tool-utils-coercion.test.ts:
- marker stripped, outputSchema dropped, inputSchema preserved, execute
  replaced (direct assertions on the wrapped object)
- coercePrometheusArgs runs end-to-end via the wrapper (already present)
- coerceLokiArgs drops stepSeconds and forces direction=backward end-to-end

* chore(helm): default imagePullPolicy to Always

Fixed-tag deploys (0.1.4, 0.2.0, etc.) that get re-pushed under the same
tag don't pick up the new digest when the node has a cached copy under
that tag. This bit us immediately after the marker-strip fix — I rebuilt
and re-pushed 0.1.4, but the rollout kept running the old digest because
kubelet saw "already have 0.1.4 cached" and skipped the pull.

Always-pull costs ~2-5s on pod start in exchange for the guarantee that
re-pushing a tag actually does what you expect. For a single-replica
deployment that only rolls on real code changes, that's fine.

Override to IfNotPresent per-environment if you never re-push tags and
care about faster rollouts.

* chore: bump to 0.1.5

* fix(discover): log complete prompt hints at debug level, not a placeholder

The `logLlmCall` at the success + error paths of runDiscoverStep was
passing `prompt: "${discoverPrompt}\n\n[hints: ${fullHints.length} chars]"`,
which replaced the actual hints content with a char-count placeholder.
With LLM_LOG_LEVEL=debug set, the operator should see the ENTIRE LLM
request — user message + the full datasource hints + recipes + skills —
not a stub. Inline `fullHints` directly so the debug entry is reproducible
from the log alone.

Bump to 0.1.6.

* fix(web): redirect to Operations Desk after accepting discovery

Accepting a discovery result used to leave the operator on the Services
page with the grid re-rendered. They then had to manually click the
Operations Desk (Dashboard) nav item to see the freshly-populated service
list in the live catalog. Automate that last step: after the
`discover:accept` WS message is sent, ServicesPage fires a new
`onDiscoveryAccepted` callback which App.tsx wires to
`setLeftPane({ type: "dashboard" })`.

No-op default on the prop so existing tests (and anyone calling the
component without the callback) keep working unchanged.

Bumping to 0.1.7.

* feat(observability): log full prompts on start + per-tool agentic loop

Adds two debugging affordances that were missing from the 0.1.4 LLM
observability work:

1. logLlmCallStart now accepts an optional `prompt` field and emits the
   full content at debug level as a separate "LLM <agent> start prompt"
   line. Previously the full prompt was only visible in the completion
   log, so you couldn't inspect what the model saw during a hang.

2. logToolCall now emits an info summary (tool name + bytes + duration)
   plus debug details (full args + result) for every tool call inside
   discover, evidence phases, and chat. Previously discover and evidence
   collected tool events silently and only flushed them in the final
   LLM call summary, making the agentic loop invisible in real time.

Bump 0.1.7 -> 0.1.8.

* fix(discover): coerce hallucinated datasource UIDs + harden prompt

The discover agent consistently passed datasourceUid: "prometheus" (a
hallucinated short name) instead of the real UID from the pre-injected
hints block. The MCP server rejected it with "Tool input validation"
errors, burning 2-4 retries per discovery run.

Two-layer fix:

1. wrapToolsWithCallbacks now intercepts hallucinated short names via a
   pre-built Map<shortName, realUid> and substitutes the real UID before
   the MCP call. Logs at info level when substitution fires so we can
   count hallucinations and track whether the prompt fix reduces them.

2. The discover agent's system prompt now renders datasource UIDs in a
   dedicated "CRITICAL: non-negotiable" block, separated from the recipe
   hints that previously said "these are suggestions".

Extended the coercion map through WorkflowConfig to evidence phases and
anomaly detection for consistency.

Bump 0.1.8 -> 0.1.9.

* fix(discover): truncated JSON recovery + queryType coercion + retry UI

Three discovery improvements found via 0.1.8/0.1.9 per-tool-call logs:

1. Response truncation: max_tokens 16384 was too small for 71-service
   environments. Bumped to 32768 and added recoverTruncatedJsonArray
   to safeJsonParse — finds the last complete } in a truncated [{...},
   {... array and closes it. Strategy runs before single-object
   extraction so we don't pick up a stray inner object.

2. queryType: null coercion in coercePrometheusArgs — defaults to
   "instant", eliminating 1 wasted tool call per discovery attempt.

3. Retry visibility: new discover:retry WebSocket event + UI indicator
   showing "Attempt 2 of 3" so the user knows it's retrying, not stuck.

Also added parse failure diagnostics: logs response length + first/last
200 chars so truncation is diagnosable from a single log line.

Bump 0.1.9 -> 0.1.10.

* feat(server): log version on startup

Reads package.json version and includes it in the startup log line:
"dops-assistant v0.1.10 web server running on port 3000"

Structured fields include both port and version for log queries.

* fix(web): reset discovery state on Run Discovery click

When clicking Run Discovery after a previous run, the UI showed
"Validating services..." (stale phase from prior run) instead of
"Discovering services...". The onStartDiscovery handler only sent the
WS message without resetting discoveryState. Now resets to phase
"discovery" + status "running" before sending, so the UI immediately
shows the correct initial state.

* feat(web): add shimmer animation to discovery progress bar

The progress bar was static between tool calls, making the UI look
stuck during LLM thinking phases. Added a shimmer gradient overlay
that animates continuously, giving visual feedback that the process
is still active even when no iteration events are firing.

* fix(discover): cap tool results at 30k chars to prevent token exhaustion

k8s_pods_list returns 83k chars of raw pod data. With max_tokens=32768,
the LLM exhausts its entire token budget digesting the pod list and
produces a 0-char response (32938 output tokens, 0 chars text).

Added maxToolResultChars param to wrapToolsWithCallbacks. Discovery
tools are now capped at 30k chars with a truncation note. Moved
truncateMcpResult to tool-utils.ts (was duplicated in agents.ts).

This cuts discovery input tokens from ~90k to ~40k and prevents the
0-char exhaustion failure mode.

* fix(discover): prompt hint to avoid unfiltered pod lists

Replace the 30k tool result cap with a prompt instruction telling the
agent to prefer deployments/statefulsets or metric queries over raw pod
lists. Raw k8s_pods_list returns 83k+ chars and exhausts the token
budget. The prompt now explicitly tells the agent to use fieldSelector
or labelSelector if it must list pods, and to prefer pre-aggregated
metric queries instead.

Kept truncateMcpResult in tool-utils.ts as a shared utility but removed
the discovery-specific cap — the prompt hint is the primary fix.

* fix(agents): tighten k8s and log query guidance to reduce token waste

infra.ts: tell the agent to pass labelSelector/fieldSelector directly
in tool calls instead of fetching all pods/events then filtering
in-memory. An unfiltered k8s_pods_list returns 83k+ chars.

logs.ts: clarify that "without filter" means without pattern filter,
not without the service label selector. Added explicit line limits
(100 for error context, 10 for existence check).

discover.ts: fix blocklist test — use generic "metric or log query
tool" instead of hardcoded tool names in the datasource UID block.

* fix(discover): use filtered pod list as second pass, not skip it entirely

The previous prompt told the agent to prefer metrics over pod lists,
which caused it to skip k8s_pods_list entirely. This lost 38 of 47
services — container-level services (celery workers, sidecars,
multi-container pods) whose names differ from their deployment name
are invisible to kube_deployment_status_replicas metrics.

Now: metrics first for the initial sweep, then a filtered pod list
(excluding kube-system/kube-public/kube-node-lease) as a second pass
to catch container-level services.

* fix: consul skill exhaustive discovery + stackId in poller logs

consul-bare-metal-discovery.md: changed "Common bare-metal services"
to "Important: discover ALL Consul services, not just known ones".
The agent was treating the example list as exhaustive and skipping
the consul_catalog query, missing 6 sub-components (hdfs-journalnode,
hdfs-zkfc, hive-metastore, impala-catalog, kudu-tserver, fazbdregistry).

service-health-poller.ts: added stackId to all poll log lines (tool
not found, query result, empty query result, poll complete). Previously
impossible to tell which stack's poller was failing.

* fix(test): add buildDatasourceUidMap to ws-handler mock
WZ added a commit that referenced this pull request Apr 23, 2026
* refactor(scan): remove Status section from Settings tab

Status display (next run / last run / ticking) and "Scan now" trigger now
live in the Operation Desk view. Keeping them here duplicated the same
data and forced a 10s polling interval that the Settings tab doesn't need.

Removes the status fetch, the polling interval + its cleanup, the trigger
handler + timers, and the Field / formatTimestamp helpers that were only
used here.

Addresses finding #2 from /design-review on Settings → Scan.

* style(scan): plain-language copy on the Settings tab

Rewrites every helper sentence to avoid internal jargon ("probe", "tick",
"trip", "PromQL") that operators don't need to learn to read the settings
page. "Cron expression" label becomes "Schedule" (the expression itself is
still shown). Helper text is lifted from 10px mono to 12px regular so it's
actually readable. The top-of-page blurb now leads with what the feature
does for the user, not how it works internally.

Addresses finding #3 from /design-review on Settings → Scan.

* feat(scan): cron preset shortcuts on Settings tab

Adds a row of pill buttons above the schedule input (Every 15 min, Hourly,
Every 4 hours, Daily, Weekly) that fill the cron field in one click. The
pill matching the current value is highlighted so the user can see which
preset their schedule matches, if any. Typing a custom expression still
works and simply leaves no pill highlighted.

Addresses finding #4 from /design-review on Settings → Scan.

* feat(scan): default timezone to the browser's timezone

When the user hasn't saved a GUI timezone yet (settings.source.timezone
is still "config", i.e. the UTC default from config.yaml), pre-fill the
form with Intl.DateTimeFormat().resolvedOptions().timeZone. Marks the
form dirty so the Save button lights up and the user can persist it with
one click.

Users who explicitly want UTC can clear the field and save; a user who
has already saved a specific timezone keeps theirs.

Addresses finding #5 from /design-review on Settings → Scan.

* style(scan): tidy per-rule editor layout

Collapses the three header actions (move up, move down, remove) from
three differently-styled bordered buttons into a single horizontal icon
strip with subtle hover backgrounds. Each icon is a 28px square, no
per-button border, destructive hover only on the × button. The rule
card now has one clean header row instead of two ragged columns.

Other touch-ups in the same editor:
- "Consecutive ticks" label → "Scans in a row" (matches the vocabulary
  used elsewhere on the tab)
- Drops the redundant helper under the ticks input (label is clear)
- Rewrites the query helper to drop "Probe" jargon
- Remove-confirm dialog uses plain language instead of "hysteresis state"

Addresses finding #1 from /design-review on Settings → Scan.

* style(design): dark-theme Add recipient modal

Align the email recipient editor modal with the dark-theme design tokens
used by EmailRecipientsSection. Swap the white card for bg-card, adopt the
mono uppercase label/title pattern, use bg-background/40 inputs with
border-border/40, accent-primary radios/checkboxes, and replace raw button
elements with the shared Button (outline Cancel, primary Save). Backdrop
gains backdrop-blur and proper dialog ARIA.

* chore: bump version and changelog (v0.1.2.0)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants