feat(llm-logger): call-start + first-chunk logging, fix discover tool tracking by WZ · Pull Request #85 · WZ/dops-assistant

WZ · 2026-04-15T20:13:25Z

Summary

Ships v0.1.4. Two themes in one branch:

LLM observability (the enabling work)

Call-start + first-chunk logging at info level so mid-stream hangs are visible without enabling debug
Error-path logging so failed calls emit a matching completion line
Fixed discover.ts hardcoding toolCalls: [] in its log record
Dropped dead observability.port/logLevel config fields

Six production fixes surfaced by the new observability

Health poller metric-aware zero classification — replicas = 0 → unknown (scaled down), up = 0 → down (real scrape failure). Stops the 5 false-positive auto-investigations we were seeing on every cold start.
Metrics agent prompt carveout — defense-in-depth for flat-zero replica series.
Express trust proxy — configurable via TRUST_PROXY_HOPS env var (default 1). Fixes ERR_ERL_UNEXPECTED_X_FORWARDED_FOR and broken per-client rate limiting behind k8s ingress. Closes TODO feat: service detail page, sidebar navigation, and page reorganization #36.
Discover agent logLabels guidance — emit Loki-compatible labels (container/pod/app), not kube-state-metrics labels. The {statefulset: yb-master} label was wasting 5-6 Loki fallback queries per investigation.
Tool-utils .Z RFC3339 normalization — repairs .Z, dangling dot, and missing-Z variants that LLMs consistently produce.
Tool-utils drop Loki stepSeconds — LLMs pass it as a Prometheus leftover; Loki rejects it.

Plus a red-square favicon so the tab is identifiable.

Test Coverage

Tests: 73 → 80 (+7 new files, 96 new test cases).

CODE PATH COVERAGE — v0.1.4
========================================

[+] src/server/service-health-poller.ts (metric-aware)
    ├── [★★★] matchResultsToServices: replicas=0 → unknown, up=0 → down, NaN → unknown
    ├── [★★★] Merge priority: healthy > down > unknown across 3 batches
    ├── [★★★] First-poll: scaled-zero does NOT fire transition
    ├── [★★★] First-poll: up=0 DOES fire transition (regression test for 72fa3de)
    └── [★★★] healthy → down and healthy → unknown transitions both fire
    Coverage: 11/11 cases

[+] src/server/index.ts (trust proxy)
    ├── [★★★] X-Forwarded-For resolves to real client with trust proxy
    └── [★★★] Regression baseline: without trust proxy, XFF is ignored
    Coverage: 2/2 cases

[+] src/agents/metrics.ts (prompt)
    ├── [★★]  INTENTIONALLY-DISABLED SERVICES block present in prompt
    └── [★★]  FULL investigation window guidance retained
    Coverage: 2/2 cases (text-content only — behavioral regression requires rca-eval)

[+] src/workflows/tool-utils.ts (.Z + stepSeconds)
    ├── [★★★] .Z → Z (empty-fraction malformed case)
    ├── [★★★] dangling dot → Z
    ├── [★★★] missing Z → Z appended
    ├── [★★★] .0Z / .000Z / valid Z all left unchanged (idempotent)
    ├── [★★★] non-time field 'prefix.Z' unchanged
    └── [★★★] coerceLokiArgs drops stepSeconds, preserves direction/limit
    Coverage: 9/9 cases

[+] src/agents/discover.ts (logLabels prompt) — no test (LLM behavior)

─────────────────────────────────
COVERAGE: 24/24 new code paths tested (100%)
QUALITY:  ★★★ 20, ★★ 2, ★ 0
─────────────────────────────────

Pre-Landing Review

1 informational: coercePrometheusArgs defaults assume queryType: instant — latent for range queries (not triggered in practice). Skipped as tracked-for-future.

Design Review

Scope: 1 frontend file (src/web/index.html favicon — 1 line). Skipped.

Adversarial Review

Both Claude subagent and Codex ran. 7 findings across the two reviewers, 2 cross-model confirmed:

CROSS-MODEL: trust proxy attack surface → FIXED: made TRUST_PROXY_HOPS env-var configurable with safety notes
CROSS-MODEL: replicas=0 → unknown hides real outages → KNOWN TRADE-OFF (explicit "NOT in scope" in the plan, tracked for follow-up with proper desired vs ready comparison)
.Z regex narrowness → FIXED: also handles dangling dot and missing Z
discover.ts unbounded argsStr → FIXED: sliced to 500 chars, wrapped in try/catch
Loki stepSeconds silent break if tool adds it → monitor (tracked in plan failure modes)
Metrics prompt trusts LLM compliance → defense-in-depth; Fix feat: CLI, discovery, RCA pipeline, panel images, and investigation hardening #1 is the real gate
logLlmCallStart/FirstChunk skews on throw → minor observability edge case

Codex structured review: GATE PASS (no [P1] markers).
PR Quality Score: 9/10.

Plan Completion

Plan: /Users/wli02/.claude/plans/moonlit-nibbling-star.md (as amended by /plan-eng-review).

  [DONE]    Fix #1 — metric-aware health classification
  [DONE]    Fix #2 — metrics prompt + prompt-content test
  [DONE]    Fix #3 — trust proxy (elevated to env-var via adversarial review)
  [DONE]    Fix #4 — discover agent logLabels guidance
  [DONE]    Fix #5+6 — isTimeField extract + .Z + stepSeconds + tests
  [DONE]    Adversarial follow-ups: .Z regex, discover argsStr, trust proxy env
  [DONE]    Red-square favicon (user side request)
  [DONE]    Version bump 0.1.4 + CHANGELOG

COMPLETION: 8/8 DONE — PASS

Deferred / known trade-offs

Desired-vs-ready health detection — not in scope for this PR; current fix stops false positives but leaves a latent false-negative for desired > 0, ready = 0 workloads. Captured in plan's "NOT in scope" section.
TRUST_PROXY_HOPS topology — default 1 is correct for single k8s ingress with use-forwarded-headers: false. Deployments with a CDN/service mesh/ingress-pass-through need to set this explicitly. Documented inline and in Helm values.

Verification (after redeploy)

Health poller: summary.down should drop from ~5 to the count of services with real up = 0 targets
Zero ERR_ERL_UNEXPECTED_X_FORWARDED_FOR on WS connects
New investigations: no .Z parse errors in Loki, no stepSeconds: 300 in Loki args, no toolCallCount: 0 in discover
Red-square favicon in browser tab

Test plan

TypeScript: tsc --noEmit clean
Unit tests: 96 new cases, all pass
Docker: wliftnt/dops-assistant:0.1.4 built and pushed (digest sha256:90c8fc80…)
Adversarial review: Claude + Codex cross-model, gate PASS, 2 cross-model findings fixed inline
Runtime smoke (post-redeploy): verify the 4 items above

🤖 Generated with Claude Code

… tracking - Add logLlmCallStart() and logLlmCallFirstChunk() at info level so hangs before the first chunk are visible without enabling debug logs. - Wire into chat (agents.ts), discover, and evidence call sites. - Fix discover.ts hardcoding toolCalls: [] — collect tool events via onStepFinish like evidence.ts does, so toolCallCount reflects reality. - Drop dead observability block from config.yaml.example (port and logLevel fields are not read by any runtime code). - Bump to 0.1.3.

discover.ts and agents.ts chat both had gaps where a failing LLM call emitted `LLM ... start:` but no matching completion log — only a warn from the app logger. Now logLlmCall fires on both success and failure, with the error message in the `error` field so failed calls are visible in llm-logger output and pair cleanly with their start event. evidence.ts already recorded generateError, so no change there.

The Grafana MCP query_prometheus tool's schema requires startTime/endTime as strings and stepSeconds as a number even for instant queries, where these fields are logically meaningless. LLMs default them to null and eat a validation-error round-trip before retrying with "now". Add coercePrometheusArgs() alongside coerceLokiArgs() to fill defaults before the schema check runs. The "now" string gets converted to RFC3339 by the existing coerceToolArgs path. Saves ~1s + a few hundred tokens per discover run and removes a confusing error from the debug log.

The poller's three batch queries have opposite semantics for value=0: - kube_deployment_status_replicas = 0: intentionally scaled down (unknown) - kube_statefulset_status_replicas = 0: intentionally scaled down (unknown) - up = 0: scrape target is actually down (down) Before this change, all three were classified as "down" when value=0, causing five legitimately-scaled-to-zero services to fire false-positive auto-investigations on every cold start (each burning ~60s and 25k+ LLM tokens). Fix: pass a zeroMeans parameter ("down" | "unknown") into matchResultsToServices, call it per-batch in pollOnce, and merge results with explicit priority (healthy > down > unknown). The first-poll transition gate from commit 72fa3de is preserved — real up=0 scrape failures still fire investigations on cold start; scaled-to-zero services do not. Inline ASCII decision diagram added to matchResultsToServices for future maintainers. Test coverage: 11 new cases, including a regression test for the 72fa3de cold-start intent (up=0 must still fire onTransition).

Defense in depth for the health-poller fix. Adds an INTENTIONALLY-DISABLED SERVICES block to the metrics agent system prompt so that a webhook-triggered investigation against a scaled-to-zero workload reports "not deployed" instead of "severity: high, scaled down or failed to start". Only triggers when a replica metric (deployment/statefulset/daemonset) reports a flat zero across the ENTIRE investigation window. A transition from >0 to 0 inside the window, or ready < desired, still reports as an anomaly. Test: deterministic assertion on the prompt text (metrics.test.ts). Runtime behavioral verification requires rca-eval against real data.

Behind the k8s ingress, X-Forwarded-For is set but Express defaults to 'trust proxy: false'. This means req.ip resolves to the ingress IP, not the real client, and express-rate-limit v8 logs ERR_ERL_UNEXPECTED_X_FORWARDED_FOR on every request while effectively sharing one bucket across all clients. Fix: app.set("trust proxy", 1). The '1' (not 'true') means one trusted hop, so a client without an upstream can't spoof its own X-Forwarded-For. Closes TODOS.md #36 (from PR #57 Codex adversarial review). Test: supertest assertions in rate-limit.test.ts for both the fix and the pre-fix baseline.

…etrics labels The discover agent was writing {"statefulset": "yb-master"} into services.yaml, but Loki in most envs doesn't expose a 'statefulset' stream label — only pod, container, namespace, app, and so on. The downstream logs agent would then query Loki with {statefulset="..."} and get empty results, wasting 5-6 fallback queries per investigation. Prompt guidance rewritten to prefer {container}/{pod}/{app} labels and explicitly warn that kube-state-metrics labels like deployment/ statefulset/daemonset do NOT transfer to Loki. Only affects new discovery runs. Existing services.yaml needs regeneration via `npm run discover`.

Two small LLM-quirk coercions observed in prod logs: 1. Malformed RFC3339: the LLM consistently truncates "2026-04-15T20:27:00.000Z" to "2026-04-15T20:27:00.Z" (trailing dot without fractional digits). The Grafana MCP tool rejects this, costing one wasted tool call per logs investigation. Normalize .Z -> Z on time-typed fields in coerceToolArgs. 2. Stray stepSeconds: the LLM passes stepSeconds: 300 to grafana_query_loki_logs (a Prometheus-only concept). Drop it in coerceLokiArgs before the tool sees it. DRY cleanup: extracted the inline "is this a time field?" check into a file-local isTimeField() helper shared by the 'now' and .Z branches. Test coverage: 7 cases in new src/workflows/tool-utils.test.ts.

Inline SVG data URI so the browser tab has an identifiable marker instead of the default globe. Red #dc2626 (Tailwind red-600) matches the rest of the palette. Placeholder — swap for a real icon later.

Follow-up fixes from the /ship adversarial review (Claude subagent + Codex): 1. tool-utils.ts: widen the RFC3339 normalizer to also catch the "dangling dot" and "missing Z entirely" variants, in addition to the original ".Z" empty-fraction case. Keeps ".000Z" and ".0Z" (valid zero fractions) unchanged. Three new unit tests. 2. steps/discover.ts: bound memory on onStepFinish by slicing argsStr to 500 chars (matches evidence.ts pattern), and wrap the per-step record in try/catch so JSON.stringify exceptions (BigInt, circular refs) don't crash the discovery step. 3. server/index.ts: make `trust proxy` configurable via TRUST_PROXY_HOPS env var (default 1). Both reviewers flagged that a hardcoded 1 is wrong when the ingress is configured with use-forwarded-headers:true, or when there's more than one proxy hop (CDN + ingress, service mesh + ingress). Too-low → clients share one bucket (false 429s); too-high → XFF is spoofable. Doc the tradeoffs inline and in the Helm values. Helm chart values.yaml gets an env-var commented example for TRUST_PROXY_HOPS.

The previous rewrite hardcoded "Loki" throughout, but the `logs` role in provider-registry is generic — the wired-up log system could be Loki, Elasticsearch, Splunk, CloudWatch, or anything else with a `logs` role. Rewrote the prompt to: 1. Frame the rule as "match actual stream labels / index fields the log provider exposes", not Loki-specific 2. Tell the agent to call a label-discovery tool up front if one exists (list_loki_label_names, list_indices, describe_log_groups, etc.) and prefer labels from that result 3. Keep the kube-state-metrics gotcha because it's the common LLM mistake across every k8s log pipeline 4. Keep the container/pod/namespace/app fallback guidance since those are universal across promtail, fluent-bit, filebeat, and CloudWatch CI

Two related bugs surfaced in prod on the next discover run after shipping the JSONC parser fix: 1. Tool-arg coercions never ran for discovery. The guard at the top of the wrap call was: const tools = wrappedOnToolCall ? wrapToolsWithCallbacks(discoveryTools, wrappedOnToolCall, "discovery") : discoveryTools; When `config.onToolCall` is undefined (auto-discovery on cold start, CLI runs without callbacks), tools passed through raw, so both `coercePrometheusArgs` and `coerceLokiArgs` were skipped. Consequence: the LLM's standard `startTime: null` on instant queries crashed the Grafana MCP tool with a schema validation error, the same class of bug fix #5 was supposed to prevent for everyone. Fix: always wrap discovery tools. `wrapToolsWithCallbacks` already handles undefined `onToolCall` via `?.`, so coercions now run unconditionally and the callback path is still optional. 2. Datasource UIDs were hallucinated. The discover agent called `grafana_list_loki_label_names` with `datasourceUid: "loki"` (made-up short name) instead of the real `ef34v1a9h2fi8c` it had already learned from its own `grafana_list_datasources` call earlier in the same run. Result: label lookup failed, every service ended up with `logLabels: {}`, and fix #4 (Loki-compatible log labels from PR #85) was effectively neutralized. Fix: mirror the evidence-agent pattern and pre-fetch datasource UIDs at the start of `runDiscoverStep`, then inject them into the prompt as an `<untrusted_datasource_hints>` block. Uses the same UID-first wording evidence uses. Graceful fallback to empty hints if list_datasources isn't available or the call fails — the agent can still run, it'll just eat the round-trip discovering them itself (same as today). Both fixes live in `src/workflows/steps/discover.ts` — scope-contained to the discovery path. No tests changed because the coercion logic is already covered by tool-utils.test.ts; the new hint fetcher is a best-effort wrapper around an existing MCP tool call with no branching to meaningfully unit-test.

* fix(discover): parse JSONC and extract top-level arrays Observed in prod: the discover agent emitted a valid-looking JSON array with JavaScript-style block comments as section dividers: [ {...}, /* StatefulSets */ {...}, /* DaemonSets */ {...} ] `JSON.parse` rejected it. `safeJsonParse` fell through to its markdown- fence-extraction fallback, which still called `JSON.parse` on the body and hit the same rejection. The remaining fallbacks only looked for top-level `{...}` objects, so they found individual service records inside the array and the discover step saw `parsed = <single object>` instead of `parsed = <array>`, failed its `Array.isArray` check, and logged "discovery returned empty result" three times in a row. Two fixes: 1. `stripJsoncComments()` — a string-aware pass that removes `// ...` line comments and `/* ... */` block comments and trailing commas before `]` / `}`. Respects JSON string boundaries so content like `"see /* important */ docs"` survives unchanged. Tried after direct parse and after markdown-fence extraction. 2. `extractLastJsonArray()` / `extractFirstJsonArray()` — balanced- bracket scan for top-level `[...]` as a last-resort fallback (object extraction still runs first so evidence/metrics agents keep returning their enclosing object, not the inner observations array). Also tightened the discover agent prompt to explicitly forbid JS-style comments and trailing commas in the output JSON. Test coverage: 6 new regression cases in processors.test.ts including the exact JSONC-with-block-comments failure mode from prod, string- preservation of comment-like sequences inside values, and trailing- comma tolerance. * fix(discover): always apply arg coercions + pre-inject datasource hints Two related bugs surfaced in prod on the next discover run after shipping the JSONC parser fix: 1. Tool-arg coercions never ran for discovery. The guard at the top of the wrap call was: const tools = wrappedOnToolCall ? wrapToolsWithCallbacks(discoveryTools, wrappedOnToolCall, "discovery") : discoveryTools; When `config.onToolCall` is undefined (auto-discovery on cold start, CLI runs without callbacks), tools passed through raw, so both `coercePrometheusArgs` and `coerceLokiArgs` were skipped. Consequence: the LLM's standard `startTime: null` on instant queries crashed the Grafana MCP tool with a schema validation error, the same class of bug fix #5 was supposed to prevent for everyone. Fix: always wrap discovery tools. `wrapToolsWithCallbacks` already handles undefined `onToolCall` via `?.`, so coercions now run unconditionally and the callback path is still optional. 2. Datasource UIDs were hallucinated. The discover agent called `grafana_list_loki_label_names` with `datasourceUid: "loki"` (made-up short name) instead of the real `ef34v1a9h2fi8c` it had already learned from its own `grafana_list_datasources` call earlier in the same run. Result: label lookup failed, every service ended up with `logLabels: {}`, and fix #4 (Loki-compatible log labels from PR #85) was effectively neutralized. Fix: mirror the evidence-agent pattern and pre-fetch datasource UIDs at the start of `runDiscoverStep`, then inject them into the prompt as an `<untrusted_datasource_hints>` block. Uses the same UID-first wording evidence uses. Graceful fallback to empty hints if list_datasources isn't available or the call fails — the agent can still run, it'll just eat the round-trip discovering them itself (same as today). Both fixes live in `src/workflows/steps/discover.ts` — scope-contained to the discovery path. No tests changed because the coercion logic is already covered by tool-utils.test.ts; the new hint fetcher is a best-effort wrapper around an existing MCP tool call with no branching to meaningfully unit-test. * fix(tool-utils): strip Mastra tool marker so our coercions actually run Root cause of every "our coercion didn't run" symptom we've been chasing: Mastra's CoreToolBuilder.createExecute() validates tool input against the tool's inputSchema BEFORE calling our wrapped execute, but only for tools that Mastra considers its own (detected via the `Symbol.for("mastra.core.tool.Tool")` marker, which is copied by `{ ...tool }` spread). When the LLM sends `startTime: null` on an instant Prometheus query, Mastra's validateToolInput rejects it with Tool input validation failed ... startTime: must be string and returns the error to the LLM — our `coercePrometheusArgs` never gets a chance to fire because execute was never called. The same latent bug was silently hiding `coerceLokiArgs` as well, but that one only fixes shape-valid-but-wrong-value inputs (direction "forward" → "backward") so validation passed and our wrapper ran, masking the problem. `coercePrometheusArgs` fixes shape-invalid inputs (null → string), which validation blocks upstream — which is why the same bug kept resurfacing in prod even after three "fixes". The fix: strip the `MASTRA_TOOL_MARKER` symbol (and the outputSchema) from the wrapped tool object. Mastra's `isMastraTool()` now returns false, so the framework takes the Vercel/AI-SDK tool path, which SKIPS input validation. Our wrapped execute runs on the raw LLM args, coerces them, then calls the original Mastra Tool instance's execute directly — which forwards to the underlying MCP tool with the fixed args. Tool-spec fidelity (what the LLM sees) is unchanged because inputSchema is preserved. Only the server-side validation step is bypassed. Output validation is also skipped (outputSchema dropped) to avoid false negatives on the permissive MCP content-wrapper response shape. Test coverage in tool-utils-coercion.test.ts: - marker stripped, outputSchema dropped, inputSchema preserved, execute replaced (direct assertions on the wrapped object) - coercePrometheusArgs runs end-to-end via the wrapper (already present) - coerceLokiArgs drops stepSeconds and forces direction=backward end-to-end * chore(helm): default imagePullPolicy to Always Fixed-tag deploys (0.1.4, 0.2.0, etc.) that get re-pushed under the same tag don't pick up the new digest when the node has a cached copy under that tag. This bit us immediately after the marker-strip fix — I rebuilt and re-pushed 0.1.4, but the rollout kept running the old digest because kubelet saw "already have 0.1.4 cached" and skipped the pull. Always-pull costs ~2-5s on pod start in exchange for the guarantee that re-pushing a tag actually does what you expect. For a single-replica deployment that only rolls on real code changes, that's fine. Override to IfNotPresent per-environment if you never re-push tags and care about faster rollouts. * chore: bump to 0.1.5 * fix(discover): log complete prompt hints at debug level, not a placeholder The `logLlmCall` at the success + error paths of runDiscoverStep was passing `prompt: "${discoverPrompt}\n\n[hints: ${fullHints.length} chars]"`, which replaced the actual hints content with a char-count placeholder. With LLM_LOG_LEVEL=debug set, the operator should see the ENTIRE LLM request — user message + the full datasource hints + recipes + skills — not a stub. Inline `fullHints` directly so the debug entry is reproducible from the log alone. Bump to 0.1.6. * fix(web): redirect to Operations Desk after accepting discovery Accepting a discovery result used to leave the operator on the Services page with the grid re-rendered. They then had to manually click the Operations Desk (Dashboard) nav item to see the freshly-populated service list in the live catalog. Automate that last step: after the `discover:accept` WS message is sent, ServicesPage fires a new `onDiscoveryAccepted` callback which App.tsx wires to `setLeftPane({ type: "dashboard" })`. No-op default on the prop so existing tests (and anyone calling the component without the callback) keep working unchanged. Bumping to 0.1.7. * feat(observability): log full prompts on start + per-tool agentic loop Adds two debugging affordances that were missing from the 0.1.4 LLM observability work: 1. logLlmCallStart now accepts an optional `prompt` field and emits the full content at debug level as a separate "LLM <agent> start prompt" line. Previously the full prompt was only visible in the completion log, so you couldn't inspect what the model saw during a hang. 2. logToolCall now emits an info summary (tool name + bytes + duration) plus debug details (full args + result) for every tool call inside discover, evidence phases, and chat. Previously discover and evidence collected tool events silently and only flushed them in the final LLM call summary, making the agentic loop invisible in real time. Bump 0.1.7 -> 0.1.8. * fix(discover): coerce hallucinated datasource UIDs + harden prompt The discover agent consistently passed datasourceUid: "prometheus" (a hallucinated short name) instead of the real UID from the pre-injected hints block. The MCP server rejected it with "Tool input validation" errors, burning 2-4 retries per discovery run. Two-layer fix: 1. wrapToolsWithCallbacks now intercepts hallucinated short names via a pre-built Map<shortName, realUid> and substitutes the real UID before the MCP call. Logs at info level when substitution fires so we can count hallucinations and track whether the prompt fix reduces them. 2. The discover agent's system prompt now renders datasource UIDs in a dedicated "CRITICAL: non-negotiable" block, separated from the recipe hints that previously said "these are suggestions". Extended the coercion map through WorkflowConfig to evidence phases and anomaly detection for consistency. Bump 0.1.8 -> 0.1.9. * fix(discover): truncated JSON recovery + queryType coercion + retry UI Three discovery improvements found via 0.1.8/0.1.9 per-tool-call logs: 1. Response truncation: max_tokens 16384 was too small for 71-service environments. Bumped to 32768 and added recoverTruncatedJsonArray to safeJsonParse — finds the last complete } in a truncated [{...}, {... array and closes it. Strategy runs before single-object extraction so we don't pick up a stray inner object. 2. queryType: null coercion in coercePrometheusArgs — defaults to "instant", eliminating 1 wasted tool call per discovery attempt. 3. Retry visibility: new discover:retry WebSocket event + UI indicator showing "Attempt 2 of 3" so the user knows it's retrying, not stuck. Also added parse failure diagnostics: logs response length + first/last 200 chars so truncation is diagnosable from a single log line. Bump 0.1.9 -> 0.1.10. * feat(server): log version on startup Reads package.json version and includes it in the startup log line: "dops-assistant v0.1.10 web server running on port 3000" Structured fields include both port and version for log queries. * fix(web): reset discovery state on Run Discovery click When clicking Run Discovery after a previous run, the UI showed "Validating services..." (stale phase from prior run) instead of "Discovering services...". The onStartDiscovery handler only sent the WS message without resetting discoveryState. Now resets to phase "discovery" + status "running" before sending, so the UI immediately shows the correct initial state. * feat(web): add shimmer animation to discovery progress bar The progress bar was static between tool calls, making the UI look stuck during LLM thinking phases. Added a shimmer gradient overlay that animates continuously, giving visual feedback that the process is still active even when no iteration events are firing. * fix(discover): cap tool results at 30k chars to prevent token exhaustion k8s_pods_list returns 83k chars of raw pod data. With max_tokens=32768, the LLM exhausts its entire token budget digesting the pod list and produces a 0-char response (32938 output tokens, 0 chars text). Added maxToolResultChars param to wrapToolsWithCallbacks. Discovery tools are now capped at 30k chars with a truncation note. Moved truncateMcpResult to tool-utils.ts (was duplicated in agents.ts). This cuts discovery input tokens from ~90k to ~40k and prevents the 0-char exhaustion failure mode. * fix(discover): prompt hint to avoid unfiltered pod lists Replace the 30k tool result cap with a prompt instruction telling the agent to prefer deployments/statefulsets or metric queries over raw pod lists. Raw k8s_pods_list returns 83k+ chars and exhausts the token budget. The prompt now explicitly tells the agent to use fieldSelector or labelSelector if it must list pods, and to prefer pre-aggregated metric queries instead. Kept truncateMcpResult in tool-utils.ts as a shared utility but removed the discovery-specific cap — the prompt hint is the primary fix. * fix(agents): tighten k8s and log query guidance to reduce token waste infra.ts: tell the agent to pass labelSelector/fieldSelector directly in tool calls instead of fetching all pods/events then filtering in-memory. An unfiltered k8s_pods_list returns 83k+ chars. logs.ts: clarify that "without filter" means without pattern filter, not without the service label selector. Added explicit line limits (100 for error context, 10 for existence check). discover.ts: fix blocklist test — use generic "metric or log query tool" instead of hardcoded tool names in the datasource UID block. * fix(discover): use filtered pod list as second pass, not skip it entirely The previous prompt told the agent to prefer metrics over pod lists, which caused it to skip k8s_pods_list entirely. This lost 38 of 47 services — container-level services (celery workers, sidecars, multi-container pods) whose names differ from their deployment name are invisible to kube_deployment_status_replicas metrics. Now: metrics first for the initial sweep, then a filtered pod list (excluding kube-system/kube-public/kube-node-lease) as a second pass to catch container-level services. * fix: consul skill exhaustive discovery + stackId in poller logs consul-bare-metal-discovery.md: changed "Common bare-metal services" to "Important: discover ALL Consul services, not just known ones". The agent was treating the example list as exhaustive and skipping the consul_catalog query, missing 6 sub-components (hdfs-journalnode, hdfs-zkfc, hive-metastore, impala-catalog, kudu-tserver, fazbdregistry). service-health-poller.ts: added stackId to all poll log lines (tool not found, query result, empty query result, poll complete). Previously impossible to tell which stack's poller was failing. * fix(test): add buildDatasourceUidMap to ws-handler mock

WZ added 13 commits April 15, 2026 13:11

chore(web): add red-square favicon

c6b71ae

Inline SVG data URI so the browser tab has an identifiable marker instead of the default globe. Red #dc2626 (Tailwind red-600) matches the rest of the palette. Placeholder — swap for a real icon later.

chore: bump version to 0.1.4

322257c

docs: CHANGELOG for 0.1.4

d6227a6

WZ merged commit 3e67e54 into main Apr 15, 2026
1 check passed

WZ mentioned this pull request Apr 15, 2026

fix(discover): parse JSONC and extract top-level arrays #86

Merged

4 tasks

WZ mentioned this pull request Apr 24, 2026

fix(scan+discovery): four P1/P2 quick wins — terminal-emit, NaN log, budget warn, webhook hint #133

Merged

5 tasks

WZ deleted the feat/llm-call-start-logging branch April 25, 2026 03:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llm-logger): call-start + first-chunk logging, fix discover tool tracking#85

feat(llm-logger): call-start + first-chunk logging, fix discover tool tracking#85
WZ merged 13 commits into
mainfrom
feat/llm-call-start-logging

WZ commented Apr 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

WZ commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Coverage

Pre-Landing Review

Design Review

Adversarial Review

Plan Completion

Deferred / known trade-offs

Verification (after redeploy)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

WZ commented Apr 15, 2026 •

edited

Loading