Skip to content

feat(llm-logger): call-start + first-chunk logging, fix discover tool tracking#85

Merged
WZ merged 13 commits into
mainfrom
feat/llm-call-start-logging
Apr 15, 2026
Merged

feat(llm-logger): call-start + first-chunk logging, fix discover tool tracking#85
WZ merged 13 commits into
mainfrom
feat/llm-call-start-logging

Conversation

@WZ
Copy link
Copy Markdown
Owner

@WZ WZ commented Apr 15, 2026

Summary

Ships v0.1.4. Two themes in one branch:

LLM observability (the enabling work)

  • Call-start + first-chunk logging at info level so mid-stream hangs are visible without enabling debug
  • Error-path logging so failed calls emit a matching completion line
  • Fixed discover.ts hardcoding toolCalls: [] in its log record
  • Dropped dead observability.port/logLevel config fields

Six production fixes surfaced by the new observability

  1. Health poller metric-aware zero classificationreplicas = 0unknown (scaled down), up = 0down (real scrape failure). Stops the 5 false-positive auto-investigations we were seeing on every cold start.
  2. Metrics agent prompt carveout — defense-in-depth for flat-zero replica series.
  3. Express trust proxy — configurable via TRUST_PROXY_HOPS env var (default 1). Fixes ERR_ERL_UNEXPECTED_X_FORWARDED_FOR and broken per-client rate limiting behind k8s ingress. Closes TODO feat: service detail page, sidebar navigation, and page reorganization #36.
  4. Discover agent logLabels guidance — emit Loki-compatible labels (container/pod/app), not kube-state-metrics labels. The {statefulset: yb-master} label was wasting 5-6 Loki fallback queries per investigation.
  5. Tool-utils .Z RFC3339 normalization — repairs .Z, dangling dot, and missing-Z variants that LLMs consistently produce.
  6. Tool-utils drop Loki stepSeconds — LLMs pass it as a Prometheus leftover; Loki rejects it.

Plus a red-square favicon so the tab is identifiable.

Test Coverage

Tests: 73 → 80 (+7 new files, 96 new test cases).

CODE PATH COVERAGE — v0.1.4
========================================

[+] src/server/service-health-poller.ts (metric-aware)
    ├── [★★★] matchResultsToServices: replicas=0 → unknown, up=0 → down, NaN → unknown
    ├── [★★★] Merge priority: healthy > down > unknown across 3 batches
    ├── [★★★] First-poll: scaled-zero does NOT fire transition
    ├── [★★★] First-poll: up=0 DOES fire transition (regression test for 72fa3de)
    └── [★★★] healthy → down and healthy → unknown transitions both fire
    Coverage: 11/11 cases

[+] src/server/index.ts (trust proxy)
    ├── [★★★] X-Forwarded-For resolves to real client with trust proxy
    └── [★★★] Regression baseline: without trust proxy, XFF is ignored
    Coverage: 2/2 cases

[+] src/agents/metrics.ts (prompt)
    ├── [★★]  INTENTIONALLY-DISABLED SERVICES block present in prompt
    └── [★★]  FULL investigation window guidance retained
    Coverage: 2/2 cases (text-content only — behavioral regression requires rca-eval)

[+] src/workflows/tool-utils.ts (.Z + stepSeconds)
    ├── [★★★] .Z → Z (empty-fraction malformed case)
    ├── [★★★] dangling dot → Z
    ├── [★★★] missing Z → Z appended
    ├── [★★★] .0Z / .000Z / valid Z all left unchanged (idempotent)
    ├── [★★★] non-time field 'prefix.Z' unchanged
    └── [★★★] coerceLokiArgs drops stepSeconds, preserves direction/limit
    Coverage: 9/9 cases

[+] src/agents/discover.ts (logLabels prompt) — no test (LLM behavior)

─────────────────────────────────
COVERAGE: 24/24 new code paths tested (100%)
QUALITY:  ★★★ 20, ★★ 2, ★ 0
─────────────────────────────────

Pre-Landing Review

1 informational: coercePrometheusArgs defaults assume queryType: instant — latent for range queries (not triggered in practice). Skipped as tracked-for-future.

Design Review

Scope: 1 frontend file (src/web/index.html favicon — 1 line). Skipped.

Adversarial Review

Both Claude subagent and Codex ran. 7 findings across the two reviewers, 2 cross-model confirmed:

  • CROSS-MODEL: trust proxy attack surfaceFIXED: made TRUST_PROXY_HOPS env-var configurable with safety notes
  • CROSS-MODEL: replicas=0 → unknown hides real outages → KNOWN TRADE-OFF (explicit "NOT in scope" in the plan, tracked for follow-up with proper desired vs ready comparison)
  • .Z regex narrowness → FIXED: also handles dangling dot and missing Z
  • discover.ts unbounded argsStr → FIXED: sliced to 500 chars, wrapped in try/catch
  • Loki stepSeconds silent break if tool adds it → monitor (tracked in plan failure modes)
  • Metrics prompt trusts LLM compliance → defense-in-depth; Fix feat: CLI, discovery, RCA pipeline, panel images, and investigation hardening #1 is the real gate
  • logLlmCallStart/FirstChunk skews on throw → minor observability edge case

Codex structured review: GATE PASS (no [P1] markers).
PR Quality Score: 9/10.

Plan Completion

Plan: /Users/wli02/.claude/plans/moonlit-nibbling-star.md (as amended by /plan-eng-review).

  [DONE]    Fix #1 — metric-aware health classification
  [DONE]    Fix #2 — metrics prompt + prompt-content test
  [DONE]    Fix #3 — trust proxy (elevated to env-var via adversarial review)
  [DONE]    Fix #4 — discover agent logLabels guidance
  [DONE]    Fix #5+6 — isTimeField extract + .Z + stepSeconds + tests
  [DONE]    Adversarial follow-ups: .Z regex, discover argsStr, trust proxy env
  [DONE]    Red-square favicon (user side request)
  [DONE]    Version bump 0.1.4 + CHANGELOG

COMPLETION: 8/8 DONE — PASS

Deferred / known trade-offs

  • Desired-vs-ready health detection — not in scope for this PR; current fix stops false positives but leaves a latent false-negative for desired > 0, ready = 0 workloads. Captured in plan's "NOT in scope" section.
  • TRUST_PROXY_HOPS topology — default 1 is correct for single k8s ingress with use-forwarded-headers: false. Deployments with a CDN/service mesh/ingress-pass-through need to set this explicitly. Documented inline and in Helm values.

Verification (after redeploy)

  • Health poller: summary.down should drop from ~5 to the count of services with real up = 0 targets
  • Zero ERR_ERL_UNEXPECTED_X_FORWARDED_FOR on WS connects
  • New investigations: no .Z parse errors in Loki, no stepSeconds: 300 in Loki args, no toolCallCount: 0 in discover
  • Red-square favicon in browser tab

Test plan

  • TypeScript: tsc --noEmit clean
  • Unit tests: 96 new cases, all pass
  • Docker: wliftnt/dops-assistant:0.1.4 built and pushed (digest sha256:90c8fc80…)
  • Adversarial review: Claude + Codex cross-model, gate PASS, 2 cross-model findings fixed inline
  • Runtime smoke (post-redeploy): verify the 4 items above

🤖 Generated with Claude Code

WZ added 13 commits April 15, 2026 13:11
… tracking

- Add logLlmCallStart() and logLlmCallFirstChunk() at info level so hangs
  before the first chunk are visible without enabling debug logs.
- Wire into chat (agents.ts), discover, and evidence call sites.
- Fix discover.ts hardcoding toolCalls: [] — collect tool events via
  onStepFinish like evidence.ts does, so toolCallCount reflects reality.
- Drop dead observability block from config.yaml.example (port and
  logLevel fields are not read by any runtime code).
- Bump to 0.1.3.
discover.ts and agents.ts chat both had gaps where a failing LLM call
emitted `LLM ... start:` but no matching completion log — only a warn
from the app logger. Now logLlmCall fires on both success and failure,
with the error message in the `error` field so failed calls are
visible in llm-logger output and pair cleanly with their start event.

evidence.ts already recorded generateError, so no change there.
The Grafana MCP query_prometheus tool's schema requires startTime/endTime
as strings and stepSeconds as a number even for instant queries, where
these fields are logically meaningless. LLMs default them to null and
eat a validation-error round-trip before retrying with "now".

Add coercePrometheusArgs() alongside coerceLokiArgs() to fill defaults
before the schema check runs. The "now" string gets converted to RFC3339
by the existing coerceToolArgs path.

Saves ~1s + a few hundred tokens per discover run and removes a
confusing error from the debug log.
The poller's three batch queries have opposite semantics for value=0:
- kube_deployment_status_replicas = 0: intentionally scaled down (unknown)
- kube_statefulset_status_replicas = 0: intentionally scaled down (unknown)
- up = 0: scrape target is actually down (down)

Before this change, all three were classified as "down" when value=0,
causing five legitimately-scaled-to-zero services to fire false-positive
auto-investigations on every cold start (each burning ~60s and 25k+ LLM
tokens).

Fix: pass a zeroMeans parameter ("down" | "unknown") into matchResultsToServices,
call it per-batch in pollOnce, and merge results with explicit priority
(healthy > down > unknown). The first-poll transition gate from commit
72fa3de is preserved — real up=0 scrape failures still fire investigations
on cold start; scaled-to-zero services do not.

Inline ASCII decision diagram added to matchResultsToServices for future
maintainers.

Test coverage: 11 new cases, including a regression test for the 72fa3de
cold-start intent (up=0 must still fire onTransition).
Defense in depth for the health-poller fix. Adds an INTENTIONALLY-DISABLED
SERVICES block to the metrics agent system prompt so that a webhook-triggered
investigation against a scaled-to-zero workload reports "not deployed"
instead of "severity: high, scaled down or failed to start".

Only triggers when a replica metric (deployment/statefulset/daemonset)
reports a flat zero across the ENTIRE investigation window. A transition
from >0 to 0 inside the window, or ready < desired, still reports as an
anomaly.

Test: deterministic assertion on the prompt text (metrics.test.ts).
Runtime behavioral verification requires rca-eval against real data.
Behind the k8s ingress, X-Forwarded-For is set but Express defaults to
'trust proxy: false'. This means req.ip resolves to the ingress IP, not
the real client, and express-rate-limit v8 logs ERR_ERL_UNEXPECTED_X_FORWARDED_FOR
on every request while effectively sharing one bucket across all clients.

Fix: app.set("trust proxy", 1). The '1' (not 'true') means one trusted
hop, so a client without an upstream can't spoof its own X-Forwarded-For.

Closes TODOS.md #36 (from PR #57 Codex adversarial review).

Test: supertest assertions in rate-limit.test.ts for both the fix and
the pre-fix baseline.
…etrics labels

The discover agent was writing {"statefulset": "yb-master"} into
services.yaml, but Loki in most envs doesn't expose a 'statefulset'
stream label — only pod, container, namespace, app, and so on. The
downstream logs agent would then query Loki with {statefulset="..."}
and get empty results, wasting 5-6 fallback queries per investigation.

Prompt guidance rewritten to prefer {container}/{pod}/{app} labels
and explicitly warn that kube-state-metrics labels like deployment/
statefulset/daemonset do NOT transfer to Loki.

Only affects new discovery runs. Existing services.yaml needs
regeneration via `npm run discover`.
Two small LLM-quirk coercions observed in prod logs:

1. Malformed RFC3339: the LLM consistently truncates "2026-04-15T20:27:00.000Z"
   to "2026-04-15T20:27:00.Z" (trailing dot without fractional digits). The
   Grafana MCP tool rejects this, costing one wasted tool call per logs
   investigation. Normalize .Z -> Z on time-typed fields in coerceToolArgs.

2. Stray stepSeconds: the LLM passes stepSeconds: 300 to grafana_query_loki_logs
   (a Prometheus-only concept). Drop it in coerceLokiArgs before the tool
   sees it.

DRY cleanup: extracted the inline "is this a time field?" check into a
file-local isTimeField() helper shared by the 'now' and .Z branches.

Test coverage: 7 cases in new src/workflows/tool-utils.test.ts.
Inline SVG data URI so the browser tab has an identifiable marker
instead of the default globe. Red #dc2626 (Tailwind red-600) matches
the rest of the palette. Placeholder — swap for a real icon later.
Follow-up fixes from the /ship adversarial review (Claude subagent + Codex):

1. tool-utils.ts: widen the RFC3339 normalizer to also catch the "dangling
   dot" and "missing Z entirely" variants, in addition to the original
   ".Z" empty-fraction case. Keeps ".000Z" and ".0Z" (valid zero fractions)
   unchanged. Three new unit tests.

2. steps/discover.ts: bound memory on onStepFinish by slicing argsStr to
   500 chars (matches evidence.ts pattern), and wrap the per-step record
   in try/catch so JSON.stringify exceptions (BigInt, circular refs) don't
   crash the discovery step.

3. server/index.ts: make `trust proxy` configurable via TRUST_PROXY_HOPS
   env var (default 1). Both reviewers flagged that a hardcoded 1 is wrong
   when the ingress is configured with use-forwarded-headers:true, or
   when there's more than one proxy hop (CDN + ingress, service mesh +
   ingress). Too-low → clients share one bucket (false 429s); too-high
   → XFF is spoofable. Doc the tradeoffs inline and in the Helm values.

Helm chart values.yaml gets an env-var commented example for TRUST_PROXY_HOPS.
The previous rewrite hardcoded "Loki" throughout, but the `logs` role in
provider-registry is generic — the wired-up log system could be Loki,
Elasticsearch, Splunk, CloudWatch, or anything else with a `logs` role.

Rewrote the prompt to:
1. Frame the rule as "match actual stream labels / index fields the log
   provider exposes", not Loki-specific
2. Tell the agent to call a label-discovery tool up front if one exists
   (list_loki_label_names, list_indices, describe_log_groups, etc.) and
   prefer labels from that result
3. Keep the kube-state-metrics gotcha because it's the common LLM mistake
   across every k8s log pipeline
4. Keep the container/pod/namespace/app fallback guidance since those are
   universal across promtail, fluent-bit, filebeat, and CloudWatch CI
@WZ WZ merged commit 3e67e54 into main Apr 15, 2026
1 check passed
WZ added a commit that referenced this pull request Apr 15, 2026
Two related bugs surfaced in prod on the next discover run after shipping
the JSONC parser fix:

1. Tool-arg coercions never ran for discovery. The guard at the top of
   the wrap call was:

       const tools = wrappedOnToolCall
         ? wrapToolsWithCallbacks(discoveryTools, wrappedOnToolCall, "discovery")
         : discoveryTools;

   When `config.onToolCall` is undefined (auto-discovery on cold start,
   CLI runs without callbacks), tools passed through raw, so both
   `coercePrometheusArgs` and `coerceLokiArgs` were skipped.
   Consequence: the LLM's standard `startTime: null` on instant queries
   crashed the Grafana MCP tool with a schema validation error, the same
   class of bug fix #5 was supposed to prevent for everyone.

   Fix: always wrap discovery tools. `wrapToolsWithCallbacks` already
   handles undefined `onToolCall` via `?.`, so coercions now run
   unconditionally and the callback path is still optional.

2. Datasource UIDs were hallucinated. The discover agent called
   `grafana_list_loki_label_names` with `datasourceUid: "loki"`
   (made-up short name) instead of the real `ef34v1a9h2fi8c` it had
   already learned from its own `grafana_list_datasources` call earlier
   in the same run. Result: label lookup failed, every service ended
   up with `logLabels: {}`, and fix #4 (Loki-compatible log labels from
   PR #85) was effectively neutralized.

   Fix: mirror the evidence-agent pattern and pre-fetch datasource UIDs
   at the start of `runDiscoverStep`, then inject them into the prompt
   as an `<untrusted_datasource_hints>` block. Uses the same UID-first
   wording evidence uses. Graceful fallback to empty hints if
   list_datasources isn't available or the call fails — the agent can
   still run, it'll just eat the round-trip discovering them itself
   (same as today).

Both fixes live in `src/workflows/steps/discover.ts` — scope-contained
to the discovery path. No tests changed because the coercion logic
is already covered by tool-utils.test.ts; the new hint fetcher is a
best-effort wrapper around an existing MCP tool call with no branching
to meaningfully unit-test.
WZ added a commit that referenced this pull request Apr 16, 2026
Two related bugs surfaced in prod on the next discover run after shipping
the JSONC parser fix:

1. Tool-arg coercions never ran for discovery. The guard at the top of
   the wrap call was:

       const tools = wrappedOnToolCall
         ? wrapToolsWithCallbacks(discoveryTools, wrappedOnToolCall, "discovery")
         : discoveryTools;

   When `config.onToolCall` is undefined (auto-discovery on cold start,
   CLI runs without callbacks), tools passed through raw, so both
   `coercePrometheusArgs` and `coerceLokiArgs` were skipped.
   Consequence: the LLM's standard `startTime: null` on instant queries
   crashed the Grafana MCP tool with a schema validation error, the same
   class of bug fix #5 was supposed to prevent for everyone.

   Fix: always wrap discovery tools. `wrapToolsWithCallbacks` already
   handles undefined `onToolCall` via `?.`, so coercions now run
   unconditionally and the callback path is still optional.

2. Datasource UIDs were hallucinated. The discover agent called
   `grafana_list_loki_label_names` with `datasourceUid: "loki"`
   (made-up short name) instead of the real `ef34v1a9h2fi8c` it had
   already learned from its own `grafana_list_datasources` call earlier
   in the same run. Result: label lookup failed, every service ended
   up with `logLabels: {}`, and fix #4 (Loki-compatible log labels from
   PR #85) was effectively neutralized.

   Fix: mirror the evidence-agent pattern and pre-fetch datasource UIDs
   at the start of `runDiscoverStep`, then inject them into the prompt
   as an `<untrusted_datasource_hints>` block. Uses the same UID-first
   wording evidence uses. Graceful fallback to empty hints if
   list_datasources isn't available or the call fails — the agent can
   still run, it'll just eat the round-trip discovering them itself
   (same as today).

Both fixes live in `src/workflows/steps/discover.ts` — scope-contained
to the discovery path. No tests changed because the coercion logic
is already covered by tool-utils.test.ts; the new hint fetcher is a
best-effort wrapper around an existing MCP tool call with no branching
to meaningfully unit-test.
WZ added a commit that referenced this pull request Apr 16, 2026
* fix(discover): parse JSONC and extract top-level arrays

Observed in prod: the discover agent emitted a valid-looking JSON array
with JavaScript-style block comments as section dividers:

  [
    {...},
    /* StatefulSets */
    {...},
    /* DaemonSets */
    {...}
  ]

`JSON.parse` rejected it. `safeJsonParse` fell through to its markdown-
fence-extraction fallback, which still called `JSON.parse` on the body
and hit the same rejection. The remaining fallbacks only looked for
top-level `{...}` objects, so they found individual service records inside
the array and the discover step saw `parsed = <single object>` instead
of `parsed = <array>`, failed its `Array.isArray` check, and logged
"discovery returned empty result" three times in a row.

Two fixes:

1. `stripJsoncComments()` — a string-aware pass that removes `// ...`
   line comments and `/* ... */` block comments and trailing commas
   before `]` / `}`. Respects JSON string boundaries so content like
   `"see /* important */ docs"` survives unchanged. Tried after direct
   parse and after markdown-fence extraction.

2. `extractLastJsonArray()` / `extractFirstJsonArray()` — balanced-
   bracket scan for top-level `[...]` as a last-resort fallback
   (object extraction still runs first so evidence/metrics agents
   keep returning their enclosing object, not the inner observations
   array).

Also tightened the discover agent prompt to explicitly forbid JS-style
comments and trailing commas in the output JSON.

Test coverage: 6 new regression cases in processors.test.ts including
the exact JSONC-with-block-comments failure mode from prod, string-
preservation of comment-like sequences inside values, and trailing-
comma tolerance.

* fix(discover): always apply arg coercions + pre-inject datasource hints

Two related bugs surfaced in prod on the next discover run after shipping
the JSONC parser fix:

1. Tool-arg coercions never ran for discovery. The guard at the top of
   the wrap call was:

       const tools = wrappedOnToolCall
         ? wrapToolsWithCallbacks(discoveryTools, wrappedOnToolCall, "discovery")
         : discoveryTools;

   When `config.onToolCall` is undefined (auto-discovery on cold start,
   CLI runs without callbacks), tools passed through raw, so both
   `coercePrometheusArgs` and `coerceLokiArgs` were skipped.
   Consequence: the LLM's standard `startTime: null` on instant queries
   crashed the Grafana MCP tool with a schema validation error, the same
   class of bug fix #5 was supposed to prevent for everyone.

   Fix: always wrap discovery tools. `wrapToolsWithCallbacks` already
   handles undefined `onToolCall` via `?.`, so coercions now run
   unconditionally and the callback path is still optional.

2. Datasource UIDs were hallucinated. The discover agent called
   `grafana_list_loki_label_names` with `datasourceUid: "loki"`
   (made-up short name) instead of the real `ef34v1a9h2fi8c` it had
   already learned from its own `grafana_list_datasources` call earlier
   in the same run. Result: label lookup failed, every service ended
   up with `logLabels: {}`, and fix #4 (Loki-compatible log labels from
   PR #85) was effectively neutralized.

   Fix: mirror the evidence-agent pattern and pre-fetch datasource UIDs
   at the start of `runDiscoverStep`, then inject them into the prompt
   as an `<untrusted_datasource_hints>` block. Uses the same UID-first
   wording evidence uses. Graceful fallback to empty hints if
   list_datasources isn't available or the call fails — the agent can
   still run, it'll just eat the round-trip discovering them itself
   (same as today).

Both fixes live in `src/workflows/steps/discover.ts` — scope-contained
to the discovery path. No tests changed because the coercion logic
is already covered by tool-utils.test.ts; the new hint fetcher is a
best-effort wrapper around an existing MCP tool call with no branching
to meaningfully unit-test.

* fix(tool-utils): strip Mastra tool marker so our coercions actually run

Root cause of every "our coercion didn't run" symptom we've been chasing:

Mastra's CoreToolBuilder.createExecute() validates tool input against the
tool's inputSchema BEFORE calling our wrapped execute, but only for tools
that Mastra considers its own (detected via the
`Symbol.for("mastra.core.tool.Tool")` marker, which is copied by
`{ ...tool }` spread). When the LLM sends `startTime: null` on an instant
Prometheus query, Mastra's validateToolInput rejects it with

    Tool input validation failed ... startTime: must be string

and returns the error to the LLM — our `coercePrometheusArgs` never gets
a chance to fire because execute was never called.

The same latent bug was silently hiding `coerceLokiArgs` as well, but that
one only fixes shape-valid-but-wrong-value inputs (direction "forward" →
"backward") so validation passed and our wrapper ran, masking the problem.
`coercePrometheusArgs` fixes shape-invalid inputs (null → string), which
validation blocks upstream — which is why the same bug kept resurfacing
in prod even after three "fixes".

The fix: strip the `MASTRA_TOOL_MARKER` symbol (and the outputSchema) from
the wrapped tool object. Mastra's `isMastraTool()` now returns false, so
the framework takes the Vercel/AI-SDK tool path, which SKIPS input
validation. Our wrapped execute runs on the raw LLM args, coerces them,
then calls the original Mastra Tool instance's execute directly — which
forwards to the underlying MCP tool with the fixed args.

Tool-spec fidelity (what the LLM sees) is unchanged because inputSchema
is preserved. Only the server-side validation step is bypassed. Output
validation is also skipped (outputSchema dropped) to avoid false negatives
on the permissive MCP content-wrapper response shape.

Test coverage in tool-utils-coercion.test.ts:
- marker stripped, outputSchema dropped, inputSchema preserved, execute
  replaced (direct assertions on the wrapped object)
- coercePrometheusArgs runs end-to-end via the wrapper (already present)
- coerceLokiArgs drops stepSeconds and forces direction=backward end-to-end

* chore(helm): default imagePullPolicy to Always

Fixed-tag deploys (0.1.4, 0.2.0, etc.) that get re-pushed under the same
tag don't pick up the new digest when the node has a cached copy under
that tag. This bit us immediately after the marker-strip fix — I rebuilt
and re-pushed 0.1.4, but the rollout kept running the old digest because
kubelet saw "already have 0.1.4 cached" and skipped the pull.

Always-pull costs ~2-5s on pod start in exchange for the guarantee that
re-pushing a tag actually does what you expect. For a single-replica
deployment that only rolls on real code changes, that's fine.

Override to IfNotPresent per-environment if you never re-push tags and
care about faster rollouts.

* chore: bump to 0.1.5

* fix(discover): log complete prompt hints at debug level, not a placeholder

The `logLlmCall` at the success + error paths of runDiscoverStep was
passing `prompt: "${discoverPrompt}\n\n[hints: ${fullHints.length} chars]"`,
which replaced the actual hints content with a char-count placeholder.
With LLM_LOG_LEVEL=debug set, the operator should see the ENTIRE LLM
request — user message + the full datasource hints + recipes + skills —
not a stub. Inline `fullHints` directly so the debug entry is reproducible
from the log alone.

Bump to 0.1.6.

* fix(web): redirect to Operations Desk after accepting discovery

Accepting a discovery result used to leave the operator on the Services
page with the grid re-rendered. They then had to manually click the
Operations Desk (Dashboard) nav item to see the freshly-populated service
list in the live catalog. Automate that last step: after the
`discover:accept` WS message is sent, ServicesPage fires a new
`onDiscoveryAccepted` callback which App.tsx wires to
`setLeftPane({ type: "dashboard" })`.

No-op default on the prop so existing tests (and anyone calling the
component without the callback) keep working unchanged.

Bumping to 0.1.7.

* feat(observability): log full prompts on start + per-tool agentic loop

Adds two debugging affordances that were missing from the 0.1.4 LLM
observability work:

1. logLlmCallStart now accepts an optional `prompt` field and emits the
   full content at debug level as a separate "LLM <agent> start prompt"
   line. Previously the full prompt was only visible in the completion
   log, so you couldn't inspect what the model saw during a hang.

2. logToolCall now emits an info summary (tool name + bytes + duration)
   plus debug details (full args + result) for every tool call inside
   discover, evidence phases, and chat. Previously discover and evidence
   collected tool events silently and only flushed them in the final
   LLM call summary, making the agentic loop invisible in real time.

Bump 0.1.7 -> 0.1.8.

* fix(discover): coerce hallucinated datasource UIDs + harden prompt

The discover agent consistently passed datasourceUid: "prometheus" (a
hallucinated short name) instead of the real UID from the pre-injected
hints block. The MCP server rejected it with "Tool input validation"
errors, burning 2-4 retries per discovery run.

Two-layer fix:

1. wrapToolsWithCallbacks now intercepts hallucinated short names via a
   pre-built Map<shortName, realUid> and substitutes the real UID before
   the MCP call. Logs at info level when substitution fires so we can
   count hallucinations and track whether the prompt fix reduces them.

2. The discover agent's system prompt now renders datasource UIDs in a
   dedicated "CRITICAL: non-negotiable" block, separated from the recipe
   hints that previously said "these are suggestions".

Extended the coercion map through WorkflowConfig to evidence phases and
anomaly detection for consistency.

Bump 0.1.8 -> 0.1.9.

* fix(discover): truncated JSON recovery + queryType coercion + retry UI

Three discovery improvements found via 0.1.8/0.1.9 per-tool-call logs:

1. Response truncation: max_tokens 16384 was too small for 71-service
   environments. Bumped to 32768 and added recoverTruncatedJsonArray
   to safeJsonParse — finds the last complete } in a truncated [{...},
   {... array and closes it. Strategy runs before single-object
   extraction so we don't pick up a stray inner object.

2. queryType: null coercion in coercePrometheusArgs — defaults to
   "instant", eliminating 1 wasted tool call per discovery attempt.

3. Retry visibility: new discover:retry WebSocket event + UI indicator
   showing "Attempt 2 of 3" so the user knows it's retrying, not stuck.

Also added parse failure diagnostics: logs response length + first/last
200 chars so truncation is diagnosable from a single log line.

Bump 0.1.9 -> 0.1.10.

* feat(server): log version on startup

Reads package.json version and includes it in the startup log line:
"dops-assistant v0.1.10 web server running on port 3000"

Structured fields include both port and version for log queries.

* fix(web): reset discovery state on Run Discovery click

When clicking Run Discovery after a previous run, the UI showed
"Validating services..." (stale phase from prior run) instead of
"Discovering services...". The onStartDiscovery handler only sent the
WS message without resetting discoveryState. Now resets to phase
"discovery" + status "running" before sending, so the UI immediately
shows the correct initial state.

* feat(web): add shimmer animation to discovery progress bar

The progress bar was static between tool calls, making the UI look
stuck during LLM thinking phases. Added a shimmer gradient overlay
that animates continuously, giving visual feedback that the process
is still active even when no iteration events are firing.

* fix(discover): cap tool results at 30k chars to prevent token exhaustion

k8s_pods_list returns 83k chars of raw pod data. With max_tokens=32768,
the LLM exhausts its entire token budget digesting the pod list and
produces a 0-char response (32938 output tokens, 0 chars text).

Added maxToolResultChars param to wrapToolsWithCallbacks. Discovery
tools are now capped at 30k chars with a truncation note. Moved
truncateMcpResult to tool-utils.ts (was duplicated in agents.ts).

This cuts discovery input tokens from ~90k to ~40k and prevents the
0-char exhaustion failure mode.

* fix(discover): prompt hint to avoid unfiltered pod lists

Replace the 30k tool result cap with a prompt instruction telling the
agent to prefer deployments/statefulsets or metric queries over raw pod
lists. Raw k8s_pods_list returns 83k+ chars and exhausts the token
budget. The prompt now explicitly tells the agent to use fieldSelector
or labelSelector if it must list pods, and to prefer pre-aggregated
metric queries instead.

Kept truncateMcpResult in tool-utils.ts as a shared utility but removed
the discovery-specific cap — the prompt hint is the primary fix.

* fix(agents): tighten k8s and log query guidance to reduce token waste

infra.ts: tell the agent to pass labelSelector/fieldSelector directly
in tool calls instead of fetching all pods/events then filtering
in-memory. An unfiltered k8s_pods_list returns 83k+ chars.

logs.ts: clarify that "without filter" means without pattern filter,
not without the service label selector. Added explicit line limits
(100 for error context, 10 for existence check).

discover.ts: fix blocklist test — use generic "metric or log query
tool" instead of hardcoded tool names in the datasource UID block.

* fix(discover): use filtered pod list as second pass, not skip it entirely

The previous prompt told the agent to prefer metrics over pod lists,
which caused it to skip k8s_pods_list entirely. This lost 38 of 47
services — container-level services (celery workers, sidecars,
multi-container pods) whose names differ from their deployment name
are invisible to kube_deployment_status_replicas metrics.

Now: metrics first for the initial sweep, then a filtered pod list
(excluding kube-system/kube-public/kube-node-lease) as a second pass
to catch container-level services.

* fix: consul skill exhaustive discovery + stackId in poller logs

consul-bare-metal-discovery.md: changed "Common bare-metal services"
to "Important: discover ALL Consul services, not just known ones".
The agent was treating the example list as exhaustive and skipping
the consul_catalog query, missing 6 sub-components (hdfs-journalnode,
hdfs-zkfc, hive-metastore, impala-catalog, kudu-tserver, fazbdregistry).

service-health-poller.ts: added stackId to all poll log lines (tool
not found, query result, empty query result, poll complete). Previously
impossible to tell which stack's poller was failing.

* fix(test): add buildDatasourceUidMap to ws-handler mock
@WZ WZ deleted the feat/llm-call-start-logging branch April 25, 2026 03:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant