feat(llm-logger): call-start + first-chunk logging, fix discover tool tracking#85
Merged
Conversation
… tracking - Add logLlmCallStart() and logLlmCallFirstChunk() at info level so hangs before the first chunk are visible without enabling debug logs. - Wire into chat (agents.ts), discover, and evidence call sites. - Fix discover.ts hardcoding toolCalls: [] — collect tool events via onStepFinish like evidence.ts does, so toolCallCount reflects reality. - Drop dead observability block from config.yaml.example (port and logLevel fields are not read by any runtime code). - Bump to 0.1.3.
discover.ts and agents.ts chat both had gaps where a failing LLM call emitted `LLM ... start:` but no matching completion log — only a warn from the app logger. Now logLlmCall fires on both success and failure, with the error message in the `error` field so failed calls are visible in llm-logger output and pair cleanly with their start event. evidence.ts already recorded generateError, so no change there.
The Grafana MCP query_prometheus tool's schema requires startTime/endTime as strings and stepSeconds as a number even for instant queries, where these fields are logically meaningless. LLMs default them to null and eat a validation-error round-trip before retrying with "now". Add coercePrometheusArgs() alongside coerceLokiArgs() to fill defaults before the schema check runs. The "now" string gets converted to RFC3339 by the existing coerceToolArgs path. Saves ~1s + a few hundred tokens per discover run and removes a confusing error from the debug log.
The poller's three batch queries have opposite semantics for value=0:
- kube_deployment_status_replicas = 0: intentionally scaled down (unknown)
- kube_statefulset_status_replicas = 0: intentionally scaled down (unknown)
- up = 0: scrape target is actually down (down)
Before this change, all three were classified as "down" when value=0,
causing five legitimately-scaled-to-zero services to fire false-positive
auto-investigations on every cold start (each burning ~60s and 25k+ LLM
tokens).
Fix: pass a zeroMeans parameter ("down" | "unknown") into matchResultsToServices,
call it per-batch in pollOnce, and merge results with explicit priority
(healthy > down > unknown). The first-poll transition gate from commit
72fa3de is preserved — real up=0 scrape failures still fire investigations
on cold start; scaled-to-zero services do not.
Inline ASCII decision diagram added to matchResultsToServices for future
maintainers.
Test coverage: 11 new cases, including a regression test for the 72fa3de
cold-start intent (up=0 must still fire onTransition).
Defense in depth for the health-poller fix. Adds an INTENTIONALLY-DISABLED SERVICES block to the metrics agent system prompt so that a webhook-triggered investigation against a scaled-to-zero workload reports "not deployed" instead of "severity: high, scaled down or failed to start". Only triggers when a replica metric (deployment/statefulset/daemonset) reports a flat zero across the ENTIRE investigation window. A transition from >0 to 0 inside the window, or ready < desired, still reports as an anomaly. Test: deterministic assertion on the prompt text (metrics.test.ts). Runtime behavioral verification requires rca-eval against real data.
Behind the k8s ingress, X-Forwarded-For is set but Express defaults to
'trust proxy: false'. This means req.ip resolves to the ingress IP, not
the real client, and express-rate-limit v8 logs ERR_ERL_UNEXPECTED_X_FORWARDED_FOR
on every request while effectively sharing one bucket across all clients.
Fix: app.set("trust proxy", 1). The '1' (not 'true') means one trusted
hop, so a client without an upstream can't spoof its own X-Forwarded-For.
Closes TODOS.md #36 (from PR #57 Codex adversarial review).
Test: supertest assertions in rate-limit.test.ts for both the fix and
the pre-fix baseline.
…etrics labels
The discover agent was writing {"statefulset": "yb-master"} into
services.yaml, but Loki in most envs doesn't expose a 'statefulset'
stream label — only pod, container, namespace, app, and so on. The
downstream logs agent would then query Loki with {statefulset="..."}
and get empty results, wasting 5-6 fallback queries per investigation.
Prompt guidance rewritten to prefer {container}/{pod}/{app} labels
and explicitly warn that kube-state-metrics labels like deployment/
statefulset/daemonset do NOT transfer to Loki.
Only affects new discovery runs. Existing services.yaml needs
regeneration via `npm run discover`.
Two small LLM-quirk coercions observed in prod logs: 1. Malformed RFC3339: the LLM consistently truncates "2026-04-15T20:27:00.000Z" to "2026-04-15T20:27:00.Z" (trailing dot without fractional digits). The Grafana MCP tool rejects this, costing one wasted tool call per logs investigation. Normalize .Z -> Z on time-typed fields in coerceToolArgs. 2. Stray stepSeconds: the LLM passes stepSeconds: 300 to grafana_query_loki_logs (a Prometheus-only concept). Drop it in coerceLokiArgs before the tool sees it. DRY cleanup: extracted the inline "is this a time field?" check into a file-local isTimeField() helper shared by the 'now' and .Z branches. Test coverage: 7 cases in new src/workflows/tool-utils.test.ts.
Inline SVG data URI so the browser tab has an identifiable marker instead of the default globe. Red #dc2626 (Tailwind red-600) matches the rest of the palette. Placeholder — swap for a real icon later.
Follow-up fixes from the /ship adversarial review (Claude subagent + Codex): 1. tool-utils.ts: widen the RFC3339 normalizer to also catch the "dangling dot" and "missing Z entirely" variants, in addition to the original ".Z" empty-fraction case. Keeps ".000Z" and ".0Z" (valid zero fractions) unchanged. Three new unit tests. 2. steps/discover.ts: bound memory on onStepFinish by slicing argsStr to 500 chars (matches evidence.ts pattern), and wrap the per-step record in try/catch so JSON.stringify exceptions (BigInt, circular refs) don't crash the discovery step. 3. server/index.ts: make `trust proxy` configurable via TRUST_PROXY_HOPS env var (default 1). Both reviewers flagged that a hardcoded 1 is wrong when the ingress is configured with use-forwarded-headers:true, or when there's more than one proxy hop (CDN + ingress, service mesh + ingress). Too-low → clients share one bucket (false 429s); too-high → XFF is spoofable. Doc the tradeoffs inline and in the Helm values. Helm chart values.yaml gets an env-var commented example for TRUST_PROXY_HOPS.
The previous rewrite hardcoded "Loki" throughout, but the `logs` role in provider-registry is generic — the wired-up log system could be Loki, Elasticsearch, Splunk, CloudWatch, or anything else with a `logs` role. Rewrote the prompt to: 1. Frame the rule as "match actual stream labels / index fields the log provider exposes", not Loki-specific 2. Tell the agent to call a label-discovery tool up front if one exists (list_loki_label_names, list_indices, describe_log_groups, etc.) and prefer labels from that result 3. Keep the kube-state-metrics gotcha because it's the common LLM mistake across every k8s log pipeline 4. Keep the container/pod/namespace/app fallback guidance since those are universal across promtail, fluent-bit, filebeat, and CloudWatch CI
4 tasks
WZ
added a commit
that referenced
this pull request
Apr 15, 2026
Two related bugs surfaced in prod on the next discover run after shipping
the JSONC parser fix:
1. Tool-arg coercions never ran for discovery. The guard at the top of
the wrap call was:
const tools = wrappedOnToolCall
? wrapToolsWithCallbacks(discoveryTools, wrappedOnToolCall, "discovery")
: discoveryTools;
When `config.onToolCall` is undefined (auto-discovery on cold start,
CLI runs without callbacks), tools passed through raw, so both
`coercePrometheusArgs` and `coerceLokiArgs` were skipped.
Consequence: the LLM's standard `startTime: null` on instant queries
crashed the Grafana MCP tool with a schema validation error, the same
class of bug fix #5 was supposed to prevent for everyone.
Fix: always wrap discovery tools. `wrapToolsWithCallbacks` already
handles undefined `onToolCall` via `?.`, so coercions now run
unconditionally and the callback path is still optional.
2. Datasource UIDs were hallucinated. The discover agent called
`grafana_list_loki_label_names` with `datasourceUid: "loki"`
(made-up short name) instead of the real `ef34v1a9h2fi8c` it had
already learned from its own `grafana_list_datasources` call earlier
in the same run. Result: label lookup failed, every service ended
up with `logLabels: {}`, and fix #4 (Loki-compatible log labels from
PR #85) was effectively neutralized.
Fix: mirror the evidence-agent pattern and pre-fetch datasource UIDs
at the start of `runDiscoverStep`, then inject them into the prompt
as an `<untrusted_datasource_hints>` block. Uses the same UID-first
wording evidence uses. Graceful fallback to empty hints if
list_datasources isn't available or the call fails — the agent can
still run, it'll just eat the round-trip discovering them itself
(same as today).
Both fixes live in `src/workflows/steps/discover.ts` — scope-contained
to the discovery path. No tests changed because the coercion logic
is already covered by tool-utils.test.ts; the new hint fetcher is a
best-effort wrapper around an existing MCP tool call with no branching
to meaningfully unit-test.
WZ
added a commit
that referenced
this pull request
Apr 16, 2026
Two related bugs surfaced in prod on the next discover run after shipping
the JSONC parser fix:
1. Tool-arg coercions never ran for discovery. The guard at the top of
the wrap call was:
const tools = wrappedOnToolCall
? wrapToolsWithCallbacks(discoveryTools, wrappedOnToolCall, "discovery")
: discoveryTools;
When `config.onToolCall` is undefined (auto-discovery on cold start,
CLI runs without callbacks), tools passed through raw, so both
`coercePrometheusArgs` and `coerceLokiArgs` were skipped.
Consequence: the LLM's standard `startTime: null` on instant queries
crashed the Grafana MCP tool with a schema validation error, the same
class of bug fix #5 was supposed to prevent for everyone.
Fix: always wrap discovery tools. `wrapToolsWithCallbacks` already
handles undefined `onToolCall` via `?.`, so coercions now run
unconditionally and the callback path is still optional.
2. Datasource UIDs were hallucinated. The discover agent called
`grafana_list_loki_label_names` with `datasourceUid: "loki"`
(made-up short name) instead of the real `ef34v1a9h2fi8c` it had
already learned from its own `grafana_list_datasources` call earlier
in the same run. Result: label lookup failed, every service ended
up with `logLabels: {}`, and fix #4 (Loki-compatible log labels from
PR #85) was effectively neutralized.
Fix: mirror the evidence-agent pattern and pre-fetch datasource UIDs
at the start of `runDiscoverStep`, then inject them into the prompt
as an `<untrusted_datasource_hints>` block. Uses the same UID-first
wording evidence uses. Graceful fallback to empty hints if
list_datasources isn't available or the call fails — the agent can
still run, it'll just eat the round-trip discovering them itself
(same as today).
Both fixes live in `src/workflows/steps/discover.ts` — scope-contained
to the discovery path. No tests changed because the coercion logic
is already covered by tool-utils.test.ts; the new hint fetcher is a
best-effort wrapper around an existing MCP tool call with no branching
to meaningfully unit-test.
WZ
added a commit
that referenced
this pull request
Apr 16, 2026
* fix(discover): parse JSONC and extract top-level arrays
Observed in prod: the discover agent emitted a valid-looking JSON array
with JavaScript-style block comments as section dividers:
[
{...},
/* StatefulSets */
{...},
/* DaemonSets */
{...}
]
`JSON.parse` rejected it. `safeJsonParse` fell through to its markdown-
fence-extraction fallback, which still called `JSON.parse` on the body
and hit the same rejection. The remaining fallbacks only looked for
top-level `{...}` objects, so they found individual service records inside
the array and the discover step saw `parsed = <single object>` instead
of `parsed = <array>`, failed its `Array.isArray` check, and logged
"discovery returned empty result" three times in a row.
Two fixes:
1. `stripJsoncComments()` — a string-aware pass that removes `// ...`
line comments and `/* ... */` block comments and trailing commas
before `]` / `}`. Respects JSON string boundaries so content like
`"see /* important */ docs"` survives unchanged. Tried after direct
parse and after markdown-fence extraction.
2. `extractLastJsonArray()` / `extractFirstJsonArray()` — balanced-
bracket scan for top-level `[...]` as a last-resort fallback
(object extraction still runs first so evidence/metrics agents
keep returning their enclosing object, not the inner observations
array).
Also tightened the discover agent prompt to explicitly forbid JS-style
comments and trailing commas in the output JSON.
Test coverage: 6 new regression cases in processors.test.ts including
the exact JSONC-with-block-comments failure mode from prod, string-
preservation of comment-like sequences inside values, and trailing-
comma tolerance.
* fix(discover): always apply arg coercions + pre-inject datasource hints
Two related bugs surfaced in prod on the next discover run after shipping
the JSONC parser fix:
1. Tool-arg coercions never ran for discovery. The guard at the top of
the wrap call was:
const tools = wrappedOnToolCall
? wrapToolsWithCallbacks(discoveryTools, wrappedOnToolCall, "discovery")
: discoveryTools;
When `config.onToolCall` is undefined (auto-discovery on cold start,
CLI runs without callbacks), tools passed through raw, so both
`coercePrometheusArgs` and `coerceLokiArgs` were skipped.
Consequence: the LLM's standard `startTime: null` on instant queries
crashed the Grafana MCP tool with a schema validation error, the same
class of bug fix #5 was supposed to prevent for everyone.
Fix: always wrap discovery tools. `wrapToolsWithCallbacks` already
handles undefined `onToolCall` via `?.`, so coercions now run
unconditionally and the callback path is still optional.
2. Datasource UIDs were hallucinated. The discover agent called
`grafana_list_loki_label_names` with `datasourceUid: "loki"`
(made-up short name) instead of the real `ef34v1a9h2fi8c` it had
already learned from its own `grafana_list_datasources` call earlier
in the same run. Result: label lookup failed, every service ended
up with `logLabels: {}`, and fix #4 (Loki-compatible log labels from
PR #85) was effectively neutralized.
Fix: mirror the evidence-agent pattern and pre-fetch datasource UIDs
at the start of `runDiscoverStep`, then inject them into the prompt
as an `<untrusted_datasource_hints>` block. Uses the same UID-first
wording evidence uses. Graceful fallback to empty hints if
list_datasources isn't available or the call fails — the agent can
still run, it'll just eat the round-trip discovering them itself
(same as today).
Both fixes live in `src/workflows/steps/discover.ts` — scope-contained
to the discovery path. No tests changed because the coercion logic
is already covered by tool-utils.test.ts; the new hint fetcher is a
best-effort wrapper around an existing MCP tool call with no branching
to meaningfully unit-test.
* fix(tool-utils): strip Mastra tool marker so our coercions actually run
Root cause of every "our coercion didn't run" symptom we've been chasing:
Mastra's CoreToolBuilder.createExecute() validates tool input against the
tool's inputSchema BEFORE calling our wrapped execute, but only for tools
that Mastra considers its own (detected via the
`Symbol.for("mastra.core.tool.Tool")` marker, which is copied by
`{ ...tool }` spread). When the LLM sends `startTime: null` on an instant
Prometheus query, Mastra's validateToolInput rejects it with
Tool input validation failed ... startTime: must be string
and returns the error to the LLM — our `coercePrometheusArgs` never gets
a chance to fire because execute was never called.
The same latent bug was silently hiding `coerceLokiArgs` as well, but that
one only fixes shape-valid-but-wrong-value inputs (direction "forward" →
"backward") so validation passed and our wrapper ran, masking the problem.
`coercePrometheusArgs` fixes shape-invalid inputs (null → string), which
validation blocks upstream — which is why the same bug kept resurfacing
in prod even after three "fixes".
The fix: strip the `MASTRA_TOOL_MARKER` symbol (and the outputSchema) from
the wrapped tool object. Mastra's `isMastraTool()` now returns false, so
the framework takes the Vercel/AI-SDK tool path, which SKIPS input
validation. Our wrapped execute runs on the raw LLM args, coerces them,
then calls the original Mastra Tool instance's execute directly — which
forwards to the underlying MCP tool with the fixed args.
Tool-spec fidelity (what the LLM sees) is unchanged because inputSchema
is preserved. Only the server-side validation step is bypassed. Output
validation is also skipped (outputSchema dropped) to avoid false negatives
on the permissive MCP content-wrapper response shape.
Test coverage in tool-utils-coercion.test.ts:
- marker stripped, outputSchema dropped, inputSchema preserved, execute
replaced (direct assertions on the wrapped object)
- coercePrometheusArgs runs end-to-end via the wrapper (already present)
- coerceLokiArgs drops stepSeconds and forces direction=backward end-to-end
* chore(helm): default imagePullPolicy to Always
Fixed-tag deploys (0.1.4, 0.2.0, etc.) that get re-pushed under the same
tag don't pick up the new digest when the node has a cached copy under
that tag. This bit us immediately after the marker-strip fix — I rebuilt
and re-pushed 0.1.4, but the rollout kept running the old digest because
kubelet saw "already have 0.1.4 cached" and skipped the pull.
Always-pull costs ~2-5s on pod start in exchange for the guarantee that
re-pushing a tag actually does what you expect. For a single-replica
deployment that only rolls on real code changes, that's fine.
Override to IfNotPresent per-environment if you never re-push tags and
care about faster rollouts.
* chore: bump to 0.1.5
* fix(discover): log complete prompt hints at debug level, not a placeholder
The `logLlmCall` at the success + error paths of runDiscoverStep was
passing `prompt: "${discoverPrompt}\n\n[hints: ${fullHints.length} chars]"`,
which replaced the actual hints content with a char-count placeholder.
With LLM_LOG_LEVEL=debug set, the operator should see the ENTIRE LLM
request — user message + the full datasource hints + recipes + skills —
not a stub. Inline `fullHints` directly so the debug entry is reproducible
from the log alone.
Bump to 0.1.6.
* fix(web): redirect to Operations Desk after accepting discovery
Accepting a discovery result used to leave the operator on the Services
page with the grid re-rendered. They then had to manually click the
Operations Desk (Dashboard) nav item to see the freshly-populated service
list in the live catalog. Automate that last step: after the
`discover:accept` WS message is sent, ServicesPage fires a new
`onDiscoveryAccepted` callback which App.tsx wires to
`setLeftPane({ type: "dashboard" })`.
No-op default on the prop so existing tests (and anyone calling the
component without the callback) keep working unchanged.
Bumping to 0.1.7.
* feat(observability): log full prompts on start + per-tool agentic loop
Adds two debugging affordances that were missing from the 0.1.4 LLM
observability work:
1. logLlmCallStart now accepts an optional `prompt` field and emits the
full content at debug level as a separate "LLM <agent> start prompt"
line. Previously the full prompt was only visible in the completion
log, so you couldn't inspect what the model saw during a hang.
2. logToolCall now emits an info summary (tool name + bytes + duration)
plus debug details (full args + result) for every tool call inside
discover, evidence phases, and chat. Previously discover and evidence
collected tool events silently and only flushed them in the final
LLM call summary, making the agentic loop invisible in real time.
Bump 0.1.7 -> 0.1.8.
* fix(discover): coerce hallucinated datasource UIDs + harden prompt
The discover agent consistently passed datasourceUid: "prometheus" (a
hallucinated short name) instead of the real UID from the pre-injected
hints block. The MCP server rejected it with "Tool input validation"
errors, burning 2-4 retries per discovery run.
Two-layer fix:
1. wrapToolsWithCallbacks now intercepts hallucinated short names via a
pre-built Map<shortName, realUid> and substitutes the real UID before
the MCP call. Logs at info level when substitution fires so we can
count hallucinations and track whether the prompt fix reduces them.
2. The discover agent's system prompt now renders datasource UIDs in a
dedicated "CRITICAL: non-negotiable" block, separated from the recipe
hints that previously said "these are suggestions".
Extended the coercion map through WorkflowConfig to evidence phases and
anomaly detection for consistency.
Bump 0.1.8 -> 0.1.9.
* fix(discover): truncated JSON recovery + queryType coercion + retry UI
Three discovery improvements found via 0.1.8/0.1.9 per-tool-call logs:
1. Response truncation: max_tokens 16384 was too small for 71-service
environments. Bumped to 32768 and added recoverTruncatedJsonArray
to safeJsonParse — finds the last complete } in a truncated [{...},
{... array and closes it. Strategy runs before single-object
extraction so we don't pick up a stray inner object.
2. queryType: null coercion in coercePrometheusArgs — defaults to
"instant", eliminating 1 wasted tool call per discovery attempt.
3. Retry visibility: new discover:retry WebSocket event + UI indicator
showing "Attempt 2 of 3" so the user knows it's retrying, not stuck.
Also added parse failure diagnostics: logs response length + first/last
200 chars so truncation is diagnosable from a single log line.
Bump 0.1.9 -> 0.1.10.
* feat(server): log version on startup
Reads package.json version and includes it in the startup log line:
"dops-assistant v0.1.10 web server running on port 3000"
Structured fields include both port and version for log queries.
* fix(web): reset discovery state on Run Discovery click
When clicking Run Discovery after a previous run, the UI showed
"Validating services..." (stale phase from prior run) instead of
"Discovering services...". The onStartDiscovery handler only sent the
WS message without resetting discoveryState. Now resets to phase
"discovery" + status "running" before sending, so the UI immediately
shows the correct initial state.
* feat(web): add shimmer animation to discovery progress bar
The progress bar was static between tool calls, making the UI look
stuck during LLM thinking phases. Added a shimmer gradient overlay
that animates continuously, giving visual feedback that the process
is still active even when no iteration events are firing.
* fix(discover): cap tool results at 30k chars to prevent token exhaustion
k8s_pods_list returns 83k chars of raw pod data. With max_tokens=32768,
the LLM exhausts its entire token budget digesting the pod list and
produces a 0-char response (32938 output tokens, 0 chars text).
Added maxToolResultChars param to wrapToolsWithCallbacks. Discovery
tools are now capped at 30k chars with a truncation note. Moved
truncateMcpResult to tool-utils.ts (was duplicated in agents.ts).
This cuts discovery input tokens from ~90k to ~40k and prevents the
0-char exhaustion failure mode.
* fix(discover): prompt hint to avoid unfiltered pod lists
Replace the 30k tool result cap with a prompt instruction telling the
agent to prefer deployments/statefulsets or metric queries over raw pod
lists. Raw k8s_pods_list returns 83k+ chars and exhausts the token
budget. The prompt now explicitly tells the agent to use fieldSelector
or labelSelector if it must list pods, and to prefer pre-aggregated
metric queries instead.
Kept truncateMcpResult in tool-utils.ts as a shared utility but removed
the discovery-specific cap — the prompt hint is the primary fix.
* fix(agents): tighten k8s and log query guidance to reduce token waste
infra.ts: tell the agent to pass labelSelector/fieldSelector directly
in tool calls instead of fetching all pods/events then filtering
in-memory. An unfiltered k8s_pods_list returns 83k+ chars.
logs.ts: clarify that "without filter" means without pattern filter,
not without the service label selector. Added explicit line limits
(100 for error context, 10 for existence check).
discover.ts: fix blocklist test — use generic "metric or log query
tool" instead of hardcoded tool names in the datasource UID block.
* fix(discover): use filtered pod list as second pass, not skip it entirely
The previous prompt told the agent to prefer metrics over pod lists,
which caused it to skip k8s_pods_list entirely. This lost 38 of 47
services — container-level services (celery workers, sidecars,
multi-container pods) whose names differ from their deployment name
are invisible to kube_deployment_status_replicas metrics.
Now: metrics first for the initial sweep, then a filtered pod list
(excluding kube-system/kube-public/kube-node-lease) as a second pass
to catch container-level services.
* fix: consul skill exhaustive discovery + stackId in poller logs
consul-bare-metal-discovery.md: changed "Common bare-metal services"
to "Important: discover ALL Consul services, not just known ones".
The agent was treating the example list as exhaustive and skipping
the consul_catalog query, missing 6 sub-components (hdfs-journalnode,
hdfs-zkfc, hive-metastore, impala-catalog, kudu-tserver, fazbdregistry).
service-health-poller.ts: added stackId to all poll log lines (tool
not found, query result, empty query result, poll complete). Previously
impossible to tell which stack's poller was failing.
* fix(test): add buildDatasourceUidMap to ws-handler mock
Merged
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ships v0.1.4. Two themes in one branch:
LLM observability (the enabling work)
discover.tshardcodingtoolCalls: []in its log recordobservability.port/logLevelconfig fieldsSix production fixes surfaced by the new observability
replicas = 0→unknown(scaled down),up = 0→down(real scrape failure). Stops the 5 false-positive auto-investigations we were seeing on every cold start.trust proxy— configurable viaTRUST_PROXY_HOPSenv var (default1). FixesERR_ERL_UNEXPECTED_X_FORWARDED_FORand broken per-client rate limiting behind k8s ingress. Closes TODO feat: service detail page, sidebar navigation, and page reorganization #36.container/pod/app), not kube-state-metrics labels. The{statefulset: yb-master}label was wasting 5-6 Loki fallback queries per investigation..ZRFC3339 normalization — repairs.Z, dangling dot, and missing-Z variants that LLMs consistently produce.stepSeconds— LLMs pass it as a Prometheus leftover; Loki rejects it.Plus a red-square favicon so the tab is identifiable.
Test Coverage
Tests: 73 → 80 (+7 new files, 96 new test cases).
Pre-Landing Review
1 informational:
coercePrometheusArgsdefaults assumequeryType: instant— latent for range queries (not triggered in practice). Skipped as tracked-for-future.Design Review
Scope: 1 frontend file (
src/web/index.htmlfavicon — 1 line). Skipped.Adversarial Review
Both Claude subagent and Codex ran. 7 findings across the two reviewers, 2 cross-model confirmed:
TRUST_PROXY_HOPSenv-var configurable with safety notesdesired vs readycomparison).Zregex narrowness → FIXED: also handles dangling dot and missing Zdiscover.tsunbounded argsStr → FIXED: sliced to 500 chars, wrapped in try/catchstepSecondssilent break if tool adds it → monitor (tracked in plan failure modes)logLlmCallStart/FirstChunkskews on throw → minor observability edge caseCodex structured review: GATE PASS (no [P1] markers).
PR Quality Score: 9/10.
Plan Completion
Plan:
/Users/wli02/.claude/plans/moonlit-nibbling-star.md(as amended by /plan-eng-review).Deferred / known trade-offs
desired > 0, ready = 0workloads. Captured in plan's "NOT in scope" section.TRUST_PROXY_HOPStopology — default1is correct for single k8s ingress withuse-forwarded-headers: false. Deployments with a CDN/service mesh/ingress-pass-through need to set this explicitly. Documented inline and in Helm values.Verification (after redeploy)
summary.downshould drop from ~5 to the count of services with realup = 0targetsERR_ERL_UNEXPECTED_X_FORWARDED_FORon WS connects.Zparse errors in Loki, nostepSeconds: 300in Loki args, notoolCallCount: 0in discoverTest plan
tsc --noEmitcleanwliftnt/dops-assistant:0.1.4built and pushed (digestsha256:90c8fc80…)🤖 Generated with Claude Code