Skip to content

fix(scan+discovery): four P1/P2 quick wins — terminal-emit, NaN log, budget warn, webhook hint#133

Merged
WZ merged 3 commits into
mainfrom
fix/quick-wins-p1-p2
Apr 24, 2026
Merged

fix(scan+discovery): four P1/P2 quick wins — terminal-emit, NaN log, budget warn, webhook hint#133
WZ merged 3 commits into
mainfrom
fix/quick-wins-p1-p2

Conversation

@WZ
Copy link
Copy Markdown
Owner

@WZ WZ commented Apr 24, 2026

Summary

Batch of four low-risk items from docs/TODOS.md (P1 and P2 S/XS effort) landed together.

  • AP2 (P1) — src/workflows/discovery.ts: wrap the workflow in try/finally so the caller always receives a terminal phase. A validation-step stall/throw now falls back to the unverified discovered services (tagged confidence: "unverified" with a validationNotes explanation) instead of losing them to a thrown promise. Closes the "declared likely fixed but never re-verified" gap from QA Issue fix: investigation robustness — 6/10 to 10/10 conclusive RCA #7.
  • AP4 (P2) — src/server/anomaly-probe.ts: log at INFO when a rule scores NaN (empty vector, Loki datasource unwired, MCP timeout/error). Aggregate tick counts already existed; this adds the per-rule line operators need to find the specific broken rule.
  • AP6 (P2) — src/server/anomaly-probe.ts: WARN when a tick generates more than 200 queries. Observability only — not a hard block. Catches cardinality growth (services × per-service rules) before it floods the metrics backend.
  • AP9 (P2) — src/server/webhook-handler.ts: expand the 503 hint from "Set webhook.secret in config.yaml and restart" to name the config section and explicitly say "restart the server" so operators can act without a support loop.

All four changes are covered by unit tests alongside the affected files.

Test plan

  • npx vitest run src/workflows/discovery.test.ts — 12/12 pass (adds 2 AP2 tests)
  • npx vitest run src/server/anomaly-probe.test.ts — 49/49 pass (adds 1 AP4 + 2 AP6 tests)
  • npx vitest run src/server/webhook-handler.test.ts — 11/11 pass (AP9 regex tightened)
  • npx tsc --noEmit — clean
  • Full npx vitest run — 1649/1650 pass; the single failure is a known-flaky rate-limit timing test (rate-limit.test.ts:19) that passes in isolation and is unrelated to this diff (last touched PR feat(llm-logger): call-start + first-chunk logging, fix discover tool tracking #85)

Notes

  • No behaviour change for the happy path — the try/finally only activates on validation failure, and the terminal "complete" phase is emitted at the same logical point the function used to return.
  • The AP4 NaN log fires once per broken rule per tick. At 200 rules with all empty, that's 200 INFO lines per tick; at the default scan cadence this is bounded and actionable.
  • The AP9 hint text diverges slightly from the TODO wording ("under the scan section") because webhook is actually a top-level config section, not nested under scan.

🤖 Generated with Claude Code

WZ added 3 commits April 24, 2026 02:06
…ng, budget warn, webhook hint

AP2 — src/workflows/discovery.ts: wrap the workflow body in try/finally so
the caller always receives a terminal phase ("complete" / "complete-empty"
/ "complete-validation-failed" / "complete-failed") even when the
validation step stalls or throws mid-flow. If validation fails, fall back
to the unverified discovered services tagged with confidence "unverified"
and a validationNotes explaining the fallback, so services.yaml is still
written instead of being lost to a thrown promise. Covers the "declared
likely fixed but never re-verified" gap from QA Issue #7.

AP4 — src/server/anomaly-probe.ts: log at INFO when a rule scores NaN
(empty vector, Loki datasource unwired, MCP timeout/error). The tick
summary already counts probeErrors / queriesEmpty in aggregate; this
per-rule line is what operators need to locate the specific broken rule.

AP6 — src/server/anomaly-probe.ts: WARN at the end of task build when a
tick generates more than 200 queries. Not a hard block; observability
only so operators notice cardinality growth (many services × many
per-service rules) before it floods the metrics backend.

AP9 — src/server/webhook-handler.ts: expand the 503 hint from "Set
webhook.secret in config.yaml and restart" to name the config section
and explicitly mention restarting the server, so operators hitting this
code path can act without a support loop.

All four changes are covered by unit tests alongside the affected files.
Follow-ups from /review on this branch:

- discovery.ts: factor terminal phase strings into an exported
  TERMINAL_DISCOVERY_PHASES union/const so the WS bridge no longer
  relies on a `startsWith("complete")` string-prefix convention. Move
  the "complete-failed" assignment into an explicit outer catch (was
  an initializer default). Pass the Error object (not message) to the
  validation-failure WARN so the stack survives. Clarify the
  validationNotes text ("validation did not complete") so it reads
  correctly on the throw-after-start path.
- ws-handler.ts: swap `phase.startsWith("complete")` for the shared
  TERMINAL_DISCOVERY_PHASES membership check.
- anomaly-probe.ts: tighten AP6 threshold comment (drop arithmetic-
  wrong citation) and trim the AP4 NaN-log comment (remove
  self-contradicting abort paragraph).
- tests: add a beforeEach reset for loggerCalls so future assertions
  don't inherit prior-test noise; drop the dead "query budget"
  predicate clause; add a boundary test at tasks.length === 200; add
  a kind="error" path test for the AP4 NaN log; assert the
  confidence="unverified" + validationNotes contract on the AP2
  validation-throws fallback.
QA turned up a duplicate 503 response in `src/server/index.ts:438`
registered when no webhook.secret is configured at all (vs the "secret
set but wrong bearer" path the initial AP9 fix touched in
webhook-handler.ts). The two bodies had drifted already; the fallback
stub was still emitting the old terse hint.

Extract WEBHOOK_NOT_CONFIGURED_BODY as a shared constant exported from
webhook-handler.ts so both routes render the same operator-facing text.
Existing webhook-handler.test.ts regex coverage still guards the
contract; the index.ts stub now imports the constant directly so it
cannot drift without the regex failing too.
@WZ WZ merged commit 7e5ad94 into main Apr 24, 2026
2 checks passed
@WZ WZ deleted the fix/quick-wins-p1-p2 branch April 25, 2026 03:28
WZ added a commit that referenced this pull request Apr 25, 2026
…budget warn, webhook hint (#133)

* fix(scan+discovery): four P1/P2 quick wins — terminal-emit, NaN logging, budget warn, webhook hint

AP2 — src/workflows/discovery.ts: wrap the workflow body in try/finally so
the caller always receives a terminal phase ("complete" / "complete-empty"
/ "complete-validation-failed" / "complete-failed") even when the
validation step stalls or throws mid-flow. If validation fails, fall back
to the unverified discovered services tagged with confidence "unverified"
and a validationNotes explaining the fallback, so services.yaml is still
written instead of being lost to a thrown promise. Covers the "declared
likely fixed but never re-verified" gap from QA Issue #7.

AP4 — src/server/anomaly-probe.ts: log at INFO when a rule scores NaN
(empty vector, Loki datasource unwired, MCP timeout/error). The tick
summary already counts probeErrors / queriesEmpty in aggregate; this
per-rule line is what operators need to locate the specific broken rule.

AP6 — src/server/anomaly-probe.ts: WARN at the end of task build when a
tick generates more than 200 queries. Not a hard block; observability
only so operators notice cardinality growth (many services × many
per-service rules) before it floods the metrics backend.

AP9 — src/server/webhook-handler.ts: expand the 503 hint from "Set
webhook.secret in config.yaml and restart" to name the config section
and explicitly mention restarting the server, so operators hitting this
code path can act without a support loop.

All four changes are covered by unit tests alongside the affected files.

* refactor: address /review findings on quick-wins PR

Follow-ups from /review on this branch:

- discovery.ts: factor terminal phase strings into an exported
  TERMINAL_DISCOVERY_PHASES union/const so the WS bridge no longer
  relies on a `startsWith("complete")` string-prefix convention. Move
  the "complete-failed" assignment into an explicit outer catch (was
  an initializer default). Pass the Error object (not message) to the
  validation-failure WARN so the stack survives. Clarify the
  validationNotes text ("validation did not complete") so it reads
  correctly on the throw-after-start path.
- ws-handler.ts: swap `phase.startsWith("complete")` for the shared
  TERMINAL_DISCOVERY_PHASES membership check.
- anomaly-probe.ts: tighten AP6 threshold comment (drop arithmetic-
  wrong citation) and trim the AP4 NaN-log comment (remove
  self-contradicting abort paragraph).
- tests: add a beforeEach reset for loggerCalls so future assertions
  don't inherit prior-test noise; drop the dead "query budget"
  predicate clause; add a boundary test at tasks.length === 200; add
  a kind="error" path test for the AP4 NaN log; assert the
  confidence="unverified" + validationNotes contract on the AP2
  validation-throws fallback.

* fix(webhook): apply AP9 hint text to the unconfigured-fallback route too

QA turned up a duplicate 503 response in `src/server/index.ts:438`
registered when no webhook.secret is configured at all (vs the "secret
set but wrong bearer" path the initial AP9 fix touched in
webhook-handler.ts). The two bodies had drifted already; the fallback
stub was still emitting the old terse hint.

Extract WEBHOOK_NOT_CONFIGURED_BODY as a shared constant exported from
webhook-handler.ts so both routes render the same operator-facing text.
Existing webhook-handler.test.ts regex coverage still guards the
contract; the index.ts stub now imports the constant directly so it
cannot drift without the regex failing too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant