feat(scan): k8s-native default probe rules by WZ · Pull Request #115 · WZ/dops-assistant

WZ · 2026-04-22T07:49:26Z

Stacked on #114. Merge chain: #108 → #112 → #113 → #114 → this.

What this PR does

Replaces the three default probe rules with k8s-native equivalents. The old defaults used service="{service}" label selectors that don't exist on most k8s Prometheus setups. Confirmed live: all three old defaults returned rawResultCount: 0 on every tested service.

Smoke-test findings that drove this

Full writeup in docs/plans/2026-04-22-scan-smoke-test-findings.md (local only). Highlights:

Default rule (old)	rawResultCount against real Prom
`up{service="stolon-proxy"}`	0 — no `service=` label in this stack
`http_requests_total{service="stolon-proxy"}`	0 — same
`http_request_duration_seconds_bucket{service="stolon-proxy"}`	0 — same

New default rule	Tested value
`kube_deployment_status_replicas_available{deployment="stolon-proxy"}`	2 ✓
`kube_statefulset_status_replicas_ready{statefulset="..."}`	works per-service
`kube_daemonset_status_number_ready{daemonset="..."}`	works per-service

The existing service-health-poller matches by the same label precedence (service-health-poller.ts:162-166 — deployment → statefulset → daemonset → job → instance). This PR brings the scan probe to the same schema.

End-to-end dispatch was proven working: 3 scan-triggered investigations ran to completion, persisted, scored via rca-eval --source scan (50/100 avg on synthetic trigger, expected — the "silent degradation" failure mode flagged in the design doc when there's no real evidence to investigate).

Changes

src/config/schema.ts:

3 new default rules, one per k8s workload type (Deployment / StatefulSet / DaemonSet).
consecutiveTicks: 3 across the board. Rolling deploys briefly drop _available below desired; 3 ticks × 4h cron = 12h filter window, long enough to ignore rollout churn, short enough to catch real outages on the next cadence.
Dropped the old error_rate and latency_p99 defaults. Both assumed labeled application-level HTTP metrics (http_requests_total / http_request_duration_seconds_bucket) that are too environment-specific. Operators with those metrics add them via Settings → Scan → Rules (Step 3 UI) or config.yaml override.
Comment above the default block explains the reasoning + points to the smoke-test findings.

src/config/schema.test.ts:

Assertion updated to match new defaults.
Added invariant check: all defaults must include {service} placeholder (regression-guard against someone accidentally committing a rule that can't substitute per-service).

What this doesn't do

Doesn't cover non-k8s environments. Bare-metal, ECS, Nomad stacks will need custom rules via the GUI or config.yaml override. Documented in the schema comment.
Doesn't re-add error_rate / latency_p99. These need labels that only exist in some stacks. Adding them back as defaults would just reproduce the same rawResultCount: 0 problem for most users. Left for operator-added via GUI.

Test plan

Type check + full suite — 1296/1296 passing on branch
Fresh install with no scan.probe.metrics in config.yaml → gets the 3 k8s defaults
Existing users with source: "gui" on rules → unaffected (DB override wins over config default, same pattern as scan.enabled etc.)
Existing users with custom scan.probe.metrics in config.yaml → unaffected (explicit override)
On a k8s Prometheus with kube-state-metrics: POST /api/scan/rules/test returns rawResultCount: 1 and a numeric value for at least one service per rule-type
On a non-k8s environment: rawResultCount: 0 on all three defaults → operator follows the GUI override path

Rollout

Zero runtime behavior change on merge for anyone who already has scan.enabled: true — they presumably have working rules already (either config.yaml override or DB/GUI override). Only config.yaml-default users see different defaults on next reload.

Replace the three default probe rules in config.scan.probe.metrics. The old defaults used 'service="{service}"' label selectors which don't exist on most k8s Prometheus setups (kube-state-metrics labels are deployment= / statefulset= / daemonset=). Confirmed by smoke test 2026-04-22: all three old defaults returned rawResultCount:0 for every tested service in a live k8s environment. New defaults, one per workload type: deployment_availability kube_deployment_status_replicas_available{deployment="{service}"} < 1 statefulset_availability kube_statefulset_status_replicas_ready{statefulset="{service}"} < 1 daemonset_availability kube_daemonset_status_number_ready{daemonset="{service}"} < 1 Each service matches exactly one (or none, for services not tracked by kube-state-metrics). Non-matching rules return empty vector and score 0 harmlessly. Matches the label precedence the existing service-health-poller already uses (service-health-poller.ts:162-166). Dropped the old error_rate + latency_p99 defaults. Both assume labeled application-level HTTP metrics (http_requests_total, http_request_duration_ seconds_bucket) which are too environment-specific to be useful defaults — operators with those metrics add them via Settings → Scan → Rules (Step 3) or config.yaml override. consecutiveTicks bumped from 1/2 to 3 across the board. Rolling deploys briefly drop _available below desired; 3 ticks on the default 4h cron (12h) is long enough to filter any reasonable rollout window without missing a real incident. For non-k8s environments: override via Settings → Scan with queries that match your label schema. The GUI rule editor from Step 3 is the supported path for per-environment tuning. Full test suite 1296/1296 green (schema.test.ts updated to assert the new defaults + {service} placeholder presence + correct label selectors). Findings writeup: docs/plans/2026-04-22-scan-smoke-test-findings.md (local).

Review-caught regression fix. The three k8s availability defaults would trip on legitimate edge cases the previous iteration missed: - HPA min=0 / cron-style / paused deployments (spec=0, available=0): would false-positive on `available < 1`. - Arch-mismatched daemonsets (e.g. kube-flannel-ds-arm on amd64-only nodes): observed firing in the 2026-04-22 smoke test because desired_number_scheduled=0 and number_ready=0 both satisfy `< 1`. - StatefulSets scaled to 0 for maintenance: same pattern. Fix: every default query is now ANDed against a "desired/spec > 0" guard. When the workload is intentionally at zero, the guard's right side returns empty, the `and` eliminates the left side, the query's total result is empty, which the probe scores as NaN → no trip. When the guard is true, the left side's value flows through unchanged for threshold evaluation. PromQL changes (one per rule): deployment: + and kube_deployment_spec_replicas{deployment="..."} > 0 statefulset: + and kube_statefulset_replicas{statefulset="..."} > 0 daemonset: + and kube_daemonset_status_desired_number_scheduled{daemonset="..."} > 0 Tested against live Prometheus: `stolon-proxy` (healthy, spec=2, avail=2) still returns value=2 with the guarded query. rawResultCount=1. Also: comment math fixed. Previous comment claimed "3 ticks × 4h = 12h" for the hysteresis window. Correct math is 8h (ticks at t=0, t=4h, t=8h = 3 consecutive breaches = trip on the 3rd). Updated. Test additions: regression assertions in schema.test.ts that lock the guard clauses into the default rule schema. Any future change that removes the `> 0` guard or the `kube_*_spec_*` / `_desired_*` metric references will fail the suite. 1296/1296 green. Net effect: operators running the default rules will get cleaner, false-positive-free scans at the cost of NOT detecting "service intended to run but spec=0 config drift." That's a config-management issue, not a runtime anomaly — out of scope for the scan probe.

Adds a quality gate to rca-eval. `--min-score N` exits non-zero when the average total score falls below N, so operators can gate pre-ship checks on RCA quality instead of eyeballing a table. Example usage: npx tsx src/eval/rca-eval.ts --min-score 75 npm run test:rca-eval # min-score 20 (interim floor) Interim floor is 20 because current avg is 24/100 — the real issue is that 77% of reports persist as empty stubs (0/0/0/0/0), not that non-empty reports score poorly. Raise the floor once reliability work lands. CI wiring intentionally deferred: ci.yml runs without dops.sqlite, so a fixture DB is needed before this can become a hard CI check. Surfaced by /autoplan CEO review — the RCA eval harness had no programmatic gate, so quality regressions (like the current 67→24 drop) had no way to fail a build.

* feat(scan): Slice A — schema + registry shape for discovery-owned probe rules Prepares the scan pipeline for discovery-written probe rules. Schema adds the per-rule `source` discriminator ("metrics" | "logs") and the per-service `probeRules` field; registry persists a new top-level object shape {services, globalProbeRules} with forward-compat reads of legacy flat-array files. No runtime behavior change — Slice B (discovery prompt) and Slice C (four-track probe evaluator) build on this. Schema (src/config/schema.ts): - ProbeMetricRuleSchema gains `source: z.enum(["metrics","logs"]).default("metrics")` so the probe knows which MCP tool role to dispatch against. Default is "metrics" — every existing config and #115 k8s-native rule continues to route through the Prometheus tool unchanged. - ServiceSchema gains `probeRules: z.array(ProbeMetricRuleSchema).optional().default([])`. Populated by discovery in Slice B; empty today. - ProbeSchema gains `logsQueryTimeoutMs: z.number().default(10_000)`. Loki `count_over_time` over a 15m window regularly takes 5-20s; reusing the 3s Prometheus timeout would produce silent false negatives. Eng-review decision 2026-04-22. - ProbeMetricRuleSchema + ThresholdSchema moved above ServiceSchema so the probeRules field can reference them (Zod requires forward declaration). Registry (src/services/registry.ts): - Persisted shape inverted from ServiceConfig[] to RegistryFile {services, globalProbeRules}. - Forward-compat reader: legacy flat-array services.yaml files parse into {services: parsed, globalProbeRules: []} without migration. The first write upgrades the file on disk. Operators on old files continue to load. - Public API preserved: load()/save()/getVersion()/rollback()/listVersions() keep their signatures. No call site in routes.ts, agents.ts, scan-scheduler.ts, or service-health-poller.ts had to change. - save() internally reads current globalProbeRules and carries them forward on every write — routes.ts metadata edits, service renames, and rollback cannot silently clobber the discovery-written top-level rules. Eng-review decision 2026-04-22 (silent-clobber guard). - rollback() preserves CURRENT globalProbeRules (not historic ones). Historic snapshots from before Slice A have globalProbeRules=[]; rolling back to them would wipe the live rules. - New methods: loadGlobalRules(), saveGlobalRules(rules, source), loadAll() (atomic combined snapshot for the scan probe's per-tick read), saveAll(file, source) (atomic combined write — preferred for discovery, produces one version entry instead of two), getVersionFile(id) (historic snapshot including globals-as-of-that-version). GUI validator (src/server/scan-rule-validator.ts): - RuleSchema gains `source: z.enum(["metrics","logs"]).default("metrics")` so GUI-authored rules parse into the canonical ProbeMetricRule shape. Log-source rules come from discovery in Slice B; the GUI editor stays metrics-only for v1. Discovery types (src/types/discovery-types.ts): - ValidatedServiceConfigSchema and ServiceRegistryVersionSchema pre-thread probeRules so Slice B can populate the field without a second type-level change. Empty default today. Tests (10 new, 3 updated): - schema.test.ts: source defaults, log-source accept, invalid source reject, logsQueryTimeoutMs default + validation, per-service probeRules parse. - registry.test.ts: legacy-shape read, saveGlobalRules round-trip, save()-preserves-globals (silent-clobber guard), rollback()-preserves- current-globals, saveAll() atomic write (one version entry), forward- compat loadAll() on legacy files, first-save upgrades on-disk shape. - scan-service-override.test.ts + scan-settings.test.ts: two shape assertions now include `source: "metrics"` where the validator defaults it. Full suite: 1381/1381 pass in isolation per file. One rate-limit test flakes under parallel load, passes standalone — not caused by this change. Design doc: ~/.gstack/projects/WZ-dops-assistant/wli02-feat-llm-driven-probe-rules-design-20260422-continuation.md * feat(discover): Slice B — LLM writes probe rules + discover-eval harness Teaches the discovery agent to emit per-service probeRules and top-level globalProbeRules, then plumbs them through validate → accept → registry so they land in services.yaml. Adds discover-eval as a quality gate so a prompt regression that stops emitting rules surfaces before it ships. No runtime behavior change for the probe yet — Slice C (four-track probe evaluator) is what actually reads the rules. This slice is data-producing only: the fields show up in services.yaml, the probe still runs against the hardcoded config.yaml defaults. Prompt changes (src/agents/discover.ts): - Output shape: JSON OBJECT {services, globalProbeRules} instead of a bare services array. Bare-array form stays accepted for backward-compat with stacks that don't produce globals. - Per-service probeRules guidance: the agent writes `pod_restarts` for k8s workloads with a resolvable namespace, and `log_errors` for services with non-empty logLabels. Both are spec'd with canonical names, selectors, and thresholds so output stays consistent run to run. - globalProbeRules guidance: label-key introspection (majority-wins on app / service / job / deployment / statefulset / daemonset). Rewrites availability rules using the correct key for the stack. When the stack matches the hardcoded k8s defaults, globalProbeRules stays empty (no duplication). Data flow (src/workflows/steps/discover.ts, src/workflows/discovery.ts, src/server/agents.ts, src/server/ws-handler.ts, src/cli/commands/discover.tsx): - runDiscoverStep returns `{services, globalProbeRules}`; both legacy bare-array and new object-form agent outputs parse correctly. - runDiscovery threads the pair through validation. - IDiscoverAgent.discover() returns DiscoveryResult; accept() takes an optional third globalProbeRules arg. - MastraDiscoverAdapter.accept() routes to registryStore.saveAll() when globals are provided (one atomic version entry), else falls through to save() which preserves the file's current globals (silent-clobber guard from Slice A). Exactly the API shape Slice A's registry was designed to accept. - ws-handler caches the full DiscoveryResult in pendingDiscovery so the discover:accept handler can pull globalProbeRules server-side — the UI only echoes services over the wire. - cli discover.tsx holds globalProbeRules in state alongside services and passes both on accept. Eval harness (src/eval/discover-eval.ts, src/eval/fixtures/discover-k8s-fixture.yaml): - Four 25-point dimensions, 100 total: globals present, per-service present (partial credit when minority of services have rules), PromQL parses, LogQL parses. - Lightweight parsers catch the failure modes an LLM actually produces: empty string, unbalanced braces/parens, YOUR_NAMESPACE / <namespace> placeholder tokens. Runtime NaN fail-safe in Slice C catches "parses but returns empty vector". - Fixture at src/eval/fixtures/discover-k8s-fixture.yaml represents expected well-formed output on an `app=`-keyed k8s stack. Scores 100/100 today; any drift in scoring calibration surfaces as a test failure immediately. - npm run test:discover-eval — gates at --min-score 75 against the fixture. Wire up against a live services.yaml in CI once the agent has run on a real cluster. Tests: - discovery.test.ts: new case covering the object-form agent output path; existing cases updated to assert {services, globalProbeRules} shape. - discover-eval.test.ts (new, 17 tests): PromQL/LogQL parsers, each dimension scorer, top-level eval against fixture + empty + legacy flat-array inputs. Full suite: 1399/1400. One rate-limit flake under parallel load, passes standalone — same pre-existing one from the Slice A run. Design doc: ~/.gstack/projects/WZ-dops-assistant/wli02-feat-llm-driven-probe-rules-design-20260422-continuation.md * feat(scan): Slice C — four-track probe evaluator + Loki metric-queryType Makes the probe READ what Slice B's discovery agent writes. Replaces the single-track config.yaml-only loop with a four-track evaluator that fans out across global rules, per-service rules (metrics + logs), config defaults, and a probe.logs fallback. Hysteresis state keys are origin-namespaced so same-named rules on different tracks track independently. This is the slice that actually changes runtime behavior. Stacks where discovery has run will now evaluate the discovery-written rules; stacks where it hasn't continue to fall through to PR #115's k8s-native defaults (byte-identical regression-tested). Probe (src/server/anomaly-probe.ts): - ProbeOptions gains `registryStore: ServiceRegistryStore` and `lokiDatasourceUid?: string`. The store is read once per tick via `loadAll()` for an atomic {services, globalProbeRules} snapshot — discovery running mid-tick can't produce a half-state. - New `RuleOrigin` type ("global" | "service" | "default" | "override" | "logs-fallback"). Threaded into ProbeHit and the state key. - `stateKey(service, origin, ruleName)` exported — namespaced so a global "availability" and a per-service "availability" with the same name share neither hysteresis nor a state-clearing pass. - Shared `withTimeoutAndAbort(tool, args, signal, ms)` helper extracted from executeInstant. Both metric and log executors use it. DRY win flagged in the code-quality review. - `executeInstantLogs(...)` calls the Loki MCP tool with `queryType: "metric"` so `count_over_time(...)` returns a scalar (vs `query_loki_logs`'s default which returns log lines). Uses probe.logsQueryTimeoutMs (10s default) — Prom's 3s timeout would silently NaN out on slow Loki clusters. NaN on any failure, never throws. If the tool rejects queryType:"metric" (older Grafana MCP), the error is logged once via withTimeoutAndAbort and every log-source rule scores NaN for the rest of the tick. - `findLogQueryTool(tools)` mirrors findMetricQueryTool: prefers `query_loki_logs`, falls back to generic log-query tool names. - `runProbe` rewritten as four-track evaluator. Per service: * Operator override (disabled or rules) wins. Marked origin:"override". * Tracks 1 vs 4 — globalProbeRules REPLACE config.yaml defaults when non-empty (origin:"global"); otherwise config defaults fire (origin:"default"). Tier 4 stays as the ultimate fallback for stacks where discovery has never run. * Tracks 2 + 3 — per-service probeRules are ADDITIVE on top of the base. Discovery writes only the rules it has unique context for (pod_restarts with real namespace, log_errors with real Loki labels). origin:"service". * probe.logs generic fallback fires only when probe.logs.enabled, the service has logLabels, and no per-service log-source rule was written. origin:"logs-fallback". Threshold scales errorRateThreshold by window minutes (raw count = per-min × min). - Single shared mapWithConcurrency semaphore across all four tracks + fallback — probe.concurrency caps total in-flight queries per tick. - Service registry resolution and per-service config lookup is O(1) via a Map built once per tick. Scheduler (src/server/scan-scheduler.ts): - Threads registryStore + lokiDatasourceUid through to runProbe. - New optional dep `getLokiDatasourceUid?: () => string | undefined` — when undefined, log-source rules score NaN, metrics-source rules continue. Wired live so operators can add Loki later without restart. - resetHysteresisForService updated for 3-part key: switched from `slice(0, lastIndexOf(":")) === service` to `startsWith(service+":")`. The previous parser would have silently failed to clear any Slice-C-format key (it would have compared "svc-a" to "svc-a:global"). Bug never reached prod — caught in this slice's own development. - resetHysteresisForChangedRules works unchanged: rule name is still after the last colon in the new key format. Tests (src/server/anomaly-probe.test.ts): - All 14 existing runProbe call sites updated to pass `registryStore` via a small `fakeRegistryStore({...})` helper. State-key assertions use the new origin-namespaced helper `defaultKey(service, ruleName)` so the change is visible at the test layer too. - `mockLogsTools` added at module scope; getToolsByRole mock now routes by role so log-source tracks can supply their own tool surface. - Nine new four-track scenarios: * Track 1 replaces track 4 when global rules exist. * Track 4 regression: byte-identical to PR #115 when no globals. * Track 2 additive — global + per-service both fire. * Track 3 calls the LOGS tool (not metrics) with queryType:"metric" and the Loki UID. * Track 3 scores NaN with no logs tool wired. * probe.logs fallback fires from logLabels when no per-service log-source rule exists. * probe.logs fallback does NOT fire when one does. * Origin-namespaced state keys: same-named rules on different tracks keep independent counters. * Registry snapshot atomicity: loadAll() called exactly once per tick across N services. Full suite: 1407/1407 (102/102 files green this run — no rate-limit flake). Type check clean. Design doc: ~/.gstack/projects/WZ-dops-assistant/wli02-feat-llm-driven-probe-rules-design-20260422-continuation.md * fix(scan): address /review adversarial findings — validate LLM output, preserve globals on empty sweep, GC orphaned hysteresis Three HIGH-priority findings from the pre-landing /review that both Claude (subagent) and Codex adversarial passes converged on. Plus three mechanical auto-fixes from the critical-pass. ## Mechanical auto-fixes - `src/eval/rca-eval.test.ts` — ProbeHit literal missing the new required `origin` field. tsconfig.json excludes `*.test.ts` from tsc so the type error didn't surface; vitest passed because buildInvestigationMessage doesn't read `origin`. Fixed before the next consumer trips over it. - `src/web/components/scan/types.ts` — comment claimed "keep in lockstep with ProbeMetricRule" but RuleDraft deliberately omits `source` (GUI editor is metrics-only in v1; validator defaults to "metrics" on PUT). Comment updated to reflect the actual contract. - `src/server/anomaly-probe.ts buildGenericLogQLFromLabels` — escape backslash before double-quote so logLabel values containing `\` don't double-escape. k8s RFC 1123 label values never need this in practice, but the function takes an untrusted map and the fix is 3 characters. - `src/server/scan-scheduler.ts:329` — inline comment said `"service:ruleName"` but Slice C changed the format to `"service:origin:ruleName"`. Parse still works (lastIndexOf gets the right colon), but the comment would mislead a future maintainer trying to extract the service prefix. Updated with an explicit warning not to use `slice(0, colonIdx)` for service extraction. ## Adversarial fix A — validate LLM-written rules before persistence Cross-model HIGH (Claude F3 + Codex #3). runDiscoverStep was casting raw JSON from the agent to ProbeMetricRule with `as`. Worst cases documented: malformed threshold op silently never trips; rule name `db:slow` breaks state-key parsing; `source: "log"` (typo) routes a LogQL query to the Prometheus tool and NaNs — the operator sees a services.yaml that claims log monitoring exists, but none runs. Fix in `src/workflows/steps/discover.ts`: - `ProbeMetricRuleSchema` exported from `src/config/schema.ts` so the discovery path can safeParse against the canonical shape. - `validateDiscoveredRules(raw, source)` drops rules that fail shape validation (including invalid threshold op, missing query, wrong source value) AND rules whose name violates `^[^:]+$` (the state-key delimiter constraint scan-rule-validator already enforces on the GUI path). Structured warn log per dropped rule so operators can see what the LLM produced. - `validateDiscoveredServices(raw)` applies the same guard to per-service `probeRules` before the service reaches the full-shape validator. The scan-rule-validator (GUI path) is explicitly scoped — reusing it would have bundled its `.strict()` mode which rejects unknown keys and would break when we add fields later. Parallel schema is the right tradeoff. ## Adversarial fix B — globals survive empty service sweep Codex HIGH #2. runDiscoverStep:283 required `parsed.services.length > 0` to accept the object form, but runDiscovery:41 has an explicit `complete-empty` phase that surfaces globalProbeRules when services is empty. Today a transient empty discovery would silently wipe the learned label-key override, pushing the probe back to config.yaml defaults. Fix: the object-form branch now accepts when EITHER services OR globalProbeRules is non-empty. A discovery run that only wrote globals (label-key introspection succeeded, service sweep came back empty) is a valid useful result. ## Adversarial fix C — GC orphaned hysteresis state at tick start Cross-model HIGH (Claude F1 + Codex #5). `resetHysteresisForChangedRules` only ran on `PUT /api/scan/settings` (config.yaml reload). Discovery's accept() path writes to services.yaml but never resets state. Old counters under `{svc}:global:OLD_RULE_NAME` leaked forever after a discovery-driven rule rename. Same pattern for removed per-service rules and hidden services. Fix in `runProbe` (`src/server/anomaly-probe.ts`): after building the tick's task list, compute the set of active stateKeys, and drop every consecutiveState entry not in that set. Covers every rule-change path — discovery writes, config reloads, override toggles, hidden-service changes — with one Map diff per tick. Debug-log the orphan count. ## Testing - `discovery.test.ts`: +3 new tests exercising the three fixes (B: object form accepted with empty services + non-empty globals; A: rule with `db:slow` name dropped; A: rule with invalid threshold op dropped). Mock adapter extended with a per-test reply-override escape hatch. - `anomaly-probe.test.ts`: +1 test pre-loading orphan state keys, running a tick with a different rule name, verifying both orphans were GC'd. Exercises both the "rule renamed" and "service removed" cases in one assertion. - Full suite: 102/102 files green. Type check clean. ## Deferred (noted in the design doc "Not in Scope") - File locking on registry writes — RMW race window is real but narrow; no existing writer uses locking. P2. - Static query-cost bounds in discover-eval — runtime timeouts (queryTimeoutMs 3s, logsQueryTimeoutMs 10s) cap per-query wall-clock. P2 if prompt-driven DDoS ever surfaces. Design doc: ~/.gstack/projects/WZ-dops-assistant/wli02-feat-llm-driven-probe-rules-design-20260422-continuation.md * chore: bump version and changelog (v0.1.0.0) MINOR bump — discovery-owned probe rules ship. Discovery agent writes per-service `probeRules` and top-level `globalProbeRules` into services.yaml; four-track probe evaluator reads them; new discover-eval harness gates output quality. Also syncs `package.json.version` (0.1.14, stale from old 3-digit scheme) back in line with the VERSION file's 4-digit scheme. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs: update CLAUDE.md for discovery-owned probe rules (v0.1.0.0) - Add Where-to-Look entries for anomaly-probe, scan-scheduler, discover-eval - Note services.yaml shape change ({services, globalProbeRules}, flat-array forward-compat) - Document four-track probe evaluator + LLM output validation - Add npm run test:discover-eval to commands - Refresh test-file count (73 -> 100+) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

WZ force-pushed the feat/proactive-scan-lane-b-per-service branch from 8a0d233 to 05231c6 Compare April 22, 2026 15:25

WZ force-pushed the feat/proactive-scan-sensible-defaults branch from 51ca7ed to 79b11f6 Compare April 22, 2026 15:25

WZ mentioned this pull request Apr 22, 2026

land: stacked scan + email notifications (#112, #113, #114, #116) #118

Merged

3 tasks

WZ added 3 commits April 22, 2026 09:50

WZ force-pushed the feat/proactive-scan-sensible-defaults branch from 1943508 to 08fb02c Compare April 22, 2026 16:51

WZ changed the base branch from feat/proactive-scan-lane-b-per-service to main April 22, 2026 16:51

WZ merged commit 4bb01f4 into main Apr 22, 2026

WZ mentioned this pull request Apr 23, 2026

feat(scan): discovery-owned probe rules (v0.1.0.0) #119

Merged

6 tasks

WZ deleted the feat/proactive-scan-sensible-defaults branch April 25, 2026 03:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scan): k8s-native default probe rules#115

feat(scan): k8s-native default probe rules#115
WZ merged 3 commits into
mainfrom
feat/proactive-scan-sensible-defaults

WZ commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

WZ commented Apr 22, 2026

What this PR does

Smoke-test findings that drove this

Changes

What this doesn't do

Test plan

Rollout

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant