crickets diagnostics

title: diagnostics — design status: launched kind: design scope: feature area: crickets/diagnostics governs: [src/diagnostics/] parent: crickets-hld.md seeded: 2026-06-20 approved: 2026-06-23

Note

LAUNCHED (lifted 2026-06-24, AG Phase 3; originally approved 2026-06-23). child-design — the diagnostics capability (failure analysis + hypothesis generation + cross-session failure-pattern memory), parent crickets HLD. status: launched (lifted into tracked wiki/designs/ 2026-06-24, AG Phase 3). Points up at the crickets HLD.

diagnostics

Objective

diagnostics is the capability that makes sense of failures. When something breaks — a CI run, a test, a stack trace at runtime — it reads the output, classifies the failure, proposes ranked hypotheses for the cause, and remembers the pattern so the next occurrence is faster to resolve. It is the Troubleshooter/SRE persona's composition. It is a diagnosis engine — it analyzes and remembers, it never fixes; repair is a caller's job (development-lifecycle's /bugfix, maintenance's dependency fixer). It pairs the observability command /observe with a failure-analysis engine.

Overview

Diagnosis runs as a loop:

observe — instrument the system so a failure leaves usable signal (/observe, delivered).
diagnose — when it breaks, read the logs, classify the failure, and propose ranked hypotheses (/diagnose, greenfield).
remember — write the failure as an incident to memory, so the pattern is recallable.
recall — on the next failure, surface prior similar incidents before re-deriving.

The failure-diagnosis engine depends on agentm's memory system — cross-session failure-pattern memory is the recall engine put to work.

graph TD
    F["a failure<br/><i>CI · test · runtime</i>"] --> C["Layer 0 — classify<br/><i>build/test/type/lint/runtime</i>"]
    C --> FP["Layer 1 — fingerprint<br/>exact-match"]
    FP -- "hit" --> PR["prior incident + fix<br/><i>zero inference</i>"]
    FP -- "miss" --> SEM["Layer 2 — semantic fallback<br/><i>RRF + graph hop</i>"]
    SEM --> HY["ranked hypotheses"]
    PR --> HY
    HY --> INC["write failure-incident<br/><i>scrubbed by privacy</i>"]
    SEM -. "alias back" .-> FP
    classDef hit fill:#eef7ee,stroke:#3A9D5D,color:#1E5435;
    class PR hit;

Deterministic-first: classify, then a fingerprint exact-match (a hit returns the prior fix with zero inference); only a miss falls to semantic recall; the incident is scrubbed + written, and a drifted match aliases back so the next recurrence is an exact hit.

Design

`/observe` — the instrumentation side (delivered)

The discipline for adding observability to code, so a failure leaves usable signal for /diagnose to read.

Entry: invoked when instrumenting a production path — typically while building a feature, and again from /launch before go-live.
Exit: structured logging, RED metrics (rate · errors · duration), OpenTelemetry spans, and symptom-based alerts wired; a test request confirms metrics emit.
Automated: log events (not strings); RED metrics on new paths; trace spans; symptom-based alerting (an infra-only alert with no symptom check is rejected).
Artifacts: none in memory — /observe instruments the codebase itself.

(/observe lives here; /launch in development-lifecycle calls it before go-live — a soft cross-capability reference.)

`/diagnose` — the failure-diagnosis engine (greenfield)

Takes a failure and returns ranked causes, remembering each one. It never mutates code — it emits hypotheses and logs an incident; the fix is a caller's.

Entry: /diagnose <log-or-failure-input> — invoked when something breaks (a red CI run, a failing test, a runtime stack trace), directly or by a repair caller.
Exit: the failure classified, 2–3 ranked hypotheses (each with a confidence signal + a suggested next probe), and one failure-incident written to memory.
Automated: the recall ladder below — classify → recall (fingerprint-first) → generate → write — never a fix.
Enhancements: hypothesis verification — ask the adversarial reviewers real bug, or an assumption? (code-review).
Artifacts: one kind: failure-incident memory entry per run, via the save path → the configured memory backend (the device-local store, or <vault>/… when agentm is mounted).

[PENDING-IMPL] — build the classifier + symptom extraction + hypothesis generator (documenter flips to as-built when the engine lands); today nothing does failure analysis.

Cross-session failure-pattern memory

/diagnose doesn't re-reason from scratch each time — it recalls deterministically first, and infers only when nothing matches. The recall ladder, over a corpus of kind: failure-incident entries:

Layer 0 — classify (free, deterministic). Read the error namespace — build · test · type · lint · runtime — from the exit code, the emitting tool, and parsed structured output (a rule id like TS2345, a failing-test id, an exception class). No inference; the cheapest layer, and it picks the fingerprint's namespace.
Layer 1 — fingerprint exact-match (the "seen this exact failure?" lookup). Normalize the error signature from stable fields only — error class + the top in-app frame(s) (file basename · symbol · line), with volatile tokens stripped (absolute paths, PIDs, timestamps, hashes, addresses) — and hash it (fingerprint, with a versioned fp_algo). Stored as a frontmatter key, it materializes as an indexed column in the SQLite metadata table (V6-11), so recall is SELECT path WHERE fingerprint = :fp — an exact-match lookup that returns the prior incident + its fix with zero inference, short-circuiting the vector search entirely on a hit.
Layer 2 — semantic fallback (only on a miss). A genuinely new or drifted failure falls through to the V6 hybrid path: an exact metadata prefilter (namespace · project · status: active) first, then sqlite-vec + BM25 fused by RRF, plus one knowledge-graph hop along a same-root-cause / supersedes edge to catch "same cause, drifted signature."

It gets more deterministic with use. When Layer 2 finds a drifted match that is the same failure, the new signature attaches as an additional fingerprint alias on the existing incident — so the next recurrence collapses to a Layer-1 exact hit. The corpus leans less on inference over time, not more. The flip side is the cold start: a genuinely new failure — or an empty corpus — gets pure Layer-2 inference, so the deterministic win is recurrence-only. Early on, diagnostics is an LLM diagnoser with a growing memory; it sharpens as the corpus fills — the same accumulate-and-compound curve as the rest of memory.

A failure-incident is one markdown entry (symptom · root cause · fix/workaround · outcome), kind: failure-incident — a new kind, distinct from kind: fix because an incident may have no fix yet — superseded via the existing evolve primitive. It leans on the agentm memory engine by name (recall + the metadata table + the save path), one-way; agentm never depends back on diagnostics. Because the body is captured from logs + stack traces, it is deterministically scrubbed of secrets and PII before it lands — a regex/pattern pass (the privacy scrubber), never an LLM judgment and never optional on a log capture.

[PENDING-IMPL] — build the fingerprint normalizer + the failure-incident kind + the recall ladder alongside V6-11 (the metadata-table column is Layer-1's substrate — they ship together; documenter flips when they land).

Scheduled health passes (designed)

A scheduled pass could run health checks against projects during idle/dream cycles and append a report — still diagnosis, not repair. This leans on agentm's scheduler, which is designed-not-built (the Experience design). diagnostics names the interface and defers the slice.

[PENDING-IMPL] — build the scheduled health pass once the agentm scheduler ships (documenter); on-demand /diagnose ships first.

The boundary — a diagnosis engine, never a repairer

diagnostics analyzes and remembers; it never fixes. Repair is always a caller's job, by composition:

maintenance (the renamed github-ci) owns reactive known-breakage repair — its dependabot-fixer calls the diagnose engine to classify a red-CI failure, then runs its own bounded dependency-fix loop.
development-lifecycle owns defect repair — /bugfix may call the diagnose engine for its Analyze step, then fixes under the loop.

The shape they share is find-cause → fix → verify — a shared shape, not a shared concern. The lock: diagnostics = the engine; maintenance + development-lifecycle = the callers; no capability both diagnoses and repairs. This is also where github-ci's reframe lands — maintenance is the caller-home for dependency repair, diagnostics is the primitive-home for diagnosis.

First slice

On-demand /diagnose: classify → recall (fingerprint-first) → rank hypotheses → log a failure-incident — built with V6-11 so Layer-1 is deterministic from day one. Scheduled health passes are deferred (blocked on the scheduler).

Opinions it consumes

diagnostics leans on how-we-engineer — its deterministic-first discipline (classify before hypothesize, recall before infer) is that opinion applied to diagnosis — and composes code-review's good when it asks the reviewers to verify a hypothesis. (Hardwired today; request-by-name is Phase-3/4 — the Opinions design.)

Dependencies

enhances development-lifecycle (soft) — failure analysis strengthens /review and /work; /bugfix's Analyze step may call the diagnose engine.
composed by maintenance — dependabot-fixer calls the diagnose engine, then runs its own dependency-fix loop (the repair stays in the caller).
enhances code-review (soft) — diagnosis can ask the adversarial reviewers to verify a hypothesis.
enhanced by privacy — incident bodies are run through privacy's deterministic secret/PII scrubber (check-no-pii.sh patterns + gitleaks rules — regex, not inference) before the memory write. (Forward note: privacy today scrubs git ranges; an incident-body scrub surface is a [PENDING-IMPL] on the privacy sub-design.)
leans on the agentm memory engine by name — recall + the V6-11 SQLite metadata table (the fingerprint index) + the save path; one-way, agentm never depends back. Introduces the failure-incident kind into the open kind taxonomy (agentm Memory System).
leans on the agentm scheduler (designed-not-built) for the scheduled-health slice only (Experience design).
Points up at the crickets HLD; the requires/enhances mechanics are in crickets-composition; the Troubleshooter/SRE persona is in agentm Personas.

Risks & open questions

Fingerprint mis-calibration — determinism cuts both ways. A too-loose signature under-groups (two distinct failures share a fingerprint → Layer-1 confidently returns the wrong prior fix); a too-tight one over-groups (a real recurrence misses and silently degrades to inference). Determinism makes a bad fingerprint more dangerous than a fuzzy semantic match — it's served with confidence and short-circuits the fallback. A Layer-1 hit is therefore a strong prior, not gospel — the hypotheses are still verified — backed by a human/auto fingerprint override + the alias mechanism. The fp_algo normalization is the lever: who owns the volatile-token strip list, and bumping v1 → v2 invalidates + re-derives the corpus's fingerprints — settle ownership + cadence before incidents accrete.
Secrets / PII in incidents — a deterministic scrub on write is mandatory. Logs + stack traces routinely carry tokens, keys, absolute paths, PII; an incident write persists them. Every write runs through privacy's deterministic scrubber first (regex/pattern, not LLM judgment) — never optional on a log capture. (privacy scrubs git ranges today; an incident-body scrub surface is a [PENDING-IMPL] on the privacy sub-design.)
Layer-1 must be project-scoped. SELECT WHERE fingerprint = :fp collides across projects if the fingerprint doesn't carry the repo (same error class + frame in two repos → the wrong incident). Fold project into the fingerprint namespace, or filter Layer-1 on project too.
The greenfield is real — the diagnose engine + the recall ladder are the heaviest build of the new capabilities.
V6-11 is a hard prerequisite for Layer-1 — the deterministic short-circuit needs the indexed fingerprint column, so the metadata table is built alongside diagnostics (operator call); no grep-over-frontmatter interim.
/bugfix coupling is a phasing call — composition is the design; whether /bugfix's Analyze wires to the engine on day one or after diagnostics proves out is a build-phase decision.
The scheduled slice is blocked on the unbuilt agentm scheduler — named and deferred.
Re-audit triggers: calibrate the fingerprint (over/under-grouping) on a real incident corpus; confirm the privacy incident-scrub surface ships before the first incident write; add a failure-incident retention / consolidation policy (one-off incidents accrete as noise — ties to the V6 lifecycle layer) before volume; sequence the engine before its callers (dependabot-fixer + /bugfix keep inline diagnosis until diagnostics ships, then compose it); flip the [PENDING-IMPL] markers as the engine + V6-11, then the scheduled pass, land; confirm maintenance is the dependency-repair caller-home when the github-ci rename lands.

References

The observability command: /observe (observe.md) — the instrumentation discipline
The memory substrate it leans on: agentm harness/skills/memory/scripts/recall.py (the recall engine) · scripts/memory_mcp_tools.py (memory_recall / phase_recall) · the V6-11 SQLite metadata table (the fingerprint exact-match column) — agentm Memory System
The scheduler (designed-not-built): agentm Experience design
Siblings: crickets HLD · development-lifecycle (/observe's caller via /launch; /bugfix calls the engine) · maintenance (the renamed github-ci; dependabot-fixer calls the engine) · composition · agentm Personas (Troubleshooter/SRE) · agentm Memory System

Amendment log

2026-06-23 — added the recall-ladder diagram (diagram backfill). Per the every-design-carries-a-diagram rule.

2026-06-23 — added an Opinions-it-consumes clause (portfolio backfill). Made explicit which opinions diagnostics leans on (how-we-engineer; good via code-review) — a standard Design clause adopted across the capability designs.

2026-06-23 — authored, reviewed, and finalized.

The diagnostics capability: a diagnosis engine that never fixes — /observe (instrumentation, delivered) + /diagnose (failure analysis, greenfield). When something breaks it classifies the failure, ranks hypotheses, and logs a kind: failure-incident to memory. Repair is a caller's job by composition — maintenance (the renamed github-ci, via dependabot-fixer) and development-lifecycle (/bugfix) call the engine; no capability both diagnoses and repairs. (Why not a diagnostics-and-repair merge: it re-creates the seam the lifecycle merge just removed, /bugfix is loop-wired, and dependency-repair is a maintenance concern.) /observe's home is settled here.

Cross-session failure memory is a deterministic-first recall ladder: Layer 0 classify (build/test/type/lint/runtime) → Layer 1 fingerprint exact-match via the V6-11 indexed column (zero-inference short-circuit) → Layer 2 semantic fallback (RRF + a graph hop), with self-reinforcing fingerprint aliases (more deterministic with use), on a new kind: failure-incident, built alongside V6-11. Risk-hardened against its own failure modes: fingerprint mis-calibration (a Layer-1 hit is a strong prior, not gospel), a mandatory deterministic PII/secret scrub on incident-write (privacy's regex/gitleaks scrubber — a [PENDING-IMPL] incident-body surface), and project-scoped Layer-1; cold-start (recurrence-only win) folded into the memory section. Built-vs-designed: /observe delivered; the engine + ladder + V6-11 greenfield; the scheduled health pass blocked on the unbuilt scheduler. Re-audit triggers: calibrate the fingerprint + settle fp_algo ownership; ship the privacy incident-scrub surface before the first write; add a failure-incident retention policy; sequence the engine before its callers; flip [PENDING-IMPL] as the engine + V6-11 land; confirm maintenance is the dependency-repair caller-home at the github-ci rename.

Home

🔧 How-to

📖 Reference

🏛️ Architecture

🧩 Designs

Architecture (Agent M) — in the agentm wiki ↗

Crickets

Crickets — High Level Design

💡 Explanation

Decisions (archived)

crickets diagnostics

title: diagnostics — design status: launched kind: design scope: feature area: crickets/diagnostics governs: [src/diagnostics/] parent: crickets-hld.md seeded: 2026-06-20 approved: 2026-06-23

diagnostics

Objective

Overview

Design

/observe — the instrumentation side (delivered)

/diagnose — the failure-diagnosis engine (greenfield)

Cross-session failure-pattern memory

Scheduled health passes (designed)

The boundary — a diagnosis engine, never a repairer

First slice

Opinions it consumes

Dependencies

Risks & open questions

References

Amendment log

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

`/observe` — the instrumentation side (delivered)

`/diagnose` — the failure-diagnosis engine (greenfield)