You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add proof instrumentation so the durable Auto Review concept can be evaluated with data, not vibes.
Scope
Emit structured counters/events for run lifecycle, duplicate reuse/skips, supersede/cancel reasons, findings surfaced/inspected/applied/dismissed, ledger tokens, detail tokens, and token-spend estimates.
Add a diagnostic surface such as /review-stats if useful.
Build deterministic scanners or fixtures for stale/superseded/duplicate/fix-train signatures from logs/rollouts where appropriate.
Acceptance Criteria
Metrics include duplicate review rate, skipped/adopted/superseded/cancelled counts, unsurfaced terminal findings, ledger overhead, avoided token estimate, time to surface findings, and finding usefulness/disposition.
Each Auto Review run records enough proof data to explain latency: model, reasoning effort, resolve model/effort, phase timing, follow-up count, token count when available, prompt token estimate, and terminal reason.
Restart recovery and duplicate avoidance are testable without a live TUI where possible.
Dogfood diagnostics can compare before/after behavior across real sessions and identify whether slowness came from first review pass, follow-up loops, worktree/lock contention, retries, or prompt bloat.
Metrics do not inject bulky telemetry into normal assistant context; ordinary turns receive only bounded actionable review state.
Every Code emits enough Auto Review metrics and diagnostics to prove duplicate review reduction, avoided token spend, surfaced findings, ledger overhead, restart recovery, and finding usefulness during dogfooding.
Current Status
State: Planned after #329. This is the proof layer for the Auto Review love gate.
Recent evidence to carry forward:
Current config can be correct while reviews still feel slow; diagnostics need to show which model/reasoning settings were actually used by each background Auto Review run.
Lowering Auto Review follow-ups reduces worst-case loop count but does not speed the initial review pass, so phase timing matters.
prompt_token_estimate exists on AutoReviewRun but is not currently populated, leaving prompt bloat mostly unprovable.
Dogfooding needs before/after comparisons for duplicate review spend, stale review confusion, ledger overhead, and finding usefulness.
Next action after #329: populate prompt/token/timing/disposition fields and expose compact diagnostics for dogfood comparison without adding normal-turn context bulk.
Blocked by: #329 for dedupe outcome fields and policy events.
Last verified: 2026-06-02 during Auto Review latency/settings planning review.
Summary
Add proof instrumentation so the durable Auto Review concept can be evaluated with data, not vibes.
Scope
/review-statsif useful.Acceptance Criteria
Relationships
Parent: #324
Depends on: #325, #327, #329
Related: #43, #50
Finish Line
Every Code emits enough Auto Review metrics and diagnostics to prove duplicate review reduction, avoided token spend, surfaced findings, ledger overhead, restart recovery, and finding usefulness during dogfooding.
Current Status
State: Planned after #329. This is the proof layer for the Auto Review love gate.
Recent evidence to carry forward:
prompt_token_estimateexists onAutoReviewRunbut is not currently populated, leaving prompt bloat mostly unprovable.Next action after #329: populate prompt/token/timing/disposition fields and expose compact diagnostics for dogfood comparison without adding normal-turn context bulk.
Blocked by: #329 for dedupe outcome fields and policy events.
Last verified: 2026-06-02 during Auto Review latency/settings planning review.