🇺🇸 English | 🇨🇳 中文
Of the five ways AI coding agents fail on 100k-line codebases, this one is the most under-fixed:
Didn't run the right tests — ran the unit tests, missed the integration tests.
It's not an attitude problem. The agent doesn't know which tests cover which changes. It doesn't know which tests are already failing for known reasons. It doesn't know the test you care about needs a real account to run.
This knowledge lives in Slack threads, in the release engineer's head, in postmortems nobody re-reads. It evaporates between sessions.
better-test captures it — the testing half of a Full Context + Lite Control framework. A persistent test playbook plus a developer-feedback loop that stops you from re-filing the same closed bug three times.
Your team has knowledge CI doesn't capture:
- "The WebSocket tests only matter when
subscribe.rschanged." - "That flaky one in the auth group — dev says it's a test bug, ignore it."
- "Last release we missed the keychain regression because nobody ran group H. Don't miss it again."
Three files hold this:
| File | What it contains |
|---|---|
test-groups.md |
How tests are grouped, what each covers, what each needs to run |
impact-map.md |
Changed files / keywords → the test groups they affect |
known-issues.md |
What's already known to fail, why, and the developer's verdict |
When you change code and run /better-test strategy, the skill reads all three plus your git diff, then recommends a minimal test set with reasoning. You see exactly why groups A, B, D were picked — and which already-triaged items were excluded.
Most testing skills stop at "run these tests." better-test also asks: what did the developer say about the last failure?
When a bug report gets a developer response — "that's expected behavior," "fixed in a different way," "won't fix" — you feed it back:
/better-test feedback D-04 not-a-bug --note "dev confirmed — cancel returning 404 is expected"
The skill writes the verdict to history/, extracts a suppress rule into feedback-rules.json, and the next /better-test strategy run quietly excludes D-04. You don't re-file the same bug three times.
Six verdict types:
| Verdict | Meaning | Effect |
|---|---|---|
not-a-bug |
Developer confirms expected behavior | Excluded from active failures |
fixed |
Addressed in this release | Re-tested once, then archived |
fixed-differently |
Fixed but not how you expected | Re-tested with new expected output |
wontfix |
Acknowledged, won't address | Excluded permanently with a note |
deferred |
Known issue, postponed | Excluded until the target version |
revoke |
Retract a previous verdict | Re-activates the test ID |
git clone https://github.com/d-wwei/better-test.git ~/repos/better-test
ln -s ~/repos/better-test ~/.claude/skills/better-testAdapter install commands for Cursor, Gemini CLI, Codex, OpenCode, and OpenClaw live in references/adapters.md. Test knowledge files produced by /better-test init are platform-agnostic.
Inside a project directory (works best if /better-code init has already been run, so .better-work/shared/ exists):
/better-test init
The skill classifies the testing situation (library / daemon / API / CLI / multi-service), explores the existing test structure, and writes:
.better-work/test/protocol.md— testing cognitive constraints.better-work/test/test-groups.md— test group definitions with run conditions.better-work/test/impact-map.md— changed-file patterns → test groups.better-work/test/known-issues.md— known failures + verdicts
Then the typical loop:
# After making code changes:
/better-test strategy
→ reads impact-map.md + known-issues.md + your git diff
→ recommends: "Run groups A, B, D — 22 tests, ~5 min"
→ explains why: "Changed src/subscribe.rs touches the subscription flow (group D)..."
# If a test fails and the dev responds:
/better-test feedback D-04 not-a-bug --note "dev confirmed expected behavior"
→ writes verdict to history/
→ extracts a suppress rule
→ next strategy auto-excludes D-04
After modifying src/rest/funds.rs and src/auth/session.rs:
Recommended: groups A (auth), B (REST read), C (REST POST)
— 22 tests, ~8 minutes, bring-your-own mode
Reasoning:
• src/auth/session.rs matches impact-map keyword "auth" → group A (9 items)
• src/rest/funds.rs matches "REST" → groups B, C (5 + 8 items)
Skipping: groups D, E, F, H, I (no change signal)
Excluded: C-03 (wontfix, deferred to v1.5), B-07 (not-a-bug from 2026-03-12)
Run with: cargo test -- --test-groups A,B,C
Every recommendation points to its impact-map entry. Every exclusion points to its feedback verdict. The reasoning is auditable; you can override anything before running.
| Command | What it does |
|---|---|
/better-test init |
First-time exploration of the test structure + generate knowledge files |
/better-test update |
Signal-driven incremental update |
/better-test strategy |
Analyze git diff + impact-map → recommend minimal test set with reasoning |
/better-test feedback <id> <verdict> |
Record developer verdict → auto-refine suppress rules |
/better-test checkpoint |
Save current test task state |
/better-test resume |
Read progress and continue |
All six work identically whether invoked directly or via /better-work test <cmd> (when better-work is installed).
Test knowledge lives under .better-work/test/ (a symlink to ~/.better-work/<project>/test/):
<project>/.better-work/ → ~/.better-work/<project-name>/
├── shared/ (read; written only when needed, tagged [better-test])
│ └── index.md project entry point
├── code/ (read-only; informs test priority)
│ └── danger-zones.md high-risk files → more thorough tests
└── test/ (better-test writes here)
├── protocol.md ≤15 lines — testing cognitive constraints
├── test-groups.md group definitions + run conditions
├── impact-map.md change keyword → affected groups
├── known-issues.md known failures / expected behaviors / triage
├── status.md auto-refreshed summary
├── progress.md gitignored — current test task state
└── history/ test run history, git-tracked
├── feedback-rules.json auto-maintained, do not hand-edit
└── <version>/
└── run-NNN-<ts>/ results.json + summary.md per run
Pass must verify returned fields, not exit codes. A daemon that returns an empty list has exit code 0 — if "exit 0 = pass" you've just greenlit a broken API. protocol.md enforces this as a red line. It's advisory (better-test can't run your tests for you), but it surfaces the bad test at review time.
- better-work — Lite Control + series entry point. Start there for the full design story.
- better-code — Full Context for coding
- better-test (this repo) — Full Context for testing
better-test reads from shared/index.md (project identity) and code/danger-zones.md (high-risk files → more thorough tests) when other subskills have populated them. It writes only to test/ and, when necessary, to shared/ with a [better-test] commit tag.
- No test runner built in.
better-testrecommends which tests to run and why. Actually running them is your project's existing tooling (cargo test/pytest/go test/ custom harness). Feedresults.jsonback via/better-test feedbackor save intohistory/. impact-map.mdaccuracy depends on feedback. Initial entries are seeded from keywords. True accuracy grows as/better-test feedbackand/better-test updaterefine the mappings.feedback-rules.jsonis auto-generated. Do NOT hand-edit. Use/better-test feedback <id> revoketo retract, then re-enter with a fresh verdict.strategydoesn't run tests. It returns the test set + invocation command. You run them.- No CI integration yet. GitHub Actions / GitLab CI integration is planned.
protocol.mdenforcement is advisory. Can't prevent you from writing a bad test; can only flag it at review time.
MIT.
Companion write-up: the full Full Context, Lite Control story lives in the series entry-point README.
Questions, issues, discussion: GitHub issues.