better-test

🇺🇸 English | 🇨🇳 中文

better-test

Failure mode #4: "Didn't run the right tests." This repo solves that one.

Of the five ways AI coding agents fail on 100k-line codebases, this one is the most under-fixed:

Didn't run the right tests — ran the unit tests, missed the integration tests.

It's not an attitude problem. The agent doesn't know which tests cover which changes. It doesn't know which tests are already failing for known reasons. It doesn't know the test you care about needs a real account to run.

This knowledge lives in Slack threads, in the release engineer's head, in postmortems nobody re-reads. It evaporates between sessions.

better-test captures it — the testing half of a Full Context + Lite Control framework. A persistent test playbook plus a developer-feedback loop that stops you from re-filing the same closed bug three times.

What this repo knows that your agent doesn't

Your team has knowledge CI doesn't capture:

"The WebSocket tests only matter when subscribe.rs changed."
"That flaky one in the auth group — dev says it's a test bug, ignore it."
"Last release we missed the keychain regression because nobody ran group H. Don't miss it again."

Three files hold this:

File	What it contains
`test-groups.md`	How tests are grouped, what each covers, what each needs to run
`impact-map.md`	Changed files / keywords → the test groups they affect
`known-issues.md`	What's already known to fail, why, and the developer's verdict

When you change code and run /better-test strategy, the skill reads all three plus your git diff, then recommends a minimal test set with reasoning. You see exactly why groups A, B, D were picked — and which already-triaged items were excluded.

The feedback loop

Most testing skills stop at "run these tests." better-test also asks: what did the developer say about the last failure?

When a bug report gets a developer response — "that's expected behavior," "fixed in a different way," "won't fix" — you feed it back:

/better-test feedback D-04 not-a-bug --note "dev confirmed — cancel returning 404 is expected"

The skill writes the verdict to history/, extracts a suppress rule into feedback-rules.json, and the next /better-test strategy run quietly excludes D-04. You don't re-file the same bug three times.

Six verdict types:

Verdict	Meaning	Effect
`not-a-bug`	Developer confirms expected behavior	Excluded from active failures
`fixed`	Addressed in this release	Re-tested once, then archived
`fixed-differently`	Fixed but not how you expected	Re-tested with new expected output
`wontfix`	Acknowledged, won't address	Excluded permanently with a note
`deferred`	Known issue, postponed	Excluded until the target version
`revoke`	Retract a previous verdict	Re-activates the test ID

Installation

Claude Code (native)

git clone https://github.com/d-wwei/better-test.git ~/repos/better-test
ln -s ~/repos/better-test ~/.claude/skills/better-test

Other platforms

Adapter install commands for Cursor, Gemini CLI, Codex, OpenCode, and OpenClaw live in references/adapters.md. Test knowledge files produced by /better-test init are platform-agnostic.

Quick Start

Inside a project directory (works best if /better-code init has already been run, so .better-work/shared/ exists):

/better-test init

The skill classifies the testing situation (library / daemon / API / CLI / multi-service), explores the existing test structure, and writes:

.better-work/test/protocol.md — testing cognitive constraints
.better-work/test/test-groups.md — test group definitions with run conditions
.better-work/test/impact-map.md — changed-file patterns → test groups
.better-work/test/known-issues.md — known failures + verdicts

Then the typical loop:

# After making code changes:
/better-test strategy
  → reads impact-map.md + known-issues.md + your git diff
  → recommends: "Run groups A, B, D — 22 tests, ~5 min"
  → explains why: "Changed src/subscribe.rs touches the subscription flow (group D)..."

# If a test fails and the dev responds:
/better-test feedback D-04 not-a-bug --note "dev confirmed expected behavior"
  → writes verdict to history/
  → extracts a suppress rule
  → next strategy auto-excludes D-04

Example strategy output

After modifying src/rest/funds.rs and src/auth/session.rs:

Recommended: groups A (auth), B (REST read), C (REST POST)
  — 22 tests, ~8 minutes, bring-your-own mode

Reasoning:
  • src/auth/session.rs matches impact-map keyword "auth" → group A (9 items)
  • src/rest/funds.rs matches "REST" → groups B, C (5 + 8 items)

Skipping: groups D, E, F, H, I (no change signal)
Excluded: C-03 (wontfix, deferred to v1.5), B-07 (not-a-bug from 2026-03-12)

Run with: cargo test -- --test-groups A,B,C

Every recommendation points to its impact-map entry. Every exclusion points to its feedback verdict. The reasoning is auditable; you can override anything before running.

Command Reference

Command	What it does
`/better-test init`	First-time exploration of the test structure + generate knowledge files
`/better-test update`	Signal-driven incremental update
`/better-test strategy`	Analyze git diff + impact-map → recommend minimal test set with reasoning
`/better-test feedback <id> <verdict>`	Record developer verdict → auto-refine suppress rules
`/better-test checkpoint`	Save current test task state
`/better-test resume`	Read progress and continue

All six work identically whether invoked directly or via /better-work test <cmd> (when better-work is installed).

Output Structure

Test knowledge lives under .better-work/test/ (a symlink to ~/.better-work/<project>/test/):

<project>/.better-work/                      → ~/.better-work/<project-name>/
├── shared/                                  (read; written only when needed, tagged [better-test])
│   └── index.md                             project entry point
├── code/                                    (read-only; informs test priority)
│   └── danger-zones.md                      high-risk files → more thorough tests
└── test/                                    (better-test writes here)
    ├── protocol.md                          ≤15 lines — testing cognitive constraints
    ├── test-groups.md                       group definitions + run conditions
    ├── impact-map.md                        change keyword → affected groups
    ├── known-issues.md                      known failures / expected behaviors / triage
    ├── status.md                            auto-refreshed summary
    ├── progress.md                          gitignored — current test task state
    └── history/                             test run history, git-tracked
        ├── feedback-rules.json              auto-maintained, do not hand-edit
        └── <version>/
            └── run-NNN-<ts>/                results.json + summary.md per run

One non-obvious rule

Pass must verify returned fields, not exit codes. A daemon that returns an empty list has exit code 0 — if "exit 0 = pass" you've just greenlit a broken API. protocol.md enforces this as a red line. It's advisory (better-test can't run your tests for you), but it surfaces the bad test at review time.

The Better-Work series

better-work — Lite Control + series entry point. Start there for the full design story.
better-code — Full Context for coding
better-test (this repo) — Full Context for testing

better-test reads from shared/index.md (project identity) and code/danger-zones.md (high-risk files → more thorough tests) when other subskills have populated them. It writes only to test/ and, when necessary, to shared/ with a [better-test] commit tag.

Limitations

No test runner built in. better-test recommends which tests to run and why. Actually running them is your project's existing tooling (cargo test / pytest / go test / custom harness). Feed results.json back via /better-test feedback or save into history/.
impact-map.md accuracy depends on feedback. Initial entries are seeded from keywords. True accuracy grows as /better-test feedback and /better-test update refine the mappings.
feedback-rules.json is auto-generated. Do NOT hand-edit. Use /better-test feedback <id> revoke to retract, then re-enter with a fresh verdict.
strategy doesn't run tests. It returns the test set + invocation command. You run them.
No CI integration yet. GitHub Actions / GitLab CI integration is planned.
protocol.md enforcement is advisory. Can't prevent you from writing a bad test; can only flag it at review time.

License

MIT.

Companion write-up: the full Full Context, Lite Control story lives in the series entry-point README.

Questions, issues, discussion: GitHub issues.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
references		references
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
SKILL.md		SKILL.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

better-test

Failure mode #4: "Didn't run the right tests." This repo solves that one.

What this repo knows that your agent doesn't

The feedback loop

Installation

Claude Code (native)

Other platforms

Quick Start

Example strategy output

Command Reference

Output Structure

One non-obvious rule

The Better-Work series

Limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

better-test

Failure mode #4: "Didn't run the right tests." This repo solves that one.

What this repo knows that your agent doesn't

The feedback loop

Installation

Claude Code (native)

Other platforms

Quick Start

Example strategy output

Command Reference

Output Structure

One non-obvious rule

The Better-Work series

Limitations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages