Mutation-driven test-backfill agent. Writes pytest suites that catch real bugs, not just lines.
You point TestForge at a public Python repo and a poorly-tested module. It:
- Clones the repo into a temp workspace.
- Reads the module.
- Drafts a pytest suite.
- Runs pytest until everything is green.
- Runs mutmut to mutate the source and see which mutants survive.
- Reads the diff of each surviving mutant and writes a targeted assertion that kills it.
- Repeats steps 5–6 until the mutation score hits the target (or the budget caps it).
- Forks the repo, opens a PR with the suite and a written explanation of what each test catches.
No human in the loop after invocation.
Line coverage rewards weak tests. assert result is not None covers a function but doesn't catch a sign-flip bug. Mutation score is the only objective signal of test quality — and it's the only one an LLM can reason about and self-correct against. TestForge's loop is a closed-loop optimizer where the reward signal is "how many mutants did I kill."
pip install testforge
export ANTHROPIC_API_KEY=...
export GITHUB_TOKEN=... # gh CLI must be authenticated
testforge \
--repo https://github.com/owner/repo \
--module src/path/module.py \
--target-mutation-score 0.90 \
--max-iterations 6 \
--budget-usd 3Add --no-pr to write tests to a local clone without forking/PR'ing.
PLANNER (Claude Opus 4.7, tool-use, prompt-cached)
↓
TOOL EXECUTOR (read_module, write_test_file, lint, run_pytest, run_mutmut, read_mutant_diff, finish)
↓
OBSERVER (state machine + 5 stop conditions)
↓ loop ↑ |
finish → fork + PR
Stop conditions: target-met / iter-cap / $-cap / oscillation (2 consecutive iters with zero new mutant kills) / agent self-declared finish.
Python only. Synchronous, mostly-pure-function modules only. No async, no I/O-heavy modules, no Hypothesis. See docs/superpowers/specs/2026-05-02-testforge-design.md for the full scope contract.
- Overview — the problem, the insight, the rubric fit
- Architecture — module map, the agent loop in detail, design decisions
- Quickstart — install, test, run, troubleshoot
- Interview prep — anticipated Q&A about why-this-not-that
[2-min video link — added on submission day]