You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ship a second strategy — Test-Driven — alongside OpenEvolve (proposed in sibling issue #47). Where OpenEvolve is for optimization (evolve a self-contained artifact against a scalar fitness), Test-Driven is for specification — pin the desired behaviour in a failing test, then drive implementation until it passes. Between the two strategies, autoloop covers most of its useful problem space.
Motivation
Every agent sandbox has limits on what toolchains it can install (network restrictions, missing runtimes). When the sandbox can't reliably run a project's type-check/test suite, iterations get accepted based on the agent's self-evaluation — and PRs land with red CI because the agent wrote code that doesn't compile or tests against methods that don't exist.
The Test-Driven strategy flips the flow. Each iteration starts by pinning behaviour in a failing test, and acceptance ends with CI green on the pushed commit (composing directly with the CI-gated acceptance from #37). The agent can't skip writing tests, can't accept without a real test passing, and can't slip code with missing behaviour past CI.
Concrete use cases:
Bug fixing. A bugfix program picks a bug from an issue label, writes the failing repro, makes it green. First-class autoloop use case.
API porting. Implementing a target API against a reference (e.g., porting one library to another language). Every iteration pins one behaviour, ports the matching reference test, makes it pass.
Spec-driven implementation. A spec program satisfies one bullet per iteration from a spec document.
Test-Driven also composes with everything else recently landed:
No further changes to workflows/autoloop.md are needed — the "Strategy Discovery" prompt section from #47 already handles any strategy whose playbook is pointed at from program.md's ## Evolution Strategy section.
Content to ship — .autoloop/strategies/test-driven/
strategies/test-driven/strategy.md
# Test-Driven Strategy — <CUSTOMIZE: program-name>
This file is the **runtime playbook** for this program. The autoloop agent reads it at the start of every iteration and follows it literally. It supersedes the generic Iteration Loop in the default autoloop workflow — state read, branch management, state file updates, and CI gating still apply.
## Problem framing
<CUSTOMIZE: 2–4 sentences describing what the program specifies. What is the target artifact? What does "correct" mean — are we implementing an API against a reference, fixing bugs against a repro, adding behaviour against a spec document? Name the source of truth the agent checks whenever ambiguity arises (e.g. "pandas' `Series.sort_values` semantics are authoritative; when our behaviour diverges, pandas wins unless the divergence is documented").>
## Per-iteration loop### Step 1. Load state1. Read `program.md` — Goal, Target, Evaluation.
2. Read the program's state file from the repo-memory folder (`{program-name}.md`). Locate the `## ✅ Test Harness` subsection. If it does not exist, create it using the schema in [Test Harness schema](#test-harness-schema).
3. Read <CUSTOMIZE: the source-of-truth references the agent consults — reference docs for the current target, the issue whose bug we're fixing, a spec document>.
### Step 2. Pick target
Pick **one** unit of work — a single behaviour to pin or fix. Size it so that the entire red → green → refactor cycle fits in one iteration:
- <CUSTOMIZE: concrete guidance for how to size work in this program. E.g. "one method signature" for an API-porting program, "one failing repro" for a bug-fixing program, "one spec bullet" for a spec-driven program.>
Deterministic overrides (apply *before* free choice):
- If the Test Harness has any entry with status `failing` that is **not** marked `blocked`, pick that one. A failing test is an obligation — you don't add new tests while old ones are still red.
- If the most recent 3 iterations were all `error` (validity pre-check failed, test didn't even compile), force a `rethink-test` iteration — the problem is the test, not the implementation. See Step 4's rethink branch.
Record the chosen target in the iteration's reasoning.
### Step 3. Red — write the failing test
Use `strategy/prompts/write-test.md` as framing.
Before writing the test, state (in visible reasoning):
1. What behaviour you are pinning. One sentence, specific.
2. The source-of-truth reference.
3. The minimum set of assertions that captures "this is correct" without over-specifying implementation details.
4. Edge cases the test must include.
Then write the test file (or append to an existing one). Before continuing: **run the test and confirm it fails with a useful error message**. If it passes already, you picked wrong — either the target is already implemented (pick a different one) or the test is too weak (rewrite).
Record the new test in the Test Harness with status `failing` and the iteration number.
### Step 4. Green — implement until the test passes
Use `strategy/prompts/make-green.md` as framing.
Before writing any implementation code, state:
1. Parent state of the target file(s) — one-line summary of what exists now.
2. The **minimum** change needed to make the failing test pass. Resist scope creep; the test defines the requirement, nothing else.
3. Which invariants of the existing tests must continue to hold (list them).
Then write the implementation. Run the full test suite (not just the new test): **every existing test must still pass, and the new one must now pass too.**
If the test still fails after implementation:
-**Attempt ≤ 3**: re-analyze what's missing and try again (stay in Step 4).
-**Attempt ≥ 4**: consider that the test itself may be wrong — re-enter the `rethink-test` branch. Read the source of truth again, weaken/rewrite the test to match the *real* spec, then restart Step 4. Document the change in the Test Harness entry as a `test-revised` note.
-**After 5 total attempts in the same iteration**: stop. Mark the target `blocked` in the Test Harness with a `blocked_reason`. Set `paused: true` on the state file with `pause_reason: "td-stuck: <target>"`. End the iteration.
### Step 5. Refactor (optional, gated on green)
Only if the test suite is fully green, consider a refactor. Use `strategy/prompts/refactor.md` as framing.
Pick a refactor only if you can name a concrete clarity/complexity improvement. Cosmetic changes are not refactors — they are diffs in search of a justification. If nothing is worth refactoring, skip this step. Record the choice in reasoning either way.
After any refactor, the full test suite must still be green. If it isn't, revert the refactor and continue without it.
### Step 6. Evaluate
Run the evaluation command from `program.md`. For most TDD programs this is simply "the full test suite passes" — a boolean, not a scalar. Emit `{"metric": <count>, "passing": N, "failing": 0}` where `metric` is `passing` (higher is better).
Some TDD programs have a secondary metric (bundle size, coverage percentage). In that case `metric` can be the secondary metric, with the hard constraint that `failing == 0` — no reduction in coverage counts as progress if tests are red.
### Step 7. Update the Test Harness
Append the iteration's actions to `## ✅ Test Harness`:
- New test → add entry with status `passing` (it was just made green).
- Existing failing test became green → flip its status.
- A test became blocked → set status `blocked`, fill `blocked_reason`.
Enforce size discipline: keep at most <CUSTOMIZE: harness_size_cap, default 100> test entries visible; older entries can collapse into compressed range summaries (`### Tests 40–80 — ✅ passing (N batch additions for X feature): brief summary`).
### Step 8. Fold through to the default loop
Continue with the default workflow's accept/reject + CI gating + state file update steps. The only additional requirements from Test-Driven are:
- The Iteration History entry must include `phase` (red / green / refactor / rethink-test), `target`, `new_tests` count, `existing_tests_status` (all-green / regression-introduced-and-fixed).
- Lessons Learned additions should be phrased as *transferable heuristics* about the problem space.
## Test Harness schema
The harness lives in the state file `{program-name}.md` on the `memory/autoloop` branch as a subsection:
```markdown## ✅ Test Harness> 🤖 *Managed by the Test-Driven strategy. One entry per pinned behaviour. Newest first.*### <test-nameordescribe-block> · gen <N>-**Status**: ✅ passing / ❌ failing / 🚧 blocked
-**Target**: <file:line or target entity being specified>
-**Spec source**: <URLorreference>
-**Added iteration**: <N>
-**Made green iteration**: <M> (if applicable)
-**Blocked reason**: <one-line, if blocked>
-**Notes**: <one sentence on what the test pins, not how>
---```
Identifiers:
- Test names should match the test file's names verbatim so they're greppable.
- Status `blocked` means attempts have been exhausted and a human must intervene. The test is still present in the test file but may be `skip`ped (with a TODO comment pointing at the blocked entry here).
## Invariants the agent must not violate-**Never loosen an existing test to make a new one pass.** If an existing test fails because of your implementation change, the change is wrong — fix the implementation, not the test.
-**Never skip a failing test** to get CI green (except via `blocked` with a recorded reason, and only after the 5-attempt budget).
-**Never delete a test.** Revise the assertions if the spec has genuinely changed (and document the change), but don't remove coverage.
-**Tests are the acceptance criterion.** CI green on the pushed commit is the accept signal (composes with #37's CI-gated acceptance). If CI has tests passing but the Test Harness shows any `failing`, the state file is out of sync — reconcile before ending the iteration.
-**Tests pin behaviour, not implementation.** Don't write tests that check private state, internal method names, or structural details a future refactor would legitimately change.
strategies/test-driven/prompts/write-test.md
# Write-test prompt — <CUSTOMIZE: program-name>
Framing for the **red** phase of an iteration. Read before writing the failing test.
---
You are writing a test that **pins desired behaviour**. The point is to capture a specification the implementation must satisfy — not to exhaustively probe the current implementation's structure.
## Domain knowledge
<CUSTOMIZE: 5–15 bullets with the high-leverage facts for *this* domain. Source-of-truth URL, known edge cases this problem has (NaN in sort order, Unicode collation, timezone transitions, integer overflow boundaries), conventions for this test suite.>
## How to write a good failing test1.**One behaviour per test.** If you catch yourself writing `&&` in a single assertion, split the test.
2.**Name the test after the behaviour, not the method.** "`sortValues orders NaN last when naPosition is last`" is useful; "`test sort 3`" is not.
3.**Prefer property assertions over fixtures** for cross-cutting invariants (lengths, permutations, idempotence, ordering). Use concrete fixtures for the canonical happy path and the known-tricky cases.
4.**Test the behaviour, not the implementation.** Checking that `result.length === input.length` is a behaviour. Checking that `result._internalSortAlgorithm === "quicksort"` is implementation.
5.**Assert on both "what's there" and "what's not there"** where both matter.
## Red-phase checklist
Before moving on:
- The new test file compiles.
- Running the test produces **one clear failure message** — not a parse error, not a compile error, not a stack trace from uninitialized state. The failure should read to a human as "this behaviour is missing" or "this behaviour is wrong in this specific way."
- The failure message names the expected vs. actual value in a form a future contributor can act on.
- The test would *still pass* under a reasonable future refactor of the implementation (no implementation-detail coupling).
## What the reasoning output must contain
Before writing the test:
- Target: what behaviour are you pinning?
- Spec source: URL / reference.
- Edge cases included: list the specific cases this test covers.
- Edge cases intentionally excluded: list what this test *doesn't* cover and why.
After writing the test:
- A "Red summary" line: 10–20 words, suitable for the Test Harness and Iteration History.
- The concrete failure message observed when the test runs.
strategies/test-driven/prompts/make-green.md
# Make-green prompt — <CUSTOMIZE: program-name>
Framing for the **green** phase of an iteration. Read before writing the implementation.
---
You are making a failing test pass with the **minimum** change. Scope creep is the enemy — the test defines the requirement, nothing else.
## Domain knowledge
<CUSTOMIZE: same facts as write-test.md. Keep them in sync when one is updated.>
## How to make a test pass without scope creep1.**Re-read the failing test.** Don't skim. The exact assertions tell you exactly what must change.
2.**Identify the smallest code change** that would make the failing test pass without breaking any existing test. Name it concretely.
3.**Write only that change.** If a helper would make the code cleaner, note it for a later refactor iteration — don't add it now.
4.**Never modify existing tests to make the new test pass.** If the change you're considering breaks an existing test, something is wrong with your change, not with the old test.
5.**Run the full test suite, not just the new test.** Regressions in unrelated tests must be fixed before the iteration is accepted.
## Anti-patterns to avoid-**Overfitting to the test.** Don't hard-code the test's expected value in the implementation.
-**Speculative generality.** "While I'm here, let me also handle <edge case the test doesn't cover>." No — that edge case gets its own test.
-**Parallel implementations.** If the existing implementation has a branch your new behaviour doesn't fit into, think carefully before adding a `if (<new case>)` branch.
## Green-phase checklist
Before moving on:
- The new test passes.
- Every existing test still passes.
- The code change is as small as possible while satisfying the above two.
- You can explain in one sentence *why* the change makes the test pass.
## What the reasoning output must contain
Before writing the implementation:
- Target file(s) and what exists there now (one-line summary).
- The minimum change you're about to make.
- Existing-test invariants that must continue to hold.
After writing the implementation:
- A "Green summary" line.
- Confirmation that the full suite is green (attach the relevant pass count).
- Any follow-up refactor candidates you noted but deliberately did *not* implement in this iteration.
strategies/test-driven/prompts/refactor.md
# Refactor prompt — <CUSTOMIZE: program-name>
Framing for the optional **refactor** phase. Read only after the green phase is complete and the full test suite is passing.
---
You are deciding whether — and how — to refactor. The bar is: can you name a concrete clarity or complexity improvement that a future reader of this code will thank you for? If not, skip the refactor.
## When to refactor- Duplicated logic that just appeared — the change in this iteration introduced a near-copy of existing code. Deduplicate now, while the context is fresh.
- A function that grew past its natural boundaries — if the target file now has a function with more than one responsibility that used to have one, extract.
- Naming that drifted — the variable or function name no longer reflects what the code does after the change. Rename.
## When NOT to refactor- The code looks fine. Skip.
- You want to rearrange for aesthetic preference. Skip.
- You want to add abstractions you might need later. Strongly skip.
- The existing code is old and has nothing to do with this iteration. Not your job, not this iteration.
## Refactor-phase checklist
Before committing the refactor:
- Every test that was green at the end of the green phase is still green.
- You can summarize the refactor in one line.
- The diff is smaller than the green phase's diff. If it's not, you're probably doing something that wants its own iteration.
## What the reasoning output must contain
If you chose to refactor:
- The one-line improvement.
- Before/after of the specific thing you changed (not the whole file).
- Confirmation all tests still pass.
If you chose not to refactor:
- One sentence on why. "No refactor needed" is a valid outcome.
strategies/test-driven/CUSTOMIZE.md
# Test-Driven — Customization Guide (read by the program-creator agent)
This file is **not** copied into programs. It tells the creator agent how to turn the generic Test-Driven template into a problem-specific strategy in a new program directory.
## When to pick this strategy
Pick Test-Driven when **all** of the following hold:
- The goal is defined by **behaviour that can be captured in tests** — an API to implement, a bug to fix, a spec to satisfy.
- Progress is additive: each iteration pins one more piece of behaviour and makes it work, without regressing the pieces already pinned.
- There is a **source of truth** for correctness — a reference implementation, a spec document, a failing repro on an issue.
- Validity is cheap (the test suite runs in seconds to a few minutes).
Do **not** pick Test-Driven for: optimizing a scalar metric (use OpenEvolve), exploratory research without a clear spec, or pure refactoring tasks.
## What to copy
Copy these files from this template into the new program directory at `.autoloop/programs/<program-name>/strategy/`:
-`strategy.md` → `strategy/test-driven.md`-`prompts/write-test.md` → `strategy/prompts/write-test.md`-`prompts/make-green.md` → `strategy/prompts/make-green.md`-`prompts/refactor.md` → `strategy/prompts/refactor.md`
The `## ✅ Test Harness` subsection is not a filesystem concept — it lives in the program's state file on `memory/autoloop`.
## What to customize
Every `<CUSTOMIZE: …>` marker must be resolved before enabling the program.
In `strategy/test-driven.md`:
-**Problem framing** — 2–4 sentences on the target artifact, source of truth for correctness, what makes a test "good" in this domain.
-**Step 2 target-sizing guidance** — the concrete unit of work per iteration (one method, one bug, one spec bullet).
-**Harness size cap** — default 100, bump for large porting efforts.
In `strategy/prompts/*`:
-**Domain knowledge block** — the facts an expert would put on a whiteboard. Keep these three files in sync.
## What goes in program.md
Replace the program's `## Evolution Strategy` section with a pointer block:
```markdown## Evolution Strategy
This program uses the **Test-Driven** strategy. On every iteration, read `strategy/test-driven.md` and follow it literally — it supersedes the generic iteration steps in the default autoloop loop.
Support files:
-`strategy/test-driven.md` — the runtime playbook (red/green/refactor phases, rethink-test rule).
-`strategy/prompts/write-test.md` — framing for the red phase.
-`strategy/prompts/make-green.md` — framing for the green phase.
-`strategy/prompts/refactor.md` — framing for the optional refactor phase.
Test Harness lives in the state file on the `memory/autoloop` branch under the `## ✅ Test Harness` subsection (see the playbook for the schema).
```## What NOT to put in the program directory- Do not duplicate the Test Harness into a file in the program dir — it lives in the state file.
- Do not copy `CUSTOMIZE.md` — creator-time only.
- Do not invent new phases beyond red/green/refactor/rethink-test.
Summary
Ship a second strategy — Test-Driven — alongside OpenEvolve (proposed in sibling issue #47). Where OpenEvolve is for optimization (evolve a self-contained artifact against a scalar fitness), Test-Driven is for specification — pin the desired behaviour in a failing test, then drive implementation until it passes. Between the two strategies, autoloop covers most of its useful problem space.
Motivation
Every agent sandbox has limits on what toolchains it can install (network restrictions, missing runtimes). When the sandbox can't reliably run a project's type-check/test suite, iterations get accepted based on the agent's self-evaluation — and PRs land with red CI because the agent wrote code that doesn't compile or tests against methods that don't exist.
The Test-Driven strategy flips the flow. Each iteration starts by pinning behaviour in a failing test, and acceptance ends with CI green on the pushed commit (composing directly with the CI-gated acceptance from #37). The agent can't skip writing tests, can't accept without a real test passing, and can't slip code with missing behaviour past CI.
Concrete use cases:
bugfixprogram picks a bug from an issue label, writes the failing repro, makes it green. First-class autoloop use case.specprogram satisfies one bullet per iteration from a spec document.Test-Driven also composes with everything else recently landed:
Scope
Mirrors #47's shape so the two strategies feel like siblings:
No further changes to
workflows/autoloop.mdare needed — the "Strategy Discovery" prompt section from #47 already handles any strategy whose playbook is pointed at fromprogram.md's## Evolution Strategysection.Content to ship —
.autoloop/strategies/test-driven/strategies/test-driven/strategy.mdstrategies/test-driven/prompts/write-test.mdstrategies/test-driven/prompts/make-green.mdstrategies/test-driven/prompts/refactor.mdstrategies/test-driven/CUSTOMIZE.mdComposes with
Acceptance
.autoloop/strategies/test-driven/exists withstrategy.md,CUSTOMIZE.md,prompts/write-test.md,prompts/make-green.md,prompts/refactor.md.Out of scope
Related