Status
RFC in 48h FCP — spike validation passed all four feasibility checks (summary, artifacts in benchmarks/spike-tdd-kernel/). RFC body below has been updated with the spike-derived refinements.
Motivation
Every coding agent on the market (Claude Code, Cursor, Aider, Codex, Continue) does TDD as a prompt — "please write the test first". It's a suggestion. Models drop it under pressure, frequently on the hard tasks where it matters most.
Reasonix has a structural advantage no one else does: an append-only event log (events.jsonl) and a tool dispatcher that already gates on hooks (src/hooks.ts). Red→green can be a kernel invariant — not a prompt, not a system message — by refusing edit_file at dispatch time unless the log contains a paired test event that flipped fail → pass.
This is the only place in the agent stack where TDD can be enforced rather than encouraged. As a side effect, every accepted edit becomes a (red test, edit, green test) triple in events.jsonl — a deterministic, labelled corpus for future evals.
Proposal
New event types
type TestRunEvent = {
type: 'test_run';
test_id: string; // <rel-path>::<fullName> OR user annotation slug
test_id_source: 'native' | 'annotation';
status: 'pass' | 'fail';
command: string; // the test command actually run
duration_ms: number;
ts: number;
};
type EditClaimEvent = {
type: 'edit_claim';
test_id: string; // "this edit advances this red test"
edit_target: string;
ts: number;
};
test_id resolution
Default test_id = <rel-path>::<fullName> from vitest's --reporter=json output. If the test source has a // @reasonix-test-id: <slug> comment within 3 lines above the matched it(/test(, the slug overrides and test_id_source = 'annotation'. Brownfield is zero-churn (existing test files use the default); rename-stable for users who opt in.
Failure-mode comparison vs content-hash and annotation-only schemes: benchmarks/spike-tdd-kernel/test-id-spec.md.
Kernel invariant (src/loop.ts + src/tools.ts)
edit_file dispatch refuses unless all are true:
- Most recent
test_run for test_id is fail (red exists).
- Model emitted a matching
edit_claim after that red.
- Post-edit, the dispatcher appends
test_id(s) to a per-turn coalescing buffer; the buffer flushes once at end-of-turn as a single vitest -t a -t b -t c invocation, and the resulting test_run events are appended. If all green, edits commit to the in-memory edit set. If any red, the offending edits are reverted and a repair event fires — existing storm-breaker handles retry. (Per-edit invocations are infeasible: spike Exp 4 measured ~1.9s vitest framework boot per call, so batching is mandatory.)
/refactor bypass
Pure restructure / docs / config / dep bumps don't have a meaningful failing test. /refactor flips a session flag: edits permitted, but npm run verify (or reasonix.config.ts-configured equivalent) must pass before exit. Plan mode is the precedent.
Plan integration
submit_plan step schema gains optional test_id. Steps without it can only run inside /refactor. TUI shows red/green dots per step.
Cost analysis — does this stay cheap?
The whole pitch is cheap, so this is the load-bearing question.
- Tests are short and stable. A test file changes far less than impl. Once the red
test_run is in the prefix, green iterations cache-hit on it.
- Red runs once. Dispatcher caches by
(test_id, command). Only the post-edit candidate run is new.
- Failure bounded. Two failed greens → revert +
repair event → existing harness engages. No infinite loop.
- Augmentation is cache-positive, not negative. Spike Exp 1 measured a +10pt cache-hit improvement (83.5% → 93.6%) under the augmented
edit_file tool_result vs baseline, because the new ~80-token footer sits in the prefix (cached on subsequent turns) not the tail (always missed). See benchmarks/spike-tdd-kernel/cost-results.md.
Worst case: +1 vitest invocation per edit-batch (typically per turn) — spike Exp 4 median 1.9s framework boot + actual test runtime. Per turn the model-visible additions are edit_claim (~30 tok per claim) and the appended test_run footer (~80 tok per batch). Versus branching (N× tokens), negligible.
Open questions
Greenfield test discovery. Resolved by spike Exp 3 (10/10 prompts pass-all on the substantive checks). Plan steps with test_id must also carry test_file_path; the model authors the failing test in step 1 with a sharp system message — no structured author_failing_test tool needed. reasonix doctor warns when a plan step has test_id but no test_file_path.
Slow test suites. Narrowed by spike Exp 4. Resolution: per-test-id selectors via vitest -t, batched per turn (median 1.9s, p95 5.0s). test_command_for(test_id) in reasonix.config.ts stays as the knob for non-vitest runners.
- Mutation sweeps. Codemod across 50 files probably falls under
/refactor, but worth a separate carve-out.
- Untested codebases. A user with zero tests is locked out.
reasonix doctor could detect and default first session to /refactor.
- Strict by default, or
--strict opt-in? Strongly leaning strict-by-default. Flag if you disagree.
Out of scope
- Generating tests for the user — that's the model's job.
- Branch-and-select at the test level (token-expensive, contradicts cheap pitch).
- Coverage gating — orthogonal.
Prior art
- Cursor / Claude Code: prompt-level, not enforced.
- Aider:
--auto-test runs tests but doesn't gate dispatch on red.
- No known agent enforces TDD at the dispatcher.
Ask
Looking for feedback on the kernel-invariant shape, the greenfield flow, and any way the cost story breaks.
Status
RFC in 48h FCP — spike validation passed all four feasibility checks (summary, artifacts in
benchmarks/spike-tdd-kernel/). RFC body below has been updated with the spike-derived refinements.Motivation
Every coding agent on the market (Claude Code, Cursor, Aider, Codex, Continue) does TDD as a prompt — "please write the test first". It's a suggestion. Models drop it under pressure, frequently on the hard tasks where it matters most.
Reasonix has a structural advantage no one else does: an append-only event log (
events.jsonl) and a tool dispatcher that already gates on hooks (src/hooks.ts). Red→green can be a kernel invariant — not a prompt, not a system message — by refusingedit_fileat dispatch time unless the log contains a paired test event that flippedfail→pass.This is the only place in the agent stack where TDD can be enforced rather than encouraged. As a side effect, every accepted edit becomes a (red test, edit, green test) triple in
events.jsonl— a deterministic, labelled corpus for future evals.Proposal
New event types
test_idresolutionDefault
test_id = <rel-path>::<fullName>from vitest's--reporter=jsonoutput. If the test source has a// @reasonix-test-id: <slug>comment within 3 lines above the matchedit(/test(, the slug overrides andtest_id_source = 'annotation'. Brownfield is zero-churn (existing test files use the default); rename-stable for users who opt in.Failure-mode comparison vs content-hash and annotation-only schemes:
benchmarks/spike-tdd-kernel/test-id-spec.md.Kernel invariant (
src/loop.ts+src/tools.ts)edit_filedispatch refuses unless all are true:test_runfortest_idisfail(red exists).edit_claimafter that red.test_id(s) to a per-turn coalescing buffer; the buffer flushes once at end-of-turn as a singlevitest -t a -t b -t cinvocation, and the resultingtest_runevents are appended. If all green, edits commit to the in-memory edit set. If any red, the offending edits are reverted and arepairevent fires — existing storm-breaker handles retry. (Per-edit invocations are infeasible: spike Exp 4 measured ~1.9s vitest framework boot per call, so batching is mandatory.)/refactorbypassPure restructure / docs / config / dep bumps don't have a meaningful failing test.
/refactorflips a session flag: edits permitted, butnpm run verify(orreasonix.config.ts-configured equivalent) must pass before exit. Plan mode is the precedent.Plan integration
submit_planstep schema gains optionaltest_id. Steps without it can only run inside/refactor. TUI shows red/green dots per step.Cost analysis — does this stay cheap?
The whole pitch is cheap, so this is the load-bearing question.
test_runis in the prefix, green iterations cache-hit on it.(test_id, command). Only the post-edit candidate run is new.repairevent → existing harness engages. No infinite loop.edit_filetool_result vs baseline, because the new ~80-token footer sits in the prefix (cached on subsequent turns) not the tail (always missed). Seebenchmarks/spike-tdd-kernel/cost-results.md.Worst case: +1 vitest invocation per edit-batch (typically per turn) — spike Exp 4 median 1.9s framework boot + actual test runtime. Per turn the model-visible additions are
edit_claim(~30 tok per claim) and the appendedtest_runfooter (~80 tok per batch). Versus branching (N× tokens), negligible.Open questions
Greenfield test discovery.Resolved by spike Exp 3 (10/10 prompts pass-all on the substantive checks). Plan steps withtest_idmust also carrytest_file_path; the model authors the failing test in step 1 with a sharp system message — no structuredauthor_failing_testtool needed.reasonix doctorwarns when a plan step hastest_idbut notest_file_path.Slow test suites.Narrowed by spike Exp 4. Resolution: per-test-id selectors viavitest -t, batched per turn (median 1.9s, p95 5.0s).test_command_for(test_id)inreasonix.config.tsstays as the knob for non-vitest runners./refactor, but worth a separate carve-out.reasonix doctorcould detect and default first session to/refactor.--strictopt-in? Strongly leaning strict-by-default. Flag if you disagree.Out of scope
Prior art
--auto-testruns tests but doesn't gate dispatch on red.Ask
Looking for feedback on the kernel-invariant shape, the greenfield flow, and any way the cost story breaks.