Skip to content

RFC: Kernel-level red-green — every edit_file requires a paired test event #25

@esengine

Description

@esengine

Status

RFC in 48h FCP — spike validation passed all four feasibility checks (summary, artifacts in benchmarks/spike-tdd-kernel/). RFC body below has been updated with the spike-derived refinements.

Motivation

Every coding agent on the market (Claude Code, Cursor, Aider, Codex, Continue) does TDD as a prompt — "please write the test first". It's a suggestion. Models drop it under pressure, frequently on the hard tasks where it matters most.

Reasonix has a structural advantage no one else does: an append-only event log (events.jsonl) and a tool dispatcher that already gates on hooks (src/hooks.ts). Red→green can be a kernel invariant — not a prompt, not a system message — by refusing edit_file at dispatch time unless the log contains a paired test event that flipped failpass.

This is the only place in the agent stack where TDD can be enforced rather than encouraged. As a side effect, every accepted edit becomes a (red test, edit, green test) triple in events.jsonl — a deterministic, labelled corpus for future evals.

Proposal

New event types

type TestRunEvent = {
  type: 'test_run';
  test_id: string;            // <rel-path>::<fullName>  OR  user annotation slug
  test_id_source: 'native' | 'annotation';
  status: 'pass' | 'fail';
  command: string;            // the test command actually run
  duration_ms: number;
  ts: number;
};

type EditClaimEvent = {
  type: 'edit_claim';
  test_id: string;            // "this edit advances this red test"
  edit_target: string;
  ts: number;
};

test_id resolution

Default test_id = <rel-path>::<fullName> from vitest's --reporter=json output. If the test source has a // @reasonix-test-id: <slug> comment within 3 lines above the matched it(/test(, the slug overrides and test_id_source = 'annotation'. Brownfield is zero-churn (existing test files use the default); rename-stable for users who opt in.

Failure-mode comparison vs content-hash and annotation-only schemes: benchmarks/spike-tdd-kernel/test-id-spec.md.

Kernel invariant (src/loop.ts + src/tools.ts)

edit_file dispatch refuses unless all are true:

  1. Most recent test_run for test_id is fail (red exists).
  2. Model emitted a matching edit_claim after that red.
  3. Post-edit, the dispatcher appends test_id(s) to a per-turn coalescing buffer; the buffer flushes once at end-of-turn as a single vitest -t a -t b -t c invocation, and the resulting test_run events are appended. If all green, edits commit to the in-memory edit set. If any red, the offending edits are reverted and a repair event fires — existing storm-breaker handles retry. (Per-edit invocations are infeasible: spike Exp 4 measured ~1.9s vitest framework boot per call, so batching is mandatory.)

/refactor bypass

Pure restructure / docs / config / dep bumps don't have a meaningful failing test. /refactor flips a session flag: edits permitted, but npm run verify (or reasonix.config.ts-configured equivalent) must pass before exit. Plan mode is the precedent.

Plan integration

submit_plan step schema gains optional test_id. Steps without it can only run inside /refactor. TUI shows red/green dots per step.

Cost analysis — does this stay cheap?

The whole pitch is cheap, so this is the load-bearing question.

  • Tests are short and stable. A test file changes far less than impl. Once the red test_run is in the prefix, green iterations cache-hit on it.
  • Red runs once. Dispatcher caches by (test_id, command). Only the post-edit candidate run is new.
  • Failure bounded. Two failed greens → revert + repair event → existing harness engages. No infinite loop.
  • Augmentation is cache-positive, not negative. Spike Exp 1 measured a +10pt cache-hit improvement (83.5% → 93.6%) under the augmented edit_file tool_result vs baseline, because the new ~80-token footer sits in the prefix (cached on subsequent turns) not the tail (always missed). See benchmarks/spike-tdd-kernel/cost-results.md.

Worst case: +1 vitest invocation per edit-batch (typically per turn) — spike Exp 4 median 1.9s framework boot + actual test runtime. Per turn the model-visible additions are edit_claim (~30 tok per claim) and the appended test_run footer (~80 tok per batch). Versus branching (N× tokens), negligible.

Open questions

  1. Greenfield test discovery. Resolved by spike Exp 3 (10/10 prompts pass-all on the substantive checks). Plan steps with test_id must also carry test_file_path; the model authors the failing test in step 1 with a sharp system message — no structured author_failing_test tool needed. reasonix doctor warns when a plan step has test_id but no test_file_path.
  2. Slow test suites. Narrowed by spike Exp 4. Resolution: per-test-id selectors via vitest -t, batched per turn (median 1.9s, p95 5.0s). test_command_for(test_id) in reasonix.config.ts stays as the knob for non-vitest runners.
  3. Mutation sweeps. Codemod across 50 files probably falls under /refactor, but worth a separate carve-out.
  4. Untested codebases. A user with zero tests is locked out. reasonix doctor could detect and default first session to /refactor.
  5. Strict by default, or --strict opt-in? Strongly leaning strict-by-default. Flag if you disagree.

Out of scope

  • Generating tests for the user — that's the model's job.
  • Branch-and-select at the test level (token-expensive, contradicts cheap pitch).
  • Coverage gating — orthogonal.

Prior art

  • Cursor / Claude Code: prompt-level, not enforced.
  • Aider: --auto-test runs tests but doesn't gate dispatch on red.
  • No known agent enforces TDD at the dispatcher.

Ask

Looking for feedback on the kernel-invariant shape, the greenfield flow, and any way the cost story breaks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestrfcArchitecture proposal / request for comments

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions