RFC: Kernel-level red-green — every `edit_file` requires a paired test event

## Status
RFC in 48h FCP — spike validation passed all four feasibility checks ([summary](#issuecomment-4358477014), artifacts in `benchmarks/spike-tdd-kernel/`). RFC body below has been updated with the spike-derived refinements.

## Motivation

Every coding agent on the market (Claude Code, Cursor, Aider, Codex, Continue) does TDD as a *prompt* — "please write the test first". It's a suggestion. Models drop it under pressure, frequently on the hard tasks where it matters most.

Reasonix has a structural advantage no one else does: an append-only event log (`events.jsonl`) and a tool dispatcher that already gates on hooks (`src/hooks.ts`). Red→green can be a **kernel invariant** — not a prompt, not a system message — by refusing `edit_file` at dispatch time unless the log contains a paired test event that flipped `fail` → `pass`.

This is the only place in the agent stack where TDD can be *enforced* rather than *encouraged*. As a side effect, every accepted edit becomes a (red test, edit, green test) triple in `events.jsonl` — a deterministic, labelled corpus for future evals.

## Proposal

### New event types

```ts
type TestRunEvent = {
  type: 'test_run';
  test_id: string;            // <rel-path>::<fullName>  OR  user annotation slug
  test_id_source: 'native' | 'annotation';
  status: 'pass' | 'fail';
  command: string;            // the test command actually run
  duration_ms: number;
  ts: number;
};

type EditClaimEvent = {
  type: 'edit_claim';
  test_id: string;            // "this edit advances this red test"
  edit_target: string;
  ts: number;
};
```

### `test_id` resolution

Default `test_id = <rel-path>::<fullName>` from vitest's `--reporter=json` output. If the test source has a `// @reasonix-test-id: <slug>` comment within 3 lines above the matched `it(`/`test(`, the slug overrides and `test_id_source = 'annotation'`. Brownfield is zero-churn (existing test files use the default); rename-stable for users who opt in.

Failure-mode comparison vs content-hash and annotation-only schemes: `benchmarks/spike-tdd-kernel/test-id-spec.md`.

### Kernel invariant (`src/loop.ts` + `src/tools.ts`)

`edit_file` dispatch refuses unless **all** are true:

1. Most recent `test_run` for `test_id` is `fail` (red exists).
2. Model emitted a matching `edit_claim` after that red.
3. Post-edit, the dispatcher appends `test_id`(s) to a per-turn coalescing buffer; the buffer flushes once at end-of-turn as a single `vitest -t a -t b -t c` invocation, and the resulting `test_run` events are appended. If all green, edits commit to the in-memory edit set. If any red, the offending edits are reverted and a `repair` event fires — existing storm-breaker handles retry. (Per-edit invocations are infeasible: spike Exp 4 measured ~1.9s vitest framework boot per call, so batching is mandatory.)

### `/refactor` bypass

Pure restructure / docs / config / dep bumps don't have a meaningful failing test. `/refactor` flips a session flag: edits permitted, but `npm run verify` (or `reasonix.config.ts`-configured equivalent) must pass before exit. Plan mode is the precedent.

### Plan integration

`submit_plan` step schema gains optional `test_id`. Steps without it can only run inside `/refactor`. TUI shows red/green dots per step.

## Cost analysis — does this stay cheap?

The whole pitch is cheap, so this is the load-bearing question.

- **Tests are short and stable.** A test file changes far less than impl. Once the red `test_run` is in the prefix, green iterations cache-hit on it.
- **Red runs once.** Dispatcher caches by `(test_id, command)`. Only the post-edit candidate run is new.
- **Failure bounded.** Two failed greens → revert + `repair` event → existing harness engages. No infinite loop.
- **Augmentation is cache-positive, not negative.** Spike Exp 1 measured a +10pt cache-hit improvement (83.5% → 93.6%) under the augmented `edit_file` tool_result vs baseline, because the new ~80-token footer sits in the prefix (cached on subsequent turns) not the tail (always missed). See `benchmarks/spike-tdd-kernel/cost-results.md`.

Worst case: +1 vitest invocation per edit-batch (typically per turn) — spike Exp 4 median 1.9s framework boot + actual test runtime. Per turn the model-visible additions are `edit_claim` (~30 tok per claim) and the appended `test_run` footer (~80 tok per batch). Versus branching (N× tokens), negligible.

## Open questions

1. ~~**Greenfield test discovery.**~~ **Resolved by spike Exp 3** (10/10 prompts pass-all on the substantive checks). Plan steps with `test_id` must also carry `test_file_path`; the model authors the failing test in step 1 with a sharp system message — no structured `author_failing_test` tool needed. `reasonix doctor` warns when a plan step has `test_id` but no `test_file_path`.
2. ~~**Slow test suites.**~~ **Narrowed by spike Exp 4.** Resolution: per-test-id selectors via `vitest -t`, batched per turn (median 1.9s, p95 5.0s). `test_command_for(test_id)` in `reasonix.config.ts` stays as the knob for non-vitest runners.
3. **Mutation sweeps.** Codemod across 50 files probably falls under `/refactor`, but worth a separate carve-out.
4. **Untested codebases.** A user with zero tests is locked out. `reasonix doctor` could detect and default first session to `/refactor`.
5. **Strict by default, or `--strict` opt-in?** Strongly leaning strict-by-default. Flag if you disagree.

## Out of scope

- Generating tests for the user — that's the model's job.
- Branch-and-select at the test level (token-expensive, contradicts cheap pitch).
- Coverage gating — orthogonal.

## Prior art

- Cursor / Claude Code: prompt-level, not enforced.
- Aider: `--auto-test` runs tests but doesn't gate dispatch on red.
- No known agent enforces TDD at the dispatcher.

## Ask

Looking for feedback on the kernel-invariant shape, the greenfield flow, and any way the cost story breaks.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Kernel-level red-green — every `edit_file` requires a paired test event #25

Status

Motivation

Proposal

New event types

`test_id` resolution

Kernel invariant (`src/loop.ts` + `src/tools.ts`)

`/refactor` bypass

Plan integration

Cost analysis — does this stay cheap?

Open questions

Out of scope

Prior art

Ask

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RFC: Kernel-level red-green — every edit_file requires a paired test event #25

Description

Status

Motivation

Proposal

New event types

test_id resolution

Kernel invariant (src/loop.ts + src/tools.ts)

/refactor bypass

Plan integration

Cost analysis — does this stay cheap?

Open questions

Out of scope

Prior art

Ask

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

RFC: Kernel-level red-green — every `edit_file` requires a paired test event #25

`test_id` resolution

Kernel invariant (`src/loop.ts` + `src/tools.ts`)

`/refactor` bypass