Skip to content

feat(http): record/replay for net.fetch (closes #7)#15

Merged
escapeboy merged 2 commits intomasterfrom
feat/0.5-s7-net-record-replay
Apr 25, 2026
Merged

feat(http): record/replay for net.fetch (closes #7)#15
escapeboy merged 2 commits intomasterfrom
feat/0.5-s7-net-record-replay

Conversation

@escapeboy
Copy link
Copy Markdown
Owner

Sprint 0.5-S7 — closes #7 (FleetQ P2, fifth in a row)

Boruna scripts are deterministic by design; external HTTP is not. This bridges the gap so agent CI loops become genuinely reproducible. Distinctive selling point per FleetQ: "one of the few runtimes where record/replay would be ergonomic." Pulled forward from 0.5.0 (same pattern as #6, #8).

Surface

```bash

Record once against the real upstream:

boruna run app.ax --policy allow-all --live \
--record-net-to fixtures/run-001.tape.json

Replay forever, no network access:

boruna run app.ax --policy allow-all \
--replay-net-from fixtures/run-001.tape.json
```

Tape file:
```json
{
"format_version": 1,
"transactions": [
{ "method": "GET", "url": "https://api.example.com/users/42",
"request_body": null, "response_body": "{\"id\":42}" },
{ "method": "POST", "url": "https://api.example.com/events",
"request_body": "{\"event\":\"click\"}",
"response_body": "{\"ok\":true}" }
]
}
```

Match strategy (locked, in design doc)

  • Strict ordered, key on `(method, url, request_body)`
  • Headers EXCLUDED from match key — auth tokens change between sessions
  • Mismatch → typed error: `position N: method differs / url differs / request_body differs`
  • Exhaustion → typed error: `tape exhausted (N consumed, asked for more at position N)`
  • Under-consumption → silently OK (trailing tape entries unused)

Tests

  • 18 new tests in `boruna-vm` (round-trip, in-order match, all three mismatch flavors, exhaustion, under-consumption, default-method-GET, case normalization, mock pass-through, save-on-drop, drop-without-path, parser-agreement, back-to-back-identical-calls for polling scripts, format-version compatibility, save-then-load round trip)
  • All 124+ existing VM tests pass under `--features http`
  • All 591+ existing workspace tests pass
  • `cargo clippy --workspace -- -D warnings` clean (with and without http)
  • `cargo fmt --all -- --check` clean

Review

`ce-correctness-reviewer` surfaced 1 HIGH + 3 MEDIUM findings. All addressed before commit:

# Finding Fix
1 (HIGH) Save-on-drop swallows tape errors → CI fixture corruption CLI write-probe at startup; Drop is the safety net only
2 (MED) Tape lost on panic mid-recording Documented as v1 limitation; streaming tape is the future fix
3 (MED) Missing test for back-to-back identical calls (polling) Added
4 (MED) `describe_net_fetch_request` could drift from `HttpHandler` parser Extracted shared `parse_net_fetch_args` in `http_handler.rs`; both call it. Regression test asserts agreement.

Documented limitations (per review)

  • Request headers NOT in match key (auth tokens change between sessions)
  • Response status/headers NOT recorded (handler returns body only today)
  • Failed transactions NOT taped in v1
  • Non-UTF-8 response bodies inherit `HttpHandler`'s "not valid UTF-8" error
  • Tape file size unbounded (~2× pretty-print multiplier)
  • Panic during record may lose the tape (esp. `panic = "abort"`)

What's NOT in this PR (follow-ups)

  • `boruna workflow run --record-net-to / --replay-net-from` — same machinery; defer to follow-up sprint when there's a workflow-level ask
  • `boruna_run` MCP parameter form — defer until asked
  • Recording response status + headers — requires `HttpHandler::handle_net_fetch` return-type change; defer
  • Recording failed transactions — requires teaching the tape format about errors; defer
  • Header-aware matching — opt-in `match_headers: [...]` config; defer
  • `db.query` / `llm.call` record/replay — each is its own sprint; this PR establishes the pattern

Closes

FleetQ status after this PR

5 of 9 P1/P2 asks closed (#3, #5, #6, #7, #8). Only #9 (per-call OpenTelemetry observability) remains as the last small-sprint pick before the big-sprint pivot to `0.3-S2` (persistent state).

🤖 Generated with Claude Code

escapeboy and others added 2 commits April 25, 2026 19:03
Sprint 0.5-S7, pulled forward from 0.5.0. Fifth FleetQ ask shipped in a
row (after #3, #6, #5, #8). Boruna scripts are deterministic by design;
external HTTP is not. This bridges the gap so agent CI loops become
genuinely reproducible.

## Surface

```bash
# Record once against the real upstream:
boruna run app.ax --policy allow-all --live \\
  --record-net-to fixtures/run-001.tape.json

# Replay forever, no network access:
boruna run app.ax --policy allow-all \\
  --replay-net-from fixtures/run-001.tape.json
```

## What's in this PR

- New module `crates/llmvm/src/net_record_replay.rs` (feature `http`):
  - `NetTransaction { method, url, request_body, response_body }`
  - `NetTape { format_version: 1, transactions: [...] }` with
    save/load and version compatibility check
  - `RecordingHttpHandler` wraps `HttpHandler`; records on each call;
    `with_save_path` constructor enables save-on-drop
  - `ReplayingHttpHandler` serves from a loaded tape; strict ordered
    match on (method, url, request_body); typed errors for mismatch /
    exhaustion; under-consumption is silently OK
- New CLI flags `--record-net-to <FILE>` and `--replay-net-from <FILE>`
  on `boruna run`. Mutually exclusive (clap `conflicts_with`).
  Record requires `--live`; replay overrides `--live`.
- New shared parser `parse_net_fetch_args` in `http_handler.rs` used
  by BOTH the real handler and the recording layer — eliminates the
  silent-drift risk the reviewer flagged on the duplicated parser.
- CLI write-probe: writes an empty tape to the target path BEFORE the
  run starts, so disk errors surface in process exit code instead of
  a stale fixture from a pipeline like `record && verify`.

## Match strategy (locked, in design doc)

- Strict ordered, key on (method, url, request_body)
- Headers EXCLUDED from match key (auth tokens change between sessions)
- Mismatch returns typed error: `position N: method differs / url
  differs / request_body differs`
- Exhaustion: `tape exhausted (N transactions consumed, ... at position N)`
- Under-consumption: silently OK (trailing tape entries unused)

## Tape format

```json
{ "format_version": 1, "transactions": [ ... ] }
```

Format version is bumped on breaking shape changes; additive ones keep
the version. Loading a tape with an unsupported version returns a typed
error.

## Tests

- 18 new tests in `boruna-vm` (tape round-trip, in-order match, all
  three mismatch flavors, exhaustion, under-consumption, default-method-
  GET, case normalization, mock pass-through, save-on-drop, drop-without-
  path, parser-agreement-with-http_handler, back-to-back-identical-calls,
  bad-format-version, save-then-load round trip, RecordingHttpHandler
  empty/len helpers).
- All 124+ existing VM tests pass under `--features http`.
- All 591+ existing workspace tests pass.
- `cargo clippy --workspace -- -D warnings` clean (with and without http).
- `cargo fmt --all -- --check` clean.

## Review

`ce-correctness-reviewer` surfaced 1 HIGH + 3 MEDIUM findings. All
addressed before commit:

1. (HIGH) Save-on-drop swallows tape errors → CI fixture corruption.
   Fixed by CLI write-probe at startup; Drop becomes the safety net.
2. (MED) Tape lost on panic mid-recording. Documented as v1 limitation;
   streaming append-only tape is the future fix.
3. (MED) Missing test for back-to-back identical calls. Added.
4. (MED) `describe_net_fetch_request` duplication may drift from
   `HttpHandler::handle_net_fetch`. Extracted shared parser
   `parse_net_fetch_args` in http_handler.rs; both call sites use it.
   Added regression test asserting parser agreement.

## Documented limitations (per review)

- Request headers NOT in match key (auth tokens change between sessions)
- Response status/headers NOT recorded (handler returns body only today)
- Failed transactions NOT taped in v1 (re-recording is user's job)
- Non-UTF-8 response bodies inherit `HttpHandler`'s "not valid UTF-8"
  error — recording cannot capture binary payloads
- Tape file size unbounded (a 100k-call agent → multi-GB JSON, ~2×
  pretty-print multiplier)
- Panic during record may lose tape if Drop doesn't run (esp. under
  `panic = "abort"`)

## Closes

- Closes #7 (FleetQ P2: record/replay for net.fetch)

## FleetQ status after this PR

5 of 9 P1/P2 asks closed (#3, #5, #6, #7, #8). Only #9 (per-call
OpenTelemetry observability) remains as the last small-sprint pick
before the big-sprint pivot to 0.3-S2 (persistent state).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the 4 review findings (1 HIGH + 3 MEDIUM) and how each was
addressed. Notable: the Drop-based-save-with-eprintln correctness gap
(silent stale fixtures in CI && chains) and the paired-parser
extraction (eliminating a silent-drift risk that comments couldn't
enforce).

Establishes two new project conventions:
- For Drop-based side effects, pair with a pre-flight probe at the
  CLI integration point. Drop is ergonomic; pre-flight is exit-code-honest.
- For paired parsers (real path + instrumentation mirror path),
  extract a shared function. Comments saying "keep in lock-step"
  aren't enforceable; a single function IS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@escapeboy escapeboy merged commit e08b9cf into master Apr 25, 2026
2 of 3 checks passed
@escapeboy escapeboy deleted the feat/0.5-s7-net-record-replay branch April 25, 2026 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[P2] Record/replay for net.fetch

1 participant