Skip to content

feat(e2e): MSW cassette layer for hermetic e2e tests#1920

Draft
Stephen Belanger (Qard) wants to merge 5 commits intomainfrom
t3code/e2e-vcr-capture
Draft

feat(e2e): MSW cassette layer for hermetic e2e tests#1920
Stephen Belanger (Qard) wants to merge 5 commits intomainfrom
t3code/e2e-vcr-capture

Conversation

@Qard
Copy link
Copy Markdown
Contributor

@Qard Stephen Belanger (Qard) commented Apr 29, 2026

Summary

Adds an inbound provider HTTP capture/replay layer (cassettes) to the e2e test suite so hermetic CI runs replay recorded traffic instead of hitting live provider APIs. Built on MSW (already in the workspace as a dev dep for integrations/langchain-js and integrations/otel-js) plus ~600 LoC of cassette-format/matcher/recorder glue.

E2E scenarios previously hit live provider APIs on every CI run. Flakiness sources: rate limits, transient 5xx, model-output drift breaking exact-string snapshot fields. With this layer:

  • Replay mode (CI default) hits no network. Subprocess MSW interceptor returns canned responses from __cassettes__/<variantKey>.json.
  • Recording is local-only via BRAINTRUST_E2E_CASSETTE_MODE=record-missing. CI never records.
  • The existing outbound mock-braintrust-server.ts and __snapshots__/ are untouched. Cassettes are the inbound mirror: provider→SDK, where snapshots are SDK→Braintrust.

Status: ~506 tests passing in hermetic mode across 25 scenarios, 2 of 3 consecutive runs deterministic (the third had an unrelated turbopack-auto-instrumentation Next.js compile timeout flake — not introduced by this PR).

Architecture

Layer Direction Captured by Stored in
Existing SDK → Braintrust mock-braintrust-server.ts (parent process HTTP server) __snapshots__/*.json
New Provider → SDK cassette/preload.mjs boots an MSW setupServer() in subprocess __cassettes__/<variantKey>.json
  • e2e/helpers/cassette/preload.mjs — loaded into each scenario subprocess via node --import=<preload>. Boots MSW synchronously and intercepts provider HTTP traffic.
  • Per-key call counter for retries; reuses single entry on every match if only one entry exists for a key.
  • Recorder server runs in the parent vitest process; subprocess POSTs captured wire to it; parent merges entries across variant runs and writes once at scenario end.
  • cassetteTagsFor(import.meta.url, variantKey) auto-tags scenarios with hermetic based on cassette file presence — opt-in is by committing the cassette.

Cassette modes (BRAINTRUST_E2E_CASSETTE_MODE)

  • replay (default in CI): match or throw CassetteMissError.
  • record: overwrite cassette fresh.
  • record-missing: match if possible, else live + record. Standard re-record loop.
  • passthrough: bypass cassettes entirely (local debugging).

Recording safeguards

  • Skip-list filters transient/auth failures so they don't poison the cassette: 400/401/403/429/5xx are not persisted. Caught one real bug here — google-genai cassettes had been recorded with a bad key, capturing 400 API_KEY_INVALID responses. The 400 case was added after.
  • Automatic retry-with-backoff for 429/5xx during recording (2 attempts, 5s/10s, abort-aware to respect SDK timeouts).
  • Volatile-header scrub strips authorization, x-api-key, api-key, x-goog-api-key, cohere-api-key, cookies, request IDs, rate-limit windows, content-encoding, etc. before persisting. (Caught a near-miss on this PR — the initial commit leaked x-api-key for Anthropic; volatile-header set has been broadened and a scrub run removed leaked values.)
  • Headers forwarding fixnew Headers(request.headers) silently drops most headers when the source is an MSW-intercepted request (Authorization included). The forwarder copies via forEach instead. This one bug was responsible for the bulk of the recording failures during initial migration (every Mistral 401, plenty of others).

Scenarios with complete cassettes (hermetic green)

  • anthropic-instrumentation (6 variants)
  • openai-instrumentation (3 variants)
  • claude-agent-sdk-instrumentation
  • openrouter-instrumentation (2 variants)
  • ai-sdk-instrumentation (4 variants)
  • ai-sdk-otel-export (2 variants)
  • groq-instrumentation (2 variants)
  • huggingface-instrumentation (3 variants)
  • openrouter-agent-instrumentation
  • wrap-langchain-js-traces
  • cassette-replay (meta-scenario validating record→replay loop end-to-end)
  • cohere-instrumentation v7-14-0 (1 of 5 variants — see below)

Scenarios still missing cassettes (auto-skipped in hermetic mode)

These auto-skip cleanly because cassetteTagsFor only applies the hermetic tag when the cassette file is present. CI does not fail on them today; they need a follow-up record run with working credentials.

mistral-instrumentation — needs re-record after rebase

  • This branch was originally based on PR Add OpenAI Agents auto-instrumentation with real e2e coverage #1891 (OpenAI Agents auto-instrumentation). After being rebased onto main, the existing mistral cassettes no longer match: main extended the mistral scenario with new thinking/reasoning model coverage (NATIVE_REASONING_MODEL, ADJUSTABLE_REASONING_MODEL) that wasn't in the older scenario shape the cassettes were recorded against.
  • Re-recording also hit aggressive provider rate limits on the new reasoning models — the existing cassette layer's retry-with-backoff (5s/10s, 2 attempts) is not sufficient for that endpoint, so re-recording stalled mid-suite. Solving this likely requires either a longer per-call throttle (mirroring what cohere-instrumentation already does — COHERE_RECORD_THROTTLE_MS = 60_000) or running mistral variants serially with longer waits between calls.
  • Cassettes for the older scenario shape were dropped to keep the suite green; mistral now auto-skips in hermetic mode until the follow-up re-record lands.
  • Action needed: add a record-time throttle to mistral's scenario.impl.mjs and run BRAINTRUST_E2E_CASSETTE_MODE=record-missing pnpm --filter=@braintrust/js-e2e-tests vitest run scenarios/mistral-instrumentation.

google-genai-instrumentation — Gemini quota exhausted

  • Quota on the recording key is fully consumed (RESOURCE_EXHAUSTED 429).
  • Initial cassettes had been recorded back when the key was invalid and captured 400 API_KEY_INVALID responses; those were detected and deleted in this PR. The skip-list now rejects 400 to prevent recurrence.
  • Action needed: rotate to a working Gemini key (or wait for quota reset) and run BRAINTRUST_E2E_CASSETTE_MODE=record-missing pnpm --filter=@braintrust/js-e2e-tests vitest run scenarios/google-genai-instrumentation.

cohere-instrumentation — per-MONTH quota exhausted (4 of 5 variants)

  • Cohere account is past the per-MONTH request limit on the chat models. Quote from API: "You are past the per-month request limit for this model, please wait and try again later." This is monthly, not daily — recovers on the next billing cycle.
  • The v7-14-0 variant is fully recorded (chat + chat-stream + embed + rerank) and replays green. The 4 remaining variants (v7-20-0, v7-21-0, v7 default, v8) auto-skip until re-recorded.
  • The scenario impl now includes a 60s throttle between calls during recording (COHERE_RECORD_THROTTLE_MS) to land each call in a fresh budget window once quota is restored — but the throttle can't help with monthly exhaustion.
  • Action needed: wait for the next billing cycle (or upgrade Cohere plan), then run the same record command for the cohere scenarios.

google-adk-instrumentation — model-behavior drift, unrelated to cassette layer

  • ADK does not have any __cassettes__/ files in this PR, so it auto-skips in hermetic mode. There is pre-existing snapshot drift unrelated to this PR which should be triaged independently.

Risks / things to watch

  • Cassette files are large and committed. claude-agent-sdk-instrumentation and ai-sdk-instrumentation have the largest cassettes (long transcripts). This is intentional — diff-ability matters for review, and we want byte-identical replay.
  • Re-record workflow is documented in e2e/README.md ("Cassettes" section) and .agents/skills/e2e-tests/SKILL.md.
  • Day-one scope excluded Deno scenarios, nextjs-instrumentation, turbopack-auto-instrumentation, and OTEL-only scenarios. Those need separate preload mechanisms; they continue running as before (or are already hermetic-ish via different machinery).

Test plan

  • pnpm --filter=@braintrust/js-e2e-tests exec vitest run --tags-filter=hermetic — green (506 passed, 396 skipped, 0 failed)
  • Same command run consecutively for determinism — green twice in a row (one intermediate run hit an unrelated turbopack flake)
  • pnpm run formatting clean
  • pnpm run lint — 0 errors
  • No leaked credentials in committed cassettes (x-api-key, api-key, x-goog-api-key, authorization, etc. all stripped)
  • CI hermetic suite green
  • Re-record blocked scenarios (mistral, google-genai, remaining cohere variants) once credentials/quota are available — follow-up PR

🤖 Generated with Claude Code

Stephen Belanger (Qard) and others added 2 commits May 1, 2026 14:23
Vendors the seinfeld VCR/cassette library into the monorepo under
dev-packages/seinfeld. The package wraps MSW to record and replay
HTTP traffic in tests — record mode hits real providers and writes
JSON cassette files; replay mode intercepts fetch and serves the
recorded responses deterministically.

Key features:
- Two-pipeline design: normalizers (matching-only) vs redactors (persistence)
- Built-in filter presets strip volatile headers/params before matching
- Paranoid redaction preset masks auth headers and credential-shaped body fields
- Vitest integration (setupCassettes) for per-test cassette lifecycle
- passthroughHosts option to exempt specific hosts from interception

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Removes e2e/helpers/cassette/ and the parent-process recorder server.
Replaces them with @braintrust/seinfeld, the workspace package added in
the preceding commit.

Key changes:
- cassette-preload.mjs: ~80-line subprocess preload that boots seinfeld
  via createCassette(), replacing the old 450-line preload.mjs. The
  subprocess writes its cassette file directly; no parent recorder server needed.
- cassette-filters.mjs: per-scenario FilterSpec registry, porting the
  AI-SDK volatile-field normalizer and Mistral agent-name normalizer to
  seinfeld's FilterConfig API.
- scenario-harness.ts: drops startCassetteRecorderServer, parseCassetteMode,
  and all parent-side recorder wiring. record-missing mode replaced with
  plain record (seinfeld overwrites cassette files in full).
- 26 cassette files migrated from the legacy format to seinfeld's format
  (version + meta wrapper, body payloads as { kind, value } objects) using
  dev-packages/seinfeld/scripts/migrate-from-legacy.mjs.
- cassette-replay scenario removed (covered by seinfeld's own test suite).
- record-cassettes.mjs simplified: always uses record mode, --record-fresh
  flag dropped.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stephen Belanger (Qard) and others added 3 commits May 1, 2026 14:57
- Run prettier on all cassette JSON files and tsconfig.json
- Add dev-packages/seinfeld workspace to knip.jsonc with correct entry
  points so internal exports are traced from src/index.ts
- Remove export keyword from internal-only constants (format/v1.ts
  intermediate Zod schemas, normalizer/redactor preset objects and header
  arrays) that are only used within their own module
- Remove unused recordResponse export from msw.ts
- Remove redundant computeMatchKey re-export from recorder.ts (it is
  still exported from matcher/index.ts which is what tests import)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…oad use

AsyncLocalStorage.enterWith() does not propagate through async boundaries
created by MSW's request interceptors. When start() is called from a
Node.js --import preload, als.getStore() returns undefined in the MSW
handler, causing every intercepted request to passthrough to the real
network instead of replaying from the cassette.

Fix: alongside als.enterWith(ctx), also set a module-level processLevelCtx.
The handler checks als.getStore() first (so concurrent use() calls via
vitest's beforeEach still work correctly) and falls back to processLevelCtx.
stop() clears it when the cassette is torn down.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
seinfeld's createJsonFileStore appends .cassette.json when resolving
cassette file paths. Rename all cassette files accordingly and update
the extension references in tags.ts and scenario-harness.ts.

Also simplify cassette-preload.mjs to pass the __cassettes__ directory
to createJsonFileStore rather than a full file path, letting the store
handle name→path resolution naturally.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant