feat(e2e): MSW cassette layer for hermetic e2e tests#1920
Draft
Stephen Belanger (Qard) wants to merge 5 commits intomainfrom
Draft
feat(e2e): MSW cassette layer for hermetic e2e tests#1920Stephen Belanger (Qard) wants to merge 5 commits intomainfrom
Stephen Belanger (Qard) wants to merge 5 commits intomainfrom
Conversation
b4e2a5a to
3ca8049
Compare
Vendors the seinfeld VCR/cassette library into the monorepo under dev-packages/seinfeld. The package wraps MSW to record and replay HTTP traffic in tests — record mode hits real providers and writes JSON cassette files; replay mode intercepts fetch and serves the recorded responses deterministically. Key features: - Two-pipeline design: normalizers (matching-only) vs redactors (persistence) - Built-in filter presets strip volatile headers/params before matching - Paranoid redaction preset masks auth headers and credential-shaped body fields - Vitest integration (setupCassettes) for per-test cassette lifecycle - passthroughHosts option to exempt specific hosts from interception Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Removes e2e/helpers/cassette/ and the parent-process recorder server.
Replaces them with @braintrust/seinfeld, the workspace package added in
the preceding commit.
Key changes:
- cassette-preload.mjs: ~80-line subprocess preload that boots seinfeld
via createCassette(), replacing the old 450-line preload.mjs. The
subprocess writes its cassette file directly; no parent recorder server needed.
- cassette-filters.mjs: per-scenario FilterSpec registry, porting the
AI-SDK volatile-field normalizer and Mistral agent-name normalizer to
seinfeld's FilterConfig API.
- scenario-harness.ts: drops startCassetteRecorderServer, parseCassetteMode,
and all parent-side recorder wiring. record-missing mode replaced with
plain record (seinfeld overwrites cassette files in full).
- 26 cassette files migrated from the legacy format to seinfeld's format
(version + meta wrapper, body payloads as { kind, value } objects) using
dev-packages/seinfeld/scripts/migrate-from-legacy.mjs.
- cassette-replay scenario removed (covered by seinfeld's own test suite).
- record-cassettes.mjs simplified: always uses record mode, --record-fresh
flag dropped.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3ca8049 to
f0c44ee
Compare
- Run prettier on all cassette JSON files and tsconfig.json - Add dev-packages/seinfeld workspace to knip.jsonc with correct entry points so internal exports are traced from src/index.ts - Remove export keyword from internal-only constants (format/v1.ts intermediate Zod schemas, normalizer/redactor preset objects and header arrays) that are only used within their own module - Remove unused recordResponse export from msw.ts - Remove redundant computeMatchKey re-export from recorder.ts (it is still exported from matcher/index.ts which is what tests import) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…oad use AsyncLocalStorage.enterWith() does not propagate through async boundaries created by MSW's request interceptors. When start() is called from a Node.js --import preload, als.getStore() returns undefined in the MSW handler, causing every intercepted request to passthrough to the real network instead of replaying from the cassette. Fix: alongside als.enterWith(ctx), also set a module-level processLevelCtx. The handler checks als.getStore() first (so concurrent use() calls via vitest's beforeEach still work correctly) and falls back to processLevelCtx. stop() clears it when the cassette is torn down. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
seinfeld's createJsonFileStore appends .cassette.json when resolving cassette file paths. Rename all cassette files accordingly and update the extension references in tags.ts and scenario-harness.ts. Also simplify cassette-preload.mjs to pass the __cassettes__ directory to createJsonFileStore rather than a full file path, letting the store handle name→path resolution naturally. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an inbound provider HTTP capture/replay layer (cassettes) to the e2e test suite so hermetic CI runs replay recorded traffic instead of hitting live provider APIs. Built on MSW (already in the workspace as a dev dep for
integrations/langchain-jsandintegrations/otel-js) plus ~600 LoC of cassette-format/matcher/recorder glue.E2E scenarios previously hit live provider APIs on every CI run. Flakiness sources: rate limits, transient 5xx, model-output drift breaking exact-string snapshot fields. With this layer:
__cassettes__/<variantKey>.json.BRAINTRUST_E2E_CASSETTE_MODE=record-missing. CI never records.mock-braintrust-server.tsand__snapshots__/are untouched. Cassettes are the inbound mirror: provider→SDK, where snapshots are SDK→Braintrust.Status: ~506 tests passing in hermetic mode across 25 scenarios, 2 of 3 consecutive runs deterministic (the third had an unrelated
turbopack-auto-instrumentationNext.js compile timeout flake — not introduced by this PR).Architecture
mock-braintrust-server.ts(parent process HTTP server)__snapshots__/*.jsoncassette/preload.mjsboots an MSWsetupServer()in subprocess__cassettes__/<variantKey>.jsone2e/helpers/cassette/preload.mjs— loaded into each scenario subprocess vianode --import=<preload>. Boots MSW synchronously and intercepts provider HTTP traffic.cassetteTagsFor(import.meta.url, variantKey)auto-tags scenarios withhermeticbased on cassette file presence — opt-in is by committing the cassette.Cassette modes (
BRAINTRUST_E2E_CASSETTE_MODE)replay(default in CI): match or throwCassetteMissError.record: overwrite cassette fresh.record-missing: match if possible, else live + record. Standard re-record loop.passthrough: bypass cassettes entirely (local debugging).Recording safeguards
400 API_KEY_INVALIDresponses. The 400 case was added after.authorization,x-api-key,api-key,x-goog-api-key,cohere-api-key, cookies, request IDs, rate-limit windows,content-encoding, etc. before persisting. (Caught a near-miss on this PR — the initial commit leakedx-api-keyfor Anthropic; volatile-header set has been broadened and a scrub run removed leaked values.)new Headers(request.headers)silently drops most headers when the source is an MSW-intercepted request (Authorization included). The forwarder copies viaforEachinstead. This one bug was responsible for the bulk of the recording failures during initial migration (every Mistral 401, plenty of others).Scenarios with complete cassettes (hermetic green)
anthropic-instrumentation(6 variants)openai-instrumentation(3 variants)claude-agent-sdk-instrumentationopenrouter-instrumentation(2 variants)ai-sdk-instrumentation(4 variants)ai-sdk-otel-export(2 variants)groq-instrumentation(2 variants)huggingface-instrumentation(3 variants)openrouter-agent-instrumentationwrap-langchain-js-tracescassette-replay(meta-scenario validating record→replay loop end-to-end)cohere-instrumentationv7-14-0 (1 of 5 variants — see below)Scenarios still missing cassettes (auto-skipped in hermetic mode)
These auto-skip cleanly because
cassetteTagsForonly applies thehermetictag when the cassette file is present. CI does not fail on them today; they need a follow-up record run with working credentials.mistral-instrumentation— needs re-record after rebasemain, the existing mistral cassettes no longer match: main extended the mistral scenario with new thinking/reasoning model coverage (NATIVE_REASONING_MODEL,ADJUSTABLE_REASONING_MODEL) that wasn't in the older scenario shape the cassettes were recorded against.cohere-instrumentationalready does —COHERE_RECORD_THROTTLE_MS = 60_000) or running mistral variants serially with longer waits between calls.scenario.impl.mjsand runBRAINTRUST_E2E_CASSETTE_MODE=record-missing pnpm --filter=@braintrust/js-e2e-tests vitest run scenarios/mistral-instrumentation.google-genai-instrumentation— Gemini quota exhaustedRESOURCE_EXHAUSTED 429).400 API_KEY_INVALIDresponses; those were detected and deleted in this PR. The skip-list now rejects 400 to prevent recurrence.BRAINTRUST_E2E_CASSETTE_MODE=record-missing pnpm --filter=@braintrust/js-e2e-tests vitest run scenarios/google-genai-instrumentation.cohere-instrumentation— per-MONTH quota exhausted (4 of 5 variants)"You are past the per-month request limit for this model, please wait and try again later."This is monthly, not daily — recovers on the next billing cycle.v7-14-0variant is fully recorded (chat + chat-stream + embed + rerank) and replays green. The 4 remaining variants (v7-20-0,v7-21-0,v7default,v8) auto-skip until re-recorded.COHERE_RECORD_THROTTLE_MS) to land each call in a fresh budget window once quota is restored — but the throttle can't help with monthly exhaustion.google-adk-instrumentation— model-behavior drift, unrelated to cassette layer__cassettes__/files in this PR, so it auto-skips in hermetic mode. There is pre-existing snapshot drift unrelated to this PR which should be triaged independently.Risks / things to watch
claude-agent-sdk-instrumentationandai-sdk-instrumentationhave the largest cassettes (long transcripts). This is intentional — diff-ability matters for review, and we want byte-identical replay.e2e/README.md("Cassettes" section) and.agents/skills/e2e-tests/SKILL.md.nextjs-instrumentation,turbopack-auto-instrumentation, and OTEL-only scenarios. Those need separate preload mechanisms; they continue running as before (or are already hermetic-ish via different machinery).Test plan
pnpm --filter=@braintrust/js-e2e-tests exec vitest run --tags-filter=hermetic— green (506 passed, 396 skipped, 0 failed)pnpm run formattingcleanpnpm run lint— 0 errorsx-api-key,api-key,x-goog-api-key,authorization, etc. all stripped)🤖 Generated with Claude Code