feat(evals): Ring 1 scenario — react-router-7/framework by kaiapeacock-eng · Pull Request #571 · amplitude/wizard

kaiapeacock-eng · 2026-05-06T20:25:01Z

Summary

Adds a Ring 1 SDK-integration eval scenario for React Router 7 framework mode, mirroring the shape of nextjs-app-router/vanilla.

Pristine: minimal RR7 framework-mode starter (Vite under the hood) with home / signup / burrito routes for the agent to attach analytics to.
Golden: hand-authored NDJSON + post-run working tree per the integration skill — init in app/entry.client.tsx, env prefix VITE_ (specifically VITE_PUBLIC_AMPLITUDE_API_KEY), SDK family @amplitude/unified from the framework→SDK table.
Forbidden: vite.config.ts / react-router.config.ts / webpack.config.js / babel.config.js (build-config-bridging hard-fail surface for this framework).

Stacked on #567 (ba-104-runner-infra); will rebase onto main once #567 lands.

Test plan

pnpm evals:run react-router-7/framework → totalScore: 50, maxScore: 50, contractOk: true, hardFailed: false (Layer 0+1 only on this base branch — Layer 2 lands on a sibling PR).
pnpm exec vitest run evals/ — all 27 existing tests pass; no regressions.
pnpm lint clean (prettier + eslint).
Sanity: confirm the green replay still scores 50/50 once stacked PRs rebase onto main.

🤖 Generated with Claude Code

Note

Medium Risk
Modifies eval runner invocation/cleanup and live spawn behavior (working directory location, binary override, integration hint toggling), which could affect all live eval runs and scorer inputs. Adds a sizable new scenario fixture and adjusts scoring logic/tests, increasing surface area for regressions but remaining confined to the eval tooling.

Overview
Adds a new Ring 1 eval scenario for react-router-7/framework, including a pristine/ RR7 starter and a full golden/ replay artifact + post-run working/ tree.

Upgrades the runner/orchestrator contract so runLive/runReplay return { artifact, workingDir, cleanup }, with live runs minting per-run tmp working dirs and cleanup deferred until after scoring (now enforced via try/finally in evals/bin/run-eval.ts).

Extends live spawning to support --wizard-bin/WIZARD_BIN overrides and adds Scenario.useDetection (default true) to optionally omit --integration for detection testing; includes new Vitest coverage for spawn arg construction, working-dir behavior, and schema defaults.

Fixes the Layer 0 single-init-call scorer to avoid double-counting amplitude.init( when a named init import exists, and corrects the Layer 1 exit-code-matches-outcome scorer’s criterion to 0 to avoid report label collisions.

^{Reviewed by Cursor Bugbot for commit 8bde3c0. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

✅ Fixed: Live runs delete working directory before scoring
- Removed premature rmSync from runLive and moved cleanup to run-eval.ts after scoring completes, ensuring scorers can read files from workingDir.
✅ Fixed: Init count overlap subtraction missing despite comment
- Added subtraction logic to countInitCallsInFile to properly deduct amplitude.init matches from bare init() counts when named imports are present.

Or push these changes by commenting:

@cursor push ed3f519ddd

Preview (ed3f519ddd)

diff --git a/evals/bin/run-eval.ts b/evals/bin/run-eval.ts
--- a/evals/bin/run-eval.ts
+++ b/evals/bin/run-eval.ts
@@ -21,7 +21,7 @@
  * available for whoever has a key on hand.
  */
 
-import { existsSync, mkdirSync, readFileSync, writeFileSync } from 'node:fs';
+import { existsSync, mkdirSync, readFileSync, rmSync, writeFileSync } from 'node:fs';
 import { dirname, join, resolve } from 'node:path';
 
 import { runLive, runReplay } from '../runner/invoke-wizard.js';
@@ -143,6 +143,17 @@
   // Score against the workingDir the runner returned.
   const report = score({ artifact, scenario, workingDir });
 
+  // Best-effort teardown of live run working directory. Only cleanup
+  // live runs (tmpdir-scoped); golden replay directories are committed
+  // and should not be removed.
+  if (artifact.source === 'live') {
+    try {
+      rmSync(workingDir, { recursive: true });
+    } catch {
+      // ignore — a stale dir only costs tmpdir space
+    }
+  }
+
   // Write the report.
   const reportsRoot = args.reportsDir ?? resolve(repoRoot, 'evals', 'reports');
   const runDir = join(reportsRoot, artifact.runId);

diff --git a/evals/runner/invoke-wizard.ts b/evals/runner/invoke-wizard.ts
--- a/evals/runner/invoke-wizard.ts
+++ b/evals/runner/invoke-wizard.ts
@@ -246,15 +246,6 @@
   const parsed = parseStream(stdout);
   const fsSnapshot = captureFsSnapshot(pristineDir, workingDir);
 
-  // Best-effort teardown. If this fails the next run will still
-  // start by removing `workingDir` (mintWorkingDir is keyed on runId,
-  // which is fresh per call, so a leak only costs tmpdir space).
-  try {
-    rmSync(workingDir, { recursive: true });
-  } catch {
-    // ignore — diagnostics will surface a stale dir
-  }
-
   return {
     artifact: {
       runId,

diff --git a/evals/scorers/layer0-hard-fail/single-init-call.ts b/evals/scorers/layer0-hard-fail/single-init-call.ts
--- a/evals/scorers/layer0-hard-fail/single-init-call.ts
+++ b/evals/scorers/layer0-hard-fail/single-init-call.ts
@@ -44,7 +44,11 @@
     const matches = text.match(BARE_INIT_CALL);
     // Each `init(` after a named import counts as a real call. We
     // already counted `amplitude.init(` above; subtract the overlap.
-    if (matches) count += matches.length;
+    if (matches) {
+      const overlapMatches = text.match(/\bamplitude\.init\s*\(/gi);
+      const overlap = overlapMatches ? overlapMatches.length : 0;
+      count += matches.length - overlap;
+    }
   }
   return count;
 }

_{You can send follow-ups to the cloud agent here.}

Scenarios default to `useDetection: true` — the runner forwards `--integration <hint>` so SDK-integration regressions are graded deterministically and don't get tangled up in detection drift. Setting `useDetection: false` drops the flag and lets the wizard run its full detection pipeline against the pristine fixture, with `integrationHint` as the ground truth downstream scorers will check.

Replaces the single-slot `<scenarioDir>/working/` path with a per-run subdir under `os.tmpdir()` keyed by runId, so two parallel scenario runs no longer collide on a shared working tree. The runner now returns the absolute workingDir alongside the artifact via a new `RunnerResult` shape; both `runLive` and `runReplay` produce it so `run-eval.ts` doesn't branch on source when wiring the workingDir into the scorer stack. Cleanup stays best-effort — the next run mints a fresh runId, so a leak only costs tmpdir space.

Lets the runner target binaries other than the developer-checkout `pnpm exec tsx bin.ts` shape — needed for Ring 3 packaging coverage where we run the same scenarios against `npx @amplitude/wizard@latest` or a built tarball. Set `WIZARD_BIN` on the runner process or pass `--wizard-bin <cmd>` to `evals:run`; the flag wins. The override is tokenized on whitespace and spawned directly — `--agent --yes --install-dir <dir>` is appended by the runner since those are framework invariants, not packaging concerns.

Bug: runLive used to rmSync(workingDir) before returning, but the RunnerResult still carried that path, and run-eval.ts then forwarded it to score(), which sets process.env.EVALS_WORKING_DIR for the scorer stack to read files from. Every live run hard-failed Layer 0 the moment a scorer tried readFileSync(join(root, 'package.json')) — the tree was already gone. The bug went undetected because every existing scorer test runs against the golden replay path, which keeps its working dir on disk. Fix: lifecycle ownership moves up to the orchestrator. - RunnerResult gains a `cleanup: () => void` callback. runLive's cleanup removes the tmpdir-scoped tree (best-effort, idempotent). runReplay's cleanup is a no-op since the golden working dir is a permanent fixture. - run-eval.ts wraps score() in try/finally and invokes cleanup afterward, so a thrown scorer can't leak per-run tmpdirs across the suite. Regression test (`invoke-wizard.test.ts`): - asserts runLive returns a populated workingDir (and that the pristine package.json is readable from it) — this would have caught the pre-PR ordering bug - asserts cleanup() removes the tree and is idempotent - asserts replay's cleanup is a no-op so the golden fixture survives No behavior change for the golden-replay path. Only the live path was broken. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address remaining Bugbot findings on the runner-infra PR (kelsonpw already approved; HIGH-severity working-dir cleanup landed in b8e112b). Medium — wrong criterion number on exit-code-matches-outcome. Criterion 19 is the setup-report scorer (`setup-complete-shape.ts`); exit-code/outcome consistency is runner contract assertion 4 and isn't a row in the 19-point quality checklist. Reusing 19 collided with the real criterion-19 scorer and mislabeled the report. Set criterion=0 with a comment explaining this is a contract assertion. Low — single-init-call double-counts on amplitude.init() form. The overlap-subtraction comment correctly identified that BARE_INIT_CALL also matches the `init(` suffix of `amplitude.init(` (the `.` is a non-word char, so `\binit(` matches both forms), but the code unconditionally added instead of subtracting. A file with one `amplitude.init(` and a named-init import was counted as two — hard- failing clean integrations on layer 0. New regression test covers the overlap case plus the still-hard-fail multi-init and no-init paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a Ring 1 SDK-integration eval scenario for React Router 7 framework mode, mirroring the structural shape of the nextjs-app-router/vanilla scenario. The pristine fixture is a minimal RR7 framework-mode starter (`react-router build` / Vite) with a few representative routes (home, signup, burrito) that the agent can attach analytics to. The golden run reflects the wizard skill's canonical pattern: init lives in `app/entry.client.tsx`, the env var uses the `VITE_PUBLIC_AMPLITUDE_API_KEY` prefix, and the SDK is `@amplitude/unified` (defaulted via the framework→SDK table). Replays cleanly through `pnpm evals:run react-router-7/framework`: 50/50 score, contractOk, no hard fails on the Layer 0+1 stack that ships on this base branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Golden fixture missing .env.local claimed by NDJSON
- Added .env.local file with VITE_PUBLIC_AMPLITUDE_API_KEY placeholder to match the setup_complete.files.written claim in run.ndjson, force-committed despite .gitignore exclusion.

Or push these changes by commenting:

@cursor push b31324134e

Preview (b31324134e)

diff --git a/evals/scenarios/react-router-7/framework/golden/working/.env.local b/evals/scenarios/react-router-7/framework/golden/working/.env.local
new file mode 100644
--- /dev/null
+++ b/evals/scenarios/react-router-7/framework/golden/working/.env.local
@@ -1,0 +1,2 @@
+# Amplitude API key for client-side initialization
+VITE_PUBLIC_AMPLITUDE_API_KEY=your-api-key-here

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit 8bde3c0. Configure here.}

cursor · 2026-05-08T16:27:00Z

+.react-router/
+build/
+.env
+.env.local


Golden fixture missing .env.local claimed by NDJSON

Low Severity

The golden run.ndjson records file_change_applied for .env.local and the setup_complete event lists it in files.written, but the golden/working/ directory does not contain this file. The .gitignore entry for .env.local likely caused it to be excluded when the fixture was committed. This means captureFsSnapshot won't include .env.local in diff.added, creating an inconsistency between the NDJSON claims and the filesystem state that scorers grade against.

Additional Locations (1)

evals/scenarios/react-router-7/framework/golden/run.ndjson#L7-L9

^{Reviewed by Cursor Bugbot for commit 8bde3c0. Configure here.}

kelsonpw

LGTM. Solid set of changes for a scenario PR — runner contract upgrade is the load-bearing part:\n\n- Per-run tmpdir keyed by runId means parallel scenario runs no longer collide on a shared working tree (good).\n- HIGH severity bug landed in ef0b9ad: live cleanup deferred to after scoring via try/finally ownership in run-eval.ts — this fixes the ENOENT cliff every file-reading scorer would have hit on a live run. The new invoke-wizard.test.ts regression tests (workingDir populated, cleanup is idempotent, replay cleanup is no-op) are exactly the right pins.\n- single-init-call.ts overlap subtraction fix in 3a287f7 prevents amplitude.init() + named init import from being double-counted as a hard fail. Regression test covers the overlap, multi-init, and no-init cases.\n- exit-code-matches-outcome criterion=0 fixes the report-label collision with the real criterion-19 setup-report scorer.\n- useDetection toggle is well documented (default true, false drops --integration).\n\nThe rr7-framework fixture itself is faithful to the integration skill (entry.client.tsx init, VITE_PUBLIC_AMPLITUDE_API_KEY, @amplitude/unified) and replays 50/50.\n\nNon-blocking: Bugbot low-severity on golden/working/.gitignore — run.ndjson logs .env.local under files.written but the file isn't in the working tree (gitignored). Fixture is internally inconsistent vs. captureFsSnapshot diff. The sibling rr7-data and react-vite PRs caught the same issue and added the entry — apply the same restoration here.

kaiapeacock-eng requested a review from a team as a code owner May 6, 2026 20:25

cursor Bot reviewed May 6, 2026

View reviewed changes

Comment thread evals/runner/invoke-wizard.ts Outdated

Comment thread evals/scorers/layer0-hard-fail/single-init-call.ts

kaiapeacock-eng and others added 6 commits May 7, 2026 15:25

kaiapeacock-eng force-pushed the ba-104-scenario-rr7-framework branch from df240a6 to 8bde3c0 Compare May 8, 2026 16:21

cursor Bot reviewed May 8, 2026

View reviewed changes

kelsonpw mentioned this pull request May 8, 2026

feat(evals): Layer 3 build runner + PR-gate CI workflow #570

Merged

4 tasks

kelsonpw approved these changes May 8, 2026

View reviewed changes

This was referenced May 8, 2026

feat(evals): Ring 1 scenario — react-vite/vanilla #576

Closed

feat(evals): Ring 1 scenario — react-router-7/data #575

Closed

kaiapeacock-eng merged commit 47959a3 into main May 8, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): Ring 1 scenario — react-router-7/framework#571

feat(evals): Ring 1 scenario — react-router-7/framework#571
kaiapeacock-eng merged 6 commits into
mainfrom
ba-104-scenario-rr7-framework

kaiapeacock-eng commented May 6, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

cursor Bot May 8, 2026

Uh oh!

kelsonpw left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kaiapeacock-eng commented May 6, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 8, 2026

Choose a reason for hiding this comment

Golden fixture missing .env.local claimed by NDJSON

Uh oh!

kelsonpw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaiapeacock-eng commented May 6, 2026 •

edited by cursor Bot

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading

Golden fixture missing `.env.local` claimed by NDJSON