Skip to content

feat(evals): Ring 1 scenario — react-router-7/framework#571

Merged
kaiapeacock-eng merged 6 commits into
mainfrom
ba-104-scenario-rr7-framework
May 8, 2026
Merged

feat(evals): Ring 1 scenario — react-router-7/framework#571
kaiapeacock-eng merged 6 commits into
mainfrom
ba-104-scenario-rr7-framework

Conversation

@kaiapeacock-eng
Copy link
Copy Markdown
Collaborator

@kaiapeacock-eng kaiapeacock-eng commented May 6, 2026

Summary

Adds a Ring 1 SDK-integration eval scenario for React Router 7 framework mode, mirroring the shape of nextjs-app-router/vanilla.

  • Pristine: minimal RR7 framework-mode starter (Vite under the hood) with home / signup / burrito routes for the agent to attach analytics to.
  • Golden: hand-authored NDJSON + post-run working tree per the integration skill — init in app/entry.client.tsx, env prefix VITE_ (specifically VITE_PUBLIC_AMPLITUDE_API_KEY), SDK family @amplitude/unified from the framework→SDK table.
  • Forbidden: vite.config.ts / react-router.config.ts / webpack.config.js / babel.config.js (build-config-bridging hard-fail surface for this framework).

Stacked on #567 (ba-104-runner-infra); will rebase onto main once #567 lands.

Test plan

  • pnpm evals:run react-router-7/frameworktotalScore: 50, maxScore: 50, contractOk: true, hardFailed: false (Layer 0+1 only on this base branch — Layer 2 lands on a sibling PR).
  • pnpm exec vitest run evals/ — all 27 existing tests pass; no regressions.
  • pnpm lint clean (prettier + eslint).
  • Sanity: confirm the green replay still scores 50/50 once stacked PRs rebase onto main.

🤖 Generated with Claude Code


Note

Medium Risk
Modifies eval runner invocation/cleanup and live spawn behavior (working directory location, binary override, integration hint toggling), which could affect all live eval runs and scorer inputs. Adds a sizable new scenario fixture and adjusts scoring logic/tests, increasing surface area for regressions but remaining confined to the eval tooling.

Overview
Adds a new Ring 1 eval scenario for react-router-7/framework, including a pristine/ RR7 starter and a full golden/ replay artifact + post-run working/ tree.

Upgrades the runner/orchestrator contract so runLive/runReplay return { artifact, workingDir, cleanup }, with live runs minting per-run tmp working dirs and cleanup deferred until after scoring (now enforced via try/finally in evals/bin/run-eval.ts).

Extends live spawning to support --wizard-bin/WIZARD_BIN overrides and adds Scenario.useDetection (default true) to optionally omit --integration for detection testing; includes new Vitest coverage for spawn arg construction, working-dir behavior, and schema defaults.

Fixes the Layer 0 single-init-call scorer to avoid double-counting amplitude.init( when a named init import exists, and corrects the Layer 1 exit-code-matches-outcome scorer’s criterion to 0 to avoid report label collisions.

Reviewed by Cursor Bugbot for commit 8bde3c0. Bugbot is set up for automated code reviews on this repo. Configure here.

@kaiapeacock-eng kaiapeacock-eng requested a review from a team as a code owner May 6, 2026 20:25
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: Live runs delete working directory before scoring
    • Removed premature rmSync from runLive and moved cleanup to run-eval.ts after scoring completes, ensuring scorers can read files from workingDir.
  • ✅ Fixed: Init count overlap subtraction missing despite comment
    • Added subtraction logic to countInitCallsInFile to properly deduct amplitude.init matches from bare init() counts when named imports are present.

Create PR

Or push these changes by commenting:

@cursor push ed3f519ddd
Preview (ed3f519ddd)
diff --git a/evals/bin/run-eval.ts b/evals/bin/run-eval.ts
--- a/evals/bin/run-eval.ts
+++ b/evals/bin/run-eval.ts
@@ -21,7 +21,7 @@
  * available for whoever has a key on hand.
  */
 
-import { existsSync, mkdirSync, readFileSync, writeFileSync } from 'node:fs';
+import { existsSync, mkdirSync, readFileSync, rmSync, writeFileSync } from 'node:fs';
 import { dirname, join, resolve } from 'node:path';
 
 import { runLive, runReplay } from '../runner/invoke-wizard.js';
@@ -143,6 +143,17 @@
   // Score against the workingDir the runner returned.
   const report = score({ artifact, scenario, workingDir });
 
+  // Best-effort teardown of live run working directory. Only cleanup
+  // live runs (tmpdir-scoped); golden replay directories are committed
+  // and should not be removed.
+  if (artifact.source === 'live') {
+    try {
+      rmSync(workingDir, { recursive: true });
+    } catch {
+      // ignore — a stale dir only costs tmpdir space
+    }
+  }
+
   // Write the report.
   const reportsRoot = args.reportsDir ?? resolve(repoRoot, 'evals', 'reports');
   const runDir = join(reportsRoot, artifact.runId);

diff --git a/evals/runner/invoke-wizard.ts b/evals/runner/invoke-wizard.ts
--- a/evals/runner/invoke-wizard.ts
+++ b/evals/runner/invoke-wizard.ts
@@ -246,15 +246,6 @@
   const parsed = parseStream(stdout);
   const fsSnapshot = captureFsSnapshot(pristineDir, workingDir);
 
-  // Best-effort teardown. If this fails the next run will still
-  // start by removing `workingDir` (mintWorkingDir is keyed on runId,
-  // which is fresh per call, so a leak only costs tmpdir space).
-  try {
-    rmSync(workingDir, { recursive: true });
-  } catch {
-    // ignore — diagnostics will surface a stale dir
-  }
-
   return {
     artifact: {
       runId,

diff --git a/evals/scorers/layer0-hard-fail/single-init-call.ts b/evals/scorers/layer0-hard-fail/single-init-call.ts
--- a/evals/scorers/layer0-hard-fail/single-init-call.ts
+++ b/evals/scorers/layer0-hard-fail/single-init-call.ts
@@ -44,7 +44,11 @@
     const matches = text.match(BARE_INIT_CALL);
     // Each `init(` after a named import counts as a real call. We
     // already counted `amplitude.init(` above; subtract the overlap.
-    if (matches) count += matches.length;
+    if (matches) {
+      const overlapMatches = text.match(/\bamplitude\.init\s*\(/gi);
+      const overlap = overlapMatches ? overlapMatches.length : 0;
+      count += matches.length - overlap;
+    }
   }
   return count;
 }

You can send follow-ups to the cloud agent here.

Comment thread evals/runner/invoke-wizard.ts Outdated
Comment thread evals/scorers/layer0-hard-fail/single-init-call.ts
kaiapeacock-eng and others added 6 commits May 7, 2026 15:25
Scenarios default to `useDetection: true` — the runner forwards
`--integration <hint>` so SDK-integration regressions are graded
deterministically and don't get tangled up in detection drift.
Setting `useDetection: false` drops the flag and lets the wizard run
its full detection pipeline against the pristine fixture, with
`integrationHint` as the ground truth downstream scorers will check.
Replaces the single-slot `<scenarioDir>/working/` path with a
per-run subdir under `os.tmpdir()` keyed by runId, so two parallel
scenario runs no longer collide on a shared working tree. The runner
now returns the absolute workingDir alongside the artifact via a new
`RunnerResult` shape; both `runLive` and `runReplay` produce it so
`run-eval.ts` doesn't branch on source when wiring the workingDir
into the scorer stack. Cleanup stays best-effort — the next run
mints a fresh runId, so a leak only costs tmpdir space.
Lets the runner target binaries other than the developer-checkout
`pnpm exec tsx bin.ts` shape — needed for Ring 3 packaging coverage
where we run the same scenarios against `npx
@amplitude/wizard@latest` or a built tarball. Set `WIZARD_BIN` on the
runner process or pass `--wizard-bin <cmd>` to `evals:run`; the
flag wins. The override is tokenized on whitespace and spawned
directly — `--agent --yes --install-dir <dir>` is appended by the
runner since those are framework invariants, not packaging concerns.
Bug: runLive used to rmSync(workingDir) before returning, but the
RunnerResult still carried that path, and run-eval.ts then forwarded it
to score(), which sets process.env.EVALS_WORKING_DIR for the scorer
stack to read files from. Every live run hard-failed Layer 0 the
moment a scorer tried readFileSync(join(root, 'package.json')) — the
tree was already gone. The bug went undetected because every existing
scorer test runs against the golden replay path, which keeps its
working dir on disk.

Fix: lifecycle ownership moves up to the orchestrator.

  - RunnerResult gains a `cleanup: () => void` callback. runLive's
    cleanup removes the tmpdir-scoped tree (best-effort, idempotent).
    runReplay's cleanup is a no-op since the golden working dir is a
    permanent fixture.
  - run-eval.ts wraps score() in try/finally and invokes cleanup
    afterward, so a thrown scorer can't leak per-run tmpdirs across
    the suite.

Regression test (`invoke-wizard.test.ts`):

  - asserts runLive returns a populated workingDir (and that the
    pristine package.json is readable from it) — this would have
    caught the pre-PR ordering bug
  - asserts cleanup() removes the tree and is idempotent
  - asserts replay's cleanup is a no-op so the golden fixture survives

No behavior change for the golden-replay path. Only the live path was
broken.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address remaining Bugbot findings on the runner-infra PR (kelsonpw
already approved; HIGH-severity working-dir cleanup landed in b8e112b).

Medium — wrong criterion number on exit-code-matches-outcome.
Criterion 19 is the setup-report scorer (`setup-complete-shape.ts`);
exit-code/outcome consistency is runner contract assertion 4 and
isn't a row in the 19-point quality checklist. Reusing 19 collided
with the real criterion-19 scorer and mislabeled the report. Set
criterion=0 with a comment explaining this is a contract assertion.

Low — single-init-call double-counts on amplitude.init() form.
The overlap-subtraction comment correctly identified that BARE_INIT_CALL
also matches the `init(` suffix of `amplitude.init(` (the `.` is a
non-word char, so `\binit(` matches both forms), but the code
unconditionally added instead of subtracting. A file with one
`amplitude.init(` and a named-init import was counted as two — hard-
failing clean integrations on layer 0. New regression test covers the
overlap case plus the still-hard-fail multi-init and no-init paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a Ring 1 SDK-integration eval scenario for React Router 7
framework mode, mirroring the structural shape of the
nextjs-app-router/vanilla scenario.

The pristine fixture is a minimal RR7 framework-mode starter
(`react-router build` / Vite) with a few representative routes
(home, signup, burrito) that the agent can attach analytics to.
The golden run reflects the wizard skill's canonical pattern:
init lives in `app/entry.client.tsx`, the env var uses the
`VITE_PUBLIC_AMPLITUDE_API_KEY` prefix, and the SDK is
`@amplitude/unified` (defaulted via the framework→SDK table).

Replays cleanly through `pnpm evals:run react-router-7/framework`:
50/50 score, contractOk, no hard fails on the Layer 0+1 stack
that ships on this base branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kaiapeacock-eng kaiapeacock-eng force-pushed the ba-104-scenario-rr7-framework branch from df240a6 to 8bde3c0 Compare May 8, 2026 16:21
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Golden fixture missing .env.local claimed by NDJSON
    • Added .env.local file with VITE_PUBLIC_AMPLITUDE_API_KEY placeholder to match the setup_complete.files.written claim in run.ndjson, force-committed despite .gitignore exclusion.

Create PR

Or push these changes by commenting:

@cursor push b31324134e
Preview (b31324134e)
diff --git a/evals/scenarios/react-router-7/framework/golden/working/.env.local b/evals/scenarios/react-router-7/framework/golden/working/.env.local
new file mode 100644
--- /dev/null
+++ b/evals/scenarios/react-router-7/framework/golden/working/.env.local
@@ -1,0 +1,2 @@
+# Amplitude API key for client-side initialization
+VITE_PUBLIC_AMPLITUDE_API_KEY=your-api-key-here

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 8bde3c0. Configure here.

.react-router/
build/
.env
.env.local
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golden fixture missing .env.local claimed by NDJSON

Low Severity

The golden run.ndjson records file_change_applied for .env.local and the setup_complete event lists it in files.written, but the golden/working/ directory does not contain this file. The .gitignore entry for .env.local likely caused it to be excluded when the fixture was committed. This means captureFsSnapshot won't include .env.local in diff.added, creating an inconsistency between the NDJSON claims and the filesystem state that scorers grade against.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8bde3c0. Configure here.

Copy link
Copy Markdown
Member

@kelsonpw kelsonpw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Solid set of changes for a scenario PR — runner contract upgrade is the load-bearing part:\n\n- Per-run tmpdir keyed by runId means parallel scenario runs no longer collide on a shared working tree (good).\n- HIGH severity bug landed in ef0b9ad: live cleanup deferred to after scoring via try/finally ownership in run-eval.ts — this fixes the ENOENT cliff every file-reading scorer would have hit on a live run. The new invoke-wizard.test.ts regression tests (workingDir populated, cleanup is idempotent, replay cleanup is no-op) are exactly the right pins.\n- single-init-call.ts overlap subtraction fix in 3a287f7 prevents amplitude.init() + named init import from being double-counted as a hard fail. Regression test covers the overlap, multi-init, and no-init cases.\n- exit-code-matches-outcome criterion=0 fixes the report-label collision with the real criterion-19 setup-report scorer.\n- useDetection toggle is well documented (default true, false drops --integration).\n\nThe rr7-framework fixture itself is faithful to the integration skill (entry.client.tsx init, VITE_PUBLIC_AMPLITUDE_API_KEY, @amplitude/unified) and replays 50/50.\n\nNon-blocking: Bugbot low-severity on golden/working/.gitignore — run.ndjson logs .env.local under files.written but the file isn't in the working tree (gitignored). Fixture is internally inconsistent vs. captureFsSnapshot diff. The sibling rr7-data and react-vite PRs caught the same issue and added the entry — apply the same restoration here.

@kaiapeacock-eng kaiapeacock-eng merged commit 47959a3 into main May 8, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants