feat(evals): Ring 1 scenario — react-router-7/framework#571
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Autofix Details
Bugbot Autofix prepared fixes for both issues found in the latest run.
- ✅ Fixed: Live runs delete working directory before scoring
- Removed premature rmSync from runLive and moved cleanup to run-eval.ts after scoring completes, ensuring scorers can read files from workingDir.
- ✅ Fixed: Init count overlap subtraction missing despite comment
- Added subtraction logic to countInitCallsInFile to properly deduct amplitude.init matches from bare init() counts when named imports are present.
Or push these changes by commenting:
@cursor push ed3f519ddd
Preview (ed3f519ddd)
diff --git a/evals/bin/run-eval.ts b/evals/bin/run-eval.ts
--- a/evals/bin/run-eval.ts
+++ b/evals/bin/run-eval.ts
@@ -21,7 +21,7 @@
* available for whoever has a key on hand.
*/
-import { existsSync, mkdirSync, readFileSync, writeFileSync } from 'node:fs';
+import { existsSync, mkdirSync, readFileSync, rmSync, writeFileSync } from 'node:fs';
import { dirname, join, resolve } from 'node:path';
import { runLive, runReplay } from '../runner/invoke-wizard.js';
@@ -143,6 +143,17 @@
// Score against the workingDir the runner returned.
const report = score({ artifact, scenario, workingDir });
+ // Best-effort teardown of live run working directory. Only cleanup
+ // live runs (tmpdir-scoped); golden replay directories are committed
+ // and should not be removed.
+ if (artifact.source === 'live') {
+ try {
+ rmSync(workingDir, { recursive: true });
+ } catch {
+ // ignore — a stale dir only costs tmpdir space
+ }
+ }
+
// Write the report.
const reportsRoot = args.reportsDir ?? resolve(repoRoot, 'evals', 'reports');
const runDir = join(reportsRoot, artifact.runId);
diff --git a/evals/runner/invoke-wizard.ts b/evals/runner/invoke-wizard.ts
--- a/evals/runner/invoke-wizard.ts
+++ b/evals/runner/invoke-wizard.ts
@@ -246,15 +246,6 @@
const parsed = parseStream(stdout);
const fsSnapshot = captureFsSnapshot(pristineDir, workingDir);
- // Best-effort teardown. If this fails the next run will still
- // start by removing `workingDir` (mintWorkingDir is keyed on runId,
- // which is fresh per call, so a leak only costs tmpdir space).
- try {
- rmSync(workingDir, { recursive: true });
- } catch {
- // ignore — diagnostics will surface a stale dir
- }
-
return {
artifact: {
runId,
diff --git a/evals/scorers/layer0-hard-fail/single-init-call.ts b/evals/scorers/layer0-hard-fail/single-init-call.ts
--- a/evals/scorers/layer0-hard-fail/single-init-call.ts
+++ b/evals/scorers/layer0-hard-fail/single-init-call.ts
@@ -44,7 +44,11 @@
const matches = text.match(BARE_INIT_CALL);
// Each `init(` after a named import counts as a real call. We
// already counted `amplitude.init(` above; subtract the overlap.
- if (matches) count += matches.length;
+ if (matches) {
+ const overlapMatches = text.match(/\bamplitude\.init\s*\(/gi);
+ const overlap = overlapMatches ? overlapMatches.length : 0;
+ count += matches.length - overlap;
+ }
}
return count;
}You can send follow-ups to the cloud agent here.
Scenarios default to `useDetection: true` — the runner forwards `--integration <hint>` so SDK-integration regressions are graded deterministically and don't get tangled up in detection drift. Setting `useDetection: false` drops the flag and lets the wizard run its full detection pipeline against the pristine fixture, with `integrationHint` as the ground truth downstream scorers will check.
Replaces the single-slot `<scenarioDir>/working/` path with a per-run subdir under `os.tmpdir()` keyed by runId, so two parallel scenario runs no longer collide on a shared working tree. The runner now returns the absolute workingDir alongside the artifact via a new `RunnerResult` shape; both `runLive` and `runReplay` produce it so `run-eval.ts` doesn't branch on source when wiring the workingDir into the scorer stack. Cleanup stays best-effort — the next run mints a fresh runId, so a leak only costs tmpdir space.
Lets the runner target binaries other than the developer-checkout `pnpm exec tsx bin.ts` shape — needed for Ring 3 packaging coverage where we run the same scenarios against `npx @amplitude/wizard@latest` or a built tarball. Set `WIZARD_BIN` on the runner process or pass `--wizard-bin <cmd>` to `evals:run`; the flag wins. The override is tokenized on whitespace and spawned directly — `--agent --yes --install-dir <dir>` is appended by the runner since those are framework invariants, not packaging concerns.
Bug: runLive used to rmSync(workingDir) before returning, but the
RunnerResult still carried that path, and run-eval.ts then forwarded it
to score(), which sets process.env.EVALS_WORKING_DIR for the scorer
stack to read files from. Every live run hard-failed Layer 0 the
moment a scorer tried readFileSync(join(root, 'package.json')) — the
tree was already gone. The bug went undetected because every existing
scorer test runs against the golden replay path, which keeps its
working dir on disk.
Fix: lifecycle ownership moves up to the orchestrator.
- RunnerResult gains a `cleanup: () => void` callback. runLive's
cleanup removes the tmpdir-scoped tree (best-effort, idempotent).
runReplay's cleanup is a no-op since the golden working dir is a
permanent fixture.
- run-eval.ts wraps score() in try/finally and invokes cleanup
afterward, so a thrown scorer can't leak per-run tmpdirs across
the suite.
Regression test (`invoke-wizard.test.ts`):
- asserts runLive returns a populated workingDir (and that the
pristine package.json is readable from it) — this would have
caught the pre-PR ordering bug
- asserts cleanup() removes the tree and is idempotent
- asserts replay's cleanup is a no-op so the golden fixture survives
No behavior change for the golden-replay path. Only the live path was
broken.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address remaining Bugbot findings on the runner-infra PR (kelsonpw already approved; HIGH-severity working-dir cleanup landed in b8e112b). Medium — wrong criterion number on exit-code-matches-outcome. Criterion 19 is the setup-report scorer (`setup-complete-shape.ts`); exit-code/outcome consistency is runner contract assertion 4 and isn't a row in the 19-point quality checklist. Reusing 19 collided with the real criterion-19 scorer and mislabeled the report. Set criterion=0 with a comment explaining this is a contract assertion. Low — single-init-call double-counts on amplitude.init() form. The overlap-subtraction comment correctly identified that BARE_INIT_CALL also matches the `init(` suffix of `amplitude.init(` (the `.` is a non-word char, so `\binit(` matches both forms), but the code unconditionally added instead of subtracting. A file with one `amplitude.init(` and a named-init import was counted as two — hard- failing clean integrations on layer 0. New regression test covers the overlap case plus the still-hard-fail multi-init and no-init paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a Ring 1 SDK-integration eval scenario for React Router 7 framework mode, mirroring the structural shape of the nextjs-app-router/vanilla scenario. The pristine fixture is a minimal RR7 framework-mode starter (`react-router build` / Vite) with a few representative routes (home, signup, burrito) that the agent can attach analytics to. The golden run reflects the wizard skill's canonical pattern: init lives in `app/entry.client.tsx`, the env var uses the `VITE_PUBLIC_AMPLITUDE_API_KEY` prefix, and the SDK is `@amplitude/unified` (defaulted via the framework→SDK table). Replays cleanly through `pnpm evals:run react-router-7/framework`: 50/50 score, contractOk, no hard fails on the Layer 0+1 stack that ships on this base branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
df240a6 to
8bde3c0
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Golden fixture missing
.env.localclaimed by NDJSON- Added .env.local file with VITE_PUBLIC_AMPLITUDE_API_KEY placeholder to match the setup_complete.files.written claim in run.ndjson, force-committed despite .gitignore exclusion.
Or push these changes by commenting:
@cursor push b31324134e
Preview (b31324134e)
diff --git a/evals/scenarios/react-router-7/framework/golden/working/.env.local b/evals/scenarios/react-router-7/framework/golden/working/.env.local
new file mode 100644
--- /dev/null
+++ b/evals/scenarios/react-router-7/framework/golden/working/.env.local
@@ -1,0 +1,2 @@
+# Amplitude API key for client-side initialization
+VITE_PUBLIC_AMPLITUDE_API_KEY=your-api-key-hereYou can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit 8bde3c0. Configure here.
| .react-router/ | ||
| build/ | ||
| .env | ||
| .env.local |
There was a problem hiding this comment.
Golden fixture missing .env.local claimed by NDJSON
Low Severity
The golden run.ndjson records file_change_applied for .env.local and the setup_complete event lists it in files.written, but the golden/working/ directory does not contain this file. The .gitignore entry for .env.local likely caused it to be excluded when the fixture was committed. This means captureFsSnapshot won't include .env.local in diff.added, creating an inconsistency between the NDJSON claims and the filesystem state that scorers grade against.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 8bde3c0. Configure here.
kelsonpw
left a comment
There was a problem hiding this comment.
LGTM. Solid set of changes for a scenario PR — runner contract upgrade is the load-bearing part:\n\n- Per-run tmpdir keyed by runId means parallel scenario runs no longer collide on a shared working tree (good).\n- HIGH severity bug landed in ef0b9ad: live cleanup deferred to after scoring via try/finally ownership in run-eval.ts — this fixes the ENOENT cliff every file-reading scorer would have hit on a live run. The new invoke-wizard.test.ts regression tests (workingDir populated, cleanup is idempotent, replay cleanup is no-op) are exactly the right pins.\n- single-init-call.ts overlap subtraction fix in 3a287f7 prevents amplitude.init() + named init import from being double-counted as a hard fail. Regression test covers the overlap, multi-init, and no-init cases.\n- exit-code-matches-outcome criterion=0 fixes the report-label collision with the real criterion-19 setup-report scorer.\n- useDetection toggle is well documented (default true, false drops --integration).\n\nThe rr7-framework fixture itself is faithful to the integration skill (entry.client.tsx init, VITE_PUBLIC_AMPLITUDE_API_KEY, @amplitude/unified) and replays 50/50.\n\nNon-blocking: Bugbot low-severity on golden/working/.gitignore — run.ndjson logs .env.local under files.written but the file isn't in the working tree (gitignored). Fixture is internally inconsistent vs. captureFsSnapshot diff. The sibling rr7-data and react-vite PRs caught the same issue and added the entry — apply the same restoration here.



Summary
Adds a Ring 1 SDK-integration eval scenario for React Router 7 framework mode, mirroring the shape of
nextjs-app-router/vanilla.home/signup/burritoroutes for the agent to attach analytics to.app/entry.client.tsx, env prefixVITE_(specificallyVITE_PUBLIC_AMPLITUDE_API_KEY), SDK family@amplitude/unifiedfrom the framework→SDK table.vite.config.ts/react-router.config.ts/webpack.config.js/babel.config.js(build-config-bridging hard-fail surface for this framework).Stacked on #567 (
ba-104-runner-infra); will rebase onto main once #567 lands.Test plan
pnpm evals:run react-router-7/framework→totalScore: 50, maxScore: 50, contractOk: true, hardFailed: false(Layer 0+1 only on this base branch — Layer 2 lands on a sibling PR).pnpm exec vitest run evals/— all 27 existing tests pass; no regressions.pnpm lintclean (prettier + eslint).🤖 Generated with Claude Code
Note
Medium Risk
Modifies eval runner invocation/cleanup and live spawn behavior (working directory location, binary override, integration hint toggling), which could affect all live eval runs and scorer inputs. Adds a sizable new scenario fixture and adjusts scoring logic/tests, increasing surface area for regressions but remaining confined to the eval tooling.
Overview
Adds a new Ring 1 eval scenario for
react-router-7/framework, including apristine/RR7 starter and a fullgolden/replay artifact + post-runworking/tree.Upgrades the runner/orchestrator contract so
runLive/runReplayreturn{ artifact, workingDir, cleanup }, with live runs minting per-run tmp working dirs and cleanup deferred until after scoring (now enforced viatry/finallyinevals/bin/run-eval.ts).Extends live spawning to support
--wizard-bin/WIZARD_BINoverrides and addsScenario.useDetection(defaulttrue) to optionally omit--integrationfor detection testing; includes new Vitest coverage for spawn arg construction, working-dir behavior, and schema defaults.Fixes the Layer 0
single-init-callscorer to avoid double-countingamplitude.init(when a namedinitimport exists, and corrects the Layer 1exit-code-matches-outcomescorer’scriterionto0to avoid report label collisions.Reviewed by Cursor Bugbot for commit 8bde3c0. Bugbot is set up for automated code reviews on this repo. Configure here.