feat(evals): add verifier benchmark instrumentation#2138
Conversation
|
There was a problem hiding this comment.
3 issues found across 5 files
Confidence score: 3/5
- There is some merge risk because
scripts/cross-verify-parallel.shhas concrete shell-safety issues (null-delimiting forxargs, and quoting redirect targets) that can fail on valid paths containing spaces, quotes, or glob characters. - The most severe issue is the
xargsinput handling inscripts/cross-verify-parallel.sh(6/10, confidence 7/10): withoutprintf '%s\0'+xargs -0, path parsing can break and cause incorrect task execution. scripts/cross-verify.shalso has a quoting concern around$TRAJECTORY_GLOB; unquoted expansion can mis-handle special characters, so this is user-facing in environments with non-trivial file names.- Pay close attention to
scripts/cross-verify-parallel.shandscripts/cross-verify.sh- path quoting and argument delimiting need hardening to avoid shell parsing errors.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="scripts/cross-verify-parallel.sh">
<violation number="1" location="scripts/cross-verify-parallel.sh:40">
P2: Unquoted redirect target will cause "ambiguous redirect" if `$task` contains spaces or glob characters. Quote the path to make the redirect robust.</violation>
<violation number="2" location="scripts/cross-verify-parallel.sh:58">
P2: Use null-delimited input (`printf '%s\0'` + `xargs -0`) to prevent `xargs` from misinterpreting directory paths that contain spaces, quotes, or backslashes.</violation>
</file>
<file name="scripts/cross-verify.sh">
<violation number="1" location="scripts/cross-verify.sh:22">
P2: Quote `$TRAJECTORY_GLOB` or use an array to safely expand the glob. Unquoted variable expansion combined with globbing can break when paths contain spaces or special characters.</violation>
</file>
Architecture diagram
sequenceDiagram
participant CLI as Script / CLI
participant Eval as Evaluator (runWithVerifier)
participant BT as Braintrust Span
participant Cache as RubricCache
participant Agent as Agent Executor
participant Recorder as TrajectoryRecorder
participant Verifier as Verifier (V3Evaluator)
Note over CLI,Verifier: Verifier Benchmark Flow
CLI->>Eval: runWithVerifier(taskSpec, dataset)
Eval->>BT: tracedSpan("verifier.rubric")
alt Precomputed rubric
BT->>BT: Use taskSpec.precomputedRubric
else Cache disabled (env)
BT->>Eval: evaluator.generateRubric(taskSpec)
else Cache enabled
BT->>Cache: cache.read(taskSpec)
alt Cache hit
Cache-->>BT: Return cached rubric
else Cache miss
BT->>Eval: evaluator.generateRubric(taskSpec)
Eval->>Cache: cache.write(taskSpec, rubric)
end
end
BT->>BT: span.log({output, metadata})
BT-->>Eval: Return resolved rubric
Eval->>Eval: Hydrate TaskSpec with rubric
Eval->>BT: tracedSpan("agent.execute")
BT->>Agent: agent.execute(instruction)
Agent-->>BT: AgentResult (including usage)
BT->>BT: span.log({output, metrics})
BT-->>Eval: Return AgentResult
Eval->>Recorder: recorder.finish(agentResult)
Recorder-->>Eval: Trajectory
Eval->>BT: tracedSpan("verifier.verify")
BT->>Verifier: evaluator.verify(trajectory, hydratedTaskSpec)
Verifier-->>BT: Verdict
BT->>BT: span.log({output, scores, metadata})
BT-->>Eval: Return Verdict
Eval->>Recorder: recorder.persistVerdict(verdict)
opt Cross-Verification (scripts/cross-verify-parallel.sh)
CLI->>CLI: Enumerate trajectory dirs
loop Each trajectory
par Approach A (VERIFIER_APPROACH=a)
CLI->>Eval: verify() with approach=a
Eval->>BT: tracedSpan("verifier.rubric") & tracedSpan("verifier.verify")
Note over Eval,Verifier: Same agent trajectory, different verifier logic
Eval-->>CLI: Score output to scores/mmrubric_cross-a.json
and Approach B (VERIFIER_APPROACH=b)
CLI->>Eval: verify() with approach=b
Eval->>BT: tracedSpan("verifier.rubric") & tracedSpan("verifier.verify")
Eval-->>CLI: Score output to scores/mmrubric_cross-b.json
end
end
end
Note over CLI,Verifier: Default backend remains "legacy" until benchmark shows verifier improvement
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic
| JOBS+=("$d|a") | ||
| done | ||
|
|
||
| printf '%s\n' "${JOBS[@]}" | xargs -I {} -n 1 -P "$PARALLEL" bash -c ' |
There was a problem hiding this comment.
P2: Use null-delimited input (printf '%s\0' + xargs -0) to prevent xargs from misinterpreting directory paths that contain spaces, quotes, or backslashes.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At scripts/cross-verify-parallel.sh, line 58:
<comment>Use null-delimited input (`printf '%s\0'` + `xargs -0`) to prevent `xargs` from misinterpreting directory paths that contain spaces, quotes, or backslashes.</comment>
<file context>
@@ -0,0 +1,63 @@
+ JOBS+=("$d|a")
+done
+
+printf '%s\n' "${JOBS[@]}" | xargs -I {} -n 1 -P "$PARALLEL" bash -c '
+ IFS="|" read -r dir approach <<< "$1"
+ run_one "$dir" "$approach"
</file context>
| DIRS=() | ||
| while IFS= read -r d; do | ||
| DIRS+=("$d") | ||
| done < <(find $TRAJECTORY_GLOB -mindepth 1 -maxdepth 1 -type d | sort) |
There was a problem hiding this comment.
P2: Quote $TRAJECTORY_GLOB or use an array to safely expand the glob. Unquoted variable expansion combined with globbing can break when paths contain spaces or special characters.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At scripts/cross-verify.sh, line 22:
<comment>Quote `$TRAJECTORY_GLOB` or use an array to safely expand the glob. Unquoted variable expansion combined with globbing can break when paths contain spaces or special characters.</comment>
<file context>
@@ -0,0 +1,44 @@
+DIRS=()
+while IFS= read -r d; do
+ DIRS+=("$d")
+done < <(find $TRAJECTORY_GLOB -mindepth 1 -maxdepth 1 -type d | sort)
+
+echo "Found ${#DIRS[@]} trajectory dirs"
</file context>
57873b1 to
eede85a
Compare
09ba439 to
6774e3d
Compare
3d202fa to
e4fbe53
Compare
1ae8368 to
573e4e0
Compare
e4fbe53 to
248b0ea
Compare
573e4e0 to
ed49853
Compare
248b0ea to
9326826
Compare
ed49853 to
ec15b9b
Compare
9326826 to
81ee924
Compare
Why
The verifier should not become the default until it is benchmarked against legacy and Approach A/B are compared on the same saved trajectories. This PR adds the instrumentation and scripts needed for that evidence gate.
What Changed
legacyuntil benchmark evidence supports a flip.Tests
pnpm -w exec prettier --check packages/core/lib/v3/index.ts packages/core/lib/v3/verifier packages/core/tests/unit/verifier-failure-step-parser.test.ts packages/core/tests/unit/verifier-trajectory.test.ts packages/evals/cli.ts packages/evals/framework packages/evals/tests/framework/rubricCache.test.ts packages/evals/tests/framework/verifierAdapter.test.ts packages/evals/tests/tui/commandTree.test.ts packages/evals/tui packages/evals/tasks/bench/agentpnpm --filter @browserbasehq/stagehand run typecheckpnpm --filter @browserbasehq/stagehand run build:esmpnpm --filter @browserbasehq/stagehand run test:core -- packages/core/dist/esm/tests/unit/verifier-failure-step-parser.test.js packages/core/dist/esm/tests/unit/verifier-trajectory.test.jspnpm --filter @browserbasehq/stagehand-evals run typecheckpnpm --dir packages/evals exec vitest run tests/framework/rubricCache.test.ts tests/framework/verifierAdapter.test.ts tests/tui/commandTree.test.ts tests/framework/claudeCodeRunner.test.ts tests/framework/codexRunner.test.ts tests/cli.test.ts