Record arena comparisons as EvalOps trace annotations#48
Conversation
PR SummaryMedium Risk Overview This introduces Reviewed by Cursor Bugbot for commit 66cfba3. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before autofix could start.
Reviewed by Cursor Bugbot for commit 66cfba3. Configure here.
| modelName: response.modelName | ||
| }) | ||
| }], | ||
| qualityPerDollar: qualityPerDollar(won ? 1 : 0, estimateModelCostUsd(response.model, 0, estimateTokens(response.content ?? ''))), |
There was a problem hiding this comment.
Vote cost estimate omits prompt input tokens
Medium Severity
In recordEvalOpsArenaVote, estimateModelCostUsd is called with 0 for input tokens, so qualityPerDollar only reflects output token cost. Meanwhile, buildArenaCompletionAnnotation correctly uses estimateTokens(input.prompt) for the same metric. The EvalOpsRecordArenaVoteRequest interface lacks a prompt field, even though the arenaStore has session.prompt available when building the vote request. This produces inflated and inconsistent qualityPerDollar values between completion and vote annotations for the same response.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 66cfba3. Configure here.
| name: 'arena.run', | ||
| kind: 'arena', | ||
| tokenInput: promptTokens, | ||
| tokenOutput: input.responses.reduce((total, response) => total + estimateTokens(response.content ?? ''), 0), |
There was a problem hiding this comment.
Error message text counted as LLM output tokens
Low Severity
When a model response fails with no actual output, the arenaStore replaces content with a human-readable error string like "Error from ModelName: ...". The new recordEvalOpsArenaTrace and buildArenaCompletionAnnotation functions compute tokenOutput, costUsd, and responseChars from response.content without checking response.error, so error spans get fabricated token counts and cost estimates derived from the error message text instead of zeros.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 66cfba3. Configure here.


Summary
Closes #18
Validation
Note: npm ci currently reaches the existing better-sqlite3 Electron 41 native rebuild mismatch during postinstall, but the production build succeeds once dependencies are present.