Skip to content

Record arena comparisons as EvalOps trace annotations#48

Merged
haasonsaas merged 1 commit intomainfrom
codex/evalops-arena-trace-annotations
Apr 21, 2026
Merged

Record arena comparisons as EvalOps trace annotations#48
haasonsaas merged 1 commit intomainfrom
codex/evalops-arena-trace-annotations

Conversation

@haasonsaas
Copy link
Copy Markdown
Collaborator

Summary

  • add TraceQualityAnnotation support to the EvalOps traces client and IPC surface
  • record one arena trace with a root span plus one child span per model response
  • emit completion and user-vote annotations so arena comparisons feed EvalOps trace quality data

Closes #18

Validation

  • npm run build
  • git diff --check

Note: npm ci currently reaches the existing better-sqlite3 Electron 41 native rebuild mismatch during postinstall, but the production build succeeds once dependencies are present.

@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 21, 2026

PR Summary

Medium Risk
Adds new cross-process IPC surfaces and automatic telemetry emission for arena sessions, including heuristic token/cost calculations that could affect data quality and volume if incorrect.

Overview
Arena runs are now exported to EvalOps as a single trace (root span + one child span per model response), including estimated token/cost/latency metadata and offline handling when unauthenticated.

This introduces TraceQualityAnnotation/AnnotateTraceQuality support end-to-end (consumer SDK types + TracesClient, main-process IPC handlers/services, and shared IPC types), and the renderer’s arenaStore now generates stable traceId/spanIds, captures per-response timings, and invokes new evalops:arena:recordTrace and evalops:arena:recordVote IPC calls to emit completion-heuristic and user-vote annotations.

Reviewed by Cursor Bugbot for commit 66cfba3. Bugbot is set up for automated code reviews on this repo. Configure here.

@haasonsaas haasonsaas merged commit 4f3b019 into main Apr 21, 2026
5 checks passed
@haasonsaas haasonsaas deleted the codex/evalops-arena-trace-annotations branch April 21, 2026 10:26
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before autofix could start.

Reviewed by Cursor Bugbot for commit 66cfba3. Configure here.

modelName: response.modelName
})
}],
qualityPerDollar: qualityPerDollar(won ? 1 : 0, estimateModelCostUsd(response.model, 0, estimateTokens(response.content ?? ''))),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vote cost estimate omits prompt input tokens

Medium Severity

In recordEvalOpsArenaVote, estimateModelCostUsd is called with 0 for input tokens, so qualityPerDollar only reflects output token cost. Meanwhile, buildArenaCompletionAnnotation correctly uses estimateTokens(input.prompt) for the same metric. The EvalOpsRecordArenaVoteRequest interface lacks a prompt field, even though the arenaStore has session.prompt available when building the vote request. This produces inflated and inconsistent qualityPerDollar values between completion and vote annotations for the same response.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 66cfba3. Configure here.

name: 'arena.run',
kind: 'arena',
tokenInput: promptTokens,
tokenOutput: input.responses.reduce((total, response) => total + estimateTokens(response.content ?? ''), 0),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message text counted as LLM output tokens

Low Severity

When a model response fails with no actual output, the arenaStore replaces content with a human-readable error string like "Error from ModelName: ...". The new recordEvalOpsArenaTrace and buildArenaCompletionAnnotation functions compute tokenOutput, costUsd, and responseChars from response.content without checking response.error, so error spans get fabricated token counts and cost estimates derived from the error message text instead of zeros.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 66cfba3. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

traces: arena-mode model comparisons as TraceQualityAnnotations

1 participant