Record arena comparisons as EvalOps trace annotations by haasonsaas · Pull Request #48 · evalops/kestrel

haasonsaas · 2026-04-21T10:22:56Z

Summary

add TraceQualityAnnotation support to the EvalOps traces client and IPC surface
record one arena trace with a root span plus one child span per model response
emit completion and user-vote annotations so arena comparisons feed EvalOps trace quality data

Closes #18

Validation

npm run build
git diff --check

Note: npm ci currently reaches the existing better-sqlite3 Electron 41 native rebuild mismatch during postinstall, but the production build succeeds once dependencies are present.

cursor · 2026-04-21T10:23:01Z

PR Summary

Medium Risk
Adds new cross-process IPC surfaces and automatic telemetry emission for arena sessions, including heuristic token/cost calculations that could affect data quality and volume if incorrect.

Overview
Arena runs are now exported to EvalOps as a single trace (root span + one child span per model response), including estimated token/cost/latency metadata and offline handling when unauthenticated.

This introduces TraceQualityAnnotation/AnnotateTraceQuality support end-to-end (consumer SDK types + TracesClient, main-process IPC handlers/services, and shared IPC types), and the renderer’s arenaStore now generates stable traceId/spanIds, captures per-response timings, and invokes new evalops:arena:recordTrace and evalops:arena:recordVote IPC calls to emit completion-heuristic and user-vote annotations.

^{Reviewed by Cursor Bugbot for commit 66cfba3. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before autofix could start.}

^{Reviewed by Cursor Bugbot for commit 66cfba3. Configure here.}

cursor · 2026-04-21T10:30:25Z

+          modelName: response.modelName
+        })
+      }],
+      qualityPerDollar: qualityPerDollar(won ? 1 : 0, estimateModelCostUsd(response.model, 0, estimateTokens(response.content ?? ''))),


Vote cost estimate omits prompt input tokens

Medium Severity

In recordEvalOpsArenaVote, estimateModelCostUsd is called with 0 for input tokens, so qualityPerDollar only reflects output token cost. Meanwhile, buildArenaCompletionAnnotation correctly uses estimateTokens(input.prompt) for the same metric. The EvalOpsRecordArenaVoteRequest interface lacks a prompt field, even though the arenaStore has session.prompt available when building the vote request. This produces inflated and inconsistent qualityPerDollar values between completion and vote annotations for the same response.

Additional Locations (2)

src/shared/ipc.ts#L522-L528

src/renderer/main/src/stores/arenaStore.ts#L199-L218

^{Reviewed by Cursor Bugbot for commit 66cfba3. Configure here.}

cursor · 2026-04-21T10:30:25Z

+    name: 'arena.run',
+    kind: 'arena',
+    tokenInput: promptTokens,
+    tokenOutput: input.responses.reduce((total, response) => total + estimateTokens(response.content ?? ''), 0),


Error message text counted as LLM output tokens

Low Severity

When a model response fails with no actual output, the arenaStore replaces content with a human-readable error string like "Error from ModelName: ...". The new recordEvalOpsArenaTrace and buildArenaCompletionAnnotation functions compute tokenOutput, costUsd, and responseChars from response.content without checking response.error, so error spans get fabricated token counts and cost estimates derived from the error message text instead of zeros.

Additional Locations (2)

src/main/evalops/services.ts#L259-L261

src/main/evalops/services.ts#L407-L410

^{Reviewed by Cursor Bugbot for commit 66cfba3. Configure here.}

Record Kestrel arena traces in EvalOps

66cfba3

haasonsaas merged commit 4f3b019 into main Apr 21, 2026
5 checks passed

haasonsaas deleted the codex/evalops-arena-trace-annotations branch April 21, 2026 10:26

haasonsaas mentioned this pull request Apr 21, 2026

[umbrella] EvalOps platform integration — roadmap for becoming a proper platform consumer #20

Closed

9 tasks

cursor Bot reviewed Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record arena comparisons as EvalOps trace annotations#48

Record arena comparisons as EvalOps trace annotations#48
haasonsaas merged 1 commit intomainfrom
codex/evalops-arena-trace-annotations

haasonsaas commented Apr 21, 2026

Uh oh!

cursor Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 21, 2026

Uh oh!

cursor Bot Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

haasonsaas commented Apr 21, 2026

Summary

Validation

Uh oh!

cursor Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 21, 2026

Choose a reason for hiding this comment

Vote cost estimate omits prompt input tokens

Uh oh!

cursor Bot Apr 21, 2026

Choose a reason for hiding this comment

Error message text counted as LLM output tokens

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cursor Bot commented Apr 21, 2026 •

edited

Loading