Skip to content

feat(scorecard): link validators and metrics to originating run event#310

Merged
Atharva-Kanherkar merged 2 commits into
mainfrom
feat/issue-302-validator-source-pointer
Apr 17, 2026
Merged

feat(scorecard): link validators and metrics to originating run event#310
Atharva-Kanherkar merged 2 commits into
mainfrom
feat/issue-302-validator-source-pointer

Conversation

@Atharva-Kanherkar
Copy link
Copy Markdown
Collaborator

Summary

  • Adds an optional source pointer on ValidatorDetail and MetricDetail so the scorecard Inspector Sheet can deep-link into the exact replay step that produced the evaluated value.
  • New ScorecardSource union with kind discriminator (run_event / tool_call / final_output), sequence (run event sequence_number), event_type (denormalized for UI labels), and field_path.
  • Replay timeline gained a ?step=<sequence> URL param that highlights and auto-scrolls to the matching step.

Closes #302.

Design decisions (with SOTA precedent)

  • Explicit kind discriminator rather than Langfuse-style null-coalescing over trace_id / observation_id / session_id. Follows LangSmith's feedback_level enum and Phoenix's separate annotation types — self-describing, unambiguous.
  • sequence (int64) as the addressable pointer, not event_id (UUID). The run_events table uses (run_agent_id, sequence_number) as the unique key; event_id on the envelope is a producer-side dedup string and isn't persisted as a column. Matches OpenAI Evals' event_id: int monotonic sequence pattern.
  • field_path kept despite no competing precedent (Inspect AI, Langfuse, LangSmith, Phoenix all stop at event level). A scorecard that can say "this validator looked at final_output" versus "at file:generated_code" is worth the mild novelty, and there's no existing convention to conflict with.
  • Dropped "step" kind from the issue proposal. Our replay steps are pairs of start/completed events — they're addressable via any contained event's sequence, so a separate kind would just duplicate run_event.
  • Source is nil by design for aggregate evidence (token totals, latency spans, challenge-input validators) rather than lying with a synthetic pointer. UI renders the link only when a single originating event exists.

Backend

  • scoring.Source + SourceKind enum in backend/internal/scoring/source.go.
  • scoring.Event now carries SequenceNumber; both conversion sites (repository/run_agent_evaluation.go and workflow/scoring.go) thread it through.
  • extractedEvidence gained finalOutputSource, capturedFileSources, capturedDirListingSources, codeExecutionSources maps, populated when the relevant events are consumed in buildEvidence.
  • New resolveEvidenceSource mirror of resolveEvidenceValue produces the *Source for each validator target (final_output, file:*, otherwise nil).
  • ValidatorResult.Source / MetricResult.Source surface through buildRunAgentScorecardDocument alongside the existing evidence field.

OpenAPI

  • New ScorecardSource schema and source property on ValidatorDetail.
  • Added MetricDetail schema (previously undocumented) and wired it to ScorecardDocument.metric_details.

Frontend

  • ScorecardSource TS union added to web/src/lib/api/types.ts; ValidatorDetail / MetricDetail now carry an optional source.
  • InspectorSheet renders a "View in replay →" card for validators/metrics whose evidence points at a single event; the link uses ?step=<sequence>.
  • ReplayTimeline highlights and auto-scrolls to the matching step when ?step= is present. Uses a sequence-range match with nearest-earlier fallback so links still land somewhere sensible for events emitted outside the agent's stepping loop (e.g. grader verification).
  • Pure findHighlightIndex helper extracted into replay-highlight.ts for deterministic unit testing.

Test plan

  • go test -short -race -count=1 ./... (backend)
  • npx tsc --noEmit (web)
  • npx vitest run src/components/replay/ src/app/(workspace)/workspaces/.../scorecard/
  • npx eslint on changed files
  • npx @redocly/cli lint docs/api-server/openapi.yaml — new schemas validate (the 4 pre-existing errors are unrelated to this PR)
  • Manual: load a scorecard with a failing final_output validator and confirm "View in replay →" renders, navigates to /.../replay?step=N, and scrolls/highlights the matching step.
  • Manual: confirm an aggregate metric (e.g. token usage) renders no source link.

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 17, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agentclash Ready Ready Preview, Comment Apr 17, 2026 7:20pm

@Atharva-Kanherkar
Copy link
Copy Markdown
Collaborator Author

Addressed all five review findings in 7a8299f:

High — wrapper step swallowing the highlight
findHighlightIndex now picks the narrowest containing range rather than the first match, so stacked wrappers (run / scoring / agent_step) no longer swallow deeper matches. Ties on span width break toward the later started_sequence (deeper in the stack). Regression test: prefers the innermost wrapper when nested steps overlap.

Medium — out-of-window fallback
The nearest-earlier fallback is now gated to the currently-loaded sequence window. When the target exceeds max(completed_sequence) across loaded steps we return -1 instead of highlighting an unrelated early card. Regression test: returns -1 when the target is beyond the loaded window.

Medium — metric source exposed but never populated
Dropped Source from MetricResult, metric_details[].source, the OpenAPI MetricDetail schema, the TS MetricDetail type, and the inspector's metric-side replay link. No collector sets it today, and shipping an always-absent field was misleading. Will re-introduce when a metric collector has a genuine single originating event.

Medium — system.run.completed overwriting the finalized source
buildEvidence now only records finalOutputSource from system.run.completed when it is still nil, so the dedicated system.output.finalized event wins (as it should — it's the narrower producer). Regression test: TestEvaluateRunAgent_PrefersFinalizedEventOverRunCompletedWrapper.

Low — code_execution FieldPath
Now uses validator.Target (e.g. file:generated_code) instead of "file:" + validator.Key. Regression test: TestEvaluateRunAgent_CodeExecutionSourceUsesValidatorTargetAsFieldPath.

Full backend suite + web vitest/tsc/eslint all green.

@Atharva-Kanherkar Atharva-Kanherkar merged commit 293af61 into main Apr 17, 2026
6 checks passed
AyushRajSinghParihar added a commit that referenced this pull request Apr 26, 2026
fix(web): redirect /workspaces/{id} in middleware to avoid React #310
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Link validators and metrics to their replay step (evidence source)

1 participant