feat(scorecard): link validators and metrics to originating run event by Atharva-Kanherkar · Pull Request #310 · agentclash/agentclash

Atharva-Kanherkar · 2026-04-17T18:40:15Z

Summary

Adds an optional source pointer on ValidatorDetail and MetricDetail so the scorecard Inspector Sheet can deep-link into the exact replay step that produced the evaluated value.
New ScorecardSource union with kind discriminator (run_event / tool_call / final_output), sequence (run event sequence_number), event_type (denormalized for UI labels), and field_path.
Replay timeline gained a ?step=<sequence> URL param that highlights and auto-scrolls to the matching step.

Closes #302.

Design decisions (with SOTA precedent)

Explicit kind discriminator rather than Langfuse-style null-coalescing over trace_id / observation_id / session_id. Follows LangSmith's feedback_level enum and Phoenix's separate annotation types — self-describing, unambiguous.
sequence (int64) as the addressable pointer, not event_id (UUID). The run_events table uses (run_agent_id, sequence_number) as the unique key; event_id on the envelope is a producer-side dedup string and isn't persisted as a column. Matches OpenAI Evals' event_id: int monotonic sequence pattern.
field_path kept despite no competing precedent (Inspect AI, Langfuse, LangSmith, Phoenix all stop at event level). A scorecard that can say "this validator looked at final_output" versus "at file:generated_code" is worth the mild novelty, and there's no existing convention to conflict with.
Dropped "step" kind from the issue proposal. Our replay steps are pairs of start/completed events — they're addressable via any contained event's sequence, so a separate kind would just duplicate run_event.
Source is nil by design for aggregate evidence (token totals, latency spans, challenge-input validators) rather than lying with a synthetic pointer. UI renders the link only when a single originating event exists.

Backend

scoring.Source + SourceKind enum in backend/internal/scoring/source.go.
scoring.Event now carries SequenceNumber; both conversion sites (repository/run_agent_evaluation.go and workflow/scoring.go) thread it through.
extractedEvidence gained finalOutputSource, capturedFileSources, capturedDirListingSources, codeExecutionSources maps, populated when the relevant events are consumed in buildEvidence.
New resolveEvidenceSource mirror of resolveEvidenceValue produces the *Source for each validator target (final_output, file:*, otherwise nil).
ValidatorResult.Source / MetricResult.Source surface through buildRunAgentScorecardDocument alongside the existing evidence field.

OpenAPI

New ScorecardSource schema and source property on ValidatorDetail.
Added MetricDetail schema (previously undocumented) and wired it to ScorecardDocument.metric_details.

Frontend

ScorecardSource TS union added to web/src/lib/api/types.ts; ValidatorDetail / MetricDetail now carry an optional source.
InspectorSheet renders a "View in replay →" card for validators/metrics whose evidence points at a single event; the link uses ?step=<sequence>.
ReplayTimeline highlights and auto-scrolls to the matching step when ?step= is present. Uses a sequence-range match with nearest-earlier fallback so links still land somewhere sensible for events emitted outside the agent's stepping loop (e.g. grader verification).
Pure findHighlightIndex helper extracted into replay-highlight.ts for deterministic unit testing.

Test plan

go test -short -race -count=1 ./... (backend)
npx tsc --noEmit (web)
npx vitest run src/components/replay/ src/app/(workspace)/workspaces/.../scorecard/
npx eslint on changed files
npx @redocly/cli lint docs/api-server/openapi.yaml — new schemas validate (the 4 pre-existing errors are unrelated to this PR)
Manual: load a scorecard with a failing final_output validator and confirm "View in replay →" renders, navigates to /.../replay?step=N, and scrolls/highlights the matching step.
Manual: confirm an aggregate metric (e.g. token usage) renders no source link.

vercel · 2026-04-17T18:40:21Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agentclash	Ready	Preview, Comment	Apr 17, 2026 7:20pm

Atharva-Kanherkar · 2026-04-17T19:19:08Z

Addressed all five review findings in 7a8299f:

High — wrapper step swallowing the highlight
findHighlightIndex now picks the narrowest containing range rather than the first match, so stacked wrappers (run / scoring / agent_step) no longer swallow deeper matches. Ties on span width break toward the later started_sequence (deeper in the stack). Regression test: prefers the innermost wrapper when nested steps overlap.

Medium — out-of-window fallback
The nearest-earlier fallback is now gated to the currently-loaded sequence window. When the target exceeds max(completed_sequence) across loaded steps we return -1 instead of highlighting an unrelated early card. Regression test: returns -1 when the target is beyond the loaded window.

Medium — metric source exposed but never populated
Dropped Source from MetricResult, metric_details[].source, the OpenAPI MetricDetail schema, the TS MetricDetail type, and the inspector's metric-side replay link. No collector sets it today, and shipping an always-absent field was misleading. Will re-introduce when a metric collector has a genuine single originating event.

Medium — system.run.completed overwriting the finalized source
buildEvidence now only records finalOutputSource from system.run.completed when it is still nil, so the dedicated system.output.finalized event wins (as it should — it's the narrower producer). Regression test: TestEvaluateRunAgent_PrefersFinalizedEventOverRunCompletedWrapper.

Low — code_execution FieldPath
Now uses validator.Target (e.g. file:generated_code) instead of "file:" + validator.Key. Regression test: TestEvaluateRunAgent_CodeExecutionSourceUsesValidatorTargetAsFieldPath.

Full backend suite + web vitest/tsc/eslint all green.

fix(web): redirect /workspaces/{id} in middleware to avoid React #310

feat(scorecard): link validators and metrics to originating run event

47cbc52

vercel Bot deployed to Preview April 17, 2026 18:40 View deployment

Atharva-Kanherkar self-assigned this Apr 17, 2026

fix(scorecard): harden validator source pointer against review findings

7a8299f

vercel Bot deployed to Preview April 17, 2026 19:20 View deployment

Atharva-Kanherkar merged commit 293af61 into main Apr 17, 2026
6 checks passed

Atharva-Kanherkar mentioned this pull request Apr 17, 2026

Retry failed runs and preview re-scores without mutating canonical results #314

Open

greptile-apps Bot mentioned this pull request Apr 26, 2026

fix(web): redirect /workspaces/{id} in middleware to avoid React #310 #415

Merged

12 tasks

AyushRajSinghParihar added a commit that referenced this pull request Apr 26, 2026

Merge pull request #415 from agentclash/worktree-peaceful-herding-robin

e1de6fc

fix(web): redirect /workspaces/{id} in middleware to avoid React #310

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scorecard): link validators and metrics to originating run event#310

feat(scorecard): link validators and metrics to originating run event#310
Atharva-Kanherkar merged 2 commits into
mainfrom
feat/issue-302-validator-source-pointer

Atharva-Kanherkar commented Apr 17, 2026

Uh oh!

vercel Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

Atharva-Kanherkar commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Atharva-Kanherkar commented Apr 17, 2026

Summary

Design decisions (with SOTA precedent)

Backend

OpenAPI

Frontend

Test plan

Uh oh!

vercel Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Atharva-Kanherkar commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Apr 17, 2026 •

edited

Loading