feat(scorecard): link validators and metrics to originating run event#310
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Addressed all five review findings in 7a8299f: High — wrapper step swallowing the highlight Medium — out-of-window fallback Medium — metric source exposed but never populated Medium — Low — code_execution FieldPath Full backend suite + web vitest/tsc/eslint all green. |
fix(web): redirect /workspaces/{id} in middleware to avoid React #310
Summary
sourcepointer onValidatorDetailandMetricDetailso the scorecard Inspector Sheet can deep-link into the exact replay step that produced the evaluated value.ScorecardSourceunion withkinddiscriminator (run_event/tool_call/final_output),sequence(run eventsequence_number),event_type(denormalized for UI labels), andfield_path.?step=<sequence>URL param that highlights and auto-scrolls to the matching step.Closes #302.
Design decisions (with SOTA precedent)
kinddiscriminator rather than Langfuse-style null-coalescing overtrace_id/observation_id/session_id. Follows LangSmith'sfeedback_levelenum and Phoenix's separate annotation types — self-describing, unambiguous.sequence(int64) as the addressable pointer, notevent_id(UUID). Therun_eventstable uses(run_agent_id, sequence_number)as the unique key;event_idon the envelope is a producer-side dedup string and isn't persisted as a column. Matches OpenAI Evals'event_id: intmonotonic sequence pattern.field_pathkept despite no competing precedent (Inspect AI, Langfuse, LangSmith, Phoenix all stop at event level). A scorecard that can say "this validator looked atfinal_output" versus "atfile:generated_code" is worth the mild novelty, and there's no existing convention to conflict with."step"kind from the issue proposal. Our replay steps are pairs of start/completed events — they're addressable via any contained event's sequence, so a separate kind would just duplicaterun_event.Backend
scoring.Source+SourceKindenum inbackend/internal/scoring/source.go.scoring.Eventnow carriesSequenceNumber; both conversion sites (repository/run_agent_evaluation.goandworkflow/scoring.go) thread it through.extractedEvidencegainedfinalOutputSource,capturedFileSources,capturedDirListingSources,codeExecutionSourcesmaps, populated when the relevant events are consumed inbuildEvidence.resolveEvidenceSourcemirror ofresolveEvidenceValueproduces the*Sourcefor each validator target (final_output,file:*, otherwise nil).ValidatorResult.Source/MetricResult.Sourcesurface throughbuildRunAgentScorecardDocumentalongside the existingevidencefield.OpenAPI
ScorecardSourceschema andsourceproperty onValidatorDetail.MetricDetailschema (previously undocumented) and wired it toScorecardDocument.metric_details.Frontend
ScorecardSourceTS union added toweb/src/lib/api/types.ts;ValidatorDetail/MetricDetailnow carry an optionalsource.InspectorSheetrenders a "View in replay →" card for validators/metrics whose evidence points at a single event; the link uses?step=<sequence>.ReplayTimelinehighlights and auto-scrolls to the matching step when?step=is present. Uses a sequence-range match with nearest-earlier fallback so links still land somewhere sensible for events emitted outside the agent's stepping loop (e.g. grader verification).findHighlightIndexhelper extracted intoreplay-highlight.tsfor deterministic unit testing.Test plan
go test -short -race -count=1 ./...(backend)npx tsc --noEmit(web)npx vitest run src/components/replay/ src/app/(workspace)/workspaces/.../scorecard/npx eslinton changed filesnpx @redocly/cli lint docs/api-server/openapi.yaml— new schemas validate (the 4 pre-existing errors are unrelated to this PR)final_outputvalidator and confirm "View in replay →" renders, navigates to/.../replay?step=N, and scrolls/highlights the matching step.