[codex] Complete issue #149 eval session UI#373
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Greptile SummaryThis PR completes the user-facing surface for repeated statistical eval sessions (issue #149) by adding a create dialog, a list tab, and a detail page — all wired to the existing backend aggregation APIs. It also threads Key changes:
Confidence Score: 4/5Safe to merge after fixing the primary-metric label display bug in eval-session-list.tsx; all other issues are minor style improvements. Backend changes are mechanically straightforward and well-covered by tests. The frontend architecture is sound — SSR initial load + client polling, proper error handling, form validation. One P1 display bug exists: "pass@k"/"pass^k" are rendered as "Pass@K"/"Pass^K" due to misapplication of formatEvalSessionMetricName. This is visible to every user with completed eval sessions but doesn't break any data or flow. Remaining comments are P2 style/robustness suggestions. No security issues, no data-loss risk. eval-session-list.tsx (primary metric label rendering bug); create-eval-session-dialog.tsx (empty deployment state UX)
|
| Filename | Overview |
|---|---|
| web/src/app/(workspace)/workspaces/[workspaceId]/runs/eval-session-list.tsx | New eval session list component with polling; has a display bug where formatEvalSessionMetricName is incorrectly applied to already-formatted labels like "pass@k", rendering them as "Pass@K" |
| web/src/app/(workspace)/workspaces/[workspaceId]/runs/create-eval-session-dialog.tsx | New dialog for creating eval sessions with form validation, input-set loading, and participant selection; has a redundant client-side status filter and lacks an empty state when no deployments exist |
| web/src/app/(workspace)/workspaces/[workspaceId]/eval-sessions/[evalSessionId]/eval-session-detail-client.tsx | New client component for eval session detail with polling, aggregate scorecards, participant cards, and child-run table; uses warning text as React key which could cause collisions on duplicate warnings |
| backend/internal/api/agent_deployments.go | Exposes current_build_version_id in the list response; straightforward addition that mirrors the existing AgentDeploymentSummary struct fields |
| backend/db/queries/agent_deployments.sql | Added current_build_version_id to SELECT; DISTINCT ON ordering is correct for picking the latest snapshot per deployment |
| backend/internal/repository/sqlc/agent_deployments.sql.go | SQLC-generated file updated to scan current_build_version_id; treated as non-nullable uuid.UUID consistent with the DB schema |
| backend/internal/repository/repository.go | Added CurrentBuildVersionID to AgentDeploymentSummary and its mapping; clean implementation |
| web/src/lib/api/types.ts | Comprehensive TypeScript types for eval session flows added; AgentDeployment now includes current_build_version_id matching the updated backend response |
| web/src/lib/eval-sessions.ts | Helper library for eval session display logic; formatting functions are well-factored and correctly handle null/NaN edge cases |
| web/src/app/(workspace)/workspaces/[workspaceId]/runs/page.tsx | Adds tabbed layout for runs and eval sessions with parallel SSR data fetching; clean integration of new components |
| web/src/app/(workspace)/workspaces/[workspaceId]/runs/status-variant.ts | Added evalSessionStatusVariant record covering all six EvalSessionStatus values; exhaustive and safe |
| web/src/lib/tests/eval-sessions.test.ts | Unit tests for eval-session helpers cover title derivation, mode detection, value formatting, and dimension sorting; coverage is adequate |
| web/src/lib/api/tests/deployments.test.ts | Tests verify that the deployment list response includes current_build_version_id and that error paths surface correctly |
Sequence Diagram
sequenceDiagram
participant User
participant RunsPage as Runs Page (SSR)
participant CreateDialog as CreateEvalSessionDialog
participant EvalSessionList as EvalSessionList (polling)
participant DetailPage as EvalSessionDetailClient (polling)
participant API as Backend API
RunsPage->>API: GET /v1/workspaces/{id}/runs
RunsPage->>API: GET /v1/eval-sessions?workspace_id=...
API-->>RunsPage: runs + eval sessions (SSR initial data)
User->>CreateDialog: Open New Eval Session
CreateDialog->>API: GET /v1/workspaces/{id}/challenge-packs
CreateDialog->>API: GET /v1/workspaces/{id}/agent-deployments
API-->>CreateDialog: packs + active deployments (with current_build_version_id)
User->>CreateDialog: Select pack / version / deployments / config
CreateDialog->>API: POST /v1/eval-sessions
API-->>CreateDialog: 201 CreateEvalSessionResponse or 422 validation errors
CreateDialog->>DetailPage: navigate to /eval-sessions/{id}
DetailPage->>API: GET /v1/eval-sessions/{id} (SSR)
API-->>DetailPage: EvalSessionDetail (initial)
loop every 5s while status in queued running aggregating
DetailPage->>API: GET /v1/eval-sessions/{id}
API-->>DetailPage: updated EvalSessionDetail
end
loop every 5s while any session active
EvalSessionList->>API: GET /v1/eval-sessions?workspace_id=...
API-->>EvalSessionList: updated list
end
Prompt To Fix All With AI
This is a comment left during a code review.
Path: web/src/app/(workspace)/workspaces/[workspaceId]/runs/eval-session-list.tsx
Line: 151-155
Comment:
**`formatEvalSessionMetricName` applied to already-human-readable labels**
`formatPrimaryMetric` returns `"pass@k"` or `"pass^k"` — these are already display-ready. Passing them through `formatEvalSessionMetricName` (which capitalizes after word boundaries, i.e. after any non-word character) turns them into `"Pass@K"` and `"Pass^K"` because `@` and `^` are non-word characters, so `\b` matches before `k` and capitalizes it.
The function is designed for underscore-separated backend identifiers like `pass_at_k`, not for these labels.
```suggestion
<TableCell className="text-sm text-muted-foreground">
{formatPrimaryMetric(item)}
</TableCell>
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: web/src/app/(workspace)/workspaces/[workspaceId]/runs/create-eval-session-dialog.tsx
Line: 103-105
Comment:
**Redundant client-side filter — backend already guarantees `status = 'active'`**
The `ListActiveAgentDeploymentsByWorkspaceID` SQL query includes `AND ad.status = 'active'` in its `WHERE` clause. Filtering again on `deployment.status === "active"` in the client is harmless but misleading — it implies the API might return non-active deployments.
```suggestion
setDeployments(deploymentsResponse.items);
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: web/src/app/(workspace)/workspaces/[workspaceId]/runs/create-eval-session-dialog.tsx
Line: 452-492
Comment:
**No empty state when the workspace has no active deployments**
When `deployments.length === 0` after load, the participant section renders an empty `<div className="space-y-3">` with no message. This leaves users with a blank card and a permanently disabled submit button and no explanation. A short empty-state message (e.g. "No active deployments found. Deploy an agent to get started.") would prevent confusion.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: web/src/app/(workspace)/workspaces/[workspaceId]/eval-sessions/[evalSessionId]/eval-session-detail-client.tsx
Line: 271-280
Comment:
**Duplicate warning strings will cause React key collisions**
The evidence warnings are keyed by their text content. If the backend returns two evidence warnings with identical text, React will log a duplicate-key warning and may mishandle reconciliation. Since this list is never reordered, using the array position as the key is safer than using the warning message itself.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "Add eval session UI for repeated runs" | Re-trigger Greptile
| <TableCell className="text-sm text-muted-foreground"> | ||
| {formatPrimaryMetric(item) === "—" | ||
| ? "—" | ||
| : formatEvalSessionMetricName(formatPrimaryMetric(item))} | ||
| </TableCell> |
There was a problem hiding this comment.
formatEvalSessionMetricName applied to already-human-readable labels
formatPrimaryMetric returns "pass@k" or "pass^k" — these are already display-ready. Passing them through formatEvalSessionMetricName (which capitalizes after word boundaries, i.e. after any non-word character) turns them into "Pass@K" and "Pass^K" because @ and ^ are non-word characters, so \b matches before k and capitalizes it.
The function is designed for underscore-separated backend identifiers like pass_at_k, not for these labels.
| <TableCell className="text-sm text-muted-foreground"> | |
| {formatPrimaryMetric(item) === "—" | |
| ? "—" | |
| : formatEvalSessionMetricName(formatPrimaryMetric(item))} | |
| </TableCell> | |
| <TableCell className="text-sm text-muted-foreground"> | |
| {formatPrimaryMetric(item)} | |
| </TableCell> |
Prompt To Fix With AI
This is a comment left during a code review.
Path: web/src/app/(workspace)/workspaces/[workspaceId]/runs/eval-session-list.tsx
Line: 151-155
Comment:
**`formatEvalSessionMetricName` applied to already-human-readable labels**
`formatPrimaryMetric` returns `"pass@k"` or `"pass^k"` — these are already display-ready. Passing them through `formatEvalSessionMetricName` (which capitalizes after word boundaries, i.e. after any non-word character) turns them into `"Pass@K"` and `"Pass^K"` because `@` and `^` are non-word characters, so `\b` matches before `k` and capitalizes it.
The function is designed for underscore-separated backend identifiers like `pass_at_k`, not for these labels.
```suggestion
<TableCell className="text-sm text-muted-foreground">
{formatPrimaryMetric(item)}
</TableCell>
```
How can I resolve this? If you propose a fix, please make it concise.| setDeployments( | ||
| deploymentsResponse.items.filter((deployment) => deployment.status === "active"), | ||
| ); |
There was a problem hiding this comment.
Redundant client-side filter — backend already guarantees
status = 'active'
The ListActiveAgentDeploymentsByWorkspaceID SQL query includes AND ad.status = 'active' in its WHERE clause. Filtering again on deployment.status === "active" in the client is harmless but misleading — it implies the API might return non-active deployments.
| setDeployments( | |
| deploymentsResponse.items.filter((deployment) => deployment.status === "active"), | |
| ); | |
| setDeployments(deploymentsResponse.items); |
Prompt To Fix With AI
This is a comment left during a code review.
Path: web/src/app/(workspace)/workspaces/[workspaceId]/runs/create-eval-session-dialog.tsx
Line: 103-105
Comment:
**Redundant client-side filter — backend already guarantees `status = 'active'`**
The `ListActiveAgentDeploymentsByWorkspaceID` SQL query includes `AND ad.status = 'active'` in its `WHERE` clause. Filtering again on `deployment.status === "active"` in the client is harmless but misleading — it implies the API might return non-active deployments.
```suggestion
setDeployments(deploymentsResponse.items);
```
How can I resolve this? If you propose a fix, please make it concise.|
|
||
| <div className="space-y-3"> | ||
| {deployments.map((deployment) => { | ||
| const checked = selectedDeploymentIds.includes(deployment.id); | ||
| return ( | ||
| <div | ||
| key={deployment.id} | ||
| className="rounded-lg border border-border bg-background/60 p-3" | ||
| > | ||
| <label className="flex items-center gap-3 text-sm font-medium"> | ||
| <input | ||
| type="checkbox" | ||
| checked={checked} | ||
| onChange={() => toggleDeployment(deployment.id)} | ||
| className="size-4 rounded border-border accent-primary" | ||
| /> | ||
| <span>{deployment.name}</span> | ||
| </label> | ||
| {checked ? ( | ||
| <div className="mt-3"> | ||
| <label className="mb-1.5 block text-xs font-medium uppercase tracking-[0.12em] text-muted-foreground"> | ||
| Participant label | ||
| </label> | ||
| <input | ||
| type="text" | ||
| value={participantLabels[deployment.id] ?? deployment.name} | ||
| onChange={(event) => | ||
| setParticipantLabels((current) => ({ | ||
| ...current, | ||
| [deployment.id]: event.target.value, | ||
| })) | ||
| } | ||
| className={inputClass} | ||
| /> | ||
| </div> | ||
| ) : null} | ||
| </div> | ||
| ); | ||
| })} | ||
| </div> | ||
| </div> |
There was a problem hiding this comment.
No empty state when the workspace has no active deployments
When deployments.length === 0 after load, the participant section renders an empty <div className="space-y-3"> with no message. This leaves users with a blank card and a permanently disabled submit button and no explanation. A short empty-state message (e.g. "No active deployments found. Deploy an agent to get started.") would prevent confusion.
Prompt To Fix With AI
This is a comment left during a code review.
Path: web/src/app/(workspace)/workspaces/[workspaceId]/runs/create-eval-session-dialog.tsx
Line: 452-492
Comment:
**No empty state when the workspace has no active deployments**
When `deployments.length === 0` after load, the participant section renders an empty `<div className="space-y-3">` with no message. This leaves users with a blank card and a permanently disabled submit button and no explanation. A short empty-state message (e.g. "No active deployments found. Deploy an agent to get started.") would prevent confusion.
How can I resolve this? If you propose a fix, please make it concise.| <div className="space-y-2"> | ||
| {detail.evidence_warnings.map((warning) => ( | ||
| <div | ||
| key={warning} | ||
| className="flex items-start gap-2 rounded-lg border border-amber-500/30 bg-amber-500/10 p-3 text-sm text-amber-100" | ||
| > | ||
| <AlertTriangle className="mt-0.5 size-4 shrink-0" /> | ||
| <span>{warning}</span> | ||
| </div> | ||
| ))} |
There was a problem hiding this comment.
Duplicate warning strings will cause React key collisions
The evidence warnings are keyed by their text content. If the backend returns two evidence warnings with identical text, React will log a duplicate-key warning and may mishandle reconciliation. Since this list is never reordered, using the array position as the key is safer than using the warning message itself.
Prompt To Fix With AI
This is a comment left during a code review.
Path: web/src/app/(workspace)/workspaces/[workspaceId]/eval-sessions/[evalSessionId]/eval-session-detail-client.tsx
Line: 271-280
Comment:
**Duplicate warning strings will cause React key collisions**
The evidence warnings are keyed by their text content. If the backend returns two evidence warnings with identical text, React will log a duplicate-key warning and may mishandle reconciliation. Since this list is never reordered, using the array position as the key is safer than using the warning message itself.
How can I resolve this? If you propose a fix, please make it concise.
Summary
Completes the remaining user-facing surface for repeated statistical evals in issue #149.
This PR adds a first-class eval-session experience in the web app, threads the missing deployment build-version identity needed by the new create flow, and aligns the API docs with the shipped payload.
Closes #149.
What Changed
New Eval Sessionflow from the runs area for repeated evalsEval Sessionstab in the runs area with recent-session list polling and aggregate summariescurrent_build_version_idin the agent-deployment list response so the UI can construct eval-session participants correctlyWhy
The backend workflow, aggregation, and read APIs for repeated eval sessions were already in place, but the product surface for issue #149 was still largely missing. Users could not create, discover, or inspect eval sessions from the web app, which left the issue incomplete from an end-user perspective.
User Impact
Users can now:
Validation
go test ./internal/api ./internal/repository ./internal/workflow -run 'AgentDeployment|EvalSession'pnpm vitest run 'src/lib/api/__tests__/deployments.test.ts' 'src/lib/__tests__/eval-sessions.test.ts' 'src/app/(workspace)/workspaces/[workspaceId]/runs/create-eval-session-dialog.test.tsx'pnpm exec tsc --noEmitpnpm exec eslint 'src/app/(workspace)/workspaces/[workspaceId]/runs/create-eval-session-dialog.tsx' 'src/app/(workspace)/workspaces/[workspaceId]/runs/eval-session-list.tsx' 'src/app/(workspace)/workspaces/[workspaceId]/runs/page.tsx' 'src/app/(workspace)/workspaces/[workspaceId]/eval-sessions/[evalSessionId]/page.tsx' 'src/app/(workspace)/workspaces/[workspaceId]/eval-sessions/[evalSessionId]/eval-session-detail-client.tsx' 'src/lib/eval-sessions.ts' 'src/lib/api/types.ts'Notes
Manual browser validation against a live workspace was not run in this session; validation relied on targeted tests, typecheck, lint, and the existing production build check run during development of the change.