Skip to content

[codex] Complete issue #149 eval session UI#373

Merged
Atharva-Kanherkar merged 1 commit into
mainfrom
issue-149-eval-session-ui
Apr 20, 2026
Merged

[codex] Complete issue #149 eval session UI#373
Atharva-Kanherkar merged 1 commit into
mainfrom
issue-149-eval-session-ui

Conversation

@Atharva-Kanherkar
Copy link
Copy Markdown
Collaborator

Summary

Completes the remaining user-facing surface for repeated statistical evals in issue #149.

This PR adds a first-class eval-session experience in the web app, threads the missing deployment build-version identity needed by the new create flow, and aligns the API docs with the shipped payload.

Closes #149.

What Changed

  • added a New Eval Session flow from the runs area for repeated evals
  • added form support for repetitions, aggregation settings, optional success threshold, optional reliability weight override, routing/task properties, and participant selection from active deployments
  • added an Eval Sessions tab in the runs area with recent-session list polling and aggregate summaries
  • added an eval-session detail page with status, config snapshot, warnings, aggregate scorecards, pass@k, pass^k, metric routing guidance, comparison summaries, participant breakdowns, and child-run links
  • exposed current_build_version_id in the agent-deployment list response so the UI can construct eval-session participants correctly
  • updated frontend types, helper coverage, targeted tests, and OpenAPI docs

Why

The backend workflow, aggregation, and read APIs for repeated eval sessions were already in place, but the product surface for issue #149 was still largely missing. Users could not create, discover, or inspect eval sessions from the web app, which left the issue incomplete from an end-user perspective.

User Impact

Users can now:

  • configure repeated eval sessions from the workspace UI instead of hand-writing API requests
  • compare aggregated repeated-run results directly in the product
  • inspect pass metrics and routing guidance to understand when pass@k vs pass^k is emphasized
  • drill into the underlying child runs using the existing run detail flow

Validation

  • go test ./internal/api ./internal/repository ./internal/workflow -run 'AgentDeployment|EvalSession'
  • pnpm vitest run 'src/lib/api/__tests__/deployments.test.ts' 'src/lib/__tests__/eval-sessions.test.ts' 'src/app/(workspace)/workspaces/[workspaceId]/runs/create-eval-session-dialog.test.tsx'
  • pnpm exec tsc --noEmit
  • pnpm exec eslint 'src/app/(workspace)/workspaces/[workspaceId]/runs/create-eval-session-dialog.tsx' 'src/app/(workspace)/workspaces/[workspaceId]/runs/eval-session-list.tsx' 'src/app/(workspace)/workspaces/[workspaceId]/runs/page.tsx' 'src/app/(workspace)/workspaces/[workspaceId]/eval-sessions/[evalSessionId]/page.tsx' 'src/app/(workspace)/workspaces/[workspaceId]/eval-sessions/[evalSessionId]/eval-session-detail-client.tsx' 'src/lib/eval-sessions.ts' 'src/lib/api/types.ts'

Notes

Manual browser validation against a live workspace was not run in this session; validation relied on targeted tests, typecheck, lint, and the existing production build check run during development of the change.

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 20, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agentclash Ready Ready Preview, Comment Apr 20, 2026 7:57pm

@Atharva-Kanherkar Atharva-Kanherkar marked this pull request as ready for review April 20, 2026 19:57
@Atharva-Kanherkar Atharva-Kanherkar self-assigned this Apr 20, 2026
@Atharva-Kanherkar Atharva-Kanherkar merged commit c0a62da into main Apr 20, 2026
6 checks passed
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 20, 2026

Greptile Summary

This PR completes the user-facing surface for repeated statistical eval sessions (issue #149) by adding a create dialog, a list tab, and a detail page — all wired to the existing backend aggregation APIs. It also threads current_build_version_id through the agent-deployment list endpoint (SQL → sqlc → repository → API handler → frontend types) so the create form can correctly identify eval-session participants.

Key changes:

  • Backend: ListActiveAgentDeploymentsByWorkspaceID query and repository layer updated to surface current_build_version_id; the field is now included in the list API JSON response.
  • Frontend types: AgentDeployment and the full suite of eval-session request/response types added to types.ts.
  • CreateEvalSessionDialog: Multi-step form covering pack/version/input-set selection, participant labeling, repetitions, aggregation config, optional success threshold, and metric routing hints. Handles 422 validation errors and navigates to the detail page on success.
  • EvalSessionList: Polling table in a new "Eval Sessions" tab showing status, run-count summaries, and primary metric.
  • EvalSessionDetailClient: Full detail page with config snapshot, evidence warnings, aggregate scorecards, metric routing guidance, comparison summary, participant breakdowns, and child-run links; polls every 5 s while active.
  • One P1 display bug: formatEvalSessionMetricName is applied to already-human-readable \"pass@k\" / \"pass^k\" labels in the list component, rendering them as \"Pass@K\" / \"Pass^K\" due to \\b matching after @ and ^.

Confidence Score: 4/5

Safe to merge after fixing the primary-metric label display bug in eval-session-list.tsx; all other issues are minor style improvements.

Backend changes are mechanically straightforward and well-covered by tests. The frontend architecture is sound — SSR initial load + client polling, proper error handling, form validation. One P1 display bug exists: "pass@k"/"pass^k" are rendered as "Pass@K"/"Pass^K" due to misapplication of formatEvalSessionMetricName. This is visible to every user with completed eval sessions but doesn't break any data or flow. Remaining comments are P2 style/robustness suggestions. No security issues, no data-loss risk.

eval-session-list.tsx (primary metric label rendering bug); create-eval-session-dialog.tsx (empty deployment state UX)

Important Files Changed

Filename Overview
web/src/app/(workspace)/workspaces/[workspaceId]/runs/eval-session-list.tsx New eval session list component with polling; has a display bug where formatEvalSessionMetricName is incorrectly applied to already-formatted labels like "pass@k", rendering them as "Pass@K"
web/src/app/(workspace)/workspaces/[workspaceId]/runs/create-eval-session-dialog.tsx New dialog for creating eval sessions with form validation, input-set loading, and participant selection; has a redundant client-side status filter and lacks an empty state when no deployments exist
web/src/app/(workspace)/workspaces/[workspaceId]/eval-sessions/[evalSessionId]/eval-session-detail-client.tsx New client component for eval session detail with polling, aggregate scorecards, participant cards, and child-run table; uses warning text as React key which could cause collisions on duplicate warnings
backend/internal/api/agent_deployments.go Exposes current_build_version_id in the list response; straightforward addition that mirrors the existing AgentDeploymentSummary struct fields
backend/db/queries/agent_deployments.sql Added current_build_version_id to SELECT; DISTINCT ON ordering is correct for picking the latest snapshot per deployment
backend/internal/repository/sqlc/agent_deployments.sql.go SQLC-generated file updated to scan current_build_version_id; treated as non-nullable uuid.UUID consistent with the DB schema
backend/internal/repository/repository.go Added CurrentBuildVersionID to AgentDeploymentSummary and its mapping; clean implementation
web/src/lib/api/types.ts Comprehensive TypeScript types for eval session flows added; AgentDeployment now includes current_build_version_id matching the updated backend response
web/src/lib/eval-sessions.ts Helper library for eval session display logic; formatting functions are well-factored and correctly handle null/NaN edge cases
web/src/app/(workspace)/workspaces/[workspaceId]/runs/page.tsx Adds tabbed layout for runs and eval sessions with parallel SSR data fetching; clean integration of new components
web/src/app/(workspace)/workspaces/[workspaceId]/runs/status-variant.ts Added evalSessionStatusVariant record covering all six EvalSessionStatus values; exhaustive and safe
web/src/lib/tests/eval-sessions.test.ts Unit tests for eval-session helpers cover title derivation, mode detection, value formatting, and dimension sorting; coverage is adequate
web/src/lib/api/tests/deployments.test.ts Tests verify that the deployment list response includes current_build_version_id and that error paths surface correctly

Sequence Diagram

sequenceDiagram
    participant User
    participant RunsPage as Runs Page (SSR)
    participant CreateDialog as CreateEvalSessionDialog
    participant EvalSessionList as EvalSessionList (polling)
    participant DetailPage as EvalSessionDetailClient (polling)
    participant API as Backend API

    RunsPage->>API: GET /v1/workspaces/{id}/runs
    RunsPage->>API: GET /v1/eval-sessions?workspace_id=...
    API-->>RunsPage: runs + eval sessions (SSR initial data)

    User->>CreateDialog: Open New Eval Session
    CreateDialog->>API: GET /v1/workspaces/{id}/challenge-packs
    CreateDialog->>API: GET /v1/workspaces/{id}/agent-deployments
    API-->>CreateDialog: packs + active deployments (with current_build_version_id)
    User->>CreateDialog: Select pack / version / deployments / config
    CreateDialog->>API: POST /v1/eval-sessions
    API-->>CreateDialog: 201 CreateEvalSessionResponse or 422 validation errors
    CreateDialog->>DetailPage: navigate to /eval-sessions/{id}

    DetailPage->>API: GET /v1/eval-sessions/{id} (SSR)
    API-->>DetailPage: EvalSessionDetail (initial)
    loop every 5s while status in queued running aggregating
        DetailPage->>API: GET /v1/eval-sessions/{id}
        API-->>DetailPage: updated EvalSessionDetail
    end

    loop every 5s while any session active
        EvalSessionList->>API: GET /v1/eval-sessions?workspace_id=...
        API-->>EvalSessionList: updated list
    end
Loading

Fix All in Codex

Prompt To Fix All With AI
This is a comment left during a code review.
Path: web/src/app/(workspace)/workspaces/[workspaceId]/runs/eval-session-list.tsx
Line: 151-155

Comment:
**`formatEvalSessionMetricName` applied to already-human-readable labels**

`formatPrimaryMetric` returns `"pass@k"` or `"pass^k"` — these are already display-ready. Passing them through `formatEvalSessionMetricName` (which capitalizes after word boundaries, i.e. after any non-word character) turns them into `"Pass@K"` and `"Pass^K"` because `@` and `^` are non-word characters, so `\b` matches before `k` and capitalizes it.

The function is designed for underscore-separated backend identifiers like `pass_at_k`, not for these labels.

```suggestion
                  <TableCell className="text-sm text-muted-foreground">
                    {formatPrimaryMetric(item)}
                  </TableCell>
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: web/src/app/(workspace)/workspaces/[workspaceId]/runs/create-eval-session-dialog.tsx
Line: 103-105

Comment:
**Redundant client-side filter — backend already guarantees `status = 'active'`**

The `ListActiveAgentDeploymentsByWorkspaceID` SQL query includes `AND ad.status = 'active'` in its `WHERE` clause. Filtering again on `deployment.status === "active"` in the client is harmless but misleading — it implies the API might return non-active deployments.

```suggestion
      setDeployments(deploymentsResponse.items);
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: web/src/app/(workspace)/workspaces/[workspaceId]/runs/create-eval-session-dialog.tsx
Line: 452-492

Comment:
**No empty state when the workspace has no active deployments**

When `deployments.length === 0` after load, the participant section renders an empty `<div className="space-y-3">` with no message. This leaves users with a blank card and a permanently disabled submit button and no explanation. A short empty-state message (e.g. "No active deployments found. Deploy an agent to get started.") would prevent confusion.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: web/src/app/(workspace)/workspaces/[workspaceId]/eval-sessions/[evalSessionId]/eval-session-detail-client.tsx
Line: 271-280

Comment:
**Duplicate warning strings will cause React key collisions**

The evidence warnings are keyed by their text content. If the backend returns two evidence warnings with identical text, React will log a duplicate-key warning and may mishandle reconciliation. Since this list is never reordered, using the array position as the key is safer than using the warning message itself.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "Add eval session UI for repeated runs" | Re-trigger Greptile

Comment on lines +151 to +155
<TableCell className="text-sm text-muted-foreground">
{formatPrimaryMetric(item) === "—"
? "—"
: formatEvalSessionMetricName(formatPrimaryMetric(item))}
</TableCell>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 formatEvalSessionMetricName applied to already-human-readable labels

formatPrimaryMetric returns "pass@k" or "pass^k" — these are already display-ready. Passing them through formatEvalSessionMetricName (which capitalizes after word boundaries, i.e. after any non-word character) turns them into "Pass@K" and "Pass^K" because @ and ^ are non-word characters, so \b matches before k and capitalizes it.

The function is designed for underscore-separated backend identifiers like pass_at_k, not for these labels.

Suggested change
<TableCell className="text-sm text-muted-foreground">
{formatPrimaryMetric(item) === "—"
? "—"
: formatEvalSessionMetricName(formatPrimaryMetric(item))}
</TableCell>
<TableCell className="text-sm text-muted-foreground">
{formatPrimaryMetric(item)}
</TableCell>
Prompt To Fix With AI
This is a comment left during a code review.
Path: web/src/app/(workspace)/workspaces/[workspaceId]/runs/eval-session-list.tsx
Line: 151-155

Comment:
**`formatEvalSessionMetricName` applied to already-human-readable labels**

`formatPrimaryMetric` returns `"pass@k"` or `"pass^k"` — these are already display-ready. Passing them through `formatEvalSessionMetricName` (which capitalizes after word boundaries, i.e. after any non-word character) turns them into `"Pass@K"` and `"Pass^K"` because `@` and `^` are non-word characters, so `\b` matches before `k` and capitalizes it.

The function is designed for underscore-separated backend identifiers like `pass_at_k`, not for these labels.

```suggestion
                  <TableCell className="text-sm text-muted-foreground">
                    {formatPrimaryMetric(item)}
                  </TableCell>
```

How can I resolve this? If you propose a fix, please make it concise.

Fix in Codex

Comment on lines +103 to +105
setDeployments(
deploymentsResponse.items.filter((deployment) => deployment.status === "active"),
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Redundant client-side filter — backend already guarantees status = 'active'

The ListActiveAgentDeploymentsByWorkspaceID SQL query includes AND ad.status = 'active' in its WHERE clause. Filtering again on deployment.status === "active" in the client is harmless but misleading — it implies the API might return non-active deployments.

Suggested change
setDeployments(
deploymentsResponse.items.filter((deployment) => deployment.status === "active"),
);
setDeployments(deploymentsResponse.items);
Prompt To Fix With AI
This is a comment left during a code review.
Path: web/src/app/(workspace)/workspaces/[workspaceId]/runs/create-eval-session-dialog.tsx
Line: 103-105

Comment:
**Redundant client-side filter — backend already guarantees `status = 'active'`**

The `ListActiveAgentDeploymentsByWorkspaceID` SQL query includes `AND ad.status = 'active'` in its `WHERE` clause. Filtering again on `deployment.status === "active"` in the client is harmless but misleading — it implies the API might return non-active deployments.

```suggestion
      setDeployments(deploymentsResponse.items);
```

How can I resolve this? If you propose a fix, please make it concise.

Fix in Codex

Comment on lines +452 to +492

<div className="space-y-3">
{deployments.map((deployment) => {
const checked = selectedDeploymentIds.includes(deployment.id);
return (
<div
key={deployment.id}
className="rounded-lg border border-border bg-background/60 p-3"
>
<label className="flex items-center gap-3 text-sm font-medium">
<input
type="checkbox"
checked={checked}
onChange={() => toggleDeployment(deployment.id)}
className="size-4 rounded border-border accent-primary"
/>
<span>{deployment.name}</span>
</label>
{checked ? (
<div className="mt-3">
<label className="mb-1.5 block text-xs font-medium uppercase tracking-[0.12em] text-muted-foreground">
Participant label
</label>
<input
type="text"
value={participantLabels[deployment.id] ?? deployment.name}
onChange={(event) =>
setParticipantLabels((current) => ({
...current,
[deployment.id]: event.target.value,
}))
}
className={inputClass}
/>
</div>
) : null}
</div>
);
})}
</div>
</div>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No empty state when the workspace has no active deployments

When deployments.length === 0 after load, the participant section renders an empty <div className="space-y-3"> with no message. This leaves users with a blank card and a permanently disabled submit button and no explanation. A short empty-state message (e.g. "No active deployments found. Deploy an agent to get started.") would prevent confusion.

Prompt To Fix With AI
This is a comment left during a code review.
Path: web/src/app/(workspace)/workspaces/[workspaceId]/runs/create-eval-session-dialog.tsx
Line: 452-492

Comment:
**No empty state when the workspace has no active deployments**

When `deployments.length === 0` after load, the participant section renders an empty `<div className="space-y-3">` with no message. This leaves users with a blank card and a permanently disabled submit button and no explanation. A short empty-state message (e.g. "No active deployments found. Deploy an agent to get started.") would prevent confusion.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Codex

Comment on lines +271 to +280
<div className="space-y-2">
{detail.evidence_warnings.map((warning) => (
<div
key={warning}
className="flex items-start gap-2 rounded-lg border border-amber-500/30 bg-amber-500/10 p-3 text-sm text-amber-100"
>
<AlertTriangle className="mt-0.5 size-4 shrink-0" />
<span>{warning}</span>
</div>
))}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Duplicate warning strings will cause React key collisions

The evidence warnings are keyed by their text content. If the backend returns two evidence warnings with identical text, React will log a duplicate-key warning and may mishandle reconciliation. Since this list is never reordered, using the array position as the key is safer than using the warning message itself.

Prompt To Fix With AI
This is a comment left during a code review.
Path: web/src/app/(workspace)/workspaces/[workspaceId]/eval-sessions/[evalSessionId]/eval-session-detail-client.tsx
Line: 271-280

Comment:
**Duplicate warning strings will cause React key collisions**

The evidence warnings are keyed by their text content. If the backend returns two evidence warnings with identical text, React will log a duplicate-key warning and may mishandle reconciliation. Since this list is never reordered, using the array position as the key is safer than using the warning message itself.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Codex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Statistical run aggregation for non-deterministic agents

1 participant