[codex] Complete issue #149 eval session UI by Atharva-Kanherkar · Pull Request #373 · agentclash/agentclash

Atharva-Kanherkar · 2026-04-20T19:56:43Z

Summary

Completes the remaining user-facing surface for repeated statistical evals in issue #149.

This PR adds a first-class eval-session experience in the web app, threads the missing deployment build-version identity needed by the new create flow, and aligns the API docs with the shipped payload.

Closes #149.

What Changed

added a New Eval Session flow from the runs area for repeated evals
added form support for repetitions, aggregation settings, optional success threshold, optional reliability weight override, routing/task properties, and participant selection from active deployments
added an Eval Sessions tab in the runs area with recent-session list polling and aggregate summaries
added an eval-session detail page with status, config snapshot, warnings, aggregate scorecards, pass@k, pass^k, metric routing guidance, comparison summaries, participant breakdowns, and child-run links
exposed current_build_version_id in the agent-deployment list response so the UI can construct eval-session participants correctly
updated frontend types, helper coverage, targeted tests, and OpenAPI docs

Why

The backend workflow, aggregation, and read APIs for repeated eval sessions were already in place, but the product surface for issue #149 was still largely missing. Users could not create, discover, or inspect eval sessions from the web app, which left the issue incomplete from an end-user perspective.

User Impact

Users can now:

configure repeated eval sessions from the workspace UI instead of hand-writing API requests
compare aggregated repeated-run results directly in the product
inspect pass metrics and routing guidance to understand when pass@k vs pass^k is emphasized
drill into the underlying child runs using the existing run detail flow

Validation

go test ./internal/api ./internal/repository ./internal/workflow -run 'AgentDeployment|EvalSession'
pnpm vitest run 'src/lib/api/__tests__/deployments.test.ts' 'src/lib/__tests__/eval-sessions.test.ts' 'src/app/(workspace)/workspaces/[workspaceId]/runs/create-eval-session-dialog.test.tsx'
pnpm exec tsc --noEmit
pnpm exec eslint 'src/app/(workspace)/workspaces/[workspaceId]/runs/create-eval-session-dialog.tsx' 'src/app/(workspace)/workspaces/[workspaceId]/runs/eval-session-list.tsx' 'src/app/(workspace)/workspaces/[workspaceId]/runs/page.tsx' 'src/app/(workspace)/workspaces/[workspaceId]/eval-sessions/[evalSessionId]/page.tsx' 'src/app/(workspace)/workspaces/[workspaceId]/eval-sessions/[evalSessionId]/eval-session-detail-client.tsx' 'src/lib/eval-sessions.ts' 'src/lib/api/types.ts'

Notes

Manual browser validation against a live workspace was not run in this session; validation relied on targeted tests, typecheck, lint, and the existing production build check run during development of the change.

vercel · 2026-04-20T19:56:50Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agentclash	Ready	Preview, Comment	Apr 20, 2026 7:57pm

greptile-apps · 2026-04-20T20:02:10Z

Greptile Summary

This PR completes the user-facing surface for repeated statistical eval sessions (issue #149) by adding a create dialog, a list tab, and a detail page — all wired to the existing backend aggregation APIs. It also threads current_build_version_id through the agent-deployment list endpoint (SQL → sqlc → repository → API handler → frontend types) so the create form can correctly identify eval-session participants.

Key changes:

Backend: ListActiveAgentDeploymentsByWorkspaceID query and repository layer updated to surface current_build_version_id; the field is now included in the list API JSON response.
Frontend types: AgentDeployment and the full suite of eval-session request/response types added to types.ts.
CreateEvalSessionDialog: Multi-step form covering pack/version/input-set selection, participant labeling, repetitions, aggregation config, optional success threshold, and metric routing hints. Handles 422 validation errors and navigates to the detail page on success.
EvalSessionList: Polling table in a new "Eval Sessions" tab showing status, run-count summaries, and primary metric.
EvalSessionDetailClient: Full detail page with config snapshot, evidence warnings, aggregate scorecards, metric routing guidance, comparison summary, participant breakdowns, and child-run links; polls every 5 s while active.
One P1 display bug: formatEvalSessionMetricName is applied to already-human-readable \"pass@k\" / \"pass^k\" labels in the list component, rendering them as \"Pass@K\" / \"Pass^K\" due to \\b matching after @ and ^.

Confidence Score: 4/5

Safe to merge after fixing the primary-metric label display bug in eval-session-list.tsx; all other issues are minor style improvements.

Backend changes are mechanically straightforward and well-covered by tests. The frontend architecture is sound — SSR initial load + client polling, proper error handling, form validation. One P1 display bug exists: "pass@k"/"pass^k" are rendered as "Pass@K"/"Pass^K" due to misapplication of formatEvalSessionMetricName. This is visible to every user with completed eval sessions but doesn't break any data or flow. Remaining comments are P2 style/robustness suggestions. No security issues, no data-loss risk.

eval-session-list.tsx (primary metric label rendering bug); create-eval-session-dialog.tsx (empty deployment state UX)

Important Files Changed

Filename	Overview
web/src/app/(workspace)/workspaces/[workspaceId]/runs/eval-session-list.tsx	New eval session list component with polling; has a display bug where `formatEvalSessionMetricName` is incorrectly applied to already-formatted labels like "pass@k", rendering them as "Pass@K"
web/src/app/(workspace)/workspaces/[workspaceId]/runs/create-eval-session-dialog.tsx	New dialog for creating eval sessions with form validation, input-set loading, and participant selection; has a redundant client-side status filter and lacks an empty state when no deployments exist
web/src/app/(workspace)/workspaces/[workspaceId]/eval-sessions/[evalSessionId]/eval-session-detail-client.tsx	New client component for eval session detail with polling, aggregate scorecards, participant cards, and child-run table; uses warning text as React key which could cause collisions on duplicate warnings
backend/internal/api/agent_deployments.go	Exposes `current_build_version_id` in the list response; straightforward addition that mirrors the existing AgentDeploymentSummary struct fields
backend/db/queries/agent_deployments.sql	Added `current_build_version_id` to SELECT; DISTINCT ON ordering is correct for picking the latest snapshot per deployment
backend/internal/repository/sqlc/agent_deployments.sql.go	SQLC-generated file updated to scan `current_build_version_id`; treated as non-nullable uuid.UUID consistent with the DB schema
backend/internal/repository/repository.go	Added `CurrentBuildVersionID` to `AgentDeploymentSummary` and its mapping; clean implementation
web/src/lib/api/types.ts	Comprehensive TypeScript types for eval session flows added; `AgentDeployment` now includes `current_build_version_id` matching the updated backend response
web/src/lib/eval-sessions.ts	Helper library for eval session display logic; formatting functions are well-factored and correctly handle null/NaN edge cases
web/src/app/(workspace)/workspaces/[workspaceId]/runs/page.tsx	Adds tabbed layout for runs and eval sessions with parallel SSR data fetching; clean integration of new components
web/src/app/(workspace)/workspaces/[workspaceId]/runs/status-variant.ts	Added `evalSessionStatusVariant` record covering all six EvalSessionStatus values; exhaustive and safe
web/src/lib/tests/eval-sessions.test.ts	Unit tests for eval-session helpers cover title derivation, mode detection, value formatting, and dimension sorting; coverage is adequate
web/src/lib/api/tests/deployments.test.ts	Tests verify that the deployment list response includes `current_build_version_id` and that error paths surface correctly

Sequence Diagram

sequenceDiagram
    participant User
    participant RunsPage as Runs Page (SSR)
    participant CreateDialog as CreateEvalSessionDialog
    participant EvalSessionList as EvalSessionList (polling)
    participant DetailPage as EvalSessionDetailClient (polling)
    participant API as Backend API

    RunsPage->>API: GET /v1/workspaces/{id}/runs
    RunsPage->>API: GET /v1/eval-sessions?workspace_id=...
    API-->>RunsPage: runs + eval sessions (SSR initial data)

    User->>CreateDialog: Open New Eval Session
    CreateDialog->>API: GET /v1/workspaces/{id}/challenge-packs
    CreateDialog->>API: GET /v1/workspaces/{id}/agent-deployments
    API-->>CreateDialog: packs + active deployments (with current_build_version_id)
    User->>CreateDialog: Select pack / version / deployments / config
    CreateDialog->>API: POST /v1/eval-sessions
    API-->>CreateDialog: 201 CreateEvalSessionResponse or 422 validation errors
    CreateDialog->>DetailPage: navigate to /eval-sessions/{id}

    DetailPage->>API: GET /v1/eval-sessions/{id} (SSR)
    API-->>DetailPage: EvalSessionDetail (initial)
    loop every 5s while status in queued running aggregating
        DetailPage->>API: GET /v1/eval-sessions/{id}
        API-->>DetailPage: updated EvalSessionDetail
    end

    loop every 5s while any session active
        EvalSessionList->>API: GET /v1/eval-sessions?workspace_id=...
        API-->>EvalSessionList: updated list
    end

Prompt To Fix All With AI

This is a comment left during a code review.
Path: web/src/app/(workspace)/workspaces/[workspaceId]/runs/eval-session-list.tsx
Line: 151-155

Comment:
**`formatEvalSessionMetricName` applied to already-human-readable labels**

`formatPrimaryMetric` returns `"pass@k"` or `"pass^k"` — these are already display-ready. Passing them through `formatEvalSessionMetricName` (which capitalizes after word boundaries, i.e. after any non-word character) turns them into `"Pass@K"` and `"Pass^K"` because `@` and `^` are non-word characters, so `\b` matches before `k` and capitalizes it.

The function is designed for underscore-separated backend identifiers like `pass_at_k`, not for these labels.

```suggestion
                  <TableCell className="text-sm text-muted-foreground">
                    {formatPrimaryMetric(item)}
                  </TableCell>
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: web/src/app/(workspace)/workspaces/[workspaceId]/runs/create-eval-session-dialog.tsx
Line: 103-105

Comment:
**Redundant client-side filter — backend already guarantees `status = 'active'`**

The `ListActiveAgentDeploymentsByWorkspaceID` SQL query includes `AND ad.status = 'active'` in its `WHERE` clause. Filtering again on `deployment.status === "active"` in the client is harmless but misleading — it implies the API might return non-active deployments.

```suggestion
      setDeployments(deploymentsResponse.items);
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: web/src/app/(workspace)/workspaces/[workspaceId]/runs/create-eval-session-dialog.tsx
Line: 452-492

Comment:
**No empty state when the workspace has no active deployments**

When `deployments.length === 0` after load, the participant section renders an empty `<div className="space-y-3">` with no message. This leaves users with a blank card and a permanently disabled submit button and no explanation. A short empty-state message (e.g. "No active deployments found. Deploy an agent to get started.") would prevent confusion.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: web/src/app/(workspace)/workspaces/[workspaceId]/eval-sessions/[evalSessionId]/eval-session-detail-client.tsx
Line: 271-280

Comment:
**Duplicate warning strings will cause React key collisions**

The evidence warnings are keyed by their text content. If the backend returns two evidence warnings with identical text, React will log a duplicate-key warning and may mishandle reconciliation. Since this list is never reordered, using the array position as the key is safer than using the warning message itself.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "Add eval session UI for repeated runs" | Re-trigger Greptile}

greptile-apps · 2026-04-20T20:02:13Z

+                  <TableCell className="text-sm text-muted-foreground">
+                    {formatPrimaryMetric(item) === "—"
+                      ? "—"
+                      : formatEvalSessionMetricName(formatPrimaryMetric(item))}
+                  </TableCell>


formatEvalSessionMetricName applied to already-human-readable labels

formatPrimaryMetric returns "pass@k" or "pass^k" — these are already display-ready. Passing them through formatEvalSessionMetricName (which capitalizes after word boundaries, i.e. after any non-word character) turns them into "Pass@K" and "Pass^K" because @ and ^ are non-word characters, so \b matches before k and capitalizes it.

The function is designed for underscore-separated backend identifiers like pass_at_k, not for these labels.

Suggested change

<TableCell className="text-sm text-muted-foreground">

{formatPrimaryMetric(item) === "—"

? "—"

: formatEvalSessionMetricName(formatPrimaryMetric(item))}

</TableCell>

<TableCell className="text-sm text-muted-foreground">

{formatPrimaryMetric(item)}

</TableCell>

Prompt To Fix With AI

This is a comment left during a code review. Path: web/src/app/(workspace)/workspaces/[workspaceId]/runs/eval-session-list.tsx Line: 151-155 Comment: **`formatEvalSessionMetricName` applied to already-human-readable labels** `formatPrimaryMetric` returns `"pass@k"` or `"pass^k"` — these are already display-ready. Passing them through `formatEvalSessionMetricName` (which capitalizes after word boundaries, i.e. after any non-word character) turns them into `"Pass@K"` and `"Pass^K"` because `@` and `^` are non-word characters, so `\b` matches before `k` and capitalizes it. The function is designed for underscore-separated backend identifiers like `pass_at_k`, not for these labels. ```suggestion <TableCell className="text-sm text-muted-foreground"> {formatPrimaryMetric(item)} </TableCell> ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-20T20:02:14Z

+      setDeployments(
+        deploymentsResponse.items.filter((deployment) => deployment.status === "active"),
+      );


Redundant client-side filter — backend already guarantees status = 'active'

The ListActiveAgentDeploymentsByWorkspaceID SQL query includes AND ad.status = 'active' in its WHERE clause. Filtering again on deployment.status === "active" in the client is harmless but misleading — it implies the API might return non-active deployments.

Suggested change

setDeployments(

deploymentsResponse.items.filter((deployment) => deployment.status === "active"),

);

setDeployments(deploymentsResponse.items);

Prompt To Fix With AI

This is a comment left during a code review. Path: web/src/app/(workspace)/workspaces/[workspaceId]/runs/create-eval-session-dialog.tsx Line: 103-105 Comment: **Redundant client-side filter — backend already guarantees `status = 'active'`** The `ListActiveAgentDeploymentsByWorkspaceID` SQL query includes `AND ad.status = 'active'` in its `WHERE` clause. Filtering again on `deployment.status === "active"` in the client is harmless but misleading — it implies the API might return non-active deployments. ```suggestion setDeployments(deploymentsResponse.items); ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-20T20:02:15Z

+
+            <div className="space-y-3">
+              {deployments.map((deployment) => {
+                const checked = selectedDeploymentIds.includes(deployment.id);
+                return (
+                  <div
+                    key={deployment.id}
+                    className="rounded-lg border border-border bg-background/60 p-3"
+                  >
+                    <label className="flex items-center gap-3 text-sm font-medium">
+                      <input
+                        type="checkbox"
+                        checked={checked}
+                        onChange={() => toggleDeployment(deployment.id)}
+                        className="size-4 rounded border-border accent-primary"
+                      />
+                      <span>{deployment.name}</span>
+                    </label>
+                    {checked ? (
+                      <div className="mt-3">
+                        <label className="mb-1.5 block text-xs font-medium uppercase tracking-[0.12em] text-muted-foreground">
+                          Participant label
+                        </label>
+                        <input
+                          type="text"
+                          value={participantLabels[deployment.id] ?? deployment.name}
+                          onChange={(event) =>
+                            setParticipantLabels((current) => ({
+                              ...current,
+                              [deployment.id]: event.target.value,
+                            }))
+                          }
+                          className={inputClass}
+                        />
+                      </div>
+                    ) : null}
+                  </div>
+                );
+              })}
+            </div>
+          </div>


No empty state when the workspace has no active deployments

When deployments.length === 0 after load, the participant section renders an empty <div className="space-y-3"> with no message. This leaves users with a blank card and a permanently disabled submit button and no explanation. A short empty-state message (e.g. "No active deployments found. Deploy an agent to get started.") would prevent confusion.

Prompt To Fix With AI

This is a comment left during a code review. Path: web/src/app/(workspace)/workspaces/[workspaceId]/runs/create-eval-session-dialog.tsx Line: 452-492 Comment: **No empty state when the workspace has no active deployments** When `deployments.length === 0` after load, the participant section renders an empty `<div className="space-y-3">` with no message. This leaves users with a blank card and a permanently disabled submit button and no explanation. A short empty-state message (e.g. "No active deployments found. Deploy an agent to get started.") would prevent confusion. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-20T20:02:16Z

+          <div className="space-y-2">
+            {detail.evidence_warnings.map((warning) => (
+              <div
+                key={warning}
+                className="flex items-start gap-2 rounded-lg border border-amber-500/30 bg-amber-500/10 p-3 text-sm text-amber-100"
+              >
+                <AlertTriangle className="mt-0.5 size-4 shrink-0" />
+                <span>{warning}</span>
+              </div>
+            ))}


Duplicate warning strings will cause React key collisions

The evidence warnings are keyed by their text content. If the backend returns two evidence warnings with identical text, React will log a duplicate-key warning and may mishandle reconciliation. Since this list is never reordered, using the array position as the key is safer than using the warning message itself.

Prompt To Fix With AI

This is a comment left during a code review. Path: web/src/app/(workspace)/workspaces/[workspaceId]/eval-sessions/[evalSessionId]/eval-session-detail-client.tsx Line: 271-280 Comment: **Duplicate warning strings will cause React key collisions** The evidence warnings are keyed by their text content. If the backend returns two evidence warnings with identical text, React will log a duplicate-key warning and may mishandle reconciliation. Since this list is never reordered, using the array position as the key is safer than using the warning message itself. How can I resolve this? If you propose a fix, please make it concise.

Add eval session UI for repeated runs

d1c02ce

Atharva-Kanherkar marked this pull request as ready for review April 20, 2026 19:57

vercel Bot deployed to Preview April 20, 2026 19:57 View deployment

Atharva-Kanherkar self-assigned this Apr 20, 2026

Atharva-Kanherkar merged commit c0a62da into main Apr 20, 2026
6 checks passed

greptile-apps Bot reviewed Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Complete issue #149 eval session UI#373

[codex] Complete issue #149 eval session UI#373
Atharva-Kanherkar merged 1 commit into
mainfrom
issue-149-eval-session-ui

Atharva-Kanherkar commented Apr 20, 2026

Uh oh!

vercel Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 20, 2026

Important Files Changed

Uh oh!

greptile-apps Bot Apr 20, 2026

Uh oh!

greptile-apps Bot Apr 20, 2026

Uh oh!

greptile-apps Bot Apr 20, 2026

Uh oh!

greptile-apps Bot Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Atharva-Kanherkar commented Apr 20, 2026

Summary

What Changed

Why

User Impact

Validation

Notes

Uh oh!

vercel Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 20, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Apr 20, 2026 •

edited

Loading