[codex] Fix same-run agent scorecard comparison#429
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Greptile SummaryThis PR fixes the HTTP 500 that occurred when
Confidence Score: 3/5Not safe to merge as-is: the migration enables the new code path but leaves a data-integrity gap that will silently corrupt the stored comparison for same-run multi-agent scenarios. The application-layer and test changes are correct and well-covered. However, the DB migration fixes only the CHECK constraint while the UNIQUE index and the upsert conflict key remain keyed on (baseline_run_id, candidate_run_id) — the same key for every same-run comparison. This means concurrent or sequential same-run comparisons for different agent pairs will overwrite each other in the database, which is the exact scenario this PR enables and needs to handle correctly. backend/db/migrations/00030_same_run_agent_comparisons.sql and backend/db/queries/run_comparisons.sql (upsert conflict key) need attention before merging.
|
| Filename | Overview |
|---|---|
| backend/db/migrations/00030_same_run_agent_comparisons.sql | Replaces the old CHECK constraint with one that allows same-run/different-agent comparisons, but does not update the UNIQUE (baseline_run_id, candidate_run_id) constraint — causing same-run comparisons for different agent pairs to silently overwrite each other. |
| backend/internal/api/compare_reads.go | Adds ErrInvalidCompareRequest sentinel, validation logic for same-run/different-agent path, and HTTP 400 mapping in the handler; isCompareValidationError has a fragile string-based fallback with two stale entries. |
| backend/internal/api/compare_reads_test.go | Adds three manager-level tests and one handler test covering the new same-run validation paths; coverage is good. |
| backend/internal/repository/run_comparison.go | Adds validateBuildRunComparisonParams with correct same-run/same-agent rejection; upsert logic unchanged and still keyed on (baseline_run_id, candidate_run_id), which collides for same-run different-agent pairs. |
| backend/internal/repository/run_comparison_test.go | Adds unit tests for all three validateBuildRunComparisonParams branches (allow different agents, reject same agent, reject missing agent IDs). |
| backend/internal/repository/repository_integration_test.go | Adds a DB-backed integration test for same-run/different-agent BuildRunComparison; guarded by DATABASE_URL so it runs in CI only. |
| testing/issue-424-eval-scorecard-same-run-500.md | Review-checkpoint document describing contract expectations and validation test commands for issue #424. |
Sequence Diagram
sequenceDiagram
participant CLI as CLI / Client
participant H as HTTP Handler
participant M as CompareReadManager
participant R as Repository
participant DB as Postgres
CLI->>H: GET /v1/compare?baseline_run_id=X&candidate_run_id=X&baseline_run_agent_id=A&candidate_run_agent_id=B
H->>H: compareInputFromRequest (parse UUIDs)
H->>M: GetRunComparison(input)
M->>M: validate: same run + both agent IDs present + IDs differ
alt validation fails (same agent or missing IDs)
M-->>H: invalidCompareRequest error (wraps ErrInvalidCompareRequest)
H-->>CLI: 400 invalid_compare_request
end
M->>R: GetRunByID(X) x2
R-->>M: Run{X}
M->>M: AuthorizeWorkspace
M->>R: BuildRunComparison(X, X, A, B)
R->>R: validateBuildRunComparisonParams (same-run guard)
R->>DB: ListRunAgentsByRunID(X)
DB-->>R: [agentA, agentB, ...]
R->>R: resolveRunComparisonParticipants (pin A and B)
R->>DB: load scorecards, replays, judge results for A and B
R->>R: buildComparableRunComparisonSummary
R->>DB: UpsertRunComparison ON CONFLICT (baseline_run_id, candidate_run_id) overwrites any prior same-run row
DB-->>R: RunComparison row (A vs B)
R-->>M: RunComparison
M-->>H: GetRunComparisonResult
H-->>CLI: 200 JSON with state/deltas
Prompt To Fix All With AI
This is a comment left during a code review.
Path: backend/db/migrations/00030_same_run_agent_comparisons.sql
Line: 19-26
Comment:
**Same-run comparisons collide on the existing `UNIQUE (baseline_run_id, candidate_run_id)` constraint**
This migration drops and replaces the `CHECK` constraint but leaves the `UNIQUE (baseline_run_id, candidate_run_id)` index untouched (introduced in `00011_run_comparisons.sql`). For same-run comparisons `baseline_run_id = candidate_run_id`, so every distinct agent-pair comparison maps to the *same* key `(run_id, run_id)`.
The `UpsertRunComparison` query uses `ON CONFLICT (baseline_run_id, candidate_run_id) DO UPDATE SET baseline_run_agent_id = EXCLUDED.baseline_run_agent_id, ...`, which means a comparison of agents A vs B will be silently overwritten the moment agents A vs C are compared for the same run. The row stored in `run_comparisons` will always reflect only the *most recent* same-run request.
`GetRunComparisonByRunIDs` reads by `(baseline_run_id, candidate_run_id)` and is exposed through the repository interface, so any caller that uses that read path after a subsequent same-run comparison will receive stale data for the wrong agent pair. The migration needs to also extend the unique index to distinguish same-run comparisons by agent pair, e.g.:
```sql
ALTER TABLE run_comparisons DROP CONSTRAINT run_comparisons_baseline_run_id_candidate_run_id_key;
CREATE UNIQUE INDEX run_comparisons_cross_run_unique
ON run_comparisons (baseline_run_id, candidate_run_id)
WHERE baseline_run_id <> candidate_run_id;
CREATE UNIQUE INDEX run_comparisons_same_run_agent_unique
ON run_comparisons (baseline_run_id, candidate_run_id, baseline_run_agent_id, candidate_run_agent_id)
WHERE baseline_run_id = candidate_run_id;
```
The `UpsertRunComparison` query and the SQLC-generated code would also need to be updated to conflict on the appropriate key depending on whether this is a same-run or cross-run comparison.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: backend/internal/api/compare_reads.go
Line: 258-273
Comment:
**String-based fallback in `isCompareValidationError` contains stale entries and is fragile**
The `errors.Is(err, ErrInvalidCompareRequest)` sentinel check correctly handles all errors produced by this file's own `invalidCompareRequest()` helper. The string-based `switch` below it is intended as a compatibility net for repository-layer plain `fmt.Errorf` strings, but two of the listed strings are no longer produced by any code path in the repo:
- `"baseline_run_id is required"` — the API now wraps this before the service is called; the repo never returned it
- `"baseline_run_id and candidate_run_id must differ"` — replaced with the wrapped form in this PR
These dead entries will never match and could mislead future readers. More importantly, the string-matching approach will silently stop working if error messages are renamed. Prefer wrapping repository validation errors with `ErrInvalidCompareRequest` (or a shared sentinel) at the service boundary so that the `errors.Is` path alone is sufficient.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix: allow same-run agent comparisons" | Re-trigger Greptile
| ADD CONSTRAINT run_comparisons_distinct_runs_or_agents_check | ||
| CHECK ( | ||
| baseline_run_id <> candidate_run_id | ||
| OR ( | ||
| baseline_run_agent_id IS NOT NULL | ||
| AND candidate_run_agent_id IS NOT NULL | ||
| AND baseline_run_agent_id <> candidate_run_agent_id | ||
| ) |
There was a problem hiding this comment.
Same-run comparisons collide on the existing
UNIQUE (baseline_run_id, candidate_run_id) constraint
This migration drops and replaces the CHECK constraint but leaves the UNIQUE (baseline_run_id, candidate_run_id) index untouched (introduced in 00011_run_comparisons.sql). For same-run comparisons baseline_run_id = candidate_run_id, so every distinct agent-pair comparison maps to the same key (run_id, run_id).
The UpsertRunComparison query uses ON CONFLICT (baseline_run_id, candidate_run_id) DO UPDATE SET baseline_run_agent_id = EXCLUDED.baseline_run_agent_id, ..., which means a comparison of agents A vs B will be silently overwritten the moment agents A vs C are compared for the same run. The row stored in run_comparisons will always reflect only the most recent same-run request.
GetRunComparisonByRunIDs reads by (baseline_run_id, candidate_run_id) and is exposed through the repository interface, so any caller that uses that read path after a subsequent same-run comparison will receive stale data for the wrong agent pair. The migration needs to also extend the unique index to distinguish same-run comparisons by agent pair, e.g.:
ALTER TABLE run_comparisons DROP CONSTRAINT run_comparisons_baseline_run_id_candidate_run_id_key;
CREATE UNIQUE INDEX run_comparisons_cross_run_unique
ON run_comparisons (baseline_run_id, candidate_run_id)
WHERE baseline_run_id <> candidate_run_id;
CREATE UNIQUE INDEX run_comparisons_same_run_agent_unique
ON run_comparisons (baseline_run_id, candidate_run_id, baseline_run_agent_id, candidate_run_agent_id)
WHERE baseline_run_id = candidate_run_id;The UpsertRunComparison query and the SQLC-generated code would also need to be updated to conflict on the appropriate key depending on whether this is a same-run or cross-run comparison.
Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/db/migrations/00030_same_run_agent_comparisons.sql
Line: 19-26
Comment:
**Same-run comparisons collide on the existing `UNIQUE (baseline_run_id, candidate_run_id)` constraint**
This migration drops and replaces the `CHECK` constraint but leaves the `UNIQUE (baseline_run_id, candidate_run_id)` index untouched (introduced in `00011_run_comparisons.sql`). For same-run comparisons `baseline_run_id = candidate_run_id`, so every distinct agent-pair comparison maps to the *same* key `(run_id, run_id)`.
The `UpsertRunComparison` query uses `ON CONFLICT (baseline_run_id, candidate_run_id) DO UPDATE SET baseline_run_agent_id = EXCLUDED.baseline_run_agent_id, ...`, which means a comparison of agents A vs B will be silently overwritten the moment agents A vs C are compared for the same run. The row stored in `run_comparisons` will always reflect only the *most recent* same-run request.
`GetRunComparisonByRunIDs` reads by `(baseline_run_id, candidate_run_id)` and is exposed through the repository interface, so any caller that uses that read path after a subsequent same-run comparison will receive stale data for the wrong agent pair. The migration needs to also extend the unique index to distinguish same-run comparisons by agent pair, e.g.:
```sql
ALTER TABLE run_comparisons DROP CONSTRAINT run_comparisons_baseline_run_id_candidate_run_id_key;
CREATE UNIQUE INDEX run_comparisons_cross_run_unique
ON run_comparisons (baseline_run_id, candidate_run_id)
WHERE baseline_run_id <> candidate_run_id;
CREATE UNIQUE INDEX run_comparisons_same_run_agent_unique
ON run_comparisons (baseline_run_id, candidate_run_id, baseline_run_agent_id, candidate_run_agent_id)
WHERE baseline_run_id = candidate_run_id;
```
The `UpsertRunComparison` query and the SQLC-generated code would also need to be updated to conflict on the appropriate key depending on whether this is a same-run or cross-run comparison.
How can I resolve this? If you propose a fix, please make it concise.| func isCompareValidationError(err error) bool { | ||
| if errors.Is(err, ErrInvalidCompareRequest) { | ||
| return true | ||
| } | ||
| switch err.Error() { | ||
| case "baseline_run_id is required", | ||
| "candidate_run_id is required", | ||
| "baseline_run_id and candidate_run_id must differ", | ||
| "baseline and candidate run ids must differ", | ||
| "same-run comparison requires baseline and candidate run agent ids", | ||
| "baseline and candidate run agent ids must differ for same-run comparison": | ||
| return true | ||
| default: | ||
| return false | ||
| } | ||
| } |
There was a problem hiding this comment.
String-based fallback in
isCompareValidationError contains stale entries and is fragile
The errors.Is(err, ErrInvalidCompareRequest) sentinel check correctly handles all errors produced by this file's own invalidCompareRequest() helper. The string-based switch below it is intended as a compatibility net for repository-layer plain fmt.Errorf strings, but two of the listed strings are no longer produced by any code path in the repo:
"baseline_run_id is required"— the API now wraps this before the service is called; the repo never returned it"baseline_run_id and candidate_run_id must differ"— replaced with the wrapped form in this PR
These dead entries will never match and could mislead future readers. More importantly, the string-matching approach will silently stop working if error messages are renamed. Prefer wrapping repository validation errors with ErrInvalidCompareRequest (or a shared sentinel) at the service boundary so that the errors.Is path alone is sufficient.
Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/internal/api/compare_reads.go
Line: 258-273
Comment:
**String-based fallback in `isCompareValidationError` contains stale entries and is fragile**
The `errors.Is(err, ErrInvalidCompareRequest)` sentinel check correctly handles all errors produced by this file's own `invalidCompareRequest()` helper. The string-based `switch` below it is intended as a compatibility net for repository-layer plain `fmt.Errorf` strings, but two of the listed strings are no longer produced by any code path in the repo:
- `"baseline_run_id is required"` — the API now wraps this before the service is called; the repo never returned it
- `"baseline_run_id and candidate_run_id must differ"` — replaced with the wrapped form in this PR
These dead entries will never match and could mislead future readers. More importantly, the string-matching approach will silently stop working if error messages are renamed. Prefer wrapping repository validation errors with `ErrInvalidCompareRequest` (or a shared sentinel) at the service boundary so that the `errors.Is` path alone is sufficient.
How can I resolve this? If you propose a fix, please make it concise.# Conflicts: # backend/db/migrations/00030_same_run_agent_comparisons.sql # backend/internal/api/compare_reads.go # backend/internal/api/compare_reads_test.go # backend/internal/repository/repository_integration_test.go # backend/internal/repository/run_comparison.go # backend/internal/repository/run_comparison_test.go
Summary
Fixes #424.
This fixes the compare path used by
agentclash eval scorecardwhen the bookmarked baseline and selected candidate are two different agents from the same multi-agent run. That workflow should compare the two run agents, not fail before explicit agent IDs are considered.What changed
baseline_run_idandcandidate_run_idwhen explicit baseline/candidate run-agent IDs are present and different.400 invalid_compare_requestinstead of500 internal_error.baseline_run_id <> candidate_run_idcheck with a distinct-run-or-distinct-agent check.testing/.Validation
cd backend && go test ./internal/api -run 'TestCompare|TestGetRunComparison'\n-cd backend && go test ./internal/repository -run 'TestValidateBuildRunComparisonParams|TestRepositoryBuildRunComparison'\n-cd cli && go test ./cmd -run 'TestEvalScorecard|TestCompareRuns'\n-cd backend && go test ./internal/api ./internal/repository\n-cd cli && go test ./cmd\n\nNote: repository integration tests that requireDATABASE_URLare skipped locally in this environment; the same-run integration case is included for database-backed CI validation.