Skip to content

[codex] Fix same-run agent scorecard comparison#429

Merged
Atharva-Kanherkar merged 3 commits into
mainfrom
issue-424-eval-scorecard-same-run-500
Apr 29, 2026
Merged

[codex] Fix same-run agent scorecard comparison#429
Atharva-Kanherkar merged 3 commits into
mainfrom
issue-424-eval-scorecard-same-run-500

Conversation

@Atharva-Kanherkar
Copy link
Copy Markdown
Collaborator

Summary

Fixes #424.

This fixes the compare path used by agentclash eval scorecard when the bookmarked baseline and selected candidate are two different agents from the same multi-agent run. That workflow should compare the two run agents, not fail before explicit agent IDs are considered.

What changed

  • Allows same baseline_run_id and candidate_run_id when explicit baseline/candidate run-agent IDs are present and different.
  • Rejects same-run same-agent and same-run ambiguous comparisons as 400 invalid_compare_request instead of 500 internal_error.
  • Adds a DB migration replacing the old baseline_run_id <> candidate_run_id check with a distinct-run-or-distinct-agent check.
  • Adds review-checkpoint contract coverage under testing/.

Validation

  • cd backend && go test ./internal/api -run 'TestCompare|TestGetRunComparison'\n- cd backend && go test ./internal/repository -run 'TestValidateBuildRunComparisonParams|TestRepositoryBuildRunComparison'\n- cd cli && go test ./cmd -run 'TestEvalScorecard|TestCompareRuns'\n- cd backend && go test ./internal/api ./internal/repository\n- cd cli && go test ./cmd\n\nNote: repository integration tests that require DATABASE_URL are skipped locally in this environment; the same-run integration case is included for database-backed CI validation.

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agentclash Ready Ready Preview, Comment Apr 29, 2026 6:50am

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

This PR fixes the HTTP 500 that occurred when eval scorecard compared two different agents from the same multi-agent run. It adds same-run/different-agent validation in both the API manager and the repository layer, maps those validation errors to HTTP 400, and ships a DB migration relaxing the old baseline_run_id <> candidate_run_id CHECK constraint.

  • P1 — migration incomplete: The new CHECK constraint is correct, but the existing UNIQUE (baseline_run_id, candidate_run_id) index (from migration 00011) is unchanged, and the UpsertRunComparison query conflicts on that same key. For a same-run comparison baseline_run_id = candidate_run_id = run_id, so every agent-pair comparison for a given run collides at (run_id, run_id). A second same-run comparison (agents A vs C) will silently overwrite the first (agents A vs B), leaving only the latest comparison in the DB. GetRunComparisonByRunIDs — exposed through the repository interface — would return the wrong agent-pair result for any caller that uses it after a subsequent same-run comparison.

Confidence Score: 3/5

Not safe to merge as-is: the migration enables the new code path but leaves a data-integrity gap that will silently corrupt the stored comparison for same-run multi-agent scenarios.

The application-layer and test changes are correct and well-covered. However, the DB migration fixes only the CHECK constraint while the UNIQUE index and the upsert conflict key remain keyed on (baseline_run_id, candidate_run_id) — the same key for every same-run comparison. This means concurrent or sequential same-run comparisons for different agent pairs will overwrite each other in the database, which is the exact scenario this PR enables and needs to handle correctly.

backend/db/migrations/00030_same_run_agent_comparisons.sql and backend/db/queries/run_comparisons.sql (upsert conflict key) need attention before merging.

Important Files Changed

Filename Overview
backend/db/migrations/00030_same_run_agent_comparisons.sql Replaces the old CHECK constraint with one that allows same-run/different-agent comparisons, but does not update the UNIQUE (baseline_run_id, candidate_run_id) constraint — causing same-run comparisons for different agent pairs to silently overwrite each other.
backend/internal/api/compare_reads.go Adds ErrInvalidCompareRequest sentinel, validation logic for same-run/different-agent path, and HTTP 400 mapping in the handler; isCompareValidationError has a fragile string-based fallback with two stale entries.
backend/internal/api/compare_reads_test.go Adds three manager-level tests and one handler test covering the new same-run validation paths; coverage is good.
backend/internal/repository/run_comparison.go Adds validateBuildRunComparisonParams with correct same-run/same-agent rejection; upsert logic unchanged and still keyed on (baseline_run_id, candidate_run_id), which collides for same-run different-agent pairs.
backend/internal/repository/run_comparison_test.go Adds unit tests for all three validateBuildRunComparisonParams branches (allow different agents, reject same agent, reject missing agent IDs).
backend/internal/repository/repository_integration_test.go Adds a DB-backed integration test for same-run/different-agent BuildRunComparison; guarded by DATABASE_URL so it runs in CI only.
testing/issue-424-eval-scorecard-same-run-500.md Review-checkpoint document describing contract expectations and validation test commands for issue #424.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI / Client
    participant H as HTTP Handler
    participant M as CompareReadManager
    participant R as Repository
    participant DB as Postgres

    CLI->>H: GET /v1/compare?baseline_run_id=X&candidate_run_id=X&baseline_run_agent_id=A&candidate_run_agent_id=B
    H->>H: compareInputFromRequest (parse UUIDs)
    H->>M: GetRunComparison(input)
    M->>M: validate: same run + both agent IDs present + IDs differ
    alt validation fails (same agent or missing IDs)
        M-->>H: invalidCompareRequest error (wraps ErrInvalidCompareRequest)
        H-->>CLI: 400 invalid_compare_request
    end
    M->>R: GetRunByID(X) x2
    R-->>M: Run{X}
    M->>M: AuthorizeWorkspace
    M->>R: BuildRunComparison(X, X, A, B)
    R->>R: validateBuildRunComparisonParams (same-run guard)
    R->>DB: ListRunAgentsByRunID(X)
    DB-->>R: [agentA, agentB, ...]
    R->>R: resolveRunComparisonParticipants (pin A and B)
    R->>DB: load scorecards, replays, judge results for A and B
    R->>R: buildComparableRunComparisonSummary
    R->>DB: UpsertRunComparison ON CONFLICT (baseline_run_id, candidate_run_id) overwrites any prior same-run row
    DB-->>R: RunComparison row (A vs B)
    R-->>M: RunComparison
    M-->>H: GetRunComparisonResult
    H-->>CLI: 200 JSON with state/deltas
Loading

Fix All in Codex

Prompt To Fix All With AI
This is a comment left during a code review.
Path: backend/db/migrations/00030_same_run_agent_comparisons.sql
Line: 19-26

Comment:
**Same-run comparisons collide on the existing `UNIQUE (baseline_run_id, candidate_run_id)` constraint**

This migration drops and replaces the `CHECK` constraint but leaves the `UNIQUE (baseline_run_id, candidate_run_id)` index untouched (introduced in `00011_run_comparisons.sql`). For same-run comparisons `baseline_run_id = candidate_run_id`, so every distinct agent-pair comparison maps to the *same* key `(run_id, run_id)`.

The `UpsertRunComparison` query uses `ON CONFLICT (baseline_run_id, candidate_run_id) DO UPDATE SET baseline_run_agent_id = EXCLUDED.baseline_run_agent_id, ...`, which means a comparison of agents A vs B will be silently overwritten the moment agents A vs C are compared for the same run. The row stored in `run_comparisons` will always reflect only the *most recent* same-run request.

`GetRunComparisonByRunIDs` reads by `(baseline_run_id, candidate_run_id)` and is exposed through the repository interface, so any caller that uses that read path after a subsequent same-run comparison will receive stale data for the wrong agent pair. The migration needs to also extend the unique index to distinguish same-run comparisons by agent pair, e.g.:

```sql
ALTER TABLE run_comparisons DROP CONSTRAINT run_comparisons_baseline_run_id_candidate_run_id_key;

CREATE UNIQUE INDEX run_comparisons_cross_run_unique
    ON run_comparisons (baseline_run_id, candidate_run_id)
    WHERE baseline_run_id <> candidate_run_id;

CREATE UNIQUE INDEX run_comparisons_same_run_agent_unique
    ON run_comparisons (baseline_run_id, candidate_run_id, baseline_run_agent_id, candidate_run_agent_id)
    WHERE baseline_run_id = candidate_run_id;
```

The `UpsertRunComparison` query and the SQLC-generated code would also need to be updated to conflict on the appropriate key depending on whether this is a same-run or cross-run comparison.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: backend/internal/api/compare_reads.go
Line: 258-273

Comment:
**String-based fallback in `isCompareValidationError` contains stale entries and is fragile**

The `errors.Is(err, ErrInvalidCompareRequest)` sentinel check correctly handles all errors produced by this file's own `invalidCompareRequest()` helper. The string-based `switch` below it is intended as a compatibility net for repository-layer plain `fmt.Errorf` strings, but two of the listed strings are no longer produced by any code path in the repo:

- `"baseline_run_id is required"` — the API now wraps this before the service is called; the repo never returned it
- `"baseline_run_id and candidate_run_id must differ"` — replaced with the wrapped form in this PR

These dead entries will never match and could mislead future readers. More importantly, the string-matching approach will silently stop working if error messages are renamed. Prefer wrapping repository validation errors with `ErrInvalidCompareRequest` (or a shared sentinel) at the service boundary so that the `errors.Is` path alone is sufficient.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix: allow same-run agent comparisons" | Re-trigger Greptile

Comment on lines +19 to +26
ADD CONSTRAINT run_comparisons_distinct_runs_or_agents_check
CHECK (
baseline_run_id <> candidate_run_id
OR (
baseline_run_agent_id IS NOT NULL
AND candidate_run_agent_id IS NOT NULL
AND baseline_run_agent_id <> candidate_run_agent_id
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Same-run comparisons collide on the existing UNIQUE (baseline_run_id, candidate_run_id) constraint

This migration drops and replaces the CHECK constraint but leaves the UNIQUE (baseline_run_id, candidate_run_id) index untouched (introduced in 00011_run_comparisons.sql). For same-run comparisons baseline_run_id = candidate_run_id, so every distinct agent-pair comparison maps to the same key (run_id, run_id).

The UpsertRunComparison query uses ON CONFLICT (baseline_run_id, candidate_run_id) DO UPDATE SET baseline_run_agent_id = EXCLUDED.baseline_run_agent_id, ..., which means a comparison of agents A vs B will be silently overwritten the moment agents A vs C are compared for the same run. The row stored in run_comparisons will always reflect only the most recent same-run request.

GetRunComparisonByRunIDs reads by (baseline_run_id, candidate_run_id) and is exposed through the repository interface, so any caller that uses that read path after a subsequent same-run comparison will receive stale data for the wrong agent pair. The migration needs to also extend the unique index to distinguish same-run comparisons by agent pair, e.g.:

ALTER TABLE run_comparisons DROP CONSTRAINT run_comparisons_baseline_run_id_candidate_run_id_key;

CREATE UNIQUE INDEX run_comparisons_cross_run_unique
    ON run_comparisons (baseline_run_id, candidate_run_id)
    WHERE baseline_run_id <> candidate_run_id;

CREATE UNIQUE INDEX run_comparisons_same_run_agent_unique
    ON run_comparisons (baseline_run_id, candidate_run_id, baseline_run_agent_id, candidate_run_agent_id)
    WHERE baseline_run_id = candidate_run_id;

The UpsertRunComparison query and the SQLC-generated code would also need to be updated to conflict on the appropriate key depending on whether this is a same-run or cross-run comparison.

Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/db/migrations/00030_same_run_agent_comparisons.sql
Line: 19-26

Comment:
**Same-run comparisons collide on the existing `UNIQUE (baseline_run_id, candidate_run_id)` constraint**

This migration drops and replaces the `CHECK` constraint but leaves the `UNIQUE (baseline_run_id, candidate_run_id)` index untouched (introduced in `00011_run_comparisons.sql`). For same-run comparisons `baseline_run_id = candidate_run_id`, so every distinct agent-pair comparison maps to the *same* key `(run_id, run_id)`.

The `UpsertRunComparison` query uses `ON CONFLICT (baseline_run_id, candidate_run_id) DO UPDATE SET baseline_run_agent_id = EXCLUDED.baseline_run_agent_id, ...`, which means a comparison of agents A vs B will be silently overwritten the moment agents A vs C are compared for the same run. The row stored in `run_comparisons` will always reflect only the *most recent* same-run request.

`GetRunComparisonByRunIDs` reads by `(baseline_run_id, candidate_run_id)` and is exposed through the repository interface, so any caller that uses that read path after a subsequent same-run comparison will receive stale data for the wrong agent pair. The migration needs to also extend the unique index to distinguish same-run comparisons by agent pair, e.g.:

```sql
ALTER TABLE run_comparisons DROP CONSTRAINT run_comparisons_baseline_run_id_candidate_run_id_key;

CREATE UNIQUE INDEX run_comparisons_cross_run_unique
    ON run_comparisons (baseline_run_id, candidate_run_id)
    WHERE baseline_run_id <> candidate_run_id;

CREATE UNIQUE INDEX run_comparisons_same_run_agent_unique
    ON run_comparisons (baseline_run_id, candidate_run_id, baseline_run_agent_id, candidate_run_agent_id)
    WHERE baseline_run_id = candidate_run_id;
```

The `UpsertRunComparison` query and the SQLC-generated code would also need to be updated to conflict on the appropriate key depending on whether this is a same-run or cross-run comparison.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Codex

Comment on lines +258 to +273
func isCompareValidationError(err error) bool {
if errors.Is(err, ErrInvalidCompareRequest) {
return true
}
switch err.Error() {
case "baseline_run_id is required",
"candidate_run_id is required",
"baseline_run_id and candidate_run_id must differ",
"baseline and candidate run ids must differ",
"same-run comparison requires baseline and candidate run agent ids",
"baseline and candidate run agent ids must differ for same-run comparison":
return true
default:
return false
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 String-based fallback in isCompareValidationError contains stale entries and is fragile

The errors.Is(err, ErrInvalidCompareRequest) sentinel check correctly handles all errors produced by this file's own invalidCompareRequest() helper. The string-based switch below it is intended as a compatibility net for repository-layer plain fmt.Errorf strings, but two of the listed strings are no longer produced by any code path in the repo:

  • "baseline_run_id is required" — the API now wraps this before the service is called; the repo never returned it
  • "baseline_run_id and candidate_run_id must differ" — replaced with the wrapped form in this PR

These dead entries will never match and could mislead future readers. More importantly, the string-matching approach will silently stop working if error messages are renamed. Prefer wrapping repository validation errors with ErrInvalidCompareRequest (or a shared sentinel) at the service boundary so that the errors.Is path alone is sufficient.

Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/internal/api/compare_reads.go
Line: 258-273

Comment:
**String-based fallback in `isCompareValidationError` contains stale entries and is fragile**

The `errors.Is(err, ErrInvalidCompareRequest)` sentinel check correctly handles all errors produced by this file's own `invalidCompareRequest()` helper. The string-based `switch` below it is intended as a compatibility net for repository-layer plain `fmt.Errorf` strings, but two of the listed strings are no longer produced by any code path in the repo:

- `"baseline_run_id is required"` — the API now wraps this before the service is called; the repo never returned it
- `"baseline_run_id and candidate_run_id must differ"` — replaced with the wrapped form in this PR

These dead entries will never match and could mislead future readers. More importantly, the string-matching approach will silently stop working if error messages are renamed. Prefer wrapping repository validation errors with `ErrInvalidCompareRequest` (or a shared sentinel) at the service boundary so that the `errors.Is` path alone is sufficient.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Codex

# Conflicts:
#	backend/db/migrations/00030_same_run_agent_comparisons.sql
#	backend/internal/api/compare_reads.go
#	backend/internal/api/compare_reads_test.go
#	backend/internal/repository/repository_integration_test.go
#	backend/internal/repository/run_comparison.go
#	backend/internal/repository/run_comparison_test.go
@Atharva-Kanherkar Atharva-Kanherkar merged commit 9134809 into main Apr 29, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(cli/api): eval scorecard can 500 when comparing two agents from the same run

1 participant