[codex] Fix same-run agent scorecard comparison by Atharva-Kanherkar · Pull Request #429 · agentclash/agentclash

Atharva-Kanherkar · 2026-04-28T21:12:36Z

Summary

Fixes #424.

This fixes the compare path used by agentclash eval scorecard when the bookmarked baseline and selected candidate are two different agents from the same multi-agent run. That workflow should compare the two run agents, not fail before explicit agent IDs are considered.

What changed

Allows same baseline_run_id and candidate_run_id when explicit baseline/candidate run-agent IDs are present and different.
Rejects same-run same-agent and same-run ambiguous comparisons as 400 invalid_compare_request instead of 500 internal_error.
Adds a DB migration replacing the old baseline_run_id <> candidate_run_id check with a distinct-run-or-distinct-agent check.
Adds review-checkpoint contract coverage under testing/.

Validation

cd backend && go test ./internal/api -run 'TestCompare|TestGetRunComparison'\n- cd backend && go test ./internal/repository -run 'TestValidateBuildRunComparisonParams|TestRepositoryBuildRunComparison'\n- cd cli && go test ./cmd -run 'TestEvalScorecard|TestCompareRuns'\n- cd backend && go test ./internal/api ./internal/repository\n- cd cli && go test ./cmd\n\nNote: repository integration tests that require DATABASE_URL are skipped locally in this environment; the same-run integration case is included for database-backed CI validation.

vercel · 2026-04-28T21:12:40Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agentclash	Ready	Preview, Comment	Apr 29, 2026 6:50am

greptile-apps · 2026-04-28T21:24:09Z

Greptile Summary

This PR fixes the HTTP 500 that occurred when eval scorecard compared two different agents from the same multi-agent run. It adds same-run/different-agent validation in both the API manager and the repository layer, maps those validation errors to HTTP 400, and ships a DB migration relaxing the old baseline_run_id <> candidate_run_id CHECK constraint.

P1 — migration incomplete: The new CHECK constraint is correct, but the existing UNIQUE (baseline_run_id, candidate_run_id) index (from migration 00011) is unchanged, and the UpsertRunComparison query conflicts on that same key. For a same-run comparison baseline_run_id = candidate_run_id = run_id, so every agent-pair comparison for a given run collides at (run_id, run_id). A second same-run comparison (agents A vs C) will silently overwrite the first (agents A vs B), leaving only the latest comparison in the DB. GetRunComparisonByRunIDs — exposed through the repository interface — would return the wrong agent-pair result for any caller that uses it after a subsequent same-run comparison.

Confidence Score: 3/5

Not safe to merge as-is: the migration enables the new code path but leaves a data-integrity gap that will silently corrupt the stored comparison for same-run multi-agent scenarios.

The application-layer and test changes are correct and well-covered. However, the DB migration fixes only the CHECK constraint while the UNIQUE index and the upsert conflict key remain keyed on (baseline_run_id, candidate_run_id) — the same key for every same-run comparison. This means concurrent or sequential same-run comparisons for different agent pairs will overwrite each other in the database, which is the exact scenario this PR enables and needs to handle correctly.

backend/db/migrations/00030_same_run_agent_comparisons.sql and backend/db/queries/run_comparisons.sql (upsert conflict key) need attention before merging.

Important Files Changed

Filename	Overview
backend/db/migrations/00030_same_run_agent_comparisons.sql	Replaces the old CHECK constraint with one that allows same-run/different-agent comparisons, but does not update the UNIQUE (baseline_run_id, candidate_run_id) constraint — causing same-run comparisons for different agent pairs to silently overwrite each other.
backend/internal/api/compare_reads.go	Adds ErrInvalidCompareRequest sentinel, validation logic for same-run/different-agent path, and HTTP 400 mapping in the handler; isCompareValidationError has a fragile string-based fallback with two stale entries.
backend/internal/api/compare_reads_test.go	Adds three manager-level tests and one handler test covering the new same-run validation paths; coverage is good.
backend/internal/repository/run_comparison.go	Adds validateBuildRunComparisonParams with correct same-run/same-agent rejection; upsert logic unchanged and still keyed on (baseline_run_id, candidate_run_id), which collides for same-run different-agent pairs.
backend/internal/repository/run_comparison_test.go	Adds unit tests for all three validateBuildRunComparisonParams branches (allow different agents, reject same agent, reject missing agent IDs).
backend/internal/repository/repository_integration_test.go	Adds a DB-backed integration test for same-run/different-agent BuildRunComparison; guarded by DATABASE_URL so it runs in CI only.
testing/issue-424-eval-scorecard-same-run-500.md	Review-checkpoint document describing contract expectations and validation test commands for issue #424.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI / Client
    participant H as HTTP Handler
    participant M as CompareReadManager
    participant R as Repository
    participant DB as Postgres

    CLI->>H: GET /v1/compare?baseline_run_id=X&candidate_run_id=X&baseline_run_agent_id=A&candidate_run_agent_id=B
    H->>H: compareInputFromRequest (parse UUIDs)
    H->>M: GetRunComparison(input)
    M->>M: validate: same run + both agent IDs present + IDs differ
    alt validation fails (same agent or missing IDs)
        M-->>H: invalidCompareRequest error (wraps ErrInvalidCompareRequest)
        H-->>CLI: 400 invalid_compare_request
    end
    M->>R: GetRunByID(X) x2
    R-->>M: Run{X}
    M->>M: AuthorizeWorkspace
    M->>R: BuildRunComparison(X, X, A, B)
    R->>R: validateBuildRunComparisonParams (same-run guard)
    R->>DB: ListRunAgentsByRunID(X)
    DB-->>R: [agentA, agentB, ...]
    R->>R: resolveRunComparisonParticipants (pin A and B)
    R->>DB: load scorecards, replays, judge results for A and B
    R->>R: buildComparableRunComparisonSummary
    R->>DB: UpsertRunComparison ON CONFLICT (baseline_run_id, candidate_run_id) overwrites any prior same-run row
    DB-->>R: RunComparison row (A vs B)
    R-->>M: RunComparison
    M-->>H: GetRunComparisonResult
    H-->>CLI: 200 JSON with state/deltas

Prompt To Fix All With AI

This is a comment left during a code review.
Path: backend/db/migrations/00030_same_run_agent_comparisons.sql
Line: 19-26

Comment:
**Same-run comparisons collide on the existing `UNIQUE (baseline_run_id, candidate_run_id)` constraint**

This migration drops and replaces the `CHECK` constraint but leaves the `UNIQUE (baseline_run_id, candidate_run_id)` index untouched (introduced in `00011_run_comparisons.sql`). For same-run comparisons `baseline_run_id = candidate_run_id`, so every distinct agent-pair comparison maps to the *same* key `(run_id, run_id)`.

The `UpsertRunComparison` query uses `ON CONFLICT (baseline_run_id, candidate_run_id) DO UPDATE SET baseline_run_agent_id = EXCLUDED.baseline_run_agent_id, ...`, which means a comparison of agents A vs B will be silently overwritten the moment agents A vs C are compared for the same run. The row stored in `run_comparisons` will always reflect only the *most recent* same-run request.

`GetRunComparisonByRunIDs` reads by `(baseline_run_id, candidate_run_id)` and is exposed through the repository interface, so any caller that uses that read path after a subsequent same-run comparison will receive stale data for the wrong agent pair. The migration needs to also extend the unique index to distinguish same-run comparisons by agent pair, e.g.:

```sql
ALTER TABLE run_comparisons DROP CONSTRAINT run_comparisons_baseline_run_id_candidate_run_id_key;

CREATE UNIQUE INDEX run_comparisons_cross_run_unique
    ON run_comparisons (baseline_run_id, candidate_run_id)
    WHERE baseline_run_id <> candidate_run_id;

CREATE UNIQUE INDEX run_comparisons_same_run_agent_unique
    ON run_comparisons (baseline_run_id, candidate_run_id, baseline_run_agent_id, candidate_run_agent_id)
    WHERE baseline_run_id = candidate_run_id;
```

The `UpsertRunComparison` query and the SQLC-generated code would also need to be updated to conflict on the appropriate key depending on whether this is a same-run or cross-run comparison.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: backend/internal/api/compare_reads.go
Line: 258-273

Comment:
**String-based fallback in `isCompareValidationError` contains stale entries and is fragile**

The `errors.Is(err, ErrInvalidCompareRequest)` sentinel check correctly handles all errors produced by this file's own `invalidCompareRequest()` helper. The string-based `switch` below it is intended as a compatibility net for repository-layer plain `fmt.Errorf` strings, but two of the listed strings are no longer produced by any code path in the repo:

- `"baseline_run_id is required"` — the API now wraps this before the service is called; the repo never returned it
- `"baseline_run_id and candidate_run_id must differ"` — replaced with the wrapped form in this PR

These dead entries will never match and could mislead future readers. More importantly, the string-matching approach will silently stop working if error messages are renamed. Prefer wrapping repository validation errors with `ErrInvalidCompareRequest` (or a shared sentinel) at the service boundary so that the `errors.Is` path alone is sufficient.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix: allow same-run agent comparisons" | Re-trigger Greptile}

greptile-apps · 2026-04-28T21:24:13Z

+ADD CONSTRAINT run_comparisons_distinct_runs_or_agents_check
+CHECK (
+    baseline_run_id <> candidate_run_id
+    OR (
+        baseline_run_agent_id IS NOT NULL
+        AND candidate_run_agent_id IS NOT NULL
+        AND baseline_run_agent_id <> candidate_run_agent_id
+    )


Same-run comparisons collide on the existing UNIQUE (baseline_run_id, candidate_run_id) constraint

This migration drops and replaces the CHECK constraint but leaves the UNIQUE (baseline_run_id, candidate_run_id) index untouched (introduced in 00011_run_comparisons.sql). For same-run comparisons baseline_run_id = candidate_run_id, so every distinct agent-pair comparison maps to the same key (run_id, run_id).

The UpsertRunComparison query uses ON CONFLICT (baseline_run_id, candidate_run_id) DO UPDATE SET baseline_run_agent_id = EXCLUDED.baseline_run_agent_id, ..., which means a comparison of agents A vs B will be silently overwritten the moment agents A vs C are compared for the same run. The row stored in run_comparisons will always reflect only the most recent same-run request.

GetRunComparisonByRunIDs reads by (baseline_run_id, candidate_run_id) and is exposed through the repository interface, so any caller that uses that read path after a subsequent same-run comparison will receive stale data for the wrong agent pair. The migration needs to also extend the unique index to distinguish same-run comparisons by agent pair, e.g.:

ALTER TABLE run_comparisons DROP CONSTRAINT run_comparisons_baseline_run_id_candidate_run_id_key; CREATE UNIQUE INDEX run_comparisons_cross_run_unique ON run_comparisons (baseline_run_id, candidate_run_id) WHERE baseline_run_id <> candidate_run_id; CREATE UNIQUE INDEX run_comparisons_same_run_agent_unique ON run_comparisons (baseline_run_id, candidate_run_id, baseline_run_agent_id, candidate_run_agent_id) WHERE baseline_run_id = candidate_run_id;

The UpsertRunComparison query and the SQLC-generated code would also need to be updated to conflict on the appropriate key depending on whether this is a same-run or cross-run comparison.

Prompt To Fix With AI

This is a comment left during a code review. Path: backend/db/migrations/00030_same_run_agent_comparisons.sql Line: 19-26 Comment: **Same-run comparisons collide on the existing `UNIQUE (baseline_run_id, candidate_run_id)` constraint** This migration drops and replaces the `CHECK` constraint but leaves the `UNIQUE (baseline_run_id, candidate_run_id)` index untouched (introduced in `00011_run_comparisons.sql`). For same-run comparisons `baseline_run_id = candidate_run_id`, so every distinct agent-pair comparison maps to the *same* key `(run_id, run_id)`. The `UpsertRunComparison` query uses `ON CONFLICT (baseline_run_id, candidate_run_id) DO UPDATE SET baseline_run_agent_id = EXCLUDED.baseline_run_agent_id, ...`, which means a comparison of agents A vs B will be silently overwritten the moment agents A vs C are compared for the same run. The row stored in `run_comparisons` will always reflect only the *most recent* same-run request. `GetRunComparisonByRunIDs` reads by `(baseline_run_id, candidate_run_id)` and is exposed through the repository interface, so any caller that uses that read path after a subsequent same-run comparison will receive stale data for the wrong agent pair. The migration needs to also extend the unique index to distinguish same-run comparisons by agent pair, e.g.: ```sql ALTER TABLE run_comparisons DROP CONSTRAINT run_comparisons_baseline_run_id_candidate_run_id_key; CREATE UNIQUE INDEX run_comparisons_cross_run_unique ON run_comparisons (baseline_run_id, candidate_run_id) WHERE baseline_run_id <> candidate_run_id; CREATE UNIQUE INDEX run_comparisons_same_run_agent_unique ON run_comparisons (baseline_run_id, candidate_run_id, baseline_run_agent_id, candidate_run_agent_id) WHERE baseline_run_id = candidate_run_id; ``` The `UpsertRunComparison` query and the SQLC-generated code would also need to be updated to conflict on the appropriate key depending on whether this is a same-run or cross-run comparison. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-28T21:24:14Z

+func isCompareValidationError(err error) bool {
+	if errors.Is(err, ErrInvalidCompareRequest) {
+		return true
+	}
+	switch err.Error() {
+	case "baseline_run_id is required",
+		"candidate_run_id is required",
+		"baseline_run_id and candidate_run_id must differ",
+		"baseline and candidate run ids must differ",
+		"same-run comparison requires baseline and candidate run agent ids",
+		"baseline and candidate run agent ids must differ for same-run comparison":
+		return true
+	default:
+		return false
+	}
+}


String-based fallback in isCompareValidationError contains stale entries and is fragile

The errors.Is(err, ErrInvalidCompareRequest) sentinel check correctly handles all errors produced by this file's own invalidCompareRequest() helper. The string-based switch below it is intended as a compatibility net for repository-layer plain fmt.Errorf strings, but two of the listed strings are no longer produced by any code path in the repo:

"baseline_run_id is required" — the API now wraps this before the service is called; the repo never returned it

"baseline_run_id and candidate_run_id must differ" — replaced with the wrapped form in this PR

These dead entries will never match and could mislead future readers. More importantly, the string-matching approach will silently stop working if error messages are renamed. Prefer wrapping repository validation errors with ErrInvalidCompareRequest (or a shared sentinel) at the service boundary so that the errors.Is path alone is sufficient.

Prompt To Fix With AI

This is a comment left during a code review. Path: backend/internal/api/compare_reads.go Line: 258-273 Comment: **String-based fallback in `isCompareValidationError` contains stale entries and is fragile** The `errors.Is(err, ErrInvalidCompareRequest)` sentinel check correctly handles all errors produced by this file's own `invalidCompareRequest()` helper. The string-based `switch` below it is intended as a compatibility net for repository-layer plain `fmt.Errorf` strings, but two of the listed strings are no longer produced by any code path in the repo: - `"baseline_run_id is required"` — the API now wraps this before the service is called; the repo never returned it - `"baseline_run_id and candidate_run_id must differ"` — replaced with the wrapped form in this PR These dead entries will never match and could mislead future readers. More importantly, the string-matching approach will silently stop working if error messages are renamed. Prefer wrapping repository validation errors with `ErrInvalidCompareRequest` (or a shared sentinel) at the service boundary so that the `errors.Is` path alone is sufficient. How can I resolve this? If you propose a fix, please make it concise.

# Conflicts: # backend/db/migrations/00030_same_run_agent_comparisons.sql # backend/internal/api/compare_reads.go # backend/internal/api/compare_reads_test.go # backend/internal/repository/repository_integration_test.go # backend/internal/repository/run_comparison.go # backend/internal/repository/run_comparison_test.go

fix: allow same-run agent comparisons

2b2fdce

vercel Bot deployed to Preview April 28, 2026 21:15 View deployment

Atharva-Kanherkar marked this pull request as ready for review April 28, 2026 21:19

greptile-apps Bot reviewed Apr 28, 2026

View reviewed changes

fix: isolate same-run comparison upserts

df2096b

vercel Bot deployed to Preview April 28, 2026 21:33 View deployment

vercel Bot deployed to Preview April 29, 2026 06:50 View deployment

Atharva-Kanherkar merged commit 9134809 into main Apr 29, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Fix same-run agent scorecard comparison#429

[codex] Fix same-run agent scorecard comparison#429
Atharva-Kanherkar merged 3 commits into
mainfrom
issue-424-eval-scorecard-same-run-500

Atharva-Kanherkar commented Apr 28, 2026

Uh oh!

vercel Bot commented Apr 28, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 28, 2026

Important Files Changed

Uh oh!

greptile-apps Bot Apr 28, 2026

Uh oh!

greptile-apps Bot Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Atharva-Kanherkar commented Apr 28, 2026

Summary

What changed

Validation

Uh oh!

vercel Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Apr 28, 2026 •

edited

Loading