[codex] Implement repeated-eval pass@k and comparison semantics by Atharva-Kanherkar · Pull Request #371 · agentclash/agentclash

Atharva-Kanherkar · 2026-04-20T19:07:23Z

Summary

implement issue pass@k, pass^k, AgentScore, and statistically aware repeated-eval comparisons #362 repeated-eval aggregation semantics, including task_success, pass@k, pass^k, metric_routing, and composite AgentScore
derive task outcomes from persisted challenge judge_results with a scorecard fallback, and suppress noisy top-level comparison winners unless repeated-session evidence is clear
plumb optional aggregation.reliability_weight, update the eval-session read surface and OpenAPI contract, and add focused repository/API coverage

Locked design artifacts

research-docs/issues/issue-362-passk-comparison-plan.md
testing/codex-issue-362-passk-comparisons.md

Validation

cd backend && go test ./internal/repository ./internal/api
cd backend && go test ./...
cd backend && go vet ./...
npx @redocly/cli lint docs/api-server/openapi.yaml

Closes #362

vercel · 2026-04-20T19:07:28Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agentclash	Ready	Preview, Comment	Apr 20, 2026 7:30pm

greptile-apps · 2026-04-20T19:23:11Z

Greptile Summary

This PR implements repeated-eval pass@k / pass^k metrics, composite AgentScore, metric_routing, and comparison semantics for eval sessions (issue #362). It extends AggregateEvalSession to derive per-task success outcomes from persisted judge_results, compute Bernoulli-based pass metrics across all k-values, route the primary comparison metric based on inferred task properties or an optional manual reliability_weight override, and suppress noisy top-level comparison winners unless confidence intervals are non-overlapping.

Key changes:

eval_session_aggregation.go: adds buildEvalSessionAggregateBehavior, loadEvalSessionParticipantTaskOutcomes, buildEvalSessionTaskSuccess, buildEvalSessionPassMetricSeries, buildEvalSessionMetricRouting, resolveEvalSessionReliabilityWeight, and buildEvalSessionRepeatedComparison — all well-decomposed and independently testable
eval_sessions.go / eval_session_service.go: plumbs optional aggregation.reliability_weight (validated as float ∈ [0, 1]) through decode → input → snapshot
openapi.yaml: adds EvalSessionTaskSuccess, EvalSessionPassMetricSeries, EvalSessionMetricRouting, and EvalSessionRepeatedComparison schemas that align with the Go structs
Test coverage is thorough: unit tests validate pass metric math, k=1 equivalence, monotonicity invariants, suite fallback behaviour, and both clear_winner / no_clear_winner / insufficient_evidence comparison outcomes; integration tests exercise the full repository path with real judge-result records

Three minor observations:

intervalsOverlap uses <= so touching CIs count as overlapping (conservative but not the standard statistical definition — may surprise consumers near the boundary)
The metric_routing_mismatch early-return leaves Status: \"insufficient_evidence\", which is semantically misleading (though the branch is unreachable in practice since all participants share the same session behavior)
Scores are accumulated in deriveEvalSessionChallengeSuccess even when verdictSeen is true, then silently discarded — wasteful but harmless

Confidence Score: 4/5

Safe to merge with only non-blocking style observations remaining

The implementation is well-designed, thoroughly tested (unit + integration), and correctly wired through the API surface and OpenAPI contract. The three flagged items are all P2 style/defensive-code observations — none affect correctness, data integrity, or the primary user path. The metric_routing_mismatch branch is unreachable in the current design, intervalsOverlap is intentionally conservative, and the score-accumulation waste is trivial. No P0/P1 bugs found.

backend/internal/repository/eval_session_aggregation.go — contains all the new metric logic; the three observations above all live in this file

Important Files Changed

Filename	Overview
backend/internal/repository/eval_session_aggregation.go	Core implementation of pass@k / pass^k metrics, metric routing, composite AgentScore, and comparison semantics — large and well-structured; three minor style/logic observations flagged
backend/internal/repository/eval_session_aggregation_test.go	Comprehensive unit tests covering pass metric math, monotonicity, manual vs inferred reliability weight, comparison outcomes, and suite fallback — no issues found
backend/internal/repository/eval_session_aggregation_integration_test.go	Integration tests covering single-agent and two-participant comparison scenarios with real judge results; adds insertJudgeResultRecordWithVerdict and lookupChallengeIdentityID helpers
backend/internal/api/eval_sessions.go	Decodes and validates aggregation.reliability_weight as an optional float in [0, 1]; validation logic and error codes are correct
docs/api-server/openapi.yaml	Adds reliability_weight to aggregation config, and new schemas for EvalSessionTaskSuccess, EvalSessionPassMetricSeries, EvalSessionMetricRouting, and EvalSessionRepeatedComparison — schemas match implementation

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[AggregateEvalSession] --> B[buildEvalSessionAggregateBehavior]
    B --> B1[KValues, EffectiveK, SuccessThreshold, ManualReliabilityWeight]
    A --> C[ListRunsByEvalSessionID]
    C --> D[For each run: GetRunScorecard]
    D --> F[buildEvalSessionAggregateParticipantSources]
    F --> G[loadEvalSessionParticipantTaskOutcomes]
    G --> H{judge_results available?}
    H -- yes --> I[deriveEvalSessionChallengeTaskOutcomes]
    H -- no --> J[deriveEvalSessionSuiteFallbackOutcome]
    I --> K[evalSessionTaskOutcomeAccumulator]
    J --> K
    K --> L[buildEvalSessionTaskSuccess]
    L --> M[buildEvalSessionPassMetricSeries]
    M --> N[resolveEvalSessionReliabilityWeight]
    N --> O[buildEvalSessionMetricRouting]
    O --> P{Participant count}
    P -- 1 --> Q[TopLevelSource: sole_participant]
    P -- 2+ --> R[buildEvalSessionRepeatedComparison]
    R --> S{Sufficient evidence?}
    S -- no --> W[Status: insufficient_evidence]
    S -- yes --> T{Intervals overlap?}
    T -- no --> U[Status: clear_winner]
    T -- yes --> V[Status: no_clear_winner]

_{Reviews (1): Last reviewed commit: "issue-362: add pass@k comparison semanti..." | Re-trigger Greptile}

greptile-apps · 2026-04-20T19:23:15Z

+func intervalsOverlap(left, right *evalSessionAggregateInterval) bool {
+	if left == nil || right == nil {
+		return true
+	}
+	return left.Lower <= right.Upper && right.Lower <= left.Upper
+}


<= treats touching intervals as overlapping

intervalsOverlap uses <= for both bounds, meaning two CIs that share only a single endpoint (e.g., [0.4, 0.6] and [0.6, 0.8]) are classified as overlapping and produce no_clear_winner. For a standard statistics interpretation, touching intervals do not overlap. Using < instead would give clear_winner in that edge case:

Suggested change

func intervalsOverlap(left, right *evalSessionAggregateInterval) bool {

if left == nil || right == nil {

return true

}

return left.Lower <= right.Upper && right.Lower <= left.Upper

}

func intervalsOverlap(left, right *evalSessionAggregateInterval) bool {

if left == nil || right == nil {

return true

}

return left.Lower < right.Upper && right.Lower < left.Upper

}

The current <= behaviour is deliberately conservative — it avoids claiming a winner when there is zero probability mass in common, so it is a defensible design choice. Worth keeping in mind if consumers see unexpected no_clear_winner results for near-touching intervals.

greptile-apps · 2026-04-20T19:23:16Z

+		Status:     "insufficient_evidence",
+		ReasonCode: "participant_metrics_unavailable",
+		EffectiveK: effectiveK,
+	}
+	if len(comparables) < 2 {


Unreachable metric_routing_mismatch branch returns misleading status

When leader.Routing.PrimaryMetric != runnerUp.Routing.PrimaryMetric, the function returns early with comparison.Status still set to its initial value "insufficient_evidence", giving the response { status: "insufficient_evidence", reason_code: "metric_routing_mismatch" }. The status is semantically incorrect for a mismatch — it looks like a data problem rather than a routing divergence.

In the current design this branch is unreachable: all participants in a session share the same evalSessionAggregateBehavior, so PrimaryMetric is always identical. However, if per-participant reliability weights are ever introduced, this would silently produce a misleading status. Consider at least documenting the assumption, or using a distinct status such as "comparison_not_possible":

if leader.Routing.PrimaryMetric != runnerUp.Routing.PrimaryMetric { // currently unreachable: all participants share the same session behavior comparison.Status = "insufficient_evidence" comparison.ReasonCode = "metric_routing_mismatch" return comparison }

greptile-apps · 2026-04-20T19:23:17Z

+	verdictSeen := false
+	scores := make([]float64, 0, len(results))
+
+	for _, result := range results {
+		if result.Verdict != nil {
+			verdict := strings.ToLower(strings.TrimSpace(*result.Verdict))
+			if verdict != "" {
+				verdictSeen = true
+				if verdict != "pass" {
+					return false, "judge_results_verdict", true
+				}
+			}
+		}
+		if result.NormalizedScore != nil {
+			scores = append(scores, *result.NormalizedScore)
+		}
+	}
+
+	if verdictSeen {
+		return true, "judge_results_verdict", true
+	}
+	if len(scores) > 0 {
+		return kahanMean(scores) >= threshold, "judge_results_threshold", true
+	}
+	return false, "", false


Accumulated scores slice is silently discarded when a verdict is seen

In deriveEvalSessionChallengeSuccess, NormalizedScore values are appended to scores even when a verdict is already set. After the loop, if verdictSeen == true the function returns before ever consulting scores, so every appended score is wasted allocation. This is harmless but confusing — a reader might think scores contribute to the outcome even alongside verdicts.

Consider skipping score accumulation when a verdict has already been seen:

if result.NormalizedScore != nil && !verdictSeen { scores = append(scores, *result.NormalizedScore) }

Or add a comment clarifying the intent:

// scores is only consulted when no verdict is present; accumulated here for the common no-verdict path if result.NormalizedScore != nil { scores = append(scores, *result.NormalizedScore) }

issue-362: add pass@k comparison semantics

855beb9

vercel Bot deployed to Preview April 20, 2026 19:08 View deployment

Atharva-Kanherkar marked this pull request as ready for review April 20, 2026 19:08

greptile-apps Bot reviewed Apr 20, 2026

View reviewed changes

issue-362: address greptile aggregation nits

749c51c

vercel Bot deployed to Preview April 20, 2026 19:30 View deployment

Atharva-Kanherkar merged commit 553be31 into main Apr 20, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Implement repeated-eval pass@k and comparison semantics#371

[codex] Implement repeated-eval pass@k and comparison semantics#371
Atharva-Kanherkar merged 2 commits into
mainfrom
codex/issue-362-passk-comparisons

Atharva-Kanherkar commented Apr 20, 2026

Uh oh!

vercel Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 20, 2026

Uh oh!

greptile-apps Bot Apr 20, 2026

Uh oh!

greptile-apps Bot Apr 20, 2026

Uh oh!

greptile-apps Bot Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Atharva-Kanherkar commented Apr 20, 2026

Summary

Locked design artifacts

Validation

Uh oh!

vercel Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 20, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Apr 20, 2026 •

edited

Loading