Skip to content

[codex] Implement repeated-eval pass@k and comparison semantics#371

Merged
Atharva-Kanherkar merged 2 commits into
mainfrom
codex/issue-362-passk-comparisons
Apr 20, 2026
Merged

[codex] Implement repeated-eval pass@k and comparison semantics#371
Atharva-Kanherkar merged 2 commits into
mainfrom
codex/issue-362-passk-comparisons

Conversation

@Atharva-Kanherkar
Copy link
Copy Markdown
Collaborator

Summary

  • implement issue pass@k, pass^k, AgentScore, and statistically aware repeated-eval comparisons #362 repeated-eval aggregation semantics, including task_success, pass@k, pass^k, metric_routing, and composite AgentScore
  • derive task outcomes from persisted challenge judge_results with a scorecard fallback, and suppress noisy top-level comparison winners unless repeated-session evidence is clear
  • plumb optional aggregation.reliability_weight, update the eval-session read surface and OpenAPI contract, and add focused repository/API coverage

Locked design artifacts

  • research-docs/issues/issue-362-passk-comparison-plan.md
  • testing/codex-issue-362-passk-comparisons.md

Validation

  • cd backend && go test ./internal/repository ./internal/api
  • cd backend && go test ./...
  • cd backend && go vet ./...
  • npx @redocly/cli lint docs/api-server/openapi.yaml

Closes #362

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 20, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agentclash Ready Ready Preview, Comment Apr 20, 2026 7:30pm

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 20, 2026

Greptile Summary

This PR implements repeated-eval pass@k / pass^k metrics, composite AgentScore, metric_routing, and comparison semantics for eval sessions (issue #362). It extends AggregateEvalSession to derive per-task success outcomes from persisted judge_results, compute Bernoulli-based pass metrics across all k-values, route the primary comparison metric based on inferred task properties or an optional manual reliability_weight override, and suppress noisy top-level comparison winners unless confidence intervals are non-overlapping.

Key changes:

  • eval_session_aggregation.go: adds buildEvalSessionAggregateBehavior, loadEvalSessionParticipantTaskOutcomes, buildEvalSessionTaskSuccess, buildEvalSessionPassMetricSeries, buildEvalSessionMetricRouting, resolveEvalSessionReliabilityWeight, and buildEvalSessionRepeatedComparison — all well-decomposed and independently testable
  • eval_sessions.go / eval_session_service.go: plumbs optional aggregation.reliability_weight (validated as float ∈ [0, 1]) through decode → input → snapshot
  • openapi.yaml: adds EvalSessionTaskSuccess, EvalSessionPassMetricSeries, EvalSessionMetricRouting, and EvalSessionRepeatedComparison schemas that align with the Go structs
  • Test coverage is thorough: unit tests validate pass metric math, k=1 equivalence, monotonicity invariants, suite fallback behaviour, and both clear_winner / no_clear_winner / insufficient_evidence comparison outcomes; integration tests exercise the full repository path with real judge-result records

Three minor observations:

  • intervalsOverlap uses <= so touching CIs count as overlapping (conservative but not the standard statistical definition — may surprise consumers near the boundary)
  • The metric_routing_mismatch early-return leaves Status: \"insufficient_evidence\", which is semantically misleading (though the branch is unreachable in practice since all participants share the same session behavior)
  • Scores are accumulated in deriveEvalSessionChallengeSuccess even when verdictSeen is true, then silently discarded — wasteful but harmless

Confidence Score: 4/5

Safe to merge with only non-blocking style observations remaining

The implementation is well-designed, thoroughly tested (unit + integration), and correctly wired through the API surface and OpenAPI contract. The three flagged items are all P2 style/defensive-code observations — none affect correctness, data integrity, or the primary user path. The metric_routing_mismatch branch is unreachable in the current design, intervalsOverlap is intentionally conservative, and the score-accumulation waste is trivial. No P0/P1 bugs found.

backend/internal/repository/eval_session_aggregation.go — contains all the new metric logic; the three observations above all live in this file

Important Files Changed

Filename Overview
backend/internal/repository/eval_session_aggregation.go Core implementation of pass@k / pass^k metrics, metric routing, composite AgentScore, and comparison semantics — large and well-structured; three minor style/logic observations flagged
backend/internal/repository/eval_session_aggregation_test.go Comprehensive unit tests covering pass metric math, monotonicity, manual vs inferred reliability weight, comparison outcomes, and suite fallback — no issues found
backend/internal/repository/eval_session_aggregation_integration_test.go Integration tests covering single-agent and two-participant comparison scenarios with real judge results; adds insertJudgeResultRecordWithVerdict and lookupChallengeIdentityID helpers
backend/internal/api/eval_sessions.go Decodes and validates aggregation.reliability_weight as an optional float in [0, 1]; validation logic and error codes are correct
docs/api-server/openapi.yaml Adds reliability_weight to aggregation config, and new schemas for EvalSessionTaskSuccess, EvalSessionPassMetricSeries, EvalSessionMetricRouting, and EvalSessionRepeatedComparison — schemas match implementation

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[AggregateEvalSession] --> B[buildEvalSessionAggregateBehavior]
    B --> B1[KValues, EffectiveK, SuccessThreshold, ManualReliabilityWeight]
    A --> C[ListRunsByEvalSessionID]
    C --> D[For each run: GetRunScorecard]
    D --> F[buildEvalSessionAggregateParticipantSources]
    F --> G[loadEvalSessionParticipantTaskOutcomes]
    G --> H{judge_results available?}
    H -- yes --> I[deriveEvalSessionChallengeTaskOutcomes]
    H -- no --> J[deriveEvalSessionSuiteFallbackOutcome]
    I --> K[evalSessionTaskOutcomeAccumulator]
    J --> K
    K --> L[buildEvalSessionTaskSuccess]
    L --> M[buildEvalSessionPassMetricSeries]
    M --> N[resolveEvalSessionReliabilityWeight]
    N --> O[buildEvalSessionMetricRouting]
    O --> P{Participant count}
    P -- 1 --> Q[TopLevelSource: sole_participant]
    P -- 2+ --> R[buildEvalSessionRepeatedComparison]
    R --> S{Sufficient evidence?}
    S -- no --> W[Status: insufficient_evidence]
    S -- yes --> T{Intervals overlap?}
    T -- no --> U[Status: clear_winner]
    T -- yes --> V[Status: no_clear_winner]
Loading

Reviews (1): Last reviewed commit: "issue-362: add pass@k comparison semanti..." | Re-trigger Greptile

Comment on lines +1154 to +1159
func intervalsOverlap(left, right *evalSessionAggregateInterval) bool {
if left == nil || right == nil {
return true
}
return left.Lower <= right.Upper && right.Lower <= left.Upper
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 <= treats touching intervals as overlapping

intervalsOverlap uses <= for both bounds, meaning two CIs that share only a single endpoint (e.g., [0.4, 0.6] and [0.6, 0.8]) are classified as overlapping and produce no_clear_winner. For a standard statistics interpretation, touching intervals do not overlap. Using < instead would give clear_winner in that edge case:

Suggested change
func intervalsOverlap(left, right *evalSessionAggregateInterval) bool {
if left == nil || right == nil {
return true
}
return left.Lower <= right.Upper && right.Lower <= left.Upper
}
func intervalsOverlap(left, right *evalSessionAggregateInterval) bool {
if left == nil || right == nil {
return true
}
return left.Lower < right.Upper && right.Lower < left.Upper
}

The current <= behaviour is deliberately conservative — it avoids claiming a winner when there is zero probability mass in common, so it is a defensible design choice. Worth keeping in mind if consumers see unexpected no_clear_winner results for near-touching intervals.

Comment on lines +1051 to +1055
Status: "insufficient_evidence",
ReasonCode: "participant_metrics_unavailable",
EffectiveK: effectiveK,
}
if len(comparables) < 2 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unreachable metric_routing_mismatch branch returns misleading status

When leader.Routing.PrimaryMetric != runnerUp.Routing.PrimaryMetric, the function returns early with comparison.Status still set to its initial value "insufficient_evidence", giving the response { status: "insufficient_evidence", reason_code: "metric_routing_mismatch" }. The status is semantically incorrect for a mismatch — it looks like a data problem rather than a routing divergence.

In the current design this branch is unreachable: all participants in a session share the same evalSessionAggregateBehavior, so PrimaryMetric is always identical. However, if per-participant reliability weights are ever introduced, this would silently produce a misleading status. Consider at least documenting the assumption, or using a distinct status such as "comparison_not_possible":

if leader.Routing.PrimaryMetric != runnerUp.Routing.PrimaryMetric {
    // currently unreachable: all participants share the same session behavior
    comparison.Status = "insufficient_evidence"
    comparison.ReasonCode = "metric_routing_mismatch"
    return comparison
}

Comment on lines +577 to +601
verdictSeen := false
scores := make([]float64, 0, len(results))

for _, result := range results {
if result.Verdict != nil {
verdict := strings.ToLower(strings.TrimSpace(*result.Verdict))
if verdict != "" {
verdictSeen = true
if verdict != "pass" {
return false, "judge_results_verdict", true
}
}
}
if result.NormalizedScore != nil {
scores = append(scores, *result.NormalizedScore)
}
}

if verdictSeen {
return true, "judge_results_verdict", true
}
if len(scores) > 0 {
return kahanMean(scores) >= threshold, "judge_results_threshold", true
}
return false, "", false
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Accumulated scores slice is silently discarded when a verdict is seen

In deriveEvalSessionChallengeSuccess, NormalizedScore values are appended to scores even when a verdict is already set. After the loop, if verdictSeen == true the function returns before ever consulting scores, so every appended score is wasted allocation. This is harmless but confusing — a reader might think scores contribute to the outcome even alongside verdicts.

Consider skipping score accumulation when a verdict has already been seen:

if result.NormalizedScore != nil && !verdictSeen {
    scores = append(scores, *result.NormalizedScore)
}

Or add a comment clarifying the intent:

// scores is only consulted when no verdict is present; accumulated here for the common no-verdict path
if result.NormalizedScore != nil {
    scores = append(scores, *result.NormalizedScore)
}

@Atharva-Kanherkar Atharva-Kanherkar merged commit 553be31 into main Apr 20, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pass@k, pass^k, AgentScore, and statistically aware repeated-eval comparisons

1 participant