[codex] Implement repeated-eval pass@k and comparison semantics#371
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Greptile SummaryThis PR implements repeated-eval Key changes:
Three minor observations:
Confidence Score: 4/5Safe to merge with only non-blocking style observations remaining The implementation is well-designed, thoroughly tested (unit + integration), and correctly wired through the API surface and OpenAPI contract. The three flagged items are all P2 style/defensive-code observations — none affect correctness, data integrity, or the primary user path. The metric_routing_mismatch branch is unreachable in the current design, intervalsOverlap is intentionally conservative, and the score-accumulation waste is trivial. No P0/P1 bugs found. backend/internal/repository/eval_session_aggregation.go — contains all the new metric logic; the three observations above all live in this file Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[AggregateEvalSession] --> B[buildEvalSessionAggregateBehavior]
B --> B1[KValues, EffectiveK, SuccessThreshold, ManualReliabilityWeight]
A --> C[ListRunsByEvalSessionID]
C --> D[For each run: GetRunScorecard]
D --> F[buildEvalSessionAggregateParticipantSources]
F --> G[loadEvalSessionParticipantTaskOutcomes]
G --> H{judge_results available?}
H -- yes --> I[deriveEvalSessionChallengeTaskOutcomes]
H -- no --> J[deriveEvalSessionSuiteFallbackOutcome]
I --> K[evalSessionTaskOutcomeAccumulator]
J --> K
K --> L[buildEvalSessionTaskSuccess]
L --> M[buildEvalSessionPassMetricSeries]
M --> N[resolveEvalSessionReliabilityWeight]
N --> O[buildEvalSessionMetricRouting]
O --> P{Participant count}
P -- 1 --> Q[TopLevelSource: sole_participant]
P -- 2+ --> R[buildEvalSessionRepeatedComparison]
R --> S{Sufficient evidence?}
S -- no --> W[Status: insufficient_evidence]
S -- yes --> T{Intervals overlap?}
T -- no --> U[Status: clear_winner]
T -- yes --> V[Status: no_clear_winner]
Reviews (1): Last reviewed commit: "issue-362: add pass@k comparison semanti..." | Re-trigger Greptile |
| func intervalsOverlap(left, right *evalSessionAggregateInterval) bool { | ||
| if left == nil || right == nil { | ||
| return true | ||
| } | ||
| return left.Lower <= right.Upper && right.Lower <= left.Upper | ||
| } |
There was a problem hiding this comment.
<= treats touching intervals as overlapping
intervalsOverlap uses <= for both bounds, meaning two CIs that share only a single endpoint (e.g., [0.4, 0.6] and [0.6, 0.8]) are classified as overlapping and produce no_clear_winner. For a standard statistics interpretation, touching intervals do not overlap. Using < instead would give clear_winner in that edge case:
| func intervalsOverlap(left, right *evalSessionAggregateInterval) bool { | |
| if left == nil || right == nil { | |
| return true | |
| } | |
| return left.Lower <= right.Upper && right.Lower <= left.Upper | |
| } | |
| func intervalsOverlap(left, right *evalSessionAggregateInterval) bool { | |
| if left == nil || right == nil { | |
| return true | |
| } | |
| return left.Lower < right.Upper && right.Lower < left.Upper | |
| } |
The current <= behaviour is deliberately conservative — it avoids claiming a winner when there is zero probability mass in common, so it is a defensible design choice. Worth keeping in mind if consumers see unexpected no_clear_winner results for near-touching intervals.
| Status: "insufficient_evidence", | ||
| ReasonCode: "participant_metrics_unavailable", | ||
| EffectiveK: effectiveK, | ||
| } | ||
| if len(comparables) < 2 { |
There was a problem hiding this comment.
Unreachable
metric_routing_mismatch branch returns misleading status
When leader.Routing.PrimaryMetric != runnerUp.Routing.PrimaryMetric, the function returns early with comparison.Status still set to its initial value "insufficient_evidence", giving the response { status: "insufficient_evidence", reason_code: "metric_routing_mismatch" }. The status is semantically incorrect for a mismatch — it looks like a data problem rather than a routing divergence.
In the current design this branch is unreachable: all participants in a session share the same evalSessionAggregateBehavior, so PrimaryMetric is always identical. However, if per-participant reliability weights are ever introduced, this would silently produce a misleading status. Consider at least documenting the assumption, or using a distinct status such as "comparison_not_possible":
if leader.Routing.PrimaryMetric != runnerUp.Routing.PrimaryMetric {
// currently unreachable: all participants share the same session behavior
comparison.Status = "insufficient_evidence"
comparison.ReasonCode = "metric_routing_mismatch"
return comparison
}| verdictSeen := false | ||
| scores := make([]float64, 0, len(results)) | ||
|
|
||
| for _, result := range results { | ||
| if result.Verdict != nil { | ||
| verdict := strings.ToLower(strings.TrimSpace(*result.Verdict)) | ||
| if verdict != "" { | ||
| verdictSeen = true | ||
| if verdict != "pass" { | ||
| return false, "judge_results_verdict", true | ||
| } | ||
| } | ||
| } | ||
| if result.NormalizedScore != nil { | ||
| scores = append(scores, *result.NormalizedScore) | ||
| } | ||
| } | ||
|
|
||
| if verdictSeen { | ||
| return true, "judge_results_verdict", true | ||
| } | ||
| if len(scores) > 0 { | ||
| return kahanMean(scores) >= threshold, "judge_results_threshold", true | ||
| } | ||
| return false, "", false |
There was a problem hiding this comment.
Accumulated
scores slice is silently discarded when a verdict is seen
In deriveEvalSessionChallengeSuccess, NormalizedScore values are appended to scores even when a verdict is already set. After the loop, if verdictSeen == true the function returns before ever consulting scores, so every appended score is wasted allocation. This is harmless but confusing — a reader might think scores contribute to the outcome even alongside verdicts.
Consider skipping score accumulation when a verdict has already been seen:
if result.NormalizedScore != nil && !verdictSeen {
scores = append(scores, *result.NormalizedScore)
}Or add a comment clarifying the intent:
// scores is only consulted when no verdict is present; accumulated here for the common no-verdict path
if result.NormalizedScore != nil {
scores = append(scores, *result.NormalizedScore)
}
Summary
task_success,pass@k,pass^k,metric_routing, and compositeAgentScorejudge_resultswith a scorecard fallback, and suppress noisy top-level comparison winners unless repeated-session evidence is clearaggregation.reliability_weight, update the eval-session read surface and OpenAPI contract, and add focused repository/API coverageLocked design artifacts
research-docs/issues/issue-362-passk-comparison-plan.mdtesting/codex-issue-362-passk-comparisons.mdValidation
cd backend && go test ./internal/repository ./internal/apicd backend && go test ./...cd backend && go vet ./...npx @redocly/cli lint docs/api-server/openapi.yamlCloses #362