Skip to content

feat(eval): show mean score instead of pass/fail in report and viewer#534

Merged
shivammittal274 merged 1 commit intomainfrom
fix/eval-browser-context
Mar 23, 2026
Merged

feat(eval): show mean score instead of pass/fail in report and viewer#534
shivammittal274 merged 1 commit intomainfrom
fix/eval-browser-context

Conversation

@shivammittal274
Copy link
Contributor

No description provided.

@shivammittal274 shivammittal274 merged commit f14942c into main Mar 23, 2026
9 of 10 checks passed
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 23, 2026

Greptile Summary

This PR replaces binary pass/fail reporting with a continuous mean score (0–100%) across the eval weekly-report generator and the task viewer. passRate is removed in favour of avgScore, color thresholds are updated (green ≥75%, orange ≥40%, red <40%), and the CLI trend bar is adjusted accordingly.

Key changes:

  • RunSummary.passRate removed; avgScore (already multiplied to 0–100) used everywhere in the chart, stat cards, table, config detail card, tooltip, and console trend bar.
  • resolveGrade in viewer.html now returns a percentage label when the grader result carries a numeric score field, falling back to PASS/FAIL for older data.
  • Three issues found: (1) resolveGrade emits only pass/fail CSS classes, silently skipping the neutral (orange) tier for 40–75% scores that is already defined in the stylesheet and used everywhere in weekly-report.ts; (2) resolveGrade picks keys[0] by insertion order rather than the canonical PASS_FAIL_GRADER_ORDER, so the per-task score in the viewer may come from a different grader than what the report averages; (3) in weekly-report.ts, tasks whose graderResults carry only a pass boolean (older data) silently contribute 0 to the scoreSum while still incrementing scoredCount, potentially dragging down historical averages.

Confidence Score: 4/5

  • Safe to merge after addressing the missing neutral color tier in viewer.html; the other two issues are low-risk in practice.
  • The core logic — replacing passRate with avgScore — is correct and consistent throughout weekly-report.ts. The three issues are bounded in scope: the neutral-class omission is a one-line fix, the grader-key ordering only matters for tasks with multiple graders, and the historical-data zero-score concern depends on whether older manifests without a score field still exist.
  • packages/browseros-agent/apps/eval/src/dashboard/viewer.html — missing neutral color tier and arbitrary grader key selection in resolveGrade.

Important Files Changed

Filename Overview
packages/browseros-agent/apps/eval/scripts/weekly-report.ts Replaces pass/fail passRate with numeric avgScore throughout the report generator; calculation is correct but tasks missing a score field (older data) silently contribute 0 to the average while still being counted.
packages/browseros-agent/apps/eval/src/dashboard/viewer.html New resolveGrade path shows numeric score, but uses keys[0] (insertion order) instead of the canonical grader priority, and the color tier is missing the "neutral" (orange) class for 40–75% scores that exists in both the stylesheet and the weekly report.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Manifest task] --> B{graderResults\npresent?}
    B -- No --> C[Skip task]
    B -- Yes --> D[Iterate PASS_FAIL_GRADER_ORDER]
    D --> E{Grader key\nfound?}
    E -- No --> C
    E -- Yes --> F{score field\npresent?}

    subgraph weekly-report.ts
        F -- Yes --> G[scoreSum += score\nscoredCount++]
        F -- No ← bug --> H[scoreSum += 0\nscoredCount++ ← skews avg]
        G --> I[avgScore = scoreSum / scoredCount × 100]
    end

    subgraph viewer.html resolveGrade
        J[keys 0 — arbitrary order] --> K{typeof score\n=== number?}
        K -- Yes --> L{pct >= 75?}
        L -- Yes --> M[pass — green]
        L -- No --> N[fail — red ← missing neutral tier]
        K -- No --> O[anyPass → PASS / FAIL]
    end

    I --> P[RunSummary.avgScore\nused in chart, table, stats]
    A --> J
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/dashboard/viewer.html
Line: 1228

Comment:
**Missing "neutral" class for mid-range scores**

`weekly-report.ts` uses three tiers — `pass` (≥75%), `neutral` (≥40%), `fail` (<40%) — but the new `resolveGrade` path in the viewer only emits `pass` or `fail`. Any score between 40% and 74% will appear red (fail) in the task list even though the weekly report charts it as orange (neutral). The `.neutral` CSS class is already defined in the viewer's stylesheet.

```suggestion
      return { label: pct + '%', cls: pct >= 75 ? 'pass' : pct >= 40 ? 'neutral' : 'fail' };
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/dashboard/viewer.html
Line: 1224-1225

Comment:
**Grader selection uses arbitrary key order instead of priority order**

`weekly-report.ts` iterates `PASS_FAIL_GRADER_ORDER` (`performance_grader``webvoyager_grader``fara_combined``fara_grader`) to pick the canonical grader score for each task. Here, `keys[0]` relies on `Object.keys()` insertion order, which may resolve to a different grader. When a task has results from multiple graders the viewer can display a score from a different grader than the one that contributes to the report's `avgScore`, making the per-task numbers inconsistent with the aggregate.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/scripts/weekly-report.ts
Line: 143-146

Comment:
**Historical tasks without `score` field skew the average to zero**

`task.graderResults[name].score ?? 0` treats a missing `score` (e.g. older manifest records that only carry `pass: boolean`) as a score of `0`, while still incrementing `scoredCount`. This silently dilutes `avgScore` for any run that contains historical tasks.

By contrast, `viewer.html` correctly handles this case with `if (typeof score === 'number')` and falls back to the `pass`/`fail` path — meaning the two codepaths diverge for the same data.

Consider mirroring the viewer's check:

```typescript
const scoreVal = task.graderResults[name].score;
if (typeof scoreVal === 'number') {
  scoredCount++;
  scoreSum += scoreVal;
}
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "feat(eval): show mean score instead of p..." | Re-trigger Greptile

const score = graders[firstKey].score;
if (typeof score === 'number') {
const pct = Math.round(score * 100);
return { label: pct + '%', cls: pct >= 75 ? 'pass' : 'fail' };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing "neutral" class for mid-range scores

weekly-report.ts uses three tiers — pass (≥75%), neutral (≥40%), fail (<40%) — but the new resolveGrade path in the viewer only emits pass or fail. Any score between 40% and 74% will appear red (fail) in the task list even though the weekly report charts it as orange (neutral). The .neutral CSS class is already defined in the viewer's stylesheet.

Suggested change
return { label: pct + '%', cls: pct >= 75 ? 'pass' : 'fail' };
return { label: pct + '%', cls: pct >= 75 ? 'pass' : pct >= 40 ? 'neutral' : 'fail' };
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/dashboard/viewer.html
Line: 1228

Comment:
**Missing "neutral" class for mid-range scores**

`weekly-report.ts` uses three tiers — `pass` (≥75%), `neutral` (≥40%), `fail` (<40%) — but the new `resolveGrade` path in the viewer only emits `pass` or `fail`. Any score between 40% and 74% will appear red (fail) in the task list even though the weekly report charts it as orange (neutral). The `.neutral` CSS class is already defined in the viewer's stylesheet.

```suggestion
      return { label: pct + '%', cls: pct >= 75 ? 'pass' : pct >= 40 ? 'neutral' : 'fail' };
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +1224 to +1225
const firstKey = keys[0];
const score = graders[firstKey].score;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Grader selection uses arbitrary key order instead of priority order

weekly-report.ts iterates PASS_FAIL_GRADER_ORDER (performance_graderwebvoyager_graderfara_combinedfara_grader) to pick the canonical grader score for each task. Here, keys[0] relies on Object.keys() insertion order, which may resolve to a different grader. When a task has results from multiple graders the viewer can display a score from a different grader than the one that contributes to the report's avgScore, making the per-task numbers inconsistent with the aggregate.

Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/dashboard/viewer.html
Line: 1224-1225

Comment:
**Grader selection uses arbitrary key order instead of priority order**

`weekly-report.ts` iterates `PASS_FAIL_GRADER_ORDER` (`performance_grader``webvoyager_grader``fara_combined``fara_grader`) to pick the canonical grader score for each task. Here, `keys[0]` relies on `Object.keys()` insertion order, which may resolve to a different grader. When a task has results from multiple graders the viewer can display a score from a different grader than the one that contributes to the report's `avgScore`, making the per-task numbers inconsistent with the aggregate.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 143 to 146
if (task.graderResults[name]) {
graded++
if (task.graderResults[name].pass) passed++
scoredCount++
scoreSum += task.graderResults[name].score ?? 0
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Historical tasks without score field skew the average to zero

task.graderResults[name].score ?? 0 treats a missing score (e.g. older manifest records that only carry pass: boolean) as a score of 0, while still incrementing scoredCount. This silently dilutes avgScore for any run that contains historical tasks.

By contrast, viewer.html correctly handles this case with if (typeof score === 'number') and falls back to the pass/fail path — meaning the two codepaths diverge for the same data.

Consider mirroring the viewer's check:

const scoreVal = task.graderResults[name].score;
if (typeof scoreVal === 'number') {
  scoredCount++;
  scoreSum += scoreVal;
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/scripts/weekly-report.ts
Line: 143-146

Comment:
**Historical tasks without `score` field skew the average to zero**

`task.graderResults[name].score ?? 0` treats a missing `score` (e.g. older manifest records that only carry `pass: boolean`) as a score of `0`, while still incrementing `scoredCount`. This silently dilutes `avgScore` for any run that contains historical tasks.

By contrast, `viewer.html` correctly handles this case with `if (typeof score === 'number')` and falls back to the `pass`/`fail` path — meaning the two codepaths diverge for the same data.

Consider mirroring the viewer's check:

```typescript
const scoreVal = task.graderResults[name].score;
if (typeof scoreVal === 'number') {
  scoredCount++;
  scoreSum += scoreVal;
}
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant