feat(eval): show mean score instead of pass/fail in report and viewer by shivammittal274 · Pull Request #534 · browseros-ai/BrowserOS

shivammittal274 · 2026-03-23T14:56:12Z

No description provided.

greptile-apps · 2026-03-23T14:59:04Z

Greptile Summary

This PR replaces binary pass/fail reporting with a continuous mean score (0–100%) across the eval weekly-report generator and the task viewer. passRate is removed in favour of avgScore, color thresholds are updated (green ≥75%, orange ≥40%, red <40%), and the CLI trend bar is adjusted accordingly.

Key changes:

RunSummary.passRate removed; avgScore (already multiplied to 0–100) used everywhere in the chart, stat cards, table, config detail card, tooltip, and console trend bar.
resolveGrade in viewer.html now returns a percentage label when the grader result carries a numeric score field, falling back to PASS/FAIL for older data.
Three issues found: (1) resolveGrade emits only pass/fail CSS classes, silently skipping the neutral (orange) tier for 40–75% scores that is already defined in the stylesheet and used everywhere in weekly-report.ts; (2) resolveGrade picks keys[0] by insertion order rather than the canonical PASS_FAIL_GRADER_ORDER, so the per-task score in the viewer may come from a different grader than what the report averages; (3) in weekly-report.ts, tasks whose graderResults carry only a pass boolean (older data) silently contribute 0 to the scoreSum while still incrementing scoredCount, potentially dragging down historical averages.

Confidence Score: 4/5

Safe to merge after addressing the missing neutral color tier in viewer.html; the other two issues are low-risk in practice.
The core logic — replacing passRate with avgScore — is correct and consistent throughout weekly-report.ts. The three issues are bounded in scope: the neutral-class omission is a one-line fix, the grader-key ordering only matters for tasks with multiple graders, and the historical-data zero-score concern depends on whether older manifests without a score field still exist.
packages/browseros-agent/apps/eval/src/dashboard/viewer.html — missing neutral color tier and arbitrary grader key selection in resolveGrade.

Important Files Changed

Filename	Overview
packages/browseros-agent/apps/eval/scripts/weekly-report.ts	Replaces pass/fail `passRate` with numeric `avgScore` throughout the report generator; calculation is correct but tasks missing a `score` field (older data) silently contribute 0 to the average while still being counted.
packages/browseros-agent/apps/eval/src/dashboard/viewer.html	New `resolveGrade` path shows numeric score, but uses `keys[0]` (insertion order) instead of the canonical grader priority, and the color tier is missing the "neutral" (orange) class for 40–75% scores that exists in both the stylesheet and the weekly report.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Manifest task] --> B{graderResults\npresent?}
    B -- No --> C[Skip task]
    B -- Yes --> D[Iterate PASS_FAIL_GRADER_ORDER]
    D --> E{Grader key\nfound?}
    E -- No --> C
    E -- Yes --> F{score field\npresent?}

    subgraph weekly-report.ts
        F -- Yes --> G[scoreSum += score\nscoredCount++]
        F -- No ← bug --> H[scoreSum += 0\nscoredCount++ ← skews avg]
        G --> I[avgScore = scoreSum / scoredCount × 100]
    end

    subgraph viewer.html resolveGrade
        J[keys 0 — arbitrary order] --> K{typeof score\n=== number?}
        K -- Yes --> L{pct >= 75?}
        L -- Yes --> M[pass — green]
        L -- No --> N[fail — red ← missing neutral tier]
        K -- No --> O[anyPass → PASS / FAIL]
    end

    I --> P[RunSummary.avgScore\nused in chart, table, stats]
    A --> J

Prompt To Fix All With AI

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/dashboard/viewer.html
Line: 1228

Comment:
**Missing "neutral" class for mid-range scores**

`weekly-report.ts` uses three tiers — `pass` (≥75%), `neutral` (≥40%), `fail` (<40%) — but the new `resolveGrade` path in the viewer only emits `pass` or `fail`. Any score between 40% and 74% will appear red (fail) in the task list even though the weekly report charts it as orange (neutral). The `.neutral` CSS class is already defined in the viewer's stylesheet.

```suggestion
      return { label: pct + '%', cls: pct >= 75 ? 'pass' : pct >= 40 ? 'neutral' : 'fail' };
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/dashboard/viewer.html
Line: 1224-1225

Comment:
**Grader selection uses arbitrary key order instead of priority order**

`weekly-report.ts` iterates `PASS_FAIL_GRADER_ORDER` (`performance_grader` → `webvoyager_grader` → `fara_combined` → `fara_grader`) to pick the canonical grader score for each task. Here, `keys[0]` relies on `Object.keys()` insertion order, which may resolve to a different grader. When a task has results from multiple graders the viewer can display a score from a different grader than the one that contributes to the report's `avgScore`, making the per-task numbers inconsistent with the aggregate.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/scripts/weekly-report.ts
Line: 143-146

Comment:
**Historical tasks without `score` field skew the average to zero**

`task.graderResults[name].score ?? 0` treats a missing `score` (e.g. older manifest records that only carry `pass: boolean`) as a score of `0`, while still incrementing `scoredCount`. This silently dilutes `avgScore` for any run that contains historical tasks.

By contrast, `viewer.html` correctly handles this case with `if (typeof score === 'number')` and falls back to the `pass`/`fail` path — meaning the two codepaths diverge for the same data.

Consider mirroring the viewer's check:

```typescript
const scoreVal = task.graderResults[name].score;
if (typeof scoreVal === 'number') {
  scoredCount++;
  scoreSum += scoreVal;
}
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "feat(eval): show mean score instead of p..." | Re-trigger Greptile}

greptile-apps · 2026-03-23T14:59:07Z

packages/browseros-agent/apps/eval/src/dashboard/viewer.html

+    const score = graders[firstKey].score;
+    if (typeof score === 'number') {
+      const pct = Math.round(score * 100);
+      return { label: pct + '%', cls: pct >= 75 ? 'pass' : 'fail' };


Missing "neutral" class for mid-range scores

weekly-report.ts uses three tiers — pass (≥75%), neutral (≥40%), fail (<40%) — but the new resolveGrade path in the viewer only emits pass or fail. Any score between 40% and 74% will appear red (fail) in the task list even though the weekly report charts it as orange (neutral). The .neutral CSS class is already defined in the viewer's stylesheet.

Suggested change

return { label: pct + '%', cls: pct >= 75 ? 'pass' : 'fail' };

return { label: pct + '%', cls: pct >= 75 ? 'pass' : pct >= 40 ? 'neutral' : 'fail' };

Prompt To Fix With AI

This is a comment left during a code review. Path: packages/browseros-agent/apps/eval/src/dashboard/viewer.html Line: 1228 Comment: **Missing "neutral" class for mid-range scores** `weekly-report.ts` uses three tiers — `pass` (≥75%), `neutral` (≥40%), `fail` (<40%) — but the new `resolveGrade` path in the viewer only emits `pass` or `fail`. Any score between 40% and 74% will appear red (fail) in the task list even though the weekly report charts it as orange (neutral). The `.neutral` CSS class is already defined in the viewer's stylesheet. ```suggestion return { label: pct + '%', cls: pct >= 75 ? 'pass' : pct >= 40 ? 'neutral' : 'fail' }; ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-23T14:59:09Z

packages/browseros-agent/apps/eval/src/dashboard/viewer.html

+    const firstKey = keys[0];
+    const score = graders[firstKey].score;


Grader selection uses arbitrary key order instead of priority order

weekly-report.ts iterates PASS_FAIL_GRADER_ORDER (performance_grader → webvoyager_grader → fara_combined → fara_grader) to pick the canonical grader score for each task. Here, keys[0] relies on Object.keys() insertion order, which may resolve to a different grader. When a task has results from multiple graders the viewer can display a score from a different grader than the one that contributes to the report's avgScore, making the per-task numbers inconsistent with the aggregate.

Prompt To Fix With AI

This is a comment left during a code review. Path: packages/browseros-agent/apps/eval/src/dashboard/viewer.html Line: 1224-1225 Comment: **Grader selection uses arbitrary key order instead of priority order** `weekly-report.ts` iterates `PASS_FAIL_GRADER_ORDER` (`performance_grader` → `webvoyager_grader` → `fara_combined` → `fara_grader`) to pick the canonical grader score for each task. Here, `keys[0]` relies on `Object.keys()` insertion order, which may resolve to a different grader. When a task has results from multiple graders the viewer can display a score from a different grader than the one that contributes to the report's `avgScore`, making the per-task numbers inconsistent with the aggregate. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-23T14:59:10Z

packages/browseros-agent/apps/eval/scripts/weekly-report.ts

        if (task.graderResults[name]) {
-          graded++
-          if (task.graderResults[name].pass) passed++
+          scoredCount++
+          scoreSum += task.graderResults[name].score ?? 0
          break


Historical tasks without score field skew the average to zero

task.graderResults[name].score ?? 0 treats a missing score (e.g. older manifest records that only carry pass: boolean) as a score of 0, while still incrementing scoredCount. This silently dilutes avgScore for any run that contains historical tasks.

By contrast, viewer.html correctly handles this case with if (typeof score === 'number') and falls back to the pass/fail path — meaning the two codepaths diverge for the same data.

Consider mirroring the viewer's check:

const scoreVal = task.graderResults[name].score; if (typeof scoreVal === 'number') { scoredCount++; scoreSum += scoreVal; }

Prompt To Fix With AI

This is a comment left during a code review. Path: packages/browseros-agent/apps/eval/scripts/weekly-report.ts Line: 143-146 Comment: **Historical tasks without `score` field skew the average to zero** `task.graderResults[name].score ?? 0` treats a missing `score` (e.g. older manifest records that only carry `pass: boolean`) as a score of `0`, while still incrementing `scoredCount`. This silently dilutes `avgScore` for any run that contains historical tasks. By contrast, `viewer.html` correctly handles this case with `if (typeof score === 'number')` and falls back to the `pass`/`fail` path — meaning the two codepaths diverge for the same data. Consider mirroring the viewer's check: ```typescript const scoreVal = task.graderResults[name].score; if (typeof scoreVal === 'number') { scoredCount++; scoreSum += scoreVal; } ``` How can I resolve this? If you propose a fix, please make it concise.

feat(eval): show mean score instead of pass/fail in report and viewer

c486f6c

github-actions bot added the Feature label Mar 23, 2026

shivammittal274 merged commit f14942c into main Mar 23, 2026
9 of 10 checks passed

greptile-apps bot reviewed Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): show mean score instead of pass/fail in report and viewer#534

feat(eval): show mean score instead of pass/fail in report and viewer#534
shivammittal274 merged 1 commit intomainfrom
fix/eval-browser-context

shivammittal274 commented Mar 23, 2026

Uh oh!

Uh oh!

greptile-apps bot commented Mar 23, 2026

Uh oh!

greptile-apps bot Mar 23, 2026

Uh oh!

greptile-apps bot Mar 23, 2026

Uh oh!

greptile-apps bot Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	return { label: pct + '%', cls: pct >= 75 ? 'pass' : 'fail' };
	return { label: pct + '%', cls: pct >= 75 ? 'pass' : pct >= 40 ? 'neutral' : 'fail' };

		const firstKey = keys[0];
		const score = graders[firstKey].score;

Conversation

shivammittal274 commented Mar 23, 2026

Uh oh!

Uh oh!

greptile-apps bot commented Mar 23, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant