[feature] add golden-answer benchmark workflow across CLI and web by keyur-prabhu-glean · Pull Request #3 · askscio/seer

keyur-prabhu-glean · 2026-04-29T11:14:46Z

We need to benchmark agent responses against reviewer-provided golden answers and source links, but current Seer flows only support thematic eval guidance and do not persist/score golden references end-to-end.

This PR introduces the golden benchmark workflow across schema, judge pipeline, CLI ingestion, and web run APIs/UI defaults.

Stack dependency: review after #2 (this branch includes PR #1 and #2 commits).

Summary

Add golden_answer / golden_sources support in schema/types/migrations
Add golden criteria (answer_accuracy, answer_completeness, citation_correctness) and golden judge call path
Add URL-based golden source doc fetch + correctness/citation judging
Add CLI golden flags and header-aware CSV import expansion + src/import-golden-xlsx.ts
Add web API support (cases, runs) and result export button
Switch default judge selection to GPT-5 for this benchmark mode

Test plan

bunx tsc --noEmit
cd web && bun run build currently fails on existing Drizzle typing mismatch in web/app/api/cases/route.ts (pre-existing; not introduced by this split)

Made with Cursor

Harden agent and judge API calls against transient failures and capture per-call token/latency telemetry so long benchmark runs are resilient and auditable. Made-with: Cursor

Improve faithfulness reliability by evaluating hallucination risk and groundedness with separate statement-level prompts and rubric-specific category mappings. Made-with: Cursor

Introduce golden answer and golden source evaluation across schema, judge, CLI, and web flows so benchmark runs can score accuracy, completeness, and citations. Made-with: Cursor

Use /runs/ so benchmark artifacts are ignored without masking tracked web/app/runs paths. Made-with: Cursor

glean-it · 2026-04-29T11:20:03Z

+
+  return {
+    agentId: opts.agentId!,
+    xlsxPath: opts.xlsxPath || 'Gloden Evaluation Set.xlsx',


Issue: Typo in default XLSX filename: 'Gloden Evaluation Set.xlsx' should be 'Golden Evaluation Set.xlsx'. When the importer is run without --xlsx, it will attempt to read a non-existent file with the misspelled name.

Suggested fix: In src/import-golden-xlsx.ts, change the default XLSX filename and its mention in the help text from 'Gloden Evaluation Set.xlsx' to 'Golden Evaluation Set.xlsx'.

_{🔧 Tag @ glean-for-engineering to fix or click here to fix in Glean}

_{💬 Help us improve! Was this comment helpful? React with 👍 or 👎}

Keyur Prabhu added 4 commits April 29, 2026 16:17

[infra] add retry transport and token usage ledger

b01fe37

Harden agent and judge API calls against transient failures and capture per-call token/latency telemetry so long benchmark runs are resilient and auditable. Made-with: Cursor

[judge] split faithfulness into specialized sub-judges

e2670ce

Improve faithfulness reliability by evaluating hallucination risk and groundedness with separate statement-level prompts and rubric-specific category mappings. Made-with: Cursor

[feature] add golden-answer benchmark workflow

56a4b77

Introduce golden answer and golden source evaluation across schema, judge, CLI, and web flows so benchmark runs can score accuracy, completeness, and citations. Made-with: Cursor

[chore] narrow runs ignore to repo-root artifacts

669c723

Use /runs/ so benchmark artifacts are ignored without masking tracked web/app/runs paths. Made-with: Cursor

keyur-prabhu-glean mentioned this pull request Apr 29, 2026

[judge] add safety dimension with configurable safety policy #4

Open

2 tasks

glean-it Bot reviewed Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] add golden-answer benchmark workflow across CLI and web#3

[feature] add golden-answer benchmark workflow across CLI and web#3
keyur-prabhu-glean wants to merge 4 commits into
askscio:mainfrom
keyur-prabhu-glean:seer/golden-benchmark

keyur-prabhu-glean commented Apr 29, 2026

Uh oh!

glean-it Bot Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

keyur-prabhu-glean commented Apr 29, 2026

Summary

Test plan

Uh oh!

glean-it Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant