Skip to content

[feature] add golden-answer benchmark workflow across CLI and web#3

Open
keyur-prabhu-glean wants to merge 4 commits into
askscio:mainfrom
keyur-prabhu-glean:seer/golden-benchmark
Open

[feature] add golden-answer benchmark workflow across CLI and web#3
keyur-prabhu-glean wants to merge 4 commits into
askscio:mainfrom
keyur-prabhu-glean:seer/golden-benchmark

Conversation

@keyur-prabhu-glean
Copy link
Copy Markdown

We need to benchmark agent responses against reviewer-provided golden answers and source links, but current Seer flows only support thematic eval guidance and do not persist/score golden references end-to-end.

This PR introduces the golden benchmark workflow across schema, judge pipeline, CLI ingestion, and web run APIs/UI defaults.

Stack dependency: review after #2 (this branch includes PR #1 and #2 commits).

Summary

  • Add golden_answer / golden_sources support in schema/types/migrations
  • Add golden criteria (answer_accuracy, answer_completeness, citation_correctness) and golden judge call path
  • Add URL-based golden source doc fetch + correctness/citation judging
  • Add CLI golden flags and header-aware CSV import expansion + src/import-golden-xlsx.ts
  • Add web API support (cases, runs) and result export button
  • Switch default judge selection to GPT-5 for this benchmark mode

Test plan

  • bunx tsc --noEmit
  • cd web && bun run build currently fails on existing Drizzle typing mismatch in web/app/api/cases/route.ts (pre-existing; not introduced by this split)

Made with Cursor

Keyur Prabhu added 4 commits April 29, 2026 16:17
Harden agent and judge API calls against transient failures and capture per-call
token/latency telemetry so long benchmark runs are resilient and auditable.

Made-with: Cursor
Improve faithfulness reliability by evaluating hallucination risk and groundedness
with separate statement-level prompts and rubric-specific category mappings.

Made-with: Cursor
Introduce golden answer and golden source evaluation across schema, judge, CLI,
and web flows so benchmark runs can score accuracy, completeness, and citations.

Made-with: Cursor
Use /runs/ so benchmark artifacts are ignored without masking tracked web/app/runs paths.

Made-with: Cursor
Comment thread src/import-golden-xlsx.ts

return {
agentId: opts.agentId!,
xlsxPath: opts.xlsxPath || 'Gloden Evaluation Set.xlsx',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Typo in default XLSX filename: 'Gloden Evaluation Set.xlsx' should be 'Golden Evaluation Set.xlsx'. When the importer is run without --xlsx, it will attempt to read a non-existent file with the misspelled name.

Suggested fix: In src/import-golden-xlsx.ts, change the default XLSX filename and its mention in the help text from 'Gloden Evaluation Set.xlsx' to 'Golden Evaluation Set.xlsx'.

🔧 Tag @ glean-for-engineering to fix or click here to fix in Glean

💬 Help us improve! Was this comment helpful? React with 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant