[feature] add golden-answer benchmark workflow across CLI and web#3
Open
keyur-prabhu-glean wants to merge 4 commits into
Open
[feature] add golden-answer benchmark workflow across CLI and web#3keyur-prabhu-glean wants to merge 4 commits into
keyur-prabhu-glean wants to merge 4 commits into
Conversation
added 4 commits
April 29, 2026 16:17
Harden agent and judge API calls against transient failures and capture per-call token/latency telemetry so long benchmark runs are resilient and auditable. Made-with: Cursor
Improve faithfulness reliability by evaluating hallucination risk and groundedness with separate statement-level prompts and rubric-specific category mappings. Made-with: Cursor
Introduce golden answer and golden source evaluation across schema, judge, CLI, and web flows so benchmark runs can score accuracy, completeness, and citations. Made-with: Cursor
Use /runs/ so benchmark artifacts are ignored without masking tracked web/app/runs paths. Made-with: Cursor
2 tasks
|
|
||
| return { | ||
| agentId: opts.agentId!, | ||
| xlsxPath: opts.xlsxPath || 'Gloden Evaluation Set.xlsx', |
There was a problem hiding this comment.
Issue: Typo in default XLSX filename: 'Gloden Evaluation Set.xlsx' should be 'Golden Evaluation Set.xlsx'. When the importer is run without --xlsx, it will attempt to read a non-existent file with the misspelled name.
Suggested fix: In src/import-golden-xlsx.ts, change the default XLSX filename and its mention in the help text from 'Gloden Evaluation Set.xlsx' to 'Golden Evaluation Set.xlsx'.
🔧 Tag @ glean-for-engineering to fix or click here to fix in Glean
💬 Help us improve! Was this comment helpful? React with 👍 or 👎
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We need to benchmark agent responses against reviewer-provided golden answers and source links, but current Seer flows only support thematic eval guidance and do not persist/score golden references end-to-end.
This PR introduces the golden benchmark workflow across schema, judge pipeline, CLI ingestion, and web run APIs/UI defaults.
Stack dependency: review after #2 (this branch includes PR #1 and #2 commits).
Summary
golden_answer/golden_sourcessupport in schema/types/migrationsanswer_accuracy,answer_completeness,citation_correctness) and golden judge call pathsrc/import-golden-xlsx.tscases,runs) and result export buttonTest plan
bunx tsc --noEmitcd web && bun run buildcurrently fails on existing Drizzle typing mismatch inweb/app/api/cases/route.ts(pre-existing; not introduced by this split)Made with Cursor