feat(bench): implement BenchRunner and async bench command handler by bug-ops · Pull Request #3408 · bug-ops/zeph

bug-ops · 2026-04-25T19:27:48Z

Summary

Implements BenchRunner driving Agent<BenchmarkChannel> in baseline mode (no tools, no memory)
Adds concise-answer system prompt via InstructionBlock — primary driver of score improvement
Adds post_process_response: takes first non-empty line, strips markdown formatting
LOCOMO Token F1 normalizer: lowercase + alphanumeric-only before matching
GAIA subscript-digit folding: maps Unicode ₂ → ASCII 2 before exact-match comparison
Adds --data-file, --provider, --scenario, --resume, --output flags to bench run
Gates build_vault_provider behind cfg(feature = "bench") — fixes dead_code in other bundles
Rebased onto main (df49867a vault/sanitizer/config refactor)

Benchmark results (gpt-5.4-mini, 2026-04-25)

Dataset	Scorer	Scenarios	Mean score	Exact match
LOCOMO	Token F1 ≥ 0.5	11	1.0000	11/11
GAIA	GAIA normalized exact	8	1.0000	8/8

Before fix (gpt-4o, no system prompt): LOCOMO 0.0833, GAIA 0.0000.

Test plan

cargo nextest run -p zeph-bench --lib — all unit tests pass
cargo check --features "desktop,ide,server,chat,pdf,scheduler" — clean
cargo check --features bench — clean
cargo clippy --features "desktop,ide,server,chat,pdf,scheduler" -- -D warnings — clean
End-to-end bench run with gpt-5.4-mini against sample LOCOMO and GAIA datasets

Add the missing execution layer for zeph-bench: - BenchRunner in zeph-bench/src/runner.rs drives Agent<BenchmarkChannel> over dataset scenarios in baseline mode (no tools, no memory) - NoopExecutor enables tool-free agent construction inside zeph-bench - Agent::into_channel(self) -> C added to zeph-core for response retrieval - handle_bench_command made async; bench run dispatches to loader/evaluator pairs for locomo, gaia, and frames datasets - build_vault_provider helper added to bootstrap for pre-AppBuilder vault init - --data-file flag added to bench run (required until download is implemented) Sample datasets in .local/bench-data/ verified end-to-end with gpt-4o: - locomo: 3/3 scenarios complete, mean_score=0.08 (baseline, no concise prompt) - gaia: 3/3 scenarios complete, mean_score=0.00 (exact match requires terse answer) Low scores are expected for baseline mode; evaluators need terse answers that the agent does not produce without a concise-answer system prompt.

…provements Inject a system prompt via InstructionBlock forcing the model to emit only the shortest possible answer. Add post_process_response to strip the first non-empty line and remove markdown formatting before passing to evaluators. Expand LOCOMO and GAIA sample datasets for broader coverage.

…, dataset table

…nnel doc

…stub

github-actions Bot added rust Rust code changes core zeph-core crate enhancement New feature or request dependencies Dependency updates size/XL Extra large PR (500+ lines) labels Apr 25, 2026

bug-ops force-pushed the bench-fix branch from 35f58d5 to 07b5199 Compare April 25, 2026 19:29

bug-ops enabled auto-merge (squash) April 25, 2026 19:31

bug-ops added 4 commits April 25, 2026 21:37

chore: update lockfile and stage bench runner async call site

1d7b75b

style(bench): collapse subscript/superscript match arms

13bfb5d

bug-ops force-pushed the bench-fix branch from 07b5199 to 13bfb5d Compare April 25, 2026 19:37

bug-ops added 2 commits April 25, 2026 21:42

fix(bench): gate build_vault_provider behind cfg(feature = "bench")

b06209f

docs(bench): add baseline results and quick-start to README

373eb89

github-actions Bot added the documentation Improvements or additions to documentation label Apr 25, 2026

bug-ops added 3 commits April 25, 2026 21:48

docs(bench): rewrite README with results, CLI usage, library examples…

b7b629c

…, dataset table

fix(core): remove unresolvable cross-crate intra-doc link in into_cha…

920455d

…nnel doc

fix(core): replace uncompilable doc-test in into_channel with no_run …

67cf0c0

…stub

bug-ops merged commit 84ec0eb into main Apr 25, 2026
36 checks passed

bug-ops deleted the bench-fix branch April 25, 2026 20:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): implement BenchRunner and async bench command handler#3408

feat(bench): implement BenchRunner and async bench command handler#3408
bug-ops merged 9 commits intomainfrom
bench-fix

bug-ops commented Apr 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bug-ops commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark results (gpt-5.4-mini, 2026-04-25)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bug-ops commented Apr 25, 2026 •

edited

Loading