Skip to content

feat(bench): implement BenchRunner and async bench command handler#3408

Merged
bug-ops merged 9 commits intomainfrom
bench-fix
Apr 25, 2026
Merged

feat(bench): implement BenchRunner and async bench command handler#3408
bug-ops merged 9 commits intomainfrom
bench-fix

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented Apr 25, 2026

Summary

  • Implements BenchRunner driving Agent<BenchmarkChannel> in baseline mode (no tools, no memory)
  • Adds concise-answer system prompt via InstructionBlock — primary driver of score improvement
  • Adds post_process_response: takes first non-empty line, strips markdown formatting
  • LOCOMO Token F1 normalizer: lowercase + alphanumeric-only before matching
  • GAIA subscript-digit folding: maps Unicode → ASCII 2 before exact-match comparison
  • Adds --data-file, --provider, --scenario, --resume, --output flags to bench run
  • Gates build_vault_provider behind cfg(feature = "bench") — fixes dead_code in other bundles
  • Rebased onto main (df49867a vault/sanitizer/config refactor)

Benchmark results (gpt-5.4-mini, 2026-04-25)

Dataset Scorer Scenarios Mean score Exact match
LOCOMO Token F1 ≥ 0.5 11 1.0000 11/11
GAIA GAIA normalized exact 8 1.0000 8/8

Before fix (gpt-4o, no system prompt): LOCOMO 0.0833, GAIA 0.0000.

Test plan

  • cargo nextest run -p zeph-bench --lib — all unit tests pass
  • cargo check --features "desktop,ide,server,chat,pdf,scheduler" — clean
  • cargo check --features bench — clean
  • cargo clippy --features "desktop,ide,server,chat,pdf,scheduler" -- -D warnings — clean
  • End-to-end bench run with gpt-5.4-mini against sample LOCOMO and GAIA datasets

@github-actions github-actions Bot added rust Rust code changes core zeph-core crate enhancement New feature or request dependencies Dependency updates size/XL Extra large PR (500+ lines) labels Apr 25, 2026
@bug-ops bug-ops enabled auto-merge (squash) April 25, 2026 19:31
bug-ops added 4 commits April 25, 2026 21:37
Add the missing execution layer for zeph-bench:
- BenchRunner in zeph-bench/src/runner.rs drives Agent<BenchmarkChannel>
  over dataset scenarios in baseline mode (no tools, no memory)
- NoopExecutor enables tool-free agent construction inside zeph-bench
- Agent::into_channel(self) -> C added to zeph-core for response retrieval
- handle_bench_command made async; bench run dispatches to loader/evaluator
  pairs for locomo, gaia, and frames datasets
- build_vault_provider helper added to bootstrap for pre-AppBuilder vault init
- --data-file flag added to bench run (required until download is implemented)

Sample datasets in .local/bench-data/ verified end-to-end with gpt-4o:
- locomo: 3/3 scenarios complete, mean_score=0.08 (baseline, no concise prompt)
- gaia:   3/3 scenarios complete, mean_score=0.00 (exact match requires terse answer)

Low scores are expected for baseline mode; evaluators need terse answers
that the agent does not produce without a concise-answer system prompt.
…provements

Inject a system prompt via InstructionBlock forcing the model to emit only
the shortest possible answer. Add post_process_response to strip the first
non-empty line and remove markdown formatting before passing to evaluators.
Expand LOCOMO and GAIA sample datasets for broader coverage.
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Apr 25, 2026
@bug-ops bug-ops merged commit 84ec0eb into main Apr 25, 2026
36 checks passed
@bug-ops bug-ops deleted the bench-fix branch April 25, 2026 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core zeph-core crate dependencies Dependency updates documentation Improvements or additions to documentation enhancement New feature or request rust Rust code changes size/XL Extra large PR (500+ lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant