Conversation
Add the missing execution layer for zeph-bench: - BenchRunner in zeph-bench/src/runner.rs drives Agent<BenchmarkChannel> over dataset scenarios in baseline mode (no tools, no memory) - NoopExecutor enables tool-free agent construction inside zeph-bench - Agent::into_channel(self) -> C added to zeph-core for response retrieval - handle_bench_command made async; bench run dispatches to loader/evaluator pairs for locomo, gaia, and frames datasets - build_vault_provider helper added to bootstrap for pre-AppBuilder vault init - --data-file flag added to bench run (required until download is implemented) Sample datasets in .local/bench-data/ verified end-to-end with gpt-4o: - locomo: 3/3 scenarios complete, mean_score=0.08 (baseline, no concise prompt) - gaia: 3/3 scenarios complete, mean_score=0.00 (exact match requires terse answer) Low scores are expected for baseline mode; evaluators need terse answers that the agent does not produce without a concise-answer system prompt.
…provements Inject a system prompt via InstructionBlock forcing the model to emit only the shortest possible answer. Add post_process_response to strip the first non-empty line and remove markdown formatting before passing to evaluators. Expand LOCOMO and GAIA sample datasets for broader coverage.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
BenchRunnerdrivingAgent<BenchmarkChannel>in baseline mode (no tools, no memory)InstructionBlock— primary driver of score improvementpost_process_response: takes first non-empty line, strips markdown formatting₂→ ASCII2before exact-match comparison--data-file,--provider,--scenario,--resume,--outputflags tobench runbuild_vault_providerbehindcfg(feature = "bench")— fixes dead_code in other bundlesdf49867avault/sanitizer/config refactor)Benchmark results (gpt-5.4-mini, 2026-04-25)
Before fix (gpt-4o, no system prompt): LOCOMO 0.0833, GAIA 0.0000.
Test plan
cargo nextest run -p zeph-bench --lib— all unit tests passcargo check --features "desktop,ide,server,chat,pdf,scheduler"— cleancargo check --features bench— cleancargo clippy --features "desktop,ide,server,chat,pdf,scheduler" -- -D warnings— cleanbench runwith gpt-5.4-mini against sample LOCOMO and GAIA datasets