feat(bench): Aggregate stats, Turn-based Scenario, baseline dual-run by bug-ops · Pull Request #3415 · bug-ops/zeph

bug-ops · 2026-04-25T21:02:40Z

Summary

Aggregate extended with median_score, stddev, error_count (spec 034 §NFR-008)
Scenario migrated from prompt: String to turns: Vec<Turn> with Role enum, Scenario::single() constructor, and primary_prompt() -> Result<&str, BenchError> (§multi-turn)
--baseline dual-run (memory-off / memory-on): writes results to <output>/baseline/{memory-off,memory-on}/ + comparison.json; scoped to longmemeval and locomo (§FR-006)
All 5 dataset loaders migrated to Scenario::single()
111 zeph-bench tests pass; 8 520 workspace tests pass

Test plan

cargo nextest run -p zeph-bench — 111/111 pass
cargo nextest run --workspace --lib --bins — 8 520/8 520 pass
cargo test --doc -p zeph-bench — 58/58 pass
cargo +nightly fmt --check — clean
cargo clippy --workspace --lib --bins -- -D warnings — clean

…io, baseline dual-run - Aggregate: add median_score, stddev, error_count fields with correct N=0/N=1 edge cases - Scenario: replace prompt: String with turns: Vec<Turn>, add Role enum, Scenario::single() constructor, primary_prompt() -> Result<&str, BenchError> - RunOptions: add memory_mode: MemoryMode field (Off by default) - BenchRunner: add with_memory_params(), wire SemanticMemory via with_sqlite_backend when On - bench run --baseline: dual-pass (memory-off/on) writing to <output>/baseline/{memory-off,memory-on}/ + comparison.json; gated to longmemeval/locomo; tool datasets return an error - All 5 loaders migrated from struct literals to Scenario::single() Closes spec 034 gaps: §Data Model NFR-008, §Scenario multi-turn, §FR-006 baseline dual-run

…ate_dir_all for nested output dirs

…limit

github-actions Bot added documentation Improvements or additions to documentation rust Rust code changes dependencies Dependency updates labels Apr 25, 2026

bug-ops enabled auto-merge (squash) April 25, 2026 21:02

github-actions Bot added enhancement New feature or request size/XL Extra large PR (500+ lines) labels Apr 25, 2026

bug-ops disabled auto-merge April 25, 2026 21:04

bug-ops added 4 commits April 25, 2026 23:34

docs(bench): add FRAMES results to README baseline table

e4dbf05

fix(bench): create SQLite conversation row before memory-on pass; cre…

7cbd2b2

…ate_dir_all for nested output dirs

fix(bench): split handle_run_baseline to stay within 100-line clippy …

c9e0d5a

…limit

bug-ops force-pushed the bench-fix branch from 97437af to c9e0d5a Compare April 25, 2026 21:39

bug-ops enabled auto-merge (squash) April 25, 2026 21:40

bug-ops merged commit f0e48d7 into main Apr 25, 2026
36 checks passed

bug-ops deleted the bench-fix branch April 25, 2026 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): Aggregate stats, Turn-based Scenario, baseline dual-run#3415

feat(bench): Aggregate stats, Turn-based Scenario, baseline dual-run#3415
bug-ops merged 4 commits intomainfrom
bench-fix

bug-ops commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bug-ops commented Apr 25, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant