Skip to content

feat(bench): Aggregate stats, Turn-based Scenario, baseline dual-run#3415

Merged
bug-ops merged 4 commits intomainfrom
bench-fix
Apr 25, 2026
Merged

feat(bench): Aggregate stats, Turn-based Scenario, baseline dual-run#3415
bug-ops merged 4 commits intomainfrom
bench-fix

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented Apr 25, 2026

Summary

  • Aggregate extended with median_score, stddev, error_count (spec 034 §NFR-008)
  • Scenario migrated from prompt: String to turns: Vec<Turn> with Role enum, Scenario::single() constructor, and primary_prompt() -> Result<&str, BenchError> (§multi-turn)
  • --baseline dual-run (memory-off / memory-on): writes results to <output>/baseline/{memory-off,memory-on}/ + comparison.json; scoped to longmemeval and locomo (§FR-006)
  • All 5 dataset loaders migrated to Scenario::single()
  • 111 zeph-bench tests pass; 8 520 workspace tests pass

Test plan

  • cargo nextest run -p zeph-bench — 111/111 pass
  • cargo nextest run --workspace --lib --bins — 8 520/8 520 pass
  • cargo test --doc -p zeph-bench — 58/58 pass
  • cargo +nightly fmt --check — clean
  • cargo clippy --workspace --lib --bins -- -D warnings — clean

@github-actions github-actions Bot added documentation Improvements or additions to documentation rust Rust code changes dependencies Dependency updates labels Apr 25, 2026
@bug-ops bug-ops enabled auto-merge (squash) April 25, 2026 21:02
@github-actions github-actions Bot added enhancement New feature or request size/XL Extra large PR (500+ lines) labels Apr 25, 2026
@bug-ops bug-ops disabled auto-merge April 25, 2026 21:04
bug-ops added 4 commits April 25, 2026 23:34
…io, baseline dual-run

- Aggregate: add median_score, stddev, error_count fields with correct N=0/N=1 edge cases
- Scenario: replace prompt: String with turns: Vec<Turn>, add Role enum, Scenario::single()
  constructor, primary_prompt() -> Result<&str, BenchError>
- RunOptions: add memory_mode: MemoryMode field (Off by default)
- BenchRunner: add with_memory_params(), wire SemanticMemory via with_sqlite_backend when On
- bench run --baseline: dual-pass (memory-off/on) writing to <output>/baseline/{memory-off,memory-on}/
  + comparison.json; gated to longmemeval/locomo; tool datasets return an error
- All 5 loaders migrated from struct literals to Scenario::single()

Closes spec 034 gaps: §Data Model NFR-008, §Scenario multi-turn, §FR-006 baseline dual-run
@bug-ops bug-ops enabled auto-merge (squash) April 25, 2026 21:40
@bug-ops bug-ops merged commit f0e48d7 into main Apr 25, 2026
36 checks passed
@bug-ops bug-ops deleted the bench-fix branch April 25, 2026 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Dependency updates documentation Improvements or additions to documentation enhancement New feature or request rust Rust code changes size/XL Extra large PR (500+ lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant