Release v0.1.0
Initial public release. The crates, the Python SDK, and the protocol all start at
this version.
Highlights
- Code-first eval framework —
Eval = Dataset(Sample…) + Subject + [Scorer…]crossed with a provider-agnosticTargetmatrix, a broad built-in scorer vocabulary (text, tools, budgets, files, combinators, LLM-judge), and an in-process runner (#2). - The eval protocol (1.0) — newline-delimited JSON over stdio between the study and the host, with
MAJOR.MINORversioning, capability negotiation, and a machine-readable JSON Schema generated from the wire types (#16). - Native Python SDK — author studies in pure-stdlib Python (no Rust dependency); wire types and protocol metadata are generated from the schema with a drift guard (#22, #25).
- Trials, pass@k, and seeds — first-class N-sampling for pass-rate and variance with an unbiased pass@k estimator and reproducible per-trial seeds (#24).
- Multimodal & interactive evals — typed multimodal content (input attachments + graded output) and simulated-user multi-turn dialogs folded into one transcript (#28).
- Provider-backed LLM judge + N/A semantics —
LlmJudgescorers over OpenAI/Anthropic and a third "couldn't evaluate" state, so infra failures degrade to N/A instead of a false fail (#6, #8). - Adaptive matrix concurrency — bounded, provider-aware throttling that multiplexes runs over one pipe and backs off on rate limits (#4).
What's Changed
- Targets, not models: rename ModelSpec→Target + --axis/--preset selection (#34) by @chaliy
- feat(protocol): reserve the study→host reverse-request channel seam (#32) by @chaliy
- feat(protocol): cursor-paginated sample listing (1.10) (#31) by @chaliy
- feat(protocol): promote multimodal output + capability params to the wire (1.11) (#30) by @chaliy
- feat(protocol): cancel an in-flight run by id (protocol 1.8) (#29) by @chaliy
- feat: multimodality, interactive multi-turn evals, and structured capability params (#28) by @chaliy
- feat(protocol): typed, correlated event/log notifications (1.9) (#27) by @chaliy
- feat(protocol): metadata columns for samples/models + report --group-by (#26) by @chaliy
- feat(sdks): generate protocol metadata for the Python SDK drift guard (#25) by @chaliy
- feat(protocol): trials/repetitions + seed with pass@k aggregation (#24) by @chaliy
- feat(protocol): structured RPC errors (protocol 1.5) (#23) by @chaliy
- feat(sdks): native Python SDK for authoring eval studies (#22) by @chaliy
- feat(protocol): make metadata open-ended (string → JSON) (#21) by @chaliy
- feat(cli): record environment metadata in saved runs (#20) by @chaliy
- feat(cli): add AI-friendly
mira help --fulland reword tagline (#18) by @chaliy - feat(protocol): machine-readable JSON Schema generated from wire types (#16) by @chaliy
- feat(cli): --save run archive with run ids, timestamps, and mira.toml (#15) by @chaliy
- feat: split subject execution from scoring (execute/score, rescore) (#11) by @chaliy
- feat(metrics): extensible numeric metrics map + generic budget scorers (#10) by @chaliy
- feat: surface infrastructure errors as N/A (not failures), retryable (#8) by @chaliy
- feat(scorer): N/A score state + provider-backed LLM judge (#6) by @chaliy
- feat(exec): bounded, provider-aware, adaptive matrix concurrency (#4) by @chaliy
- feat: live progress bar and session-backed checkpoints for
mira run(#3) by @chaliy - Productionize the Mira eval-framework PoC into a published workspace (#2) by @chaliy
- chore(protocol): reset protocol version to the 1.0 baseline (#33) by @chaliy
- chore(just): add install recipe (#17) by @chaliy
- chore(ship): resolve addressed PR review comments (#13) by @chaliy
- chore(skills): adopt ship skill and split public/internal skill layout (#9) by @chaliy
- docs: finish Target/expected rename in docs and examples (follow-up to #34) (#35) by @chaliy
- docs: add docs index + public-docs spec, reconcile drift (#19) by @chaliy
- docs(contributing): document main branch-protection gate (#14) by @chaliy
- docs(readme): reframe as evals toolkit with overview diagram (#12) by @chaliy
- docs: surface agentic-trajectory eval as a headline strength (#7) by @chaliy
- docs: extensibility guide + custom-subject example (#5) by @chaliy