Skip to content

Release v0.1.0

Choose a tag to compare

@github-actions github-actions released this 23 Jun 01:11
· 22 commits to main since this release
c12cc75

Initial public release. The crates, the Python SDK, and the protocol all start at
this version.

Highlights

  • Code-first eval frameworkEval = Dataset(Sample…) + Subject + [Scorer…] crossed with a provider-agnostic Target matrix, a broad built-in scorer vocabulary (text, tools, budgets, files, combinators, LLM-judge), and an in-process runner (#2).
  • The eval protocol (1.0) — newline-delimited JSON over stdio between the study and the host, with MAJOR.MINOR versioning, capability negotiation, and a machine-readable JSON Schema generated from the wire types (#16).
  • Native Python SDK — author studies in pure-stdlib Python (no Rust dependency); wire types and protocol metadata are generated from the schema with a drift guard (#22, #25).
  • Trials, pass@k, and seeds — first-class N-sampling for pass-rate and variance with an unbiased pass@k estimator and reproducible per-trial seeds (#24).
  • Multimodal & interactive evals — typed multimodal content (input attachments + graded output) and simulated-user multi-turn dialogs folded into one transcript (#28).
  • Provider-backed LLM judge + N/A semanticsLlmJudge scorers over OpenAI/Anthropic and a third "couldn't evaluate" state, so infra failures degrade to N/A instead of a false fail (#6, #8).
  • Adaptive matrix concurrency — bounded, provider-aware throttling that multiplexes runs over one pipe and backs off on rate limits (#4).

What's Changed

  • Targets, not models: rename ModelSpec→Target + --axis/--preset selection (#34) by @chaliy
  • feat(protocol): reserve the study→host reverse-request channel seam (#32) by @chaliy
  • feat(protocol): cursor-paginated sample listing (1.10) (#31) by @chaliy
  • feat(protocol): promote multimodal output + capability params to the wire (1.11) (#30) by @chaliy
  • feat(protocol): cancel an in-flight run by id (protocol 1.8) (#29) by @chaliy
  • feat: multimodality, interactive multi-turn evals, and structured capability params (#28) by @chaliy
  • feat(protocol): typed, correlated event/log notifications (1.9) (#27) by @chaliy
  • feat(protocol): metadata columns for samples/models + report --group-by (#26) by @chaliy
  • feat(sdks): generate protocol metadata for the Python SDK drift guard (#25) by @chaliy
  • feat(protocol): trials/repetitions + seed with pass@k aggregation (#24) by @chaliy
  • feat(protocol): structured RPC errors (protocol 1.5) (#23) by @chaliy
  • feat(sdks): native Python SDK for authoring eval studies (#22) by @chaliy
  • feat(protocol): make metadata open-ended (string → JSON) (#21) by @chaliy
  • feat(cli): record environment metadata in saved runs (#20) by @chaliy
  • feat(cli): add AI-friendly mira help --full and reword tagline (#18) by @chaliy
  • feat(protocol): machine-readable JSON Schema generated from wire types (#16) by @chaliy
  • feat(cli): --save run archive with run ids, timestamps, and mira.toml (#15) by @chaliy
  • feat: split subject execution from scoring (execute/score, rescore) (#11) by @chaliy
  • feat(metrics): extensible numeric metrics map + generic budget scorers (#10) by @chaliy
  • feat: surface infrastructure errors as N/A (not failures), retryable (#8) by @chaliy
  • feat(scorer): N/A score state + provider-backed LLM judge (#6) by @chaliy
  • feat(exec): bounded, provider-aware, adaptive matrix concurrency (#4) by @chaliy
  • feat: live progress bar and session-backed checkpoints for mira run (#3) by @chaliy
  • Productionize the Mira eval-framework PoC into a published workspace (#2) by @chaliy
  • chore(protocol): reset protocol version to the 1.0 baseline (#33) by @chaliy
  • chore(just): add install recipe (#17) by @chaliy
  • chore(ship): resolve addressed PR review comments (#13) by @chaliy
  • chore(skills): adopt ship skill and split public/internal skill layout (#9) by @chaliy
  • docs: finish Target/expected rename in docs and examples (follow-up to #34) (#35) by @chaliy
  • docs: add docs index + public-docs spec, reconcile drift (#19) by @chaliy
  • docs(contributing): document main branch-protection gate (#14) by @chaliy
  • docs(readme): reframe as evals toolkit with overview diagram (#12) by @chaliy
  • docs: surface agentic-trajectory eval as a headline strength (#7) by @chaliy
  • docs: extensibility guide + custom-subject example (#5) by @chaliy