Releases: everruns/mira
Releases · everruns/mira
Release v0.3.0
Added
- Self-describing results. A
RunResult(and the persisted
cases/<key>/result.json) now carries the sample'sinput(the prompt turns
sent) andexpected(the reference value, when the dataset provides one), so a
saved result can be read back without the original dataset. Both are optional
on the wire —inputomitted when empty,expectedwhen absent. - Docs diagrams. Five new committed SVGs visualise the model, each in its
topical guide: the end-to-end workflow (mira-workflow.svg— author →
plan → execute → score → report) ingetting-started.md,
the entity hierarchy (mira-entities.svg— study ▸ eval ▸
dataset/subject/scorers/targets/axes, expanded into cases · trials ·
transcripts · scores) inauthoring.md, the host ⇄
study run lifecycle (mira-run-lifecycle.svg— the protocol sequence for one
run) inhow-it-works.md, the subject fan-in
(mira-subjects.svg— the three subject shapes normalising into one
Transcript) insubjects.md, and the scoring flow
(mira-scoring.svg— transcript surfaces → scorers → case verdict) in
scorers.md. Indexed in
docs/README.md. - JSONL and CSV report formats (
--format jsonl/--format csv) for
un-aggregated, analysis-ready exports.jsonlwrites oneRunResultper line
(lossless — the line-delimited dual ofjson);csvis long-format, one row
per (case × score) with the case columns repeated and open-vocabulary
metrics/metadataflattened into stablemetric.*/meta.*columns. Both
work anywhere--out/--formatdo (run,report,score); a--group-by
view is intentionally not folded in — the consumer aggregates the rows. - Per-case wall-clock timeout: give up on a case after a budget of seconds,
cancelling the in-flight run (best-effortcancelover the protocol) and
recording it as a failed case. Set it on the CLI (mira run --timeout SECONDS,
all targets), per target inmira.toml([targets.LABEL].timeout), or as a
preset default ([presets.NAME].timeout). Precedence, first set wins:
--timeout> per-target > preset; unset ⇒ no limit. A timeout is non-retryable
(retrying would burn the same budget) and counts as a target failure. - Glob case selection.
--targets,--samples(new), and--evals(new)
match the target label / sample id / eval name by glob (*,?,[set],
{a,b}); a literal value stays an exact match.--axisvalues are globbed
too. A small dep-free matcher (mira::glob_match) backs both the host and the
in-processRunner(Runner::samples(…), glob-awareRunner::targets(…)).
Changed
- BREAKING (preset): the preset
filterkey is replaced by per-dimension
samples(glob on sample id).targets/samples/evalsin[presets.NAME]
now glob-match and accept either a single string or a list. The cross-cutting
case-key substring stays available as the positionalmira run [filter].
Release v0.2.0
Added
skills.sh— install the Mira agent skill into a Claude Code skills directory
so an agent can author and run evals.--globaltargets~/.claude/skills/mira,
--local(the default) targets./.claude/skills/mira. It copies from a local
checkout when present, else fetches from GitHub raw (--ref), so
curl -fsSL .../skills.sh | shworks on a box that only has the prebuilt
binary. Each run is a clean replace, so it also serves as the upgrade path.- Native TypeScript SDK (
sdks/typescript,mira-eval) — author
eval studies in TypeScript/Node with no Rust dependency: a zero-runtime-dep
library over the protocol, with wire types and protocol metadata generated from
schema/v1/(a self-containedcodegen.mjs --checkdrift guard, the TS dual of
the Rust/Python guards), aserve()loop (incl. theexecute/scoresplit and
list_samplespagination), a parity authoring API, and conformance + behaviour
tests. Worked example:examples/greet-typescript. Publishes to npm as
mira-evalvia OIDC trusted publishing (publish-typescriptinpublish.yml),
mirroring the Python PyPI flow. - Named launchers in
mira.toml:[launchers.NAME]saves a study invocation
(bin/example/cmd/uv/python/python3+package/manifest_path),
selected with--launcher NAME.default_launchermakes a baremira run
work; explicit launch flags override the named launcher, mirroring--preset. cargo binstall mira-clisupport:[package.metadata.binstall]points binstall
at the prebuilt release tarballs, so themirabinary installs without a compile.- Polyglot launcher flags —
mira --uv/--python/--python3 SCRIPT
drive a non-Rust study directly (e.g.mira --python3 study.py run), replacing
the verbose--cmd "python3 study.py".--cmdstill works for an arbitrary
command line. mira help --fullnow surfaces aGUIDESsection (eachdocs/guide with a
one-line scope, for progressive disclosure) and a link to themiraagent skill
inLINKS, so an agent can self-orient to the docs and skill in one read. A
drift guard keeps the guide list in sync withdocs/README.md.- Run folders, save-by-default, and resume. Every
mira run/mira scorenow
saves a run folder under the results dir by default —<run_id>/with
meta.json,report.json/report.html, and onecases/<key>/result.jsonper
finished case (written atomically as it lands).--dry-runopts out. mira run --resume <run_id>reopens an interrupted run's folder, skips the cases
already recorded undercases/, and runs only what's missing.mira report <run_id>— new subcommand that re-renders a saved run's reports
from its stored per-case results, with no study process and no re-execution.
Changed
- The execution unit (one
eval × sample × target × axis × trial) is now called a
case (was "cell"):Cell/CellSpec→Case/CaseSpec,run_cells→
run_cases, etc. The dataset-row builder.case(id, prompt)→.sample(id, prompt), and the prebuilt-Sampleadder.sample(Sample)→.add_sample(Sample).
Thepub type Case = Samplealias is removed.
Removed
--checkpoint,--fresh, and--saveonmira run/mira score, plus the
mira::session::Sessiontype. The single-file checkpoint is superseded by the
always-saved run folder; resume is now explicit via--resume <run_id>(a fresh
run mints a new id and reuses nothing, so there is no silent stale-result reuse).
Configure the results dir via[results].dirinmira.toml(the--save <dir>
override is gone).
Release v0.1.0
Initial public release. The crates, the Python SDK, and the protocol all start at
this version.
Highlights
- Code-first eval framework —
Eval = Dataset(Sample…) + Subject + [Scorer…]crossed with a provider-agnosticTargetmatrix, a broad built-in scorer vocabulary (text, tools, budgets, files, combinators, LLM-judge), and an in-process runner (#2). - The eval protocol (1.0) — newline-delimited JSON over stdio between the study and the host, with
MAJOR.MINORversioning, capability negotiation, and a machine-readable JSON Schema generated from the wire types (#16). - Native Python SDK — author studies in pure-stdlib Python (no Rust dependency); wire types and protocol metadata are generated from the schema with a drift guard (#22, #25).
- Trials, pass@k, and seeds — first-class N-sampling for pass-rate and variance with an unbiased pass@k estimator and reproducible per-trial seeds (#24).
- Multimodal & interactive evals — typed multimodal content (input attachments + graded output) and simulated-user multi-turn dialogs folded into one transcript (#28).
- Provider-backed LLM judge + N/A semantics —
LlmJudgescorers over OpenAI/Anthropic and a third "couldn't evaluate" state, so infra failures degrade to N/A instead of a false fail (#6, #8). - Adaptive matrix concurrency — bounded, provider-aware throttling that multiplexes runs over one pipe and backs off on rate limits (#4).
What's Changed
- Targets, not models: rename ModelSpec→Target + --axis/--preset selection (#34) by @chaliy
- feat(protocol): reserve the study→host reverse-request channel seam (#32) by @chaliy
- feat(protocol): cursor-paginated sample listing (1.10) (#31) by @chaliy
- feat(protocol): promote multimodal output + capability params to the wire (1.11) (#30) by @chaliy
- feat(protocol): cancel an in-flight run by id (protocol 1.8) (#29) by @chaliy
- feat: multimodality, interactive multi-turn evals, and structured capability params (#28) by @chaliy
- feat(protocol): typed, correlated event/log notifications (1.9) (#27) by @chaliy
- feat(protocol): metadata columns for samples/models + report --group-by (#26) by @chaliy
- feat(sdks): generate protocol metadata for the Python SDK drift guard (#25) by @chaliy
- feat(protocol): trials/repetitions + seed with pass@k aggregation (#24) by @chaliy
- feat(protocol): structured RPC errors (protocol 1.5) (#23) by @chaliy
- feat(sdks): native Python SDK for authoring eval studies (#22) by @chaliy
- feat(protocol): make metadata open-ended (string → JSON) (#21) by @chaliy
- feat(cli): record environment metadata in saved runs (#20) by @chaliy
- feat(cli): add AI-friendly
mira help --fulland reword tagline (#18) by @chaliy - feat(protocol): machine-readable JSON Schema generated from wire types (#16) by @chaliy
- feat(cli): --save run archive with run ids, timestamps, and mira.toml (#15) by @chaliy
- feat: split subject execution from scoring (execute/score, rescore) (#11) by @chaliy
- feat(metrics): extensible numeric metrics map + generic budget scorers (#10) by @chaliy
- feat: surface infrastructure errors as N/A (not failures), retryable (#8) by @chaliy
- feat(scorer): N/A score state + provider-backed LLM judge (#6) by @chaliy
- feat(exec): bounded, provider-aware, adaptive matrix concurrency (#4) by @chaliy
- feat: live progress bar and session-backed checkpoints for
mira run(#3) by @chaliy - Productionize the Mira eval-framework PoC into a published workspace (#2) by @chaliy
- chore(protocol): reset protocol version to the 1.0 baseline (#33) by @chaliy
- chore(just): add install recipe (#17) by @chaliy
- chore(ship): resolve addressed PR review comments (#13) by @chaliy
- chore(skills): adopt ship skill and split public/internal skill layout (#9) by @chaliy
- docs: finish Target/expected rename in docs and examples (follow-up to #34) (#35) by @chaliy
- docs: add docs index + public-docs spec, reconcile drift (#19) by @chaliy
- docs(contributing): document main branch-protection gate (#14) by @chaliy
- docs(readme): reframe as evals toolkit with overview diagram (#12) by @chaliy
- docs: surface agentic-trajectory eval as a headline strength (#7) by @chaliy
- docs: extensibility guide + custom-subject example (#5) by @chaliy