Skip to content

Releases: everruns/mira

Release v0.3.0

28 Jun 03:20
0830385

Choose a tag to compare

Added

  • Self-describing results. A RunResult (and the persisted
    cases/<key>/result.json) now carries the sample's input (the prompt turns
    sent) and expected (the reference value, when the dataset provides one), so a
    saved result can be read back without the original dataset. Both are optional
    on the wire — input omitted when empty, expected when absent.
  • Docs diagrams. Five new committed SVGs visualise the model, each in its
    topical guide: the end-to-end workflow (mira-workflow.svg — author →
    plan → execute → score → report) in getting-started.md,
    the entity hierarchy (mira-entities.svg — study ▸ eval ▸
    dataset/subject/scorers/targets/axes, expanded into cases · trials ·
    transcripts · scores) in authoring.md, the host ⇄
    study run lifecycle
    (mira-run-lifecycle.svg — the protocol sequence for one
    run) in how-it-works.md, the subject fan-in
    (mira-subjects.svg — the three subject shapes normalising into one
    Transcript) in subjects.md, and the scoring flow
    (mira-scoring.svg — transcript surfaces → scorers → case verdict) in
    scorers.md. Indexed in
    docs/README.md.
  • JSONL and CSV report formats (--format jsonl / --format csv) for
    un-aggregated, analysis-ready exports. jsonl writes one RunResult per line
    (lossless — the line-delimited dual of json); csv is long-format, one row
    per (case × score) with the case columns repeated and open-vocabulary
    metrics/metadata flattened into stable metric.*/meta.* columns. Both
    work anywhere --out/--format do (run, report, score); a --group-by
    view is intentionally not folded in — the consumer aggregates the rows.
  • Per-case wall-clock timeout: give up on a case after a budget of seconds,
    cancelling the in-flight run (best-effort cancel over the protocol) and
    recording it as a failed case. Set it on the CLI (mira run --timeout SECONDS,
    all targets), per target in mira.toml ([targets.LABEL].timeout), or as a
    preset default ([presets.NAME].timeout). Precedence, first set wins:
    --timeout > per-target > preset; unset ⇒ no limit. A timeout is non-retryable
    (retrying would burn the same budget) and counts as a target failure.
  • Glob case selection. --targets, --samples (new), and --evals (new)
    match the target label / sample id / eval name by glob (*, ?, [set],
    {a,b}); a literal value stays an exact match. --axis values are globbed
    too. A small dep-free matcher (mira::glob_match) backs both the host and the
    in-process Runner (Runner::samples(…), glob-aware Runner::targets(…)).

Changed

  • BREAKING (preset): the preset filter key is replaced by per-dimension
    samples (glob on sample id). targets/samples/evals in [presets.NAME]
    now glob-match and accept either a single string or a list. The cross-cutting
    case-key substring stays available as the positional mira run [filter].

Release v0.2.0

24 Jun 03:16
6c7202b

Choose a tag to compare

Added

  • skills.sh — install the Mira agent skill into a Claude Code skills directory
    so an agent can author and run evals. --global targets ~/.claude/skills/mira,
    --local (the default) targets ./.claude/skills/mira. It copies from a local
    checkout when present, else fetches from GitHub raw (--ref), so
    curl -fsSL .../skills.sh | sh works on a box that only has the prebuilt
    binary. Each run is a clean replace, so it also serves as the upgrade path.
  • Native TypeScript SDK (sdks/typescript, mira-eval) — author
    eval studies in TypeScript/Node with no Rust dependency: a zero-runtime-dep
    library over the protocol, with wire types and protocol metadata generated from
    schema/v1/ (a self-contained codegen.mjs --check drift guard, the TS dual of
    the Rust/Python guards), a serve() loop (incl. the execute/score split and
    list_samples pagination), a parity authoring API, and conformance + behaviour
    tests. Worked example: examples/greet-typescript. Publishes to npm as
    mira-eval via OIDC trusted publishing (publish-typescript in publish.yml),
    mirroring the Python PyPI flow.
  • Named launchers in mira.toml: [launchers.NAME] saves a study invocation
    (bin/example/cmd/uv/python/python3 + package/manifest_path),
    selected with --launcher NAME. default_launcher makes a bare mira run
    work; explicit launch flags override the named launcher, mirroring --preset.
  • cargo binstall mira-cli support: [package.metadata.binstall] points binstall
    at the prebuilt release tarballs, so the mira binary installs without a compile.
  • Polyglot launcher flagsmira --uv / --python / --python3 SCRIPT
    drive a non-Rust study directly (e.g. mira --python3 study.py run), replacing
    the verbose --cmd "python3 study.py". --cmd still works for an arbitrary
    command line.
  • mira help --full now surfaces a GUIDES section (each docs/ guide with a
    one-line scope, for progressive disclosure) and a link to the mira agent skill
    in LINKS, so an agent can self-orient to the docs and skill in one read. A
    drift guard keeps the guide list in sync with docs/README.md.
  • Run folders, save-by-default, and resume. Every mira run/mira score now
    saves a run folder under the results dir by default — <run_id>/ with
    meta.json, report.json/report.html, and one cases/<key>/result.json per
    finished case (written atomically as it lands). --dry-run opts out.
  • mira run --resume <run_id> reopens an interrupted run's folder, skips the cases
    already recorded under cases/, and runs only what's missing.
  • mira report <run_id> — new subcommand that re-renders a saved run's reports
    from its stored per-case results, with no study process and no re-execution.

Changed

  • The execution unit (one eval × sample × target × axis × trial) is now called a
    case (was "cell"): Cell/CellSpecCase/CaseSpec, run_cells
    run_cases, etc. The dataset-row builder .case(id, prompt).sample(id, prompt), and the prebuilt-Sample adder .sample(Sample).add_sample(Sample).
    The pub type Case = Sample alias is removed.

Removed

  • --checkpoint, --fresh, and --save on mira run/mira score, plus the
    mira::session::Session type. The single-file checkpoint is superseded by the
    always-saved run folder; resume is now explicit via --resume <run_id> (a fresh
    run mints a new id and reuses nothing, so there is no silent stale-result reuse).
    Configure the results dir via [results].dir in mira.toml (the --save <dir>
    override is gone).

Release v0.1.0

23 Jun 01:11
c12cc75

Choose a tag to compare

Initial public release. The crates, the Python SDK, and the protocol all start at
this version.

Highlights

  • Code-first eval frameworkEval = Dataset(Sample…) + Subject + [Scorer…] crossed with a provider-agnostic Target matrix, a broad built-in scorer vocabulary (text, tools, budgets, files, combinators, LLM-judge), and an in-process runner (#2).
  • The eval protocol (1.0) — newline-delimited JSON over stdio between the study and the host, with MAJOR.MINOR versioning, capability negotiation, and a machine-readable JSON Schema generated from the wire types (#16).
  • Native Python SDK — author studies in pure-stdlib Python (no Rust dependency); wire types and protocol metadata are generated from the schema with a drift guard (#22, #25).
  • Trials, pass@k, and seeds — first-class N-sampling for pass-rate and variance with an unbiased pass@k estimator and reproducible per-trial seeds (#24).
  • Multimodal & interactive evals — typed multimodal content (input attachments + graded output) and simulated-user multi-turn dialogs folded into one transcript (#28).
  • Provider-backed LLM judge + N/A semanticsLlmJudge scorers over OpenAI/Anthropic and a third "couldn't evaluate" state, so infra failures degrade to N/A instead of a false fail (#6, #8).
  • Adaptive matrix concurrency — bounded, provider-aware throttling that multiplexes runs over one pipe and backs off on rate limits (#4).

What's Changed

  • Targets, not models: rename ModelSpec→Target + --axis/--preset selection (#34) by @chaliy
  • feat(protocol): reserve the study→host reverse-request channel seam (#32) by @chaliy
  • feat(protocol): cursor-paginated sample listing (1.10) (#31) by @chaliy
  • feat(protocol): promote multimodal output + capability params to the wire (1.11) (#30) by @chaliy
  • feat(protocol): cancel an in-flight run by id (protocol 1.8) (#29) by @chaliy
  • feat: multimodality, interactive multi-turn evals, and structured capability params (#28) by @chaliy
  • feat(protocol): typed, correlated event/log notifications (1.9) (#27) by @chaliy
  • feat(protocol): metadata columns for samples/models + report --group-by (#26) by @chaliy
  • feat(sdks): generate protocol metadata for the Python SDK drift guard (#25) by @chaliy
  • feat(protocol): trials/repetitions + seed with pass@k aggregation (#24) by @chaliy
  • feat(protocol): structured RPC errors (protocol 1.5) (#23) by @chaliy
  • feat(sdks): native Python SDK for authoring eval studies (#22) by @chaliy
  • feat(protocol): make metadata open-ended (string → JSON) (#21) by @chaliy
  • feat(cli): record environment metadata in saved runs (#20) by @chaliy
  • feat(cli): add AI-friendly mira help --full and reword tagline (#18) by @chaliy
  • feat(protocol): machine-readable JSON Schema generated from wire types (#16) by @chaliy
  • feat(cli): --save run archive with run ids, timestamps, and mira.toml (#15) by @chaliy
  • feat: split subject execution from scoring (execute/score, rescore) (#11) by @chaliy
  • feat(metrics): extensible numeric metrics map + generic budget scorers (#10) by @chaliy
  • feat: surface infrastructure errors as N/A (not failures), retryable (#8) by @chaliy
  • feat(scorer): N/A score state + provider-backed LLM judge (#6) by @chaliy
  • feat(exec): bounded, provider-aware, adaptive matrix concurrency (#4) by @chaliy
  • feat: live progress bar and session-backed checkpoints for mira run (#3) by @chaliy
  • Productionize the Mira eval-framework PoC into a published workspace (#2) by @chaliy
  • chore(protocol): reset protocol version to the 1.0 baseline (#33) by @chaliy
  • chore(just): add install recipe (#17) by @chaliy
  • chore(ship): resolve addressed PR review comments (#13) by @chaliy
  • chore(skills): adopt ship skill and split public/internal skill layout (#9) by @chaliy
  • docs: finish Target/expected rename in docs and examples (follow-up to #34) (#35) by @chaliy
  • docs: add docs index + public-docs spec, reconcile drift (#19) by @chaliy
  • docs(contributing): document main branch-protection gate (#14) by @chaliy
  • docs(readme): reframe as evals toolkit with overview diagram (#12) by @chaliy
  • docs: surface agentic-trajectory eval as a headline strength (#7) by @chaliy
  • docs: extensibility guide + custom-subject example (#5) by @chaliy