You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Added
Self-describing results. A RunResult (and the persisted cases/<key>/result.json) now carries the sample's input (the prompt turns
sent) and expected (the reference value, when the dataset provides one), so a
saved result can be read back without the original dataset. Both are optional
on the wire — input omitted when empty, expected when absent.
Docs diagrams. Five new committed SVGs visualise the model, each in its
topical guide: the end-to-end workflow (mira-workflow.svg — author →
plan → execute → score → report) in getting-started.md,
the entity hierarchy (mira-entities.svg — study ▸ eval ▸
dataset/subject/scorers/targets/axes, expanded into cases · trials ·
transcripts · scores) in authoring.md, the host ⇄
study run lifecycle (mira-run-lifecycle.svg — the protocol sequence for one
run) in how-it-works.md, the subject fan-in
(mira-subjects.svg — the three subject shapes normalising into one Transcript) in subjects.md, and the scoring flow
(mira-scoring.svg — transcript surfaces → scorers → case verdict) in scorers.md. Indexed in docs/README.md.
JSONL and CSV report formats (--format jsonl / --format csv) for
un-aggregated, analysis-ready exports. jsonl writes one RunResult per line
(lossless — the line-delimited dual of json); csv is long-format, one row
per (case × score) with the case columns repeated and open-vocabulary metrics/metadata flattened into stable metric.*/meta.* columns. Both
work anywhere --out/--format do (run, report, score); a --group-by
view is intentionally not folded in — the consumer aggregates the rows.
Per-case wall-clock timeout: give up on a case after a budget of seconds,
cancelling the in-flight run (best-effort cancel over the protocol) and
recording it as a failed case. Set it on the CLI (mira run --timeout SECONDS,
all targets), per target in mira.toml ([targets.LABEL].timeout), or as a
preset default ([presets.NAME].timeout). Precedence, first set wins: --timeout > per-target > preset; unset ⇒ no limit. A timeout is non-retryable
(retrying would burn the same budget) and counts as a target failure.
Glob case selection.--targets, --samples (new), and --evals (new)
match the target label / sample id / eval name by glob (*, ?, [set], {a,b}); a literal value stays an exact match. --axis values are globbed
too. A small dep-free matcher (mira::glob_match) backs both the host and the
in-process Runner (Runner::samples(…), glob-aware Runner::targets(…)).
Changed
BREAKING (preset): the preset filter key is replaced by per-dimension samples (glob on sample id). targets/samples/evals in [presets.NAME]
now glob-match and accept either a single string or a list. The cross-cutting
case-key substring stays available as the positional mira run [filter].