v0.7 - Operational Simplicity & Pipeline Maturity
v0.7 — Operational Simplicity & Pipeline Maturity
Highlights
- 25+ new benchmark tasks spanning document, video, math, spatial, AGI, audio, and safety domains
- Unified video decode — single
read_videoentry point with TorchCodec backend (up to 3.58x faster), DALI GPU decode, and LRU caching - Lance-backed video distribution — MINERVA videos in a single Lance table on Hugging Face
- YAML config-driven evaluation —
--configreplaces fragile CLI one-liners with validated, reproducible YAML files - Reasoning tag stripping — pipeline-level
<think>block removal for reasoning models, configurable via--reasoning_tags - Safety & red-teaming baselines — JailbreakBench with ASR, refusal rate, toxicity, and over-refusal metrics
- Token efficiency metrics — per-sample input/output/reasoning token counts and run-level throughput
- Agentic task evaluation —
generate_until_agenticoutput type with iterative tool-call loops and deterministic simulators - Async OpenAI
message_format— replacesis_qwen3_vlflag with extensible format system - Flattened JSONL logs — cleaner output format for
generate_untilresponses
New Model Backends
- NanoVLM — lightweight local inference backend
- Async multi-GPU HF — parallel inference across GPUs with HuggingFace Transformers
Full Release Notes
See the complete v0.7 release notes for detailed documentation of every feature, migration guide, and architecture decisions.
Install
pip install lmms-eval==0.7
# or
uv pip install lmms-eval==0.7