Release v0.7 - Operational Simplicity & Pipeline Maturity · EvolvingLMMs-Lab/lmms-eval

v0.7 — Operational Simplicity & Pipeline Maturity

25+ new benchmark tasks spanning document, video, math, spatial, AGI, audio, and safety domains
Unified video decode — single read_video entry point with TorchCodec backend (up to 3.58x faster), DALI GPU decode, and LRU caching
Lance-backed video distribution — MINERVA videos in a single Lance table on Hugging Face
YAML config-driven evaluation — --config replaces fragile CLI one-liners with validated, reproducible YAML files
Reasoning tag stripping — pipeline-level <think> block removal for reasoning models, configurable via --reasoning_tags
Safety & red-teaming baselines — JailbreakBench with ASR, refusal rate, toxicity, and over-refusal metrics
Token efficiency metrics — per-sample input/output/reasoning token counts and run-level throughput
Agentic task evaluation — generate_until_agentic output type with iterative tool-call loops and deterministic simulators
Async OpenAI message_format — replaces is_qwen3_vl flag with extensible format system
Flattened JSONL logs — cleaner output format for generate_until responses

NanoVLM — lightweight local inference backend
Async multi-GPU HF — parallel inference across GPUs with HuggingFace Transformers

See the complete v0.7 release notes for detailed documentation of every feature, migration guide, and architecture decisions.

pip install lmms-eval==0.7
# or
uv pip install lmms-eval==0.7