Skip to content

v0.7 - Operational Simplicity & Pipeline Maturity

Choose a tag to compare

@Luodian Luodian released this 28 Feb 17:02
· 97 commits to main since this release
ecfb599

v0.7 — Operational Simplicity & Pipeline Maturity

Highlights

  • 25+ new benchmark tasks spanning document, video, math, spatial, AGI, audio, and safety domains
  • Unified video decode — single read_video entry point with TorchCodec backend (up to 3.58x faster), DALI GPU decode, and LRU caching
  • Lance-backed video distribution — MINERVA videos in a single Lance table on Hugging Face
  • YAML config-driven evaluation--config replaces fragile CLI one-liners with validated, reproducible YAML files
  • Reasoning tag stripping — pipeline-level <think> block removal for reasoning models, configurable via --reasoning_tags
  • Safety & red-teaming baselines — JailbreakBench with ASR, refusal rate, toxicity, and over-refusal metrics
  • Token efficiency metrics — per-sample input/output/reasoning token counts and run-level throughput
  • Agentic task evaluationgenerate_until_agentic output type with iterative tool-call loops and deterministic simulators
  • Async OpenAI message_format — replaces is_qwen3_vl flag with extensible format system
  • Flattened JSONL logs — cleaner output format for generate_until responses

New Model Backends

  • NanoVLM — lightweight local inference backend
  • Async multi-GPU HF — parallel inference across GPUs with HuggingFace Transformers

Full Release Notes

See the complete v0.7 release notes for detailed documentation of every feature, migration guide, and architecture decisions.

Install

pip install lmms-eval==0.7
# or
uv pip install lmms-eval==0.7