v0.6.1: Response Cache Fix & Cleanup
What's Changed
Bug Fixes
- Fix response cache returning identical results when
temperature > 0andrepeat > 1- The legacy per-model JSONL cache (Layer 1) used onlydoc_idas cache key without checking determinism. When running stochastic sampling with multiple repeats, all repeats silently returned the same cached response. This was a data corruption bug. - Fix multi-GPU metric gather key ordering (#1089)
- Fix simple
qwen3_vlinference whenbatch_size > 1(#1090) - Fix PyPI publish workflow: version auto-sync from git tag, version bump to 0.6.1
Features
- Response-level caching system (
ResponseCache) - New SQLite + JSONL write-ahead log architecture (--use_cache ./eval_cache). Determinism-aware: automatically bypasses cache fortemperature > 0,do_sample=True,n > 1. Per-rank files for distributed safety. Crash recovery via JSONL replay. - JSONL audit log records all responses - Both deterministic and non-deterministic responses are logged to JSONL for real-time observability (
tail -f rank0.jsonl). Each record includes adeterministicfield. Only deterministic responses are stored in SQLite for cache reuse. - SAM3 model + SA-Co/Gold benchmark (#1088)
- GitHub Actions PyPI publish workflow (#1087)
- Qwen3.5 runtime compatibility docs and examples (#1094)
Cleanup
- Remove dead code:
CachingLMM,hash_args,SqliteDictimport fromapi/model.py - Remove buggy per-model JSONL cache:
LMMS_EVAL_USE_CACHEenv var,load_cache(),get_response_from_cache(),add_request_response_to_cache(), and calls in 4 models (vllm,vllm_generate,async_openai,longvila) - Remove
sqlitedictdependency frompyproject.toml - Simplify
CacheHookto a no-op stub (50+ models still referenceself.cache_hook.add_partial(...))
Tests
- 34 cache tests covering: determinism detection, cache key collision prevention, hit/miss behavior, non-deterministic bypass with repeats, JSONL audit log observability, crash recovery via JSONL replay, multi-rank isolation and shard merging, model fingerprint isolation, stats accuracy across close/reopen, large batch sanity (1000 requests)
- Note:
loglikelihoodend-to-end execute flow is not yet covered
Docs
docs/caching.mdrewritten for the newResponseCacheimplementation- Agent skill for lmms-eval (#1092)
Full Changelog: v0.6...v0.6.1