Release v0.6.1: Response Cache Fix & Cleanup · EvolvingLMMs-Lab/lmms-eval

What's Changed

Fix response cache returning identical results when temperature > 0 and repeat > 1 - The legacy per-model JSONL cache (Layer 1) used only doc_id as cache key without checking determinism. When running stochastic sampling with multiple repeats, all repeats silently returned the same cached response. This was a data corruption bug.
Fix multi-GPU metric gather key ordering (#1089)
Fix simple qwen3_vl inference when batch_size > 1 (#1090)
Fix PyPI publish workflow: version auto-sync from git tag, version bump to 0.6.1

Response-level caching system (ResponseCache) - New SQLite + JSONL write-ahead log architecture (--use_cache ./eval_cache). Determinism-aware: automatically bypasses cache for temperature > 0, do_sample=True, n > 1. Per-rank files for distributed safety. Crash recovery via JSONL replay.
JSONL audit log records all responses - Both deterministic and non-deterministic responses are logged to JSONL for real-time observability (tail -f rank0.jsonl). Each record includes a deterministic field. Only deterministic responses are stored in SQLite for cache reuse.
SAM3 model + SA-Co/Gold benchmark (#1088)
GitHub Actions PyPI publish workflow (#1087)
Qwen3.5 runtime compatibility docs and examples (#1094)

Remove dead code: CachingLMM, hash_args, SqliteDict import from api/model.py
Remove buggy per-model JSONL cache: LMMS_EVAL_USE_CACHE env var, load_cache(), get_response_from_cache(), add_request_response_to_cache(), and calls in 4 models (vllm, vllm_generate, async_openai, longvila)
Remove sqlitedict dependency from pyproject.toml
Simplify CacheHook to a no-op stub (50+ models still reference self.cache_hook.add_partial(...))

34 cache tests covering: determinism detection, cache key collision prevention, hit/miss behavior, non-deterministic bypass with repeats, JSONL audit log observability, crash recovery via JSONL replay, multi-rank isolation and shard merging, model fingerprint isolation, stats accuracy across close/reopen, large batch sanity (1000 requests)
Note: loglikelihood end-to-end execute flow is not yet covered

Full Changelog: v0.6...v0.6.1