Skip to content

v0.6.1: Response Cache Fix & Cleanup

Choose a tag to compare

@Luodian Luodian released this 19 Feb 07:42
· 119 commits to main since this release
ee31b5a

What's Changed

Bug Fixes

  • Fix response cache returning identical results when temperature > 0 and repeat > 1 - The legacy per-model JSONL cache (Layer 1) used only doc_id as cache key without checking determinism. When running stochastic sampling with multiple repeats, all repeats silently returned the same cached response. This was a data corruption bug.
  • Fix multi-GPU metric gather key ordering (#1089)
  • Fix simple qwen3_vl inference when batch_size > 1 (#1090)
  • Fix PyPI publish workflow: version auto-sync from git tag, version bump to 0.6.1

Features

  • Response-level caching system (ResponseCache) - New SQLite + JSONL write-ahead log architecture (--use_cache ./eval_cache). Determinism-aware: automatically bypasses cache for temperature > 0, do_sample=True, n > 1. Per-rank files for distributed safety. Crash recovery via JSONL replay.
  • JSONL audit log records all responses - Both deterministic and non-deterministic responses are logged to JSONL for real-time observability (tail -f rank0.jsonl). Each record includes a deterministic field. Only deterministic responses are stored in SQLite for cache reuse.
  • SAM3 model + SA-Co/Gold benchmark (#1088)
  • GitHub Actions PyPI publish workflow (#1087)
  • Qwen3.5 runtime compatibility docs and examples (#1094)

Cleanup

  • Remove dead code: CachingLMM, hash_args, SqliteDict import from api/model.py
  • Remove buggy per-model JSONL cache: LMMS_EVAL_USE_CACHE env var, load_cache(), get_response_from_cache(), add_request_response_to_cache(), and calls in 4 models (vllm, vllm_generate, async_openai, longvila)
  • Remove sqlitedict dependency from pyproject.toml
  • Simplify CacheHook to a no-op stub (50+ models still reference self.cache_hook.add_partial(...))

Tests

  • 34 cache tests covering: determinism detection, cache key collision prevention, hit/miss behavior, non-deterministic bypass with repeats, JSONL audit log observability, crash recovery via JSONL replay, multi-rank isolation and shard merging, model fingerprint isolation, stats accuracy across close/reopen, large batch sanity (1000 requests)
  • Note: loglikelihood end-to-end execute flow is not yet covered

Docs

  • docs/caching.md rewritten for the new ResponseCache implementation
  • Agent skill for lmms-eval (#1092)

Full Changelog: v0.6...v0.6.1