Compress transformer KV-caches at decode time on Apple Silicon — with a narrow, contract-validated runtime path.
Quick Start · How It Works · Presets · Model Support · Validation · Docs
TurboQuant is a research-stage KV-cache compression library that plugs into mlx-lm inference on Apple Silicon. It patches the upstream decode loop at import time and routes allowlisted model families through a compressed attention path — no per-model fork required.
Scope note — The machine-readable source of truth is
turboquant/contract.json. This repository does not prove a current Apple runtime PASS without a published certification artifact or pinned manifest digest from the taggedapple-runtime-certworkflow.
Apple Silicon is required for runtime inference. All other platforms support static checks, linting, and contract validation only.
git clone https://github.com/dawsonblock/TURB0.git
cd TURB0
# Apple Silicon — full runtime
pip install -e '.[apple]'
# Any platform — static / dev work only
pip install -e '.[dev]'from mlx_lm import load
from mlx_lm.models.cache import make_prompt_cache
from turboquant.config import TurboQuantConfig
from turboquant.integrations.mlx.upgrade import upgrade_cache_list
model, tokenizer = load("mlx-community/Llama-3.2-1B-Instruct-4bit")
cache = make_prompt_cache(model)
# ... run your prefill here ...
cfg = TurboQuantConfig.from_preset("paper_prod")
events = upgrade_cache_list(
cache,
k_start=64,
config=cfg,
model_family="llama",
)The same path works for PolarQuant:
cfg = TurboQuantConfig.polarquant_exp(rotation="random_orthogonal")
events = upgrade_cache_list(cache, k_start=64, config=cfg, model_family="gemma")The higher-level mlx_lm.generate.generate(...) wrapper delegates into the same machinery automatically once the patch layer is active.
TurboQuant patches three upstream symbols at import time — no vendored fork:
mlx_lm.models.cache.make_prompt_cache (patched)
mlx_lm.generate.generate_step (patched)
│
▼
upgrade_cache_list(...) ← canonical support-gated entry point
│
▼
TurboQuantKCache.update_and_fetch(...)
│
▼
TurboQuantKeysView
│
▼
mlx_lm.models.base.scaled_dot_product_attention (patched)
│
▼
turboquant_streaming_attention(...)
The attention fast path scores flat K-history slices from runtime-packed tensors and decodes V in chunks with an online softmax (log-sum-exp streaming reduction), avoiding a full dense V concatenation at every decode step.
Contract summary:
upgrade_cache_list(...)is the canonical, support-gated entry point.TurboQuantKCache(...)is internal/eval-only — it bypasses the model-family allowlist.KVCache.to_turboquant()is mentioned here as documentation shorthand for the cache-adapter upgrade path; it is not currently a shipped/runtime-availableKVCachemethod. Useupgrade_cache_list(...)for supported upgrades.- The decode path returns
eventsbut does not automatically persistevents.jsonl. - Cache state is persisted as
TurboQuantKVCache.state()atschema_version == 4.
| Preset | Classification | K bits | V bits | Residual | Use when |
|---|---|---|---|---|---|
paper_prod / paper_prod_qjl |
paper-facing | 3 | 4 | QJL (1-bit) | Primary two-stage research path |
paper_mse |
paper-facing | 3 | 4 | none | Conservative scalar-only reference |
polarquant_exp |
supported, non-paper-facing | 3 | 4 | none | PolarQuant with family-scoped certification |
legacy_topk, balanced, max_quality |
compatibility-only | — | — | top-k | Loading historical configs only |
paper_prodis a stable alias forpaper_prod_qjl.high_compressionis a legacy alias for thepaper_prod_qjlfamily.polarquant_expis a formally supported contract surface but is outside the paper-facing preset story.- Generated preset math lives in docs/support_matrix.md.
| Family | Status | Evidence depth | Coverage |
|---|---|---|---|
| Llama | ✅ allowlisted | stronger | real-model smoke · PolarQuant runtime & quality · paper_mse batch guardrail · long-context stability · dense-vs-TQ benchmark sweeps |
| Gemma | ✅ allowlisted | narrower | real-model smoke · PolarQuant runtime & quality · dense-vs-TQ benchmark sweeps |
Families reachable through the patch layer are not automatically supported. Allowlist membership is a contract decision, not a side effect of patch reachability. See
turboquant/runtime/support.pyandturboquant/contract.json.
Gemma coverage is intentionally narrower: the conservative paper_mse batch quality guardrail is still Llama-scoped.
make test-static # static unit test suite, no MLX required
make compile # bytecode check across all source + test modulesmake test-structural # path-proof, cache roundtrip, streaming attention — no model weights
make test-path-proof # verify TQ path is exercised, not dense fallback
make test-smoke-llama # Llama smoke (TinyModel by default)
make test-smoke-gemma # Gemma smoke (TinyModel by default)
make test-long-context # long-context stability (TinyModel by default)
make test-mlx # full MLX suiteexport TQ_TEST_LLAMA_MODEL="mlx-community/Llama-3.2-1B-Instruct-4bit"
export TQ_TEST_GEMMA_MODEL="mlx-community/gemma-2-2b-it-4bit"
make test-smoke-llama
make test-smoke-gemma
make test-long-contextexport TQ_TEST_LLAMA_MODEL="mlx-community/Llama-3.2-1B-Instruct-4bit"
export TQ_TEST_GEMMA_MODEL="mlx-community/gemma-2-2b-it-4bit"
bash scripts/certify_apple_runtime.shEvidence rule —
artifacts/runtime-cert/bundles are workflow outputs; built wheels and source distributions do not ship that directory. Static CI passing on Linux is not a runtime go/no-go — source and built snapshots do not, by themselves, prove a current Apple runtime PASS. Only a published certification artifact or pinned manifest digest from a taggedapple-runtime-certworkflow run proves a current PASS for both allowlisted families.
TURB0/
├── turboquant/
│ ├── config.py # TurboQuantConfig — the runtime config API
│ ├── contract.json # machine-readable support contract
│ ├── patch.py # upstream mlx_lm patch bootstrap
│ ├── core/ # rotation, quantizer, QJL, PolarQuant
│ ├── runtime/ # attention fast path, support gate
│ ├── integrations/mlx/ # upgrade_cache_list, cache adapter
│ ├── eval/ # logit comparison helpers
│ └── kernels/ # experimental Metal kernel stubs
├── benchmarks/
│ ├── exploratory/ # micro-benchmarks and ablations
│ └── runtime_cert/ # certification benchmark scripts
├── tests/
│ ├── unit_static/ # contract / structural tests (no MLX)
│ ├── unit_mlx/ # unit tests requiring MLX
│ └── integration_mlx/ # full-path integration tests
├── scripts/ # certify_apple_runtime.sh, validate_local.sh
├── tools/ # dist verification, surface audit
└── docs/ # generated and hand-written documentation
| Document | Purpose |
|---|---|
| docs/architecture.md | Runtime path and component map |
| docs/theory.md | Paper-claim traceability and current evidence limits |
| docs/product_contract.md | Generated top-level product boundary |
| docs/support_matrix.md | Generated family and preset matrix |
| docs/supported-surface.md | Generated canonical vs secondary surface definitions |
| docs/preset_modes.md | Generated preset taxonomy |
| docs/runtime-certification.md | Certification scope, stages, and evidence contract |
| docs/validation-local.md | Local validation walkthrough |
| docs/benchmark_methodology.md | Benchmark publication and provenance rules |
| docs/benchmark_index.md | Generated index of benchmark surfaces and lane boundaries |
| docs/family_evidence_matrix.md | Release-gated vs research-only evidence split |
| docs/integration.md | Model-family wiring and PolarQuant integration details |
| docs/evaluation.md | Exploratory quality-evaluation guidance |
| docs/bit_budget_sweep.md | Research-only bit-budget sweep |
| docs/kv_paper_eval.md | Unified KV report command (fast-check vs heavy-offline tiers) |
| docs/vector_search.md | Research-only vector-search benchmark lane |
| docs/vendored-upstream-boundary.md | Upstream mlx_lm patch boundary |
- Run
make test-staticon any platform before opening a PR. - On Apple Silicon, run
make test-mlxandmake test-structuralbefore widening any runtime claims. - If you change runtime-contract or evidence wording, update
turboquant/contract.jsonand regenerate the derived docs. - If you change the preset registry or classifications, update
turboquant/config.pyandturboquant/contract.jsontogether, then regenerate the derived docs. - If you add a model family or preset to the supported story, extend the certification surface before updating the README.