TurboQuant

Research-grade KV-cache compression for Apple Silicon MLX LLMs

Compress transformer KV-caches at decode time on Apple Silicon — with a narrow, contract-validated runtime path.

Quick Start · How It Works · Presets · Model Support · Validation · Docs

TurboQuant is a research-stage KV-cache compression library that plugs into mlx-lm inference on Apple Silicon. It patches the upstream decode loop at import time and routes allowlisted model families through a compressed attention path — no per-model fork required.

Scope note — The machine-readable source of truth is turboquant/contract.json. This repository does not prove a current Apple runtime PASS without a published certification artifact or pinned manifest digest from the tagged apple-runtime-cert workflow.

Quick Start

Apple Silicon is required for runtime inference. All other platforms support static checks, linting, and contract validation only.

git clone https://github.com/dawsonblock/TURB0.git
cd TURB0

# Apple Silicon — full runtime
pip install -e '.[apple]'

# Any platform — static / dev work only
pip install -e '.[dev]'

Upgrade a cache after prefill

from mlx_lm import load
from mlx_lm.models.cache import make_prompt_cache
from turboquant.config import TurboQuantConfig
from turboquant.integrations.mlx.upgrade import upgrade_cache_list

model, tokenizer = load("mlx-community/Llama-3.2-1B-Instruct-4bit")
cache = make_prompt_cache(model)

# ... run your prefill here ...

cfg = TurboQuantConfig.from_preset("paper_prod")
events = upgrade_cache_list(
    cache,
    k_start=64,
    config=cfg,
    model_family="llama",
)

The same path works for PolarQuant:

cfg = TurboQuantConfig.polarquant_exp(rotation="random_orthogonal")
events = upgrade_cache_list(cache, k_start=64, config=cfg, model_family="gemma")

The higher-level mlx_lm.generate.generate(...) wrapper delegates into the same machinery automatically once the patch layer is active.

How It Works

TurboQuant patches three upstream symbols at import time — no vendored fork:

mlx_lm.models.cache.make_prompt_cache   (patched)
mlx_lm.generate.generate_step           (patched)
         │
         ▼
upgrade_cache_list(...)          ← canonical support-gated entry point
         │
         ▼
TurboQuantKCache.update_and_fetch(...)
         │
         ▼
TurboQuantKeysView
         │
         ▼
mlx_lm.models.base.scaled_dot_product_attention   (patched)
         │
         ▼
turboquant_streaming_attention(...)

The attention fast path scores flat K-history slices from runtime-packed tensors and decodes V in chunks with an online softmax (log-sum-exp streaming reduction), avoiding a full dense V concatenation at every decode step.

Contract summary:

upgrade_cache_list(...) is the canonical, support-gated entry point.
TurboQuantKCache(...) is internal/eval-only — it bypasses the model-family allowlist.
KVCache.to_turboquant() is mentioned here as documentation shorthand for the cache-adapter upgrade path; it is not currently a shipped/runtime-available KVCache method. Use upgrade_cache_list(...) for supported upgrades.
The decode path returns events but does not automatically persist events.jsonl.
Cache state is persisted as TurboQuantKVCache.state() at schema_version == 4.

Presets

Preset	Classification	K bits	V bits	Residual	Use when
`paper_prod` / `paper_prod_qjl`	paper-facing	3	4	QJL (1-bit)	Primary two-stage research path
`paper_mse`	paper-facing	3	4	none	Conservative scalar-only reference
`polarquant_exp`	supported, non-paper-facing	3	4	none	PolarQuant with family-scoped certification
`legacy_topk`, `balanced`, `max_quality`	compatibility-only	—	—	top-k	Loading historical configs only

paper_prod is a stable alias for paper_prod_qjl.
high_compression is a legacy alias for the paper_prod_qjl family.
polarquant_exp is a formally supported contract surface but is outside the paper-facing preset story.
Generated preset math lives in docs/support_matrix.md.

Model Support

Family	Status	Evidence depth	Coverage
Llama	✅ allowlisted	stronger	real-model smoke · PolarQuant runtime & quality · `paper_mse` batch guardrail · long-context stability · dense-vs-TQ benchmark sweeps
Gemma	✅ allowlisted	narrower	real-model smoke · PolarQuant runtime & quality · dense-vs-TQ benchmark sweeps

Families reachable through the patch layer are not automatically supported. Allowlist membership is a contract decision, not a side effect of patch reachability. See turboquant/runtime/support.py and turboquant/contract.json.

Gemma coverage is intentionally narrower: the conservative paper_mse batch quality guardrail is still Llama-scoped.

Validation & Certification

Static checks (any platform)

make test-static        # static unit test suite, no MLX required
make compile            # bytecode check across all source + test modules

Apple Silicon runtime checks

make test-structural    # path-proof, cache roundtrip, streaming attention — no model weights
make test-path-proof    # verify TQ path is exercised, not dense fallback
make test-smoke-llama   # Llama smoke (TinyModel by default)
make test-smoke-gemma   # Gemma smoke (TinyModel by default)
make test-long-context  # long-context stability (TinyModel by default)
make test-mlx           # full MLX suite

Real-model smoke (Apple Silicon)

export TQ_TEST_LLAMA_MODEL="mlx-community/Llama-3.2-1B-Instruct-4bit"
export TQ_TEST_GEMMA_MODEL="mlx-community/gemma-2-2b-it-4bit"

make test-smoke-llama
make test-smoke-gemma
make test-long-context

Full certification bundle

export TQ_TEST_LLAMA_MODEL="mlx-community/Llama-3.2-1B-Instruct-4bit"
export TQ_TEST_GEMMA_MODEL="mlx-community/gemma-2-2b-it-4bit"
bash scripts/certify_apple_runtime.sh

Evidence rule — artifacts/runtime-cert/ bundles are workflow outputs; built wheels and source distributions do not ship that directory. Static CI passing on Linux is not a runtime go/no-go — source and built snapshots do not, by themselves, prove a current Apple runtime PASS. Only a published certification artifact or pinned manifest digest from a tagged apple-runtime-cert workflow run proves a current PASS for both allowlisted families.

Project Layout

TURB0/
├── turboquant/
│   ├── config.py                  # TurboQuantConfig — the runtime config API
│   ├── contract.json              # machine-readable support contract
│   ├── patch.py                   # upstream mlx_lm patch bootstrap
│   ├── core/                      # rotation, quantizer, QJL, PolarQuant
│   ├── runtime/                   # attention fast path, support gate
│   ├── integrations/mlx/          # upgrade_cache_list, cache adapter
│   ├── eval/                      # logit comparison helpers
│   └── kernels/                   # experimental Metal kernel stubs
├── benchmarks/
│   ├── exploratory/               # micro-benchmarks and ablations
│   └── runtime_cert/              # certification benchmark scripts
├── tests/
│   ├── unit_static/               # contract / structural tests (no MLX)
│   ├── unit_mlx/                  # unit tests requiring MLX
│   └── integration_mlx/          # full-path integration tests
├── scripts/                       # certify_apple_runtime.sh, validate_local.sh
├── tools/                         # dist verification, surface audit
└── docs/                          # generated and hand-written documentation

Documentation

Document	Purpose
docs/architecture.md	Runtime path and component map
docs/theory.md	Paper-claim traceability and current evidence limits
docs/product_contract.md	Generated top-level product boundary
docs/support_matrix.md	Generated family and preset matrix
docs/supported-surface.md	Generated canonical vs secondary surface definitions
docs/preset_modes.md	Generated preset taxonomy
docs/runtime-certification.md	Certification scope, stages, and evidence contract
docs/validation-local.md	Local validation walkthrough
docs/benchmark_methodology.md	Benchmark publication and provenance rules
docs/benchmark_index.md	Generated index of benchmark surfaces and lane boundaries
docs/family_evidence_matrix.md	Release-gated vs research-only evidence split
docs/integration.md	Model-family wiring and PolarQuant integration details
docs/evaluation.md	Exploratory quality-evaluation guidance
docs/bit_budget_sweep.md	Research-only bit-budget sweep
docs/kv_paper_eval.md	Unified KV report command (fast-check vs heavy-offline tiers)
docs/vector_search.md	Research-only vector-search benchmark lane
docs/vendored-upstream-boundary.md	Upstream `mlx_lm` patch boundary

Contributing

Run make test-static on any platform before opening a PR.
On Apple Silicon, run make test-mlx and make test-structural before widening any runtime claims.
If you change runtime-contract or evidence wording, update turboquant/contract.json and regenerate the derived docs.
If you change the preset registry or classifications, update turboquant/config.py and turboquant/contract.json together, then regenerate the derived docs.
If you add a model family or preset to the supported story, extend the certification surface before updating the README.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 172 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
integrations/mlx		integrations/mlx
scripts		scripts
tests		tests
tools		tools
turboquant		turboquant
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.python-version		.python-version
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
README_test.md		README_test.md
RELEASE_CANDIDATE_NOTES.md		RELEASE_CANDIDATE_NOTES.md
VENDORED_MLX_LM.md		VENDORED_MLX_LM.md
conftest.py		conftest.py
mypy		mypy
noxfile.py		noxfile.py
pyproject.toml		pyproject.toml
ruff		ruff
run_all.sh		run_all.sh
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TurboQuant

Research-grade KV-cache compression for Apple Silicon MLX LLMs

Quick Start

Upgrade a cache after prefill

How It Works

Presets

Model Support

Validation & Certification

Static checks (any platform)

Apple Silicon runtime checks

Real-model smoke (Apple Silicon)

Full certification bundle

Project Layout

Documentation

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TurboQuant

Research-grade KV-cache compression for Apple Silicon MLX LLMs

Quick Start

Upgrade a cache after prefill

How It Works

Presets

Model Support

Validation & Certification

Static checks (any platform)

Apple Silicon runtime checks

Real-model smoke (Apple Silicon)

Full certification bundle

Project Layout

Documentation

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages