TurboQuant is a Rust library for research-grade vector quantization of LLM KV caches. It now includes three benchmark/evaluation paths:
synthetic: model-shaped random vectorstrace: exported per-head safetensors tracesreal-model: true end-to-end decoder inference on lightweight ONNX models with iterative past-key-value reuse
Current status as of 2026-03-25: alpha. Suitable for local research, benchmarking, and integration experiments. Not yet a production inference backend.
[dependencies]
turboquant = "0.1.1"Rust 1.87.0+ is required for the default CPU path. The experimental gpu feature depends on the Burn/WGPU stack and may require a newer stable Rust toolchain.
Feature flags:
- default: scalar CPU path plus runtime-dispatched SIMD
gpu: experimental Burn/WGPU batch kernels
| Type | Purpose | Notes |
|---|---|---|
TurboQuantMSE |
Reconstruction-oriented vector quantization | Unit-norm input contract |
TurboQuantProd |
Inner-product-oriented vector quantization | Requires bit_width >= 2 |
BatchQuantizedMSE / BatchQuantizedProd |
Packed batch storage | Validate layout after deserialization |
QuantizedKVCache / MultiHeadKVCache |
Quantized KV cache helpers | Keys and values can be reconstructed |
KvTrace |
Trace loader for exported per-head workloads | Rejects invalid query positions |
RealModelRunner |
End-to-end ONNX decoder runner via ort / ONNX Runtime |
CPU-oriented real-model path |
The repository now has a true decoder loop for lightweight open-source models:
- load a tokenizer and ONNX decoder bundle
- run prompt prefill
- run iterative decoding with explicit
past_key_values - compare exact cache reuse vs quantized cache reuse in the actual decode loop
The real-model execution backend is ONNX Runtime on CPU via the Rust ort binding. Burn remains in the repository for optional WGPU batch quantization kernels, but it is not the primary path for full decoder inference.
Verified end-to-end on the Rust real-model path today:
distilgpt2HuggingFaceTB/SmolLM2-135M-Instruct
The export helper also includes additional presets for experimentation, but only the verified models above should be treated as supported.
Other decoder-only models can work if their exported ONNX bundle exposes:
input_ids- optional
attention_mask,position_ids,cache_position,use_cache_branch past_key_values.<layer>.{key,value}present.<layer>.{key,value}logits
The quantized real-model path quantizes the cache in the real decode loop, then reconstructs float tensors before feeding them back into ONNX Runtime for the next step. That means:
- KV storage metrics reflect the quantized cache representation
- generation quality reflects quantized-cache reuse
- ONNX Runtime still performs standard float attention math internally
This is true end-to-end model execution with quantized cache feedback, but it is not a custom quantized attention kernel inside the ONNX runtime.
Pinned Python dependencies for the real-model scripts live in scripts/requirements-real-model.txt.
The Rust real-model path also pulls ONNX Runtime CPU binaries through the ort crate on first build.
Example setup:
python3 -m venv .venv
. .venv/bin/activate
pip install -r scripts/requirements-real-model.txtExport a documented lightweight preset:
python3 scripts/export_hf_decoder_onnx.py \
--preset distilgpt2 \
--output-dir artifacts/distilgpt2-onnxOr export the verified SmolLM2 preset:
python3 scripts/export_hf_decoder_onnx.py \
--preset smollm2-135m-instruct \
--output-dir artifacts/smollm2-135m-instruct-onnxOr export an explicit model id:
python3 scripts/export_hf_decoder_onnx.py \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--output-dir artifacts/tinyllama-onnxThe export helper targets text-generation-with-past and defaults to fp32, which is the current verified dtype on the CPU ONNX Runtime path.
Synthetic quick run:
cargo run --release --example benchmark -- --workload synthetic --quickTrace run:
cargo run --release --example benchmark -- \
--workload trace \
--trace traces/example.safetensorsReal-model exact run:
cargo run --release --example benchmark -- \
--workload real-model \
--real-model-dir artifacts/distilgpt2-onnx \
--prompt "Summarize the role of a KV cache in one sentence." \
--real-eval-mode exact \
--max-new-tokens 16Real-model quantized run:
cargo run --release --example benchmark -- \
--workload real-model \
--real-model-dir artifacts/distilgpt2-onnx \
--prompt "Summarize the role of a KV cache in one sentence." \
--real-eval-mode quantized \
--bits 4 \
--real-key-strategy prod \
--max-new-tokens 16Real-model side-by-side comparison:
cargo run --release --example benchmark -- \
--workload real-model \
--real-model-dir artifacts/distilgpt2-onnx \
--prompt "Summarize the role of a KV cache in one sentence." \
--real-eval-mode compare \
--bits 4 \
--value-bits 4 \
--real-key-strategy prod \
--top-k 5 \
--max-new-tokens 16The CLI reports source explicitly as synthetic, trace, or real-model to avoid confusing model-shaped workloads with true decoder runs.
For a fuller end-to-end workflow, use the orchestration helper:
python3 scripts/run_real_model_eval.py \
--preset distilgpt2 \
--bits 2 4 8 \
--strategies prod mseWhat it does:
- exports or reuses a real ONNX decoder bundle
- builds the Rust benchmark example once
- runs an exact baseline
- runs multiple exact-vs-quantized compare benchmarks
- writes raw JSON plus a markdown summary report under
artifacts/real-model-evals/
Useful options:
python3 scripts/run_real_model_eval.py \
--model-dir artifacts/distilgpt2-onnx \
--prompts scripts/prompts/real_model_eval_prompts.jsonl \
--max-prompts 6 \
--max-new-tokens 24 \
--top-k 5 \
--bits 4 8 \
--strategies prodThe default prompt suite lives at scripts/prompts/real_model_eval_prompts.jsonl.
real-model mode can report:
- next-token logit RMSE
- top-k agreement
- token match rate and divergence rate
- reference-token cross-entropy / perplexity
- latency
- tokens/sec
- exact vs quantized KV memory usage
For compare mode, cross-entropy is computed against the exact run's generated token at each shared step. This is a distribution-drift metric, not a dataset perplexity claim.
The existing trace exporter is still available when you want per-head analysis rather than full-model decode:
python3 scripts/export_hf_kv.py \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--input prompts.txt \
--output traces/mistral_layer0_head0.safetensors \
--layer 0 \
--head 0Then run:
cargo run --release --example benchmark -- \
--workload trace \
--trace traces/mistral_layer0_head0.safetensorscargo fmt -- --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test --all-features
cargo check --examples --all-features
cargo llvm-cov --workspace --all-features --summary-only
cargo auditCI-safe:
cargo test --all-featuresIncludes a tiny local ONNX fixture that exercises the ONNX Runtime/tokenizer/KV-cache path without downloading external weights.
Manual heavier smoke test:
TURBOQUANT_REAL_MODEL_DIR=artifacts/real-model-bundles/distilgpt2 \
cargo test --all-features manual_exported_real_model_smoke_test -- --ignored --nocapture- The real-model backend is CPU-oriented and currently uses ONNX Runtime, not Burn.
- Quantized real-model evaluation reconstructs float past tensors before the next ONNX Runtime step.
- The WGPU/Burn path remains experimental and is still primarily a batch-kernel benchmark surface.
- The crate still assumes unit-norm vectors for the core quantizer APIs; the real-model path handles raw KV tensors by storing norms separately.
- The verified real-model surface is currently limited to
distilgpt2andHuggingFaceTB/SmolLM2-135M-Instruct. - This repository does not provide production serving, observability, or deployment tooling.
See CONTRIBUTING.md, ARCHITECTURE.md, and AGENTS.md.