Skip to content

AbdelStark/turboquant

Repository files navigation

TurboQuant

TurboQuant is a Rust library for research-grade vector quantization of LLM KV caches. It now includes three benchmark/evaluation paths:

  • synthetic: model-shaped random vectors
  • trace: exported per-head safetensors traces
  • real-model: true end-to-end decoder inference on lightweight ONNX models with iterative past-key-value reuse

Current status as of 2026-03-25: alpha. Suitable for local research, benchmarking, and integration experiments. Not yet a production inference backend.

Installation

[dependencies]
turboquant = "0.1.1"

Rust 1.87.0+ is required for the default CPU path. The experimental gpu feature depends on the Burn/WGPU stack and may require a newer stable Rust toolchain.

Feature flags:

  • default: scalar CPU path plus runtime-dispatched SIMD
  • gpu: experimental Burn/WGPU batch kernels

Core APIs

Type Purpose Notes
TurboQuantMSE Reconstruction-oriented vector quantization Unit-norm input contract
TurboQuantProd Inner-product-oriented vector quantization Requires bit_width >= 2
BatchQuantizedMSE / BatchQuantizedProd Packed batch storage Validate layout after deserialization
QuantizedKVCache / MultiHeadKVCache Quantized KV cache helpers Keys and values can be reconstructed
KvTrace Trace loader for exported per-head workloads Rejects invalid query positions
RealModelRunner End-to-end ONNX decoder runner via ort / ONNX Runtime CPU-oriented real-model path

Real-Model Support

The repository now has a true decoder loop for lightweight open-source models:

  • load a tokenizer and ONNX decoder bundle
  • run prompt prefill
  • run iterative decoding with explicit past_key_values
  • compare exact cache reuse vs quantized cache reuse in the actual decode loop

The real-model execution backend is ONNX Runtime on CPU via the Rust ort binding. Burn remains in the repository for optional WGPU batch quantization kernels, but it is not the primary path for full decoder inference.

Supported Lightweight Models

Verified end-to-end on the Rust real-model path today:

  • distilgpt2
  • HuggingFaceTB/SmolLM2-135M-Instruct

The export helper also includes additional presets for experimentation, but only the verified models above should be treated as supported.

Other decoder-only models can work if their exported ONNX bundle exposes:

  • input_ids
  • optional attention_mask, position_ids, cache_position, use_cache_branch
  • past_key_values.<layer>.{key,value}
  • present.<layer>.{key,value}
  • logits

Important Honesty Note

The quantized real-model path quantizes the cache in the real decode loop, then reconstructs float tensors before feeding them back into ONNX Runtime for the next step. That means:

  • KV storage metrics reflect the quantized cache representation
  • generation quality reflects quantized-cache reuse
  • ONNX Runtime still performs standard float attention math internally

This is true end-to-end model execution with quantized cache feedback, but it is not a custom quantized attention kernel inside the ONNX runtime.

ONNX Export Workflow

Pinned Python dependencies for the real-model scripts live in scripts/requirements-real-model.txt.

The Rust real-model path also pulls ONNX Runtime CPU binaries through the ort crate on first build.

Example setup:

python3 -m venv .venv
. .venv/bin/activate
pip install -r scripts/requirements-real-model.txt

Export a documented lightweight preset:

python3 scripts/export_hf_decoder_onnx.py \
  --preset distilgpt2 \
  --output-dir artifacts/distilgpt2-onnx

Or export the verified SmolLM2 preset:

python3 scripts/export_hf_decoder_onnx.py \
  --preset smollm2-135m-instruct \
  --output-dir artifacts/smollm2-135m-instruct-onnx

Or export an explicit model id:

python3 scripts/export_hf_decoder_onnx.py \
  --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --output-dir artifacts/tinyllama-onnx

The export helper targets text-generation-with-past and defaults to fp32, which is the current verified dtype on the CPU ONNX Runtime path.

Benchmark CLI

Synthetic quick run:

cargo run --release --example benchmark -- --workload synthetic --quick

Trace run:

cargo run --release --example benchmark -- \
  --workload trace \
  --trace traces/example.safetensors

Real-model exact run:

cargo run --release --example benchmark -- \
  --workload real-model \
  --real-model-dir artifacts/distilgpt2-onnx \
  --prompt "Summarize the role of a KV cache in one sentence." \
  --real-eval-mode exact \
  --max-new-tokens 16

Real-model quantized run:

cargo run --release --example benchmark -- \
  --workload real-model \
  --real-model-dir artifacts/distilgpt2-onnx \
  --prompt "Summarize the role of a KV cache in one sentence." \
  --real-eval-mode quantized \
  --bits 4 \
  --real-key-strategy prod \
  --max-new-tokens 16

Real-model side-by-side comparison:

cargo run --release --example benchmark -- \
  --workload real-model \
  --real-model-dir artifacts/distilgpt2-onnx \
  --prompt "Summarize the role of a KV cache in one sentence." \
  --real-eval-mode compare \
  --bits 4 \
  --value-bits 4 \
  --real-key-strategy prod \
  --top-k 5 \
  --max-new-tokens 16

The CLI reports source explicitly as synthetic, trace, or real-model to avoid confusing model-shaped workloads with true decoder runs.

One-Command Real-Model Eval

For a fuller end-to-end workflow, use the orchestration helper:

python3 scripts/run_real_model_eval.py \
  --preset distilgpt2 \
  --bits 2 4 8 \
  --strategies prod mse

What it does:

  • exports or reuses a real ONNX decoder bundle
  • builds the Rust benchmark example once
  • runs an exact baseline
  • runs multiple exact-vs-quantized compare benchmarks
  • writes raw JSON plus a markdown summary report under artifacts/real-model-evals/

Useful options:

python3 scripts/run_real_model_eval.py \
  --model-dir artifacts/distilgpt2-onnx \
  --prompts scripts/prompts/real_model_eval_prompts.jsonl \
  --max-prompts 6 \
  --max-new-tokens 24 \
  --top-k 5 \
  --bits 4 8 \
  --strategies prod

The default prompt suite lives at scripts/prompts/real_model_eval_prompts.jsonl.

Real-Model Metrics

real-model mode can report:

  • next-token logit RMSE
  • top-k agreement
  • token match rate and divergence rate
  • reference-token cross-entropy / perplexity
  • latency
  • tokens/sec
  • exact vs quantized KV memory usage

For compare mode, cross-entropy is computed against the exact run's generated token at each shared step. This is a distribution-drift metric, not a dataset perplexity claim.

Trace Workflow

The existing trace exporter is still available when you want per-head analysis rather than full-model decode:

python3 scripts/export_hf_kv.py \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --input prompts.txt \
  --output traces/mistral_layer0_head0.safetensors \
  --layer 0 \
  --head 0

Then run:

cargo run --release --example benchmark -- \
  --workload trace \
  --trace traces/mistral_layer0_head0.safetensors

Validation Commands

cargo fmt -- --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test --all-features
cargo check --examples --all-features
cargo llvm-cov --workspace --all-features --summary-only
cargo audit

CI-Safe vs Manual Tests

CI-safe:

  • cargo test --all-features Includes a tiny local ONNX fixture that exercises the ONNX Runtime/tokenizer/KV-cache path without downloading external weights.

Manual heavier smoke test:

TURBOQUANT_REAL_MODEL_DIR=artifacts/real-model-bundles/distilgpt2 \
  cargo test --all-features manual_exported_real_model_smoke_test -- --ignored --nocapture

Limitations

  • The real-model backend is CPU-oriented and currently uses ONNX Runtime, not Burn.
  • Quantized real-model evaluation reconstructs float past tensors before the next ONNX Runtime step.
  • The WGPU/Burn path remains experimental and is still primarily a batch-kernel benchmark surface.
  • The crate still assumes unit-norm vectors for the core quantizer APIs; the real-model path handles raw KV tensors by storing norms separately.
  • The verified real-model surface is currently limited to distilgpt2 and HuggingFaceTB/SmolLM2-135M-Instruct.
  • This repository does not provide production serving, observability, or deployment tooling.

Contributing

See CONTRIBUTING.md, ARCHITECTURE.md, and AGENTS.md.

About

Rust implementation of Google's TurboQuant algorithm for vector quantization

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages