Rust-native vector quantization codec for embedding compression — CPU SIMD, optional GPU acceleration, and Python/TypeScript bindings.
Note
TinyQuant is a Rust-native vector quantization codec that compresses high-dimensional embedding vectors to low-bit representations while preserving cosine similarity rankings. It combines random orthogonal preconditioning with two-stage scalar quantization and optional FP16 residual correction to hit 8× compression at 4-bit with Pearson ρ ≈ 0.998 and 95% top-5 recall on real OpenAI embeddings.
- What it is: a Rust library (with Python and TypeScript bindings) that squeezes embedding vectors into 4-bit (or 2-bit) representations without losing retrieval quality, with optional wgpu GPU acceleration for batch workloads above 512 vectors.
- Who it's for: teams running cosine-similarity search on embeddings and paying for RAM or disk by the gigabyte.
- Headline number: 8× compression at 95% top-5 recall on 1536-dim OpenAI embeddings. 1 M vectors go from 5.7 GB to 732 MB.
On a benchmark of 335 real embeddings from OpenAI's
text-embedding-3-small (1536 dimensions), TinyQuant 4-bit achieves
8× compression with Pearson ρ = 0.998 and 95% top-5 recall —
reducing a 6 KB embedding to 768 bytes while preserving the similarity
rankings that drive retrieval quality.
| Method | Bytes/vec | Compression | Pearson ρ | Top-5 Recall |
|---|---|---|---|---|
| FP32 (baseline) | 6,144 | 1× | 1.0000 | 100% |
| FP16 | 3,072 | 2× | 1.0000 | 100% |
| uint8 scalar | 1,544 | 4× | 1.0000 | 100% |
| TinyQuant 4-bit | 768 | 8× | 0.9981 | 95% |
| TinyQuant 2-bit | 384 | 16× | 0.9643 | 85% |
| TinyQuant 4-bit + residual | 3,840 | 1.6× | 1.0000 | 100% |
For a corpus of 1 million 1536-dim vectors, TinyQuant 4-bit reduces storage from 5.7 GB to 732 MB with negligible loss in retrieval quality.
See the full benchmark report for methodology, all 9 methods compared, throughput measurements, and publication-quality plots.
Contents
TinyQuant is published on PyPI as tinyquant-cpu and imports as
tinyquant_cpu. The current release is a Rust-backed fat wheel — no
pure-Python fallback.
| I want to... | Install command |
|---|---|
| Python (Rust-backed, current) | pip install tinyquant-cpu |
| Python + PostgreSQL/pgvector support | pip install "tinyquant-cpu[pgvector]" |
| Rust native crate | cargo add tinyquant-core |
| TypeScript / Node / Bun | npm install tinyquant |
| Work on this repository | see the Development section below |
Tip
The [pgvector] extra pulls in psycopg[binary]>=3.1 for talking to a
live PostgreSQL database. Python 3.12+ is required; the Rust workspace
MSRV is 1.81, with the optional tinyquant-gpu-wgpu crate carved out
at 1.87 in its own CI lane.
TinyQuant ships the same codec / corpus / backend surface across three
languages, versioned in lockstep via rust/Cargo.toml
workspace.package.version. All bindings delegate math to the shared
tinyquant-core Rust crate — there is no per-language reimplementation.
| Language | Package | Install | Since |
|---|---|---|---|
| Python | tinyquant-cpu ( |
pip install tinyquant-cpu |
Phase 24 |
| Rust | tinyquant-core ( |
cargo add tinyquant-core |
Phase 22 |
| TypeScript | tinyquant ( |
npm install tinyquant |
Phase 25 |
All three packages guarantee byte-identical output on config_hash,
Codebook::to_bytes, and CompressedVector::to_bytes. See
COMPATIBILITY.md for the supported cross-package
version pairs.
import numpy as np
from tinyquant_cpu.codec import Codec, CodecConfig
from tinyquant_cpu.corpus import Corpus, CompressionPolicy
from tinyquant_cpu.backend import BruteForceBackend
# 1. Configure the codec: 4-bit quantization for 1536-dim vectors
config = CodecConfig(bit_width=4, dimension=1536, seed=42)
codec = Codec()
# 2. Train a codebook from representative vectors
training_vectors = np.random.default_rng(0).standard_normal((1000, 1536)).astype(np.float32)
codebook = codec.build_codebook(training_vectors, config)
# 3. Create a corpus that compresses on insert
corpus = Corpus("my-vectors", config, codebook, CompressionPolicy.COMPRESS)
for i, vec in enumerate(training_vectors):
corpus.insert(f"vec-{i}", vec)
# 4. Decompress and search
backend = BruteForceBackend()
backend.ingest(corpus.decompress_all())
results = backend.search(training_vectors[42], top_k=5)
for r in results:
print(f"{r.vector_id}: {r.score:.4f}")What just happened?
- Configure —
CodecConfig(bit_width=4, dimension=1536, seed=42)sets the bit width (4→ 8× compression), the vector dimension, and the RNG seed that controls the random rotation matrix. The seed makes the codec deterministic — same inputs always produce byte-identical output across all language bindings. - Train —
codec.build_codebook(training_vectors, config)fits a small codebook on a representative sample of your data. - Insert —
Corpus(..., CompressionPolicy.COMPRESS)creates a domain aggregate that compresses every vector on insert and tracks vector IDs. - Decompress —
corpus.decompress_all()produces(vector_id, fp32_vector)pairs. The Rust core runs these in parallel via Rayon. - Search —
BruteForceBackendperforms exact cosine search and returnsSearchResultobjects with IDs and scores. Swap forPgvectorAdapterin production, or use the GPU path for large corpora.
The problem. Naive scalar quantization crushes real embedding data because coordinate distributions are skewed: a handful of dimensions carry most of the signal and get mapped to the same bucket as noise.
The trick. Pre-multiplying each vector by a random orthogonal matrix (derived via QR decomposition of a Gaussian matrix) uniformizes the coordinate distribution without changing pairwise distances. After rotation, a single shared scalar quantizer works well across all dimensions. This is the core insight from TurboQuant and PolarQuant.
Two-stage refinement. An optional FP16 residual on top of the 4-bit coarse codebook gives you a separate point on the rate-distortion curve: 8× compression and ρ ≈ 0.998 without the residual; 1.6× compression and ρ = 1.000 with it enabled — useful for reranking stages.
Rust core with CPU and GPU paths. The codec runs through
tinyquant-core, which dispatches SIMD kernels at runtime (AVX2+FMA on
x86_64, NEON on aarch64) and parallelizes batch compression with Rayon.
For workloads exceeding the 512-vector threshold, the optional
tinyquant-gpu-wgpu crate offloads rotate/quantize/dequantize/residual
and corpus cosine search to WGSL compute shaders via wgpu, with lazy
pipeline caching to avoid per-call recompilation.
Backend-agnostic. The codec produces CompressedVector bytes; search
lives in a separate SearchBackend layer (BruteForceBackend for in-memory
exact search, PgvectorAdapter for PostgreSQL + pgvector, WgpuBackend for
GPU-accelerated corpus search), so you can plug TinyQuant into any retrieval
store without coupling storage to search.
Pick the config that matches your rate-distortion target:
| Config | Bytes/vec | Compression | ρ | Top-5 | When to use |
|---|---|---|---|---|---|
CodecConfig(bit_width=4) |
768 | 8× | 0.998 | 95% | Default balance |
CodecConfig(bit_width=2) |
384 | 16× | 0.964 | 85% | Aggressive, needs rerank |
CodecConfig(bit_width=4, residual_enabled=True) |
3,840 | 1.6× | 1.000 | 100% | Reranking / exact-match |
Single-vector compression
import numpy as np
from tinyquant_cpu.codec import Codec, CodecConfig
config = CodecConfig(bit_width=4, dimension=768, seed=42)
codec = Codec()
training_data = np.random.default_rng(0).standard_normal((1000, 768)).astype(np.float32)
codebook = codec.build_codebook(training_data, config)
vector = training_data[0]
compressed = codec.compress(vector, config, codebook)
print(f"Original: {vector.nbytes} bytes")
print(f"Compressed: {compressed.size_bytes} bytes")
print(f"Ratio: {vector.nbytes / compressed.size_bytes:.1f}x")
restored = codec.decompress(compressed, config, codebook)Batch compression (Rayon-parallel)
# Parallelized via Rayon in the Rust core — byte-identical to serial output
vectors = np.random.default_rng(0).standard_normal((10_000, 768)).astype(np.float32)
compressed_batch = codec.compress_batch(vectors, config, codebook)
restored_batch = codec.decompress_batch(compressed_batch, config, codebook)Tuning the rate–distortion tradeoff
# Maximum compression: 16x at 2-bit
config_2bit = CodecConfig(bit_width=2, dimension=768, seed=42, residual_enabled=False)
# Practical sweet spot: 8x at 4-bit (rho >= 0.998)
config_4bit = CodecConfig(bit_width=4, dimension=768, seed=42, residual_enabled=False)
# Near-perfect fidelity: 4-bit + FP16 residual correction (1.6x, rho = 1.000)
config_4bit_res = CodecConfig(bit_width=4, dimension=768, seed=42, residual_enabled=True)[!WARNING] 2-bit compression drops top-5 recall to ~85%. Only use it when a reranking stage (FP16 residual, cross-encoder, exact search) sits downstream to recover the missing signal.
Compression policies
A Corpus can store vectors in three modes:
from tinyquant_cpu.corpus import Corpus, CompressionPolicy
corpus_compressed = Corpus("c", config, codebook, CompressionPolicy.COMPRESS)
corpus_full = Corpus("p", config, codebook, CompressionPolicy.PASSTHROUGH)
corpus_fp16 = Corpus("h", config, codebook, CompressionPolicy.FP16)Policies let one corpus mix hot data (PASSTHROUGH), cold data (COMPRESS), and middle-tier data (FP16) without rebuilding the codec.
Binary serialization (TQCV format)
CompressedVector instances serialize to the TQCV versioned binary format
(70-byte header + LSB-first packed indices + optional FP16 residual),
suitable for disk, network, or database storage. Mmap corpus files are
available via the Rust tinyquant-io crate for zero-copy access.
from tinyquant_cpu.codec import CompressedVector
raw_bytes = compressed.to_bytes()
restored = CompressedVector.from_bytes(raw_bytes)PostgreSQL + pgvector backend
import psycopg
from tinyquant_cpu.backend.adapters.pgvector import PgvectorAdapter
adapter = PgvectorAdapter(
connection_factory=lambda: psycopg.connect("postgresql://user:pass@localhost/mydb"),
table_name="embeddings",
)
adapter.ingest(corpus.decompress_all())
results = adapter.search(query_vector, top_k=10)[!IMPORTANT] Requires PostgreSQL with the
pgvectorextension installed. CI runs these tests against a livepgvector/pgvector:pg17container via testcontainers.
GPU acceleration (Rust only — wgpu)
The tinyquant-gpu-wgpu crate provides a WgpuBackend that offloads
batch compress/decompress and corpus cosine search to WGSL compute shaders.
It is workspace-internal (publish = false) and selected automatically
when a batch exceeds GPU_BATCH_THRESHOLD (512 vectors).
use tinyquant_gpu_wgpu::{WgpuBackend, BackendPreference};
// Default adapter (auto-select highest-performance GPU)
let backend = WgpuBackend::new().await?;
// Or select a specific backend:
let backend = WgpuBackend::new_with_preference(BackendPreference::Vulkan).await?;
// Warm up pipeline cache explicitly (optional — lazy otherwise)
backend.load_pipelines().await;
// GPU corpus search
let state = backend.prepare_corpus_for_device(&corpus_vecs).await?;
let results = backend.cosine_topk(&state, &query_vec, top_k).await?;Available BackendPreference variants: Auto, Vulkan, Metal, Dx12,
HighPerformance, LowPower, Software.
- 8× compression at 4-bit without residuals (ρ = 0.998, 95% recall)
- 16× compression at 2-bit (ρ = 0.964, 85% recall)
- Perfect fidelity with optional FP16 residual correction (ρ = 1.000)
- Deterministic — same inputs produce byte-identical output across all language bindings and CPU architectures
- Rust-native core —
tinyquant-core; CPU SIMD dispatch (AVX2+FMA / NEON) viais_x86_feature_detected!/ ARMv8 base-ISA guarantee; Rayon parallel batch with determinism contract - Optional GPU acceleration —
tinyquant-gpu-wgpu; WGSL rotate/quantize/dequantize/residual and cosine-topk kernels; lazyCachedPipelines;BackendPreferenceadapter selection; auto-routes at ≥ 512 vectors - Multi-language — Python fat wheel (
tinyquant-cpu), TypeScript/Node (tinyquant), Rust native (tinyquant-core), C ABI (tinyquant-sys) - Pluggable backends —
BruteForceBackendfor in-process exact search;PgvectorAdapterfor PostgreSQL + pgvector;WgpuBackendfor GPU corpus search - Three compression policies — COMPRESS, PASSTHROUGH, FP16, mixable within a corpus
- TQCV serialization — versioned 70-byte header + LSB-first bit-pack + optional FP16 residual; mmap corpus files via
tinyquant-io - Calibration gates — Pearson ρ and mean recall-at-k measured against OpenAI calibration fixtures; Criterion benchmarks with 10% regression budget
- Fully typed —
py.typedmarker,mypy --strictclean, TypeScript strict mode - Apache-2.0 licensed
TinyQuant adapts ideas from published research into a clean-room implementation:
| Source | Year | Key contribution |
|---|---|---|
| TurboQuant | 2025 | Random rotation + scalar quantization, no per-block norms |
| PolarQuant | 2025 | QR-derived orthogonal preconditioning for coordinate uniformity |
| QJL | 2024 | Inner-product preservation bounds under aggressive quantization |
| Path | Purpose |
|---|---|
rust/crates/tinyquant-core/ |
Codec, corpus, backend trait, SIMD dispatch, Rayon parallel batch |
rust/crates/tinyquant-io/ |
TQCV serialization format and mmap corpus files |
rust/crates/tinyquant-gpu-wgpu/ |
Optional wgpu/WGSL GPU accelerator (publish = false, workspace-internal) |
rust/crates/tinyquant-py/ |
pyo3 Python extension — the engine behind tinyquant-cpu |
rust/crates/tinyquant-sys/ |
C ABI via cbindgen |
rust/crates/tinyquant-cli/ |
Standalone CLI binary |
rust/crates/tinyquant-js/ |
napi-rs TypeScript/Node bindings (tinyquant) |
rust/crates/tinyquant-bruteforce/ |
BruteForceBackend reference implementation |
rust/crates/tinyquant-pgvector/ |
PostgreSQL + pgvector ACL adapter |
rust/crates/tinyquant-bench/ |
Criterion benchmarks + calibration quality gates |
tests/reference/tinyquant_py_reference/ |
Pure-Python frozen oracle — differential test reference (not shipped) |
tests/parity/ |
Cross-implementation parity suite (pytest -m parity) |
tests/ |
Python unit, integration, E2E, architecture, and calibration suites |
experiments/ |
Benchmarks and empirical evaluations |
docs/ |
Obsidian wiki: design docs, research, SDLC plans, CI/CD specs |
git clone https://github.com/better-with-models/TinyQuant.git
cd TinyQuant
# Python dev dependencies
pip install pytest pytest-cov hypothesis numpy ruff mypy build
# Lint and format
ruff check . && ruff format --check .
# Strict type check
mypy --strict .
# Run the full Python suite
pytest --cov=tinyquant_py_reference
# Cross-impl parity (Python ↔ Rust)
pytest -m parity -v
# Rust: lint and test
cd rust
cargo clippy --workspace -- -D warnings
cargo test --workspaceThe Python test suite includes 289 tests covering unit, integration,
end-to-end, calibration, parity (cross-impl Python ↔ Rust), and
architecture-enforcement scenarios. Coverage is held above 90% by CI
(94% for the codec subpackage). Live PostgreSQL + pgvector tests run
against a Docker container in CI via testcontainers.
Tip
CI enforces three strict gates: ruff check / ruff format --check,
mypy --strict, and markdownlint-cli2 for all markdown outside docs/.
The docs/ vault uses Obsidian-flavored markdown under its own rules —
see AGENTS.md for the policy.
export OPENAI_API_KEY="your-key-here"
python experiments/quantization-benchmark/generate_embeddings.py
python experiments/quantization-benchmark/run_benchmark.py
python experiments/quantization-benchmark/generate_plots.pyThis fetches 335 embeddings via the OpenAI API, benchmarks 9 quantization
methods, and produces plots and JSON results in
experiments/quantization-benchmark/results/.
Contributions are welcome. The short version:
- Issues and design discussions — open a GitHub issue before starting non-trivial work so we can agree on scope.
- Follow the repo SDLC — architecture decisions, coding standards, and
pre-commit expectations live in
AGENTS.mdand thedocs/design/vault. ReadCLAUDE.mdif you're driving Claude Code or another LLM agent against this repo. - Run the full gate locally before pushing:
ruff check . && ruff format --check . && mypy --strict . && pytest --cov=tinyquant_cpu - Keep prose aligned — edits to the project tagline, elevator pitch, or
headline benchmark numbers must land in
README.md,.github/README.md,AGENTS.md, andCLAUDE.mdin the same commit.
Apache-2.0. See LICENSE.
- Benchmark Report — full empirical evaluation in CS-paper format
- CHANGELOG — release notes
- Design: Storage Codec Architecture
- Design: GPU Acceleration
- Research: Vector Quantization Paper Synthesis
- QA: Validation Plan
- CI Plan and CD Plan
