TinyQuant

Rust-native vector quantization codec for embedding compression — CPU SIMD, optional GPU acceleration, and Python/TypeScript bindings.

Note

TinyQuant is a Rust-native vector quantization codec that compresses high-dimensional embedding vectors to low-bit representations while preserving cosine similarity rankings. It combines random orthogonal preconditioning with two-stage scalar quantization and optional FP16 residual correction to hit 8× compression at 4-bit with Pearson ρ ≈ 0.998 and 95% top-5 recall on real OpenAI embeddings.

What it is: a Rust library (with Python and TypeScript bindings) that squeezes embedding vectors into 4-bit (or 2-bit) representations without losing retrieval quality, with optional wgpu GPU acceleration for batch workloads above 512 vectors.
Who it's for: teams running cosine-similarity search on embeddings and paying for RAM or disk by the gigabyte.
Headline number: 8× compression at 95% top-5 recall on 1536-dim OpenAI embeddings. 1 M vectors go from 5.7 GB to 732 MB.

At a glance

On a benchmark of 335 real embeddings from OpenAI's text-embedding-3-small (1536 dimensions), TinyQuant 4-bit achieves 8× compression with Pearson ρ = 0.998 and 95% top-5 recall — reducing a 6 KB embedding to 768 bytes while preserving the similarity rankings that drive retrieval quality.

Method	Bytes/vec	Compression	Pearson ρ	Top-5 Recall
FP32 (baseline)	6,144	1×	1.0000	100%
FP16	3,072	2×	1.0000	100%
uint8 scalar	1,544	4×	1.0000	100%
TinyQuant 4-bit	768	8×	0.9981	95%
TinyQuant 2-bit	384	16×	0.9643	85%
TinyQuant 4-bit + residual	3,840	1.6×	1.0000	100%

For a corpus of 1 million 1536-dim vectors, TinyQuant 4-bit reduces storage from 5.7 GB to 732 MB with negligible loss in retrieval quality.

See the full benchmark report for methodology, all 9 methods compared, throughput measurements, and publication-quality plots.

Contents

Installation
Language bindings
Quickstart
How it works
Recipes
Key properties
Research lineage
Repository layout
Development
Reproducing the benchmark
Contributing
License
Related documentation

Installation

TinyQuant is published on PyPI as tinyquant-cpu and imports as tinyquant_cpu. The current release is a Rust-backed fat wheel — no pure-Python fallback.

I want to...	Install command
Python (Rust-backed, current)	`pip install tinyquant-cpu`
Python + PostgreSQL/pgvector support	`pip install "tinyquant-cpu[pgvector]"`
Rust native crate	`cargo add tinyquant-core`
TypeScript / Node / Bun	`npm install tinyquant`
Work on this repository	see the Development section below

Tip

The [pgvector] extra pulls in psycopg[binary]>=3.1 for talking to a live PostgreSQL database. Python 3.12+ is required; the Rust workspace MSRV is 1.81, with the optional tinyquant-gpu-wgpu crate carved out at 1.87 in its own CI lane.

Language bindings

TinyQuant ships the same codec / corpus / backend surface across three languages, versioned in lockstep via rust/Cargo.toml workspace.package.version. All bindings delegate math to the shared tinyquant-core Rust crate — there is no per-language reimplementation.

Language	Package	Install	Since
Python	`tinyquant-cpu` ()	`pip install tinyquant-cpu`	Phase 24
Rust	`tinyquant-core` ()	`cargo add tinyquant-core`	Phase 22
TypeScript	`tinyquant` ()	`npm install tinyquant`	Phase 25

All three packages guarantee byte-identical output on config_hash, Codebook::to_bytes, and CompressedVector::to_bytes. See COMPATIBILITY.md for the supported cross-package version pairs.

Quickstart

import numpy as np
from tinyquant_cpu.codec import Codec, CodecConfig
from tinyquant_cpu.corpus import Corpus, CompressionPolicy
from tinyquant_cpu.backend import BruteForceBackend

# 1. Configure the codec: 4-bit quantization for 1536-dim vectors
config = CodecConfig(bit_width=4, dimension=1536, seed=42)
codec = Codec()

# 2. Train a codebook from representative vectors
training_vectors = np.random.default_rng(0).standard_normal((1000, 1536)).astype(np.float32)
codebook = codec.build_codebook(training_vectors, config)

# 3. Create a corpus that compresses on insert
corpus = Corpus("my-vectors", config, codebook, CompressionPolicy.COMPRESS)
for i, vec in enumerate(training_vectors):
    corpus.insert(f"vec-{i}", vec)

# 4. Decompress and search
backend = BruteForceBackend()
backend.ingest(corpus.decompress_all())
results = backend.search(training_vectors[42], top_k=5)
for r in results:
    print(f"{r.vector_id}: {r.score:.4f}")

What just happened?

Configure — CodecConfig(bit_width=4, dimension=1536, seed=42) sets the bit width (4 → 8× compression), the vector dimension, and the RNG seed that controls the random rotation matrix. The seed makes the codec deterministic — same inputs always produce byte-identical output across all language bindings.
Train — codec.build_codebook(training_vectors, config) fits a small codebook on a representative sample of your data.
Insert — Corpus(..., CompressionPolicy.COMPRESS) creates a domain aggregate that compresses every vector on insert and tracks vector IDs.
Decompress — corpus.decompress_all() produces (vector_id, fp32_vector) pairs. The Rust core runs these in parallel via Rayon.
Search — BruteForceBackend performs exact cosine search and returns SearchResult objects with IDs and scores. Swap for PgvectorAdapter in production, or use the GPU path for large corpora.

How it works

The problem. Naive scalar quantization crushes real embedding data because coordinate distributions are skewed: a handful of dimensions carry most of the signal and get mapped to the same bucket as noise.

The trick. Pre-multiplying each vector by a random orthogonal matrix (derived via QR decomposition of a Gaussian matrix) uniformizes the coordinate distribution without changing pairwise distances. After rotation, a single shared scalar quantizer works well across all dimensions. This is the core insight from TurboQuant and PolarQuant.

Two-stage refinement. An optional FP16 residual on top of the 4-bit coarse codebook gives you a separate point on the rate-distortion curve: 8× compression and ρ ≈ 0.998 without the residual; 1.6× compression and ρ = 1.000 with it enabled — useful for reranking stages.

Rust core with CPU and GPU paths. The codec runs through tinyquant-core, which dispatches SIMD kernels at runtime (AVX2+FMA on x86_64, NEON on aarch64) and parallelizes batch compression with Rayon. For workloads exceeding the 512-vector threshold, the optional tinyquant-gpu-wgpu crate offloads rotate/quantize/dequantize/residual and corpus cosine search to WGSL compute shaders via wgpu, with lazy pipeline caching to avoid per-call recompilation.

Backend-agnostic. The codec produces CompressedVector bytes; search lives in a separate SearchBackend layer (BruteForceBackend for in-memory exact search, PgvectorAdapter for PostgreSQL + pgvector, WgpuBackend for GPU-accelerated corpus search), so you can plug TinyQuant into any retrieval store without coupling storage to search.

Recipes

Pick the config that matches your rate-distortion target:

Config	Bytes/vec	Compression	ρ	Top-5	When to use
`CodecConfig(bit_width=4)`	768	8×	0.998	95%	Default balance
`CodecConfig(bit_width=2)`	384	16×	0.964	85%	Aggressive, needs rerank
`CodecConfig(bit_width=4, residual_enabled=True)`	3,840	1.6×	1.000	100%	Reranking / exact-match

Single-vector compression

import numpy as np
from tinyquant_cpu.codec import Codec, CodecConfig

config = CodecConfig(bit_width=4, dimension=768, seed=42)
codec = Codec()

training_data = np.random.default_rng(0).standard_normal((1000, 768)).astype(np.float32)
codebook = codec.build_codebook(training_data, config)

vector = training_data[0]
compressed = codec.compress(vector, config, codebook)
print(f"Original:   {vector.nbytes} bytes")
print(f"Compressed: {compressed.size_bytes} bytes")
print(f"Ratio:      {vector.nbytes / compressed.size_bytes:.1f}x")

restored = codec.decompress(compressed, config, codebook)

Batch compression (Rayon-parallel)

# Parallelized via Rayon in the Rust core — byte-identical to serial output
vectors = np.random.default_rng(0).standard_normal((10_000, 768)).astype(np.float32)
compressed_batch = codec.compress_batch(vectors, config, codebook)
restored_batch = codec.decompress_batch(compressed_batch, config, codebook)

Tuning the rate–distortion tradeoff

# Maximum compression: 16x at 2-bit
config_2bit = CodecConfig(bit_width=2, dimension=768, seed=42, residual_enabled=False)

# Practical sweet spot: 8x at 4-bit (rho >= 0.998)
config_4bit = CodecConfig(bit_width=4, dimension=768, seed=42, residual_enabled=False)

# Near-perfect fidelity: 4-bit + FP16 residual correction (1.6x, rho = 1.000)
config_4bit_res = CodecConfig(bit_width=4, dimension=768, seed=42, residual_enabled=True)

[!WARNING] 2-bit compression drops top-5 recall to ~85%. Only use it when a reranking stage (FP16 residual, cross-encoder, exact search) sits downstream to recover the missing signal.

Compression policies

A Corpus can store vectors in three modes:

from tinyquant_cpu.corpus import Corpus, CompressionPolicy

corpus_compressed = Corpus("c", config, codebook, CompressionPolicy.COMPRESS)
corpus_full       = Corpus("p", config, codebook, CompressionPolicy.PASSTHROUGH)
corpus_fp16       = Corpus("h", config, codebook, CompressionPolicy.FP16)

Policies let one corpus mix hot data (PASSTHROUGH), cold data (COMPRESS), and middle-tier data (FP16) without rebuilding the codec.

Binary serialization (TQCV format)

CompressedVector instances serialize to the TQCV versioned binary format (70-byte header + LSB-first packed indices + optional FP16 residual), suitable for disk, network, or database storage. Mmap corpus files are available via the Rust tinyquant-io crate for zero-copy access.

from tinyquant_cpu.codec import CompressedVector

raw_bytes = compressed.to_bytes()
restored  = CompressedVector.from_bytes(raw_bytes)

PostgreSQL + pgvector backend

import psycopg
from tinyquant_cpu.backend.adapters.pgvector import PgvectorAdapter

adapter = PgvectorAdapter(
    connection_factory=lambda: psycopg.connect("postgresql://user:pass@localhost/mydb"),
    table_name="embeddings",
)
adapter.ingest(corpus.decompress_all())
results = adapter.search(query_vector, top_k=10)

[!IMPORTANT] Requires PostgreSQL with the pgvector extension installed. CI runs these tests against a live pgvector/pgvector:pg17 container via testcontainers.

GPU acceleration (Rust only — wgpu)

The tinyquant-gpu-wgpu crate provides a WgpuBackend that offloads batch compress/decompress and corpus cosine search to WGSL compute shaders. It is workspace-internal (publish = false) and selected automatically when a batch exceeds GPU_BATCH_THRESHOLD (512 vectors).

use tinyquant_gpu_wgpu::{WgpuBackend, BackendPreference};

// Default adapter (auto-select highest-performance GPU)
let backend = WgpuBackend::new().await?;

// Or select a specific backend:
let backend = WgpuBackend::new_with_preference(BackendPreference::Vulkan).await?;

// Warm up pipeline cache explicitly (optional — lazy otherwise)
backend.load_pipelines().await;

// GPU corpus search
let state = backend.prepare_corpus_for_device(&corpus_vecs).await?;
let results = backend.cosine_topk(&state, &query_vec, top_k).await?;

Available BackendPreference variants: Auto, Vulkan, Metal, Dx12, HighPerformance, LowPower, Software.

Key properties

8× compression at 4-bit without residuals (ρ = 0.998, 95% recall)
16× compression at 2-bit (ρ = 0.964, 85% recall)
Perfect fidelity with optional FP16 residual correction (ρ = 1.000)
Deterministic — same inputs produce byte-identical output across all language bindings and CPU architectures
Rust-native core — tinyquant-core; CPU SIMD dispatch (AVX2+FMA / NEON) via is_x86_feature_detected! / ARMv8 base-ISA guarantee; Rayon parallel batch with determinism contract
Optional GPU acceleration — tinyquant-gpu-wgpu; WGSL rotate/quantize/dequantize/residual and cosine-topk kernels; lazy CachedPipelines; BackendPreference adapter selection; auto-routes at ≥ 512 vectors
Multi-language — Python fat wheel (tinyquant-cpu), TypeScript/Node (tinyquant), Rust native (tinyquant-core), C ABI (tinyquant-sys)
Pluggable backends — BruteForceBackend for in-process exact search; PgvectorAdapter for PostgreSQL + pgvector; WgpuBackend for GPU corpus search
Three compression policies — COMPRESS, PASSTHROUGH, FP16, mixable within a corpus
TQCV serialization — versioned 70-byte header + LSB-first bit-pack + optional FP16 residual; mmap corpus files via tinyquant-io
Calibration gates — Pearson ρ and mean recall-at-k measured against OpenAI calibration fixtures; Criterion benchmarks with 10% regression budget
Fully typed — py.typed marker, mypy --strict clean, TypeScript strict mode
Apache-2.0 licensed

Research lineage

TinyQuant adapts ideas from published research into a clean-room implementation:

Source	Year	Key contribution
TurboQuant	2025	Random rotation + scalar quantization, no per-block norms
PolarQuant	2025	QR-derived orthogonal preconditioning for coordinate uniformity
QJL	2024	Inner-product preservation bounds under aggressive quantization

Repository layout

Path	Purpose
`rust/crates/tinyquant-core/`	Codec, corpus, backend trait, SIMD dispatch, Rayon parallel batch
`rust/crates/tinyquant-io/`	TQCV serialization format and mmap corpus files
`rust/crates/tinyquant-gpu-wgpu/`	Optional wgpu/WGSL GPU accelerator (`publish = false`, workspace-internal)
`rust/crates/tinyquant-py/`	pyo3 Python extension — the engine behind `tinyquant-cpu`
`rust/crates/tinyquant-sys/`	C ABI via cbindgen
`rust/crates/tinyquant-cli/`	Standalone CLI binary
`rust/crates/tinyquant-js/`	napi-rs TypeScript/Node bindings (`tinyquant`)
`rust/crates/tinyquant-bruteforce/`	`BruteForceBackend` reference implementation
`rust/crates/tinyquant-pgvector/`	PostgreSQL + pgvector ACL adapter
`rust/crates/tinyquant-bench/`	Criterion benchmarks + calibration quality gates
`tests/reference/tinyquant_py_reference/`	Pure-Python frozen oracle — differential test reference (not shipped)
`tests/parity/`	Cross-implementation parity suite (`pytest -m parity`)
`tests/`	Python unit, integration, E2E, architecture, and calibration suites
`experiments/`	Benchmarks and empirical evaluations
`docs/`	Obsidian wiki: design docs, research, SDLC plans, CI/CD specs

Development

git clone https://github.com/better-with-models/TinyQuant.git
cd TinyQuant

# Python dev dependencies
pip install pytest pytest-cov hypothesis numpy ruff mypy build

# Lint and format
ruff check . && ruff format --check .

# Strict type check
mypy --strict .

# Run the full Python suite
pytest --cov=tinyquant_py_reference

# Cross-impl parity (Python ↔ Rust)
pytest -m parity -v

# Rust: lint and test
cd rust
cargo clippy --workspace -- -D warnings
cargo test --workspace

The Python test suite includes 289 tests covering unit, integration, end-to-end, calibration, parity (cross-impl Python ↔ Rust), and architecture-enforcement scenarios. Coverage is held above 90% by CI (94% for the codec subpackage). Live PostgreSQL + pgvector tests run against a Docker container in CI via testcontainers.

Tip

CI enforces three strict gates: ruff check / ruff format --check, mypy --strict, and markdownlint-cli2 for all markdown outside docs/. The docs/ vault uses Obsidian-flavored markdown under its own rules — see AGENTS.md for the policy.

Reproducing the benchmark

export OPENAI_API_KEY="your-key-here"
python experiments/quantization-benchmark/generate_embeddings.py
python experiments/quantization-benchmark/run_benchmark.py
python experiments/quantization-benchmark/generate_plots.py

This fetches 335 embeddings via the OpenAI API, benchmarks 9 quantization methods, and produces plots and JSON results in experiments/quantization-benchmark/results/.

Contributing

Contributions are welcome. The short version:

Issues and design discussions — open a GitHub issue before starting non-trivial work so we can agree on scope.
Follow the repo SDLC — architecture decisions, coding standards, and pre-commit expectations live in AGENTS.md and the docs/design/ vault. Read CLAUDE.md if you're driving Claude Code or another LLM agent against this repo.
Run the full gate locally before pushing: ruff check . && ruff format --check . && mypy --strict . && pytest --cov=tinyquant_cpu
Keep prose aligned — edits to the project tagline, elevator pitch, or headline benchmark numbers must land in README.md, .github/README.md, AGENTS.md, and CLAUDE.md in the same commit.

License

Apache-2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 431 Commits
.githooks		.githooks
.github		.github
calibration		calibration
docs		docs
experiments		experiments
javascript/@tinyquant/core		javascript/@tinyquant/core
rust		rust
scripts		scripts
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.markdownlint-cli2.jsonc		.markdownlint-cli2.jsonc
.obsidian-linter.jsonc		.obsidian-linter.jsonc
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
COMPATIBILITY.md		COMPATIBILITY.md
CONCEPTS.md		CONCEPTS.md
LICENSE		LICENSE
README.md		README.md
calibration_x86_run.log		calibration_x86_run.log
calibration_x86_summary.txt		calibration_x86_summary.txt
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TinyQuant

At a glance

Installation

Language bindings

Quickstart

How it works

Recipes

Key properties

Research lineage

Repository layout

Development

Reproducing the benchmark

Contributing

License

Related documentation

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TinyQuant

At a glance

Installation

Language bindings

Quickstart

How it works

Recipes

Key properties

Research lineage

Repository layout

Development

Reproducing the benchmark

Contributing

License

Related documentation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages