Skip to content

feat: add GPU backends, quantization, and search optimizations#166

Open
cluster2600 wants to merge 6 commits intoalibaba:mainfrom
cluster2600:feat/gpu-quantization-search-optimizations
Open

feat: add GPU backends, quantization, and search optimizations#166
cluster2600 wants to merge 6 commits intoalibaba:mainfrom
cluster2600:feat/gpu-quantization-search-optimizations

Conversation

@cluster2600
Copy link
Contributor

@cluster2600 cluster2600 commented Feb 24, 2026

Summary

  • Add C++ Metal GPU backend for vector operations on Apple Silicon
  • Add FAISS GPU/CPU backends with unified accelerate module
  • Add Product Quantization (PQ), Optimized PQ (OPQ), and Scalar Quantization
  • Add pure Python HNSW index with FAISS fallback
  • Add optimized search functions (ADC, batched search, reranking)
  • Add Apple Silicon MPS backend via PyTorch

Changes

C++ Metal Backend

  • src/ailego/gpu/metal/zvec_metal.h — C API header
  • src/ailego/gpu/metal/zvec_metal.cc — Objective-C++ implementation
  • src/ailego/gpu/metal/zvec_metal.metal — Metal shaders (L2, IP, cosine, normalize, matmul, top-k)
  • src/ailego/gpu/metal/CMakeLists.txt — Metal compilation
  • tests/test_metal.cc — Google Test suite

Python Backends

  • python/zvec/accelerate.py — Unified accelerator interface
  • python/zvec/backends/gpu.py — FAISS GPU backend
  • python/zvec/backends/detect.py — Hardware detection
  • python/zvec/backends/quantization.py — PQ encoder
  • python/zvec/backends/opq.py — OPQ encoder + Scalar Quantizer
  • python/zvec/backends/hnsw.py — Pure Python HNSW with FAISS fallback
  • python/zvec/backends/search.py — ADC, batch search, reranking
  • python/zvec/backends/apple_silicon.py — Apple Silicon MPS backend
  • python/zvec/backends/benchmark.py — Backend performance benchmarks

Configuration

  • pyproject.tomlaccelerate/gpu optional dependencies, per-file-ignores for backends

Docs

  • docs/METAL_CPP.md — Metal backend documentation

Context

Split from #157. Aligns with cluster2600#2 content.

Test plan

  • ruff lint and format pass
  • clang-format passes on all C++ and Metal files
  • CI builds succeed on all platforms
  • Metal tests pass on macOS (skip on Linux)

cluster2600 and others added 4 commits February 24, 2026 17:35
Add Metal Shading Language kernels for GPU-accelerated vector operations
on Apple Silicon, including L2 distance, inner product, cosine similarity,
vector normalization, matrix multiplication, and top-k selection.

Includes C API wrapper, CMakeLists.txt for Metal compilation, and
comprehensive Google Test suite.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add unified acceleration module supporting FAISS CPU and GPU backends
with automatic hardware detection. Includes backend benchmark suite
for performance comparison and realistic dataset benchmarks.

New files:
- python/zvec/accelerate.py: Unified accelerator interface
- python/zvec/backends/gpu.py: FAISS GPU backend
- python/zvec/backends/detect.py: Hardware detection
- python/zvec/backends/benchmark.py: Performance benchmarks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add Product Quantization (PQ) encoder, Optimized Product Quantization
(OPQ) with rotation learning, and Scalar Quantization (8/16-bit) for
efficient vector compression and approximate nearest neighbor search.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add pure Python HNSW index with FAISS fallback, optimized search
functions (ADC, batched search, reranking), and Apple Silicon MPS
backend using PyTorch for GPU-accelerated vector operations on macOS.

Update pyproject.toml with accelerate/gpu optional dependencies and
per-file-ignores for backends.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cluster2600 added a commit to cluster2600/zvec that referenced this pull request Feb 25, 2026
Add header-only C++ implementations of Product Quantization (PQ) and
Optimized Product Quantization (OPQ), plus upgrade the Python OPQ
rotation from QR decomposition to SVD-based Orthogonal Procrustes.

C++ Product Quantizer (product_quantizer.h):
- k-means training with configurable m sub-quantizers and k centroids
- encode/decode with distortion measurement
- Header-only, depends only on <algorithm>, <cmath>, <vector>

C++ OPQ (opq.h):
- SVD-based Procrustes rotation: R = V * U^T from SVD(X^T * Y)
- Self-contained Jacobi one-sided SVD solver (no LAPACK dependency)
- Iterative refinement of rotation + PQ codebooks

Python OPQ (_learn_rotation):
- Replace simplified QR decomposition with SVD Procrustes
- M = X^T @ decoded, U, _, Vt = svd(M), R = Vt.T @ U.T
- Produces orthogonal rotations (error ~4e-6)
- Benchmarked: ~1-10% reconstruction improvement over plain PQ

Follow-up to alibaba#166 ("Future Work: sophisticated OPQ optimization").

Tested on:
- macOS: clang++ C++17 compilation + runtime tests
- Linux (Blackwell GPU): Python OPQ + cuVS CAGRA integration

Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
cluster2600 added a commit to cluster2600/zvec that referenced this pull request Feb 25, 2026
Add persistent vector storage backed by RocksDB for GPU pipeline
integration, plus documentation for the Metal C++ backend.

VectorStorage (vector_storage.h):
- RocksDB column families: "vectors", "pq_codes", "metadata"
- Batch put/get for raw vectors and PQ codes
- load_all() streams vectors into contiguous GPU-ready float buffer
- Integrates with existing RocksdbContext wrapper

Documentation (docs/METAL_CPP.md):
- Architecture overview: RocksDB → load_all() → Metal GPU Buffers
- Complete kernel reference table (distance, utility kernels)
- Simdgroup optimization dispatch model
- C++ PQ/OPQ API examples
- RocksDB storage API examples

Follow-up to alibaba#166 ("Future Work: Integration with RocksDB storage").

Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants