Problem / Motivation
Current zvec index mainly rely on in-mem structure to achieve low-latency nearest neighbor search. While effective for moderate-sized datasets that fit entirely in RAM, in-mem index becomes impractical as collections grow to large scale.
Moreover, many real-world use cases involve infrequently accessed long-tail vectors where keeping all data in memory is wasteful. A disk-based indexing solution would enable cost-effective scaling by leveraging disk storage while maintaining acceptable query latency.
Proposed Solution
An on-disk based index will be introduced into Zvec with the following key components:
1. On-Disk Vector Storage:
Raw vector data (in FP32 or FP16 format) will be stored persistently on disk. Only compressed representations (e.g., quantized centroids, graph links, or PQ codes) and metadata will be kept in memory. During search, relevant raw vectors are fetched from disk only when needed for final distance re-ranking.
2. Support for Mainstream Similarity Metrics:
The on-disk index will natively support common similarity functions including:
2.1. Cosine similarity
2.2. Inner product (dot product)
2.3. Euclidean (L2) distance
Distance computations will be performed accurately using the original (uncompressed) vectors retrieved from disk during the refinement stage.
3. FP32 and FP16 Data Type Support:
Users can store vectors in either 32-bit or 16-bit floating point formats on disk. The system will handle type conversion and alignment transparently, enabling memory and I/O efficiency (especially with FP16) without sacrificing compatibility.
Alternatives Considered
No response
Affected Area
{"label" => "C++ Core (storage, indexing)"}
Problem / Motivation
Current zvec index mainly rely on in-mem structure to achieve low-latency nearest neighbor search. While effective for moderate-sized datasets that fit entirely in RAM, in-mem index becomes impractical as collections grow to large scale.
Moreover, many real-world use cases involve infrequently accessed long-tail vectors where keeping all data in memory is wasteful. A disk-based indexing solution would enable cost-effective scaling by leveraging disk storage while maintaining acceptable query latency.
Proposed Solution
Alternatives Considered
No response
Affected Area
{"label" => "C++ Core (storage, indexing)"}