Conversation
Add approximate nearest neighbor infrastructure to vec0: shared distance dispatch (vec0_distance_full), flat index type with parser, NEON-optimized cosine/Hamming for float32/int8, amalgamation script, and benchmark suite (benchmarks-ann/) with ground-truth generation and profiling tools. Remove unused vec_npy_each/vec_static_blobs code, fix missing stdint.h include.
Add rescore index type: stores full-precision float vectors in a rowid-keyed shadow table, quantizes to int8 for fast initial scan, then rescores top candidates with original vectors. Includes config parser, shadow table management, insert/delete support, KNN integration, compile flag (SQLITE_VEC_ENABLE_RESCORE), fuzz targets, and tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds the first ANN index to
sqlite-vec, called "rescore". A rescoreindex quantizes input vectors into int8/bit vectors, and performs an oversampled
KNN search on that smaller vector space, then re-ranks or "rescores" those
distances with the full-sized vectors.
The
INDEXED BY rescore()is the new syntax that signifies an index on theheadline_embeddingcolumn.In the KNN search shown above,
sqlite-vecwill first perform a coarse KNNsearch on the binary quantized-vectors of the
headline_embeddingcolumn. It"oversamples" on
k * oversample, meaning it will find the top8 * 10closestvectors based on hamming distance of the bit quantized vectors in the index.
Then, it will rescore those 80 vectors based on their full-sized float
embedidngs, and return the top 10 values.
Well-versed
sqlite-vecusers may notice that this is essentially the samething as the
old
sqlite-vecBinary Quantization re-scoringmethod that the doc site recommends:
Besides being way less verbose, this new
rescoreindex is slightly faster thandoing old manual method. Full-size vectors are stored in a key-value table
instead of
vec0's chunked storage, meaning the re-scoring stage are a bunch ofb-tree lookups instead of overflow-page nightmares. Granted it's not a huge
difference, maybe 10-15% faster reads, but it's something!
Benchmarks
First off, benchmarks are a notoriously difficult thing to get right:
better than others.
quantizations, others are not.
lookup faster, others may not.
So take these benchmarks with a grain of salt, try it on your own data.
But for my use-case on my hardware (Macbook Pro M4) on a semantically diverse
datasette of 1 million New York Times headlines embeded with
mixedbread-ai/mxbai-embed-large-v1,I get:
vec0Configurationrescoreint8,oversample=2rescorebit,oversample=8rescorebit,oversample=4The int8 rescore performs extremely well at preserving precision, with perfect
recall at
K=10queries while being more than twice as fast. Again, themixedbread-ai/mxbai-embed-large-v1model was specifically trained on binary atint8 quantization, so your mileage may vary.
The binary rescore index performs very well too -
0.988recall atoversample=8, and a respectable0.962recall at the loweroversample=4.The speed benefits of hamming distance really shows here - nearly 6 times faster
for
os=8, and more than 6 times faster foros=4!You can choose a different oversample value at query time, so you can tune that
for faster queries, at the expense of recall.
Do note that both
int8andbitquantized rescores take up more disk space,since it hold both full-size vectors and quantized vectors. The
int8vectorstake
Dbytes of space per vector, whilebitvectors takeD / 8bytes ofspace.
Drawbacks
This rescore index is technically still a brute-force search, but instead of
doing full-scans on full-size floating point vectors, it brute-forces much small
binary or int8 vectors.
So eventually while your vector database grows, query time with the
rescoreindex will also grow, roughly linearly.
Also, recall will nearly entirely depend on your embedding model and dataset. If
your embedding model was specifically trained on binary or integer quanitzation,
I suspect you'll have a great time using
rescore. If not, YMMV but I've seenrecall drop down to like 70-80. Additionally, if your data isn't very
semantically diverse, than quantization may muddle the results.
But in general, if data/embedding model handles quantization well and you can
take a ~10% accuracy hit, I've found that the
rescoreindex to be fantastic!