Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What does the Lucene community think about dimensionality reduction for vectors, and should it be something the library does internally (at merge time perhaps)? #13403

Open
gautamworah96 opened this issue May 21, 2024 · 11 comments

Comments

@gautamworah96
Copy link
Contributor

Description

I opened this issue as a discussion topic. With the advancement in int8, int4 type vector storage, I believe Lucene takes the unquantized vectors as inputs, intelligently calculates the correct quantized value, and then indexes it.

Another technique that experimenters use to improve vector search is to reduce the number of dimensions. In practical terms, this translates to using PCA (Principal Component Analysis) or other techniques.

Should Lucene implement support for PCA or other dimensionality reduction techniques (or add a hook maybe) internally? Or can we rely on the user preprocessing their vectors and supplying them?

I am undecided on whether a "search" and "information retrieval" library should add advanced statistics functionality (if I may call PCA that)..

@benwtrent
Copy link
Member

Is PCA ever preferred for vector information retrieval over Product Quantization? I would expect the first stab at dimension reduction would be PQ not PCA.

Maybe a first step for PQ would be a format that allows statically setting the code books and applying them at index & search time?

@gautamworah96
Copy link
Contributor Author

I would expect the first stab at dimension reduction would be PQ not PCA.

Hmm. I would've expected the opposite? If the number of dimensions are reduced, you don't even need to quantize them?

@mikemccand
Copy link
Member

+1 to explore these sorts of dimensionality-reduction compression techniques in Lucene! PQ indeed looks compelling (WARNING: PDF). I have no idea how PQ compares to PCA (and other techniques?) for approximate KNN search ...

Maybe a first step for PQ would be a format that allows statically setting the code books and applying them at index & search time?

Could Lucene compute the PQ codebooks itself at segment write (flush or merge) time? This is a wonderful aspect of Lucene's design: because every segment is write-once, and, Lucene's Codec APIs allow Codec impls to traverse all values being written as many times as they want (i.e. it is NOT iterate-once), the Codec can tune quite precisely how to best encode this one field/segment. Doc values indexing use this to great effect, e.g. recognizing that a given field has low cardinality and encoding it via lookup table, or that the field is very sparse, etc.

Then this could all be "under the hood" (just like our awesome scalar vector compression), just another form of Codec vector compression, simple for users to turn on, and maybe offering the right hyper-parameter tunables to trade off of how much quantization happens versus impact on performance / recall.

Maybe Lucene's default Codec could offer either scalar compression (int8, int7, int4, maybe soon int1 or int0.5 heh), or PQ/PCA, or both (do they mix???).

@benwtrent
Copy link
Member

If the number of dimensions are reduced, you don't even need to quantize them?

Does PCA work in non-Euclidean spaces? Does it work on out-of-domain queries?

If PCA ends up being cheaper and better than PQ, I say let's do it. The reason I reach for PQ is that it's fairly standard right now. Maybe it is for a reason?

As far as doing PQ at merge, there needs to be experimentation here.

Having to redo code books on merge and re-quantizing will increase merge time significantly, no matter how many threads we throw at it.

@navneet1v
Copy link
Contributor

+1 on the idea of bringing the dimensionality reduction technique in Lucene. One problem though I have seen with PQ is you need to have enough number of vectors to build the codebooks. Hence building a Segment level code book can be challenging, as segment may have few documents in it.

Some ways to overcome the above problem:

  1. To overcome the problem of number of vectors we can either build a global code book outside of segments and let segments use it during merging and flush. I don't know how we can do this in Lucene may be via new KNNVectorFormatReader that can supply this(based of my limited knowledge).

@mikemccand
Copy link
Member

Maybe we could avoid quantization for small segments, and only once a merged segment is big enough we trigger the code book training?

@mikemccand
Copy link
Member

This also scales nicely: the small segments are not taking that much storage in the index, and then the larger segments (which hopefully have enough data to reliably train code books) do take a lot of storage which could be reduced by PQ/PCA/something.

@benwtrent
Copy link
Member

I agree @mikemccand, we should not do any dim-reduction stuff until some threshold of vectors is reached.

I am not 100% convinced this scales nicely, well, it would scale nicer than running PQ on tiny segments all the time :). We already have some expensive issues with HNSW merging (we are slowly working through fixing those).

When segments are merged that have different PQ code-books, we have to re-calculate and re-quantize everything unless we can do something clever.

Maybe it won't be too expensive and the cost will be worth it if we can build the HNSW graph with the PQ or maybe use the PQ information to boot strap the HNSW graph build to make it cheaper.

I agree this is a worthy place for experimentation.

As an aside, this "wait to build the index" thing could also be done for HNSW. Tiny segments with quick flushes probably shouldn't even build HNSW graphs. Instead, they should probably store the float vectors flat (or the scalar quantized vectors flat as scalar quantizing is effectively linear in runtime). Then when a threshold is reached (it could be small, something like 1k, 10k?), we create the HNSW graphs.

@mikemccand
Copy link
Member

As an aside, this "wait to build the index" thing could also be done for HNSW. Tiny segments with quick flushes probably shouldn't even build HNSW graphs. Instead, they should probably store the float vectors flat (or the scalar quantized vectors flat as scalar quantizing is effectively linear in runtime). Then when a threshold is reached (it could be small, something like 1k, 10k?), we create the HNSW graphs.

Oooh -- +1 to explore this idea as a pre-cursor separately from enabling/exploring dimensionality reduction compression. Lucene's write-once segments really make such optimizations (different choices depending on segment's size or characteristics of the documents in each segment) possible and worthwhile! Maybe open a spinoff for this one?

@benwtrent
Copy link
Member

Maybe open a spinoff for this one?

#13447

@msokolov
Copy link
Contributor

msokolov commented Jun 6, 2024

I've been thinking about PCA a bit and although it can be very useful in some settings, I'm not convinced it really belongs in Lucene, but should more likely be part of a pre-indexing stage if needed. My thinking is that whether PCA is useful or not (yields significant dimension reduction that would compress the index and speed searches) depends very heavily on the vector values in the data set. For naturally-occurring high dimensional data like 256x256 images of faces (one early and successful application of PCA) we expect a high degree of redundancy in the data, and PCA is super useful -- in early days facial recognition databases used PCA to reduce dimensions from 65536 down to ~40 while retaining high recall/precision on matching tasks. I'm sure the sota is different now then it was when I was in school in the 90's but PCA hasn't changed at least.

But my expectation for the synthetic vectors used for search today is that they would have been generated by some kind of ML process that will tend to produce vectors with less redundancy. If that's the typical case (we should test!) then I don't think we'd want to build in support for something complicated and probably expensive that wouldn't usually be useful. Maybe the next step is to try PCA on some typical datasets we see in use and see whether there would be any benefit to it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants