Add support for similarity-based vector searches #12679

kaivalnp · 2023-10-14T12:39:06Z

Description

Background in #12579

Add support for getting "all vectors within a radius" as opposed to getting the "topK closest vectors" in the current system

Considerations

I've tried to keep this change minimal and non-invasive by not modifying any APIs and re-using existing HNSW graphs -- changing the graph traversal and result collection criteria to:

Visit all nodes (reachable from the entry node in the last level) that are within an outer "traversal" radius
Collect all nodes that are within an inner "result" radius

Advantages

Queries that have a high number of "relevant" results will get all of those (not limited by topK)
Conversely, arbitrary queries where many results are not "relevant" will not waste time in getting all topK (when some of them will be removed later)
Results of HNSW searches need not be sorted - and we can store them in a plain list as opposed to min-max heaps (saving on heapify calls). Merging results from segments is also cheaper, where we just concatenate results as opposed to calculating the index-level topK

On a higher level, finding topK results needed HNSW searches to happen in #rewrite because of an interdependence of results between segments - where we want to find the index-level topK from multiple segment-level results. This is kind of against Lucene's concept of segments being independently searchable sub-indexes?

Moreover, we needed explicit concurrency (#12160) to perform these in parallel, and these shortcomings would be naturally overcome with the new objective of finding "all vectors within a radius" - inherently independent of results from another segment (so we can move searches to a more fitting place?)

Caveats

I could not find much precedent in using HNSW graphs this way (or even the radius-based search for that matter - please add links to existing work if someone is aware) and consequently marked all classes as @lucene.experimental

For now I have re-used lots of functionality from AbstractKnnVectorQuery to keep this minimal, but if the use-case is accepted more widely we can look into writing more suitable queries (as mentioned above briefly)

Next steps

Run benchmarks with this new query to see how it compares to the topK based search

lucene/core/src/java/org/apache/lucene/search/AbstractRnnVectorQuery.java

lucene/core/src/java/org/apache/lucene/search/RnnCollector.java

shubhamvishu · 2023-10-14T19:48:44Z

lucene/core/src/java/org/apache/lucene/search/AbstractRnnVectorQuery.java

Lets add some tests for these going forward?

shubhamvishu · 2023-10-14T20:21:22Z

Thanks for adding this @kaivalnp! The idea makes sense to me, looking forward to the benchmarks results. I left some minor comments. Sharing some thoughts below :

Is it right to call it a radius-based search here?. I understand we are calling the traversal threshold as the outer radius and the required threshold as inner radius but it doesn't sounds very intuitive to me in this context i.e. of graph. I would correlate radius more with the sort of edges of the graph or something similar but not that much with the dot product score (atleast when visually forming a mind map). I don't feel very strongly about it though but wanted to share if incase we could have some more appropriate way to call this.
The RnnFloatVectorQuery and RnnByteVectorQuery are almost the same. Now I understand this has been existing and maybe a convention(?) to have separate implementations for byte and float vectors but this seems like a very good opportunity to make use of generics here and only have like RnnVectorQuery. I don't know if using generics here would make things more complex somehow or some unknowon caveats but to me that looks like a good approach to me here. Looking forward to everybody's thoughts on this.

jpountz · 2023-10-17T08:56:50Z

If I read correctly, this query ends up calling LeafReader#searchNearestNeighbors with k=Integer.MAX_VALUE, which will not only run in O(maxDoc) time but also use O(maxDoc) memory. I don't think we should do this.

In my opinion, there are two options: either we force this query to take a k parameter and make it only return the top k nearest neighbors that are also within the radius. Or we make it always run in "exact" mode with a two-phase iterator that performs the similary check in TwoPhaseIterator#matches(). We'd then need to prefix this query with Slow like other queries that work similarly.

kaivalnp · 2023-10-17T16:14:18Z

If I read correctly, this query ends up calling LeafReader#searchNearestNeighbors with k=Integer.MAX_VALUE

No, we're calling the new API (from here) with a custom RnnCollector that performs score-based HNSW searches (as opposed to the old API that performs topK-based searches with k=Integer.MAX_VALUE)

The Integer.MAX_VALUE passed here is just used in two places: #exactSearch (to instantiate a priority queue of size k) and #mergeLeafResults (to request for the best-scoring k hits across all segment results). We're overriding both functions in our implementation of AbstractRnnVectorQuery (because we do not want to limit to topK results)

I think you're worried that we'll end up performing brute-force KNN on all documents in the segment, and then retain vectors above the threshold? What we instead aim to do is: starting from the entry node in the last level of HNSW graphs, we keep visiting candidates as long as they are above the traversalThreshold, all the while adding nodes above the resultThreshold as accepted results

This is not necessarily slower than normal HNSW searches, provided the traversalThreshold is chosen suitably

jpountz · 2023-10-17T16:43:55Z

Thanks for explaining, I had overlooked how the Integer.MAX_VALUE was used indeed. I'm still interested in figuring out if we can have stronger guarantees on the worst-case memory usage that this query could have (I believe noting prevents this list from growing unbounded? if the threshold is high?). E.g. could we abort the approximate search if the list maintained by the RnnCollector grows too large, and fall back to an exact search that is based on a TwoPhaseIterator instead of eagerly collecting all matches into a list?

kaivalnp · 2023-10-17T17:09:44Z

Thanks for the review @shubhamvishu! Addressed some of the comments above

Is it right to call it a radius-based search here?

I think of it as finding all results within a high-dimensional circle / sphere / equivalent, and the radius-based search seems to capture the essence. Although "threshold-based search" may be more appropriate (since radius is tied to Euclidean Distance, and may not be easy to relate with Cosine Similarity or Dot Product)

No strong opinions here, looking for others' thoughts as well on more appropriate naming..

The RnnFloatVectorQuery and RnnByteVectorQuery are almost the same

The problem here is that we'll have to generalize many other (unrelated to this change) internal classes. I'll keep this to a separate issue

benwtrent · 2023-10-17T18:53:29Z

I think of it as finding all results within a high-dimensional circle / sphere / equivalent,

dot-product, cosine, etc. don't really follow that same idea as you point out. I would prefer something like VectorSimilarityQuery or something.

E.g. could we abort the approximate search if the list maintained by the RnnCollector grows too large, and fall back to an exact search that is based on a TwoPhaseIterator instead of eagerly collecting all matches into a list?

I agree with @jpountz concerns.

The topDocs collector gets a replay of the matched documents. We should put sane limits here and prevent folks from getting 100,000s of matches (int & float value arrays) via approximate search. It seems like having a huge number like that could cause issues.

kaivalnp · 2023-10-17T19:02:24Z

Benchmarks

Using the vector file from https://home.apache.org/~sokolov/enwiki-20120502-lines-1k-100d.vec (enwiki dataset, unit vectors, 100 dimensions)

The setup was 1M doc vectors in a single HNSW graph with DOT_PRODUCT similarity, and 10K query vectors

The baseline for the new objective is "all vectors above a score threshold" (as opposed to the best-scoring topK vectors in the current system) for a given query and is used to compute recall in all subsequent runs..

Here are some statistics for the result counts in the new baseline:

threshold	mean	stdDev	p25	p50	p75	p90	p99	max
0.95	71877.73	109177.23	222	7436	116567	259135	388113	483330
0.96	32155.63	57183.83	30	3524	36143	120700	235038	342959
0.97	8865.48	19006.24	1	816	5483	29966	92433	174163
0.98	1010.10	2423.03	0	46	873	3234	12175	40163
0.99	136.47	465.91	0	0	2	77	2296	2494

This is used to get an estimate of query - result count distribution for various threshold values, and also gauge the corresponding topK to use for comparison with the new radius-based vector search API

Here we will benchmark the new API against a high topK (+ filtering out results below the threshold after HNSW search)

K-NN Search (current system)

maxConn	beamWidth	topK	threshold	mean	numVisited	latency	recall
16	100	500	0.99	46.39	4086	1.465	0.34
16	100	1000	0.99	83.92	6890	2.600	0.61
16	100	2000	0.99	129.56	11727	4.746	0.95
16	200	500	0.99	46.39	4504	1.535	0.34
16	200	1000	0.99	83.92	7564	2.759	0.61
16	200	2000	0.99	129.56	12805	5.007	0.95
32	100	500	0.99	46.39	4940	1.644	0.34
32	100	1000	0.99	83.92	8271	2.944	0.61
32	100	2000	0.99	129.56	13937	5.335	0.95
32	200	500	0.99	46.39	5654	1.890	0.34
32	200	1000	0.99	83.92	9401	3.320	0.61
32	200	2000	0.99	129.56	15707	5.987	0.95
64	100	500	0.99	46.39	5241	1.736	0.34
64	100	1000	0.99	83.92	8766	3.091	0.61
64	100	2000	0.99	129.56	14736	5.567	0.95
64	200	500	0.99	46.39	6095	1.992	0.34
64	200	1000	0.99	83.92	10119	3.535	0.61
64	200	2000	0.99	129.56	16852	6.365	0.95

R-NN Search (new system)

maxConn	beamWidth	traversalThreshold	threshold	mean	numVisited	latency	recall
16	100	0.99	0.99	94.03	256	0.129	0.69
16	100	0.98	0.99	95.18	5171	2.062	0.70
16	200	0.99	0.99	89.96	263	0.119	0.66
16	200	0.98	0.99	91.09	5497	2.207	0.67
32	100	0.99	0.99	109.17	295	0.135	0.80
32	100	0.98	0.99	110.89	6529	2.580	0.81
32	200	0.99	0.99	108.97	313	0.142	0.80
32	200	0.98	0.99	110.55	7145	2.861	0.81
64	100	0.99	0.99	133.61	314	0.152	0.98
64	100	0.98	0.99	135.74	7033	2.765	0.99
64	200	0.99	0.99	133.84	333	0.163	0.98
64	200	0.98	0.99	135.96	7833	3.121	1.00

mean is the average number of results above the threshold
numVisited is the average number of HNSW nodes visited per-query
The latency is measured in ms per-query

IF the goal is to "get all vectors within a radius", then looks like using the new radius-based search API scales better than having a large topK and post-filtering results later?

kaivalnp · 2023-10-17T19:20:53Z

stronger guarantees on the worst-case memory usage

Totally agreed @jpountz! It is very easy to go wrong in the new API, specially if the user passes a low threshold (high radius -> low threshold). As we can see from benchmarks above, the number of nodes to visit may jump very fast with slight reduction in the traversalThreshold (mean column of first table)

fall back to an exact search that is based on a TwoPhaseIterator

This makes sense to me.. Something like a lazy-loading iterator, where we perform vector comparisons and determine whether a doc matches on #advance?

something like VectorSimilarityQuery

I like this, thanks for the suggestion @benwtrent!

benwtrent · 2023-10-18T15:27:01Z

Something like a lazy-loading iterator, where we perform vector comparisons and determine whether a doc matches on #advance?

I think @kaivalnp the thing to do would be to say the Collector is full by flagging "incomplete" (I think this is possible) once a threshold is reached. You can do this independently from a "maxvisit" as we don't care about visiting the vector, we just care about adding it to the result set.

benwtrent · 2023-10-18T15:28:27Z

The results: #12679 (comment)

Are astounding! I will try and replicate with Lucene Util.

The numbers seem almost too good ;)

kaivalnp · 2023-10-18T16:09:15Z

the Collector is full by flagging "incomplete" (I think this is possible) once a threshold is reached

Do you mean that we return incomplete results?

Instead, maybe we can:

Ask for a sane limit on the number of nodes to visit from the user
If this limit is reached (possibly when the supplied traversalThreshold is too low), then we break out of HNSW search
Now instead of performing a greedy #exactSearch and collecting everything into a list, we return a TwoPhaseIterator where the #matches call performs the underlying dot product comparison and returns true or false based on whether the computed score is above the resultThreshold
This way, we can perform an "exact search" lazily, and only compute vector similarity on required documents (for example: if this query is a child of some BooleanQuery, then the actual number of documents for which we'll need to compute similarity is greatly reduced). The worst case will still be an exact search on all documents

This "lazy-loading" works very well for our use case because the fact that a vector matches our query or not is independent of other vectors (unlike in K-NN, where given a query and an arbitrary doc vector, we cannot say whether the doc vector will be in the topK results of the query)

Is this what you had in mind earlier @jpountz?

I will try and replicate with Lucene Util.

Yes, I took inspiration from KnnGraphTester to write a local benchmark, but may have made some silly mistakes. It'll be good to get an independent set of benchmark results..

kaivalnp · 2023-10-18T16:51:59Z

Sorry for the confusion, I tried renaming the branch from radius-based-vector-search to similarity-based-vector-search and the PR closed automatically. I guess I'm stuck with this branch name :(

benwtrent · 2023-10-18T18:44:36Z

OK, I tried testing with KnnGraphTester.

I indexed 100_000 normalized Cohere vectors (768 dims).

With regular knn, recall@10:

recall	latency	nDoc	fanout	maxConn	beamWidth	visited	
0.771	 0.13	100000	0	16	100	        10	
0.870	 0.19	100000	10	16	100	        20	
0.953	 0.42	100000	50	16	100	        60	
0.971	 0.67	100000	100	16	100	        110

I tried the similarity threshold and its way worse.

recall	latency	nDoc    resultSim travSim  maxConn beamWidth	visited	
0.889	18.87	100000	0.89000	0.89500	   16	   100	        6714
0.889	19.69	100000	0.89000	0.89000	   16	   100	        10332

benwtrent · 2023-10-18T19:52:07Z

@kaivalnp I see the issue with my test, you are specifically testing "post-filtering" on the top values, not just getting the top10 k. I understand my issue.

Could you post your testing code or something in a gist ?

kaivalnp · 2023-10-18T20:02:05Z

Thanks for running this @benwtrent!

I just had a couple of questions:

What was your baseline in the test? If the baseline / goal is to "get the K-Nearest Neighbors", then the threshold-based search is not the best way to achieve it. I believe the true baseline should be all vectors above the score threshold
If the baseline is KNN, we should at least post-filter results below the threshold. The threshold-based search will never be able to find results below the resultSimilarity, and it may not be a fair comparison

As I'm writing this, I see your comment. I'll post my setup in a while

kaivalnp · 2023-10-18T20:50:06Z

Here is the gist of my benchmark: https://gist.github.com/kaivalnp/79808017ed7666214540213d1e2a21cf

I'm calculating the baseline / individual results as "count of vectors above the threshold"

Note that we do not need the actual vectors, because any vector with a score >= resultSimilarity is implicitly in the baseline. This simplifies the benchmark to just maintaining counts of vectors (as opposed to the actual vector IDs), and recall is calculated as the "ratio of total count of vectors found by KNN or RNN / total count of vectors in the baseline"

Had some other helper functions mainly for calling these and formatting output, but kept the important functions in the gist (how I'm calculating the baseline, KNN / RNN results and time taken)

kaivalnp · 2023-10-25T18:07:49Z

Hi @benwtrent! Curious to hear if you've been able to reproduce the benchmark?

benwtrent · 2023-10-25T18:54:23Z

@kaivalnp I have been busy doing other things. I hope to look into this in the next week or so.

kaivalnp · 2023-10-25T23:18:13Z

Thank you! I'll try to incorporate earlier suggestions in the meanwhile

- Make use of inherent independence of segment-level results - Do not greedily collect exact matches, return a lazy-loading iterator instead

kaivalnp · 2023-11-10T18:13:41Z

Summary of new changes:

Refactor into a more appropriate query
- Move away from AbstractKnnVectorQuery to take advantage of inherent independence of segment-level results
- KNN queries need to execute the core logic in #rewrite because of an inter-dependence of segment-level results (that is, given N segment-level hits we cannot determine if they will appear in the index-level topK without knowing results from other segments). This leads to requirements of custom concurrency for individual HNSW searches, which should ideally be parallel by default
- We can move graph searches down to a more appropriate place (like #scorer) to take advantage of this
Return a lazy-loading iterator instead of a greedy exact search (thanks @jpountz!)
- Introduce a visitLimit on the number of nodes to traverse before stopping graph search - deeming it "too expensive". Once this is exhausted, return a lazy-loading iterator on all vectors (functionally equivalent to an exact search)
- Unlike KNN queries, which need to traverse all vectors to determine which ones are present in the topK best-scoring ones, a similarity-based vector search can independently determine if a vector is a result or not (based on whether its similarity with the query is above a resultSimilarity)
- Making use of this behavior, we can prevent a greedy exact search for collecting all matching docs into a list on heap, and determine if a vector is a match inside a FilteredDocIdSetIterator
- This has a huge benefit when the query will be one of the clauses of a BooleanQuery (so other clauses will filter out non-matching docs and this query will only compute similarity scores with already filtered vectors). In the worst case, this will consider all vectors (same as exact search)
- We also have useful information from graph search - mainly which hit was evaluated, and which hit was collected. This information can be re-used from the iterator: if a hit has been traversed, it will either be added to the results, or discarded. If it is present in the results, we simply lookup the score, otherwise mark it as rejected
- If a vector has not been traversed in graph search, we compute its similarity score (so each query - document pair will only compute similarity scores once)

Please let me know if this approach makes sense?

kaivalnp · 2023-11-15T17:15:00Z

You still need to score the vectors to realize that they are in the iteration set or not

Right, I meant that we need not score all other vectors to determine if the vector itself is a "hit" or not (we just need its similarity score to be above the resultSimilarity) - as opposed to KNN where it's not a simple "filter" like you mentioned

we do all this work in approximateSearch (because we need to score the values) only to throw it away

I've tried to re-use some of this work to directly reject vectors that are above the traversalSimilarity but below the resultSimilarity (the ones that were already scored from HNSW search), without re-computing their scores

I wonder if we can extend this further: visited marks all the nodes for which we have computed scores from HNSW search. However, anything that is "visited but not collected" will not make it to the final results. We can do this by passing the visited variable back to the KnnCollector by adding a new method like setVisited(Bits)?

This is also usable in the current KNN-based search, wherever we fall back from approximateSearch to exactSearch. If the KnnCollector had information about whatever we have already scored in graph searches (but is not present in the results) -- we can prevent computing its similarity scores again from exactSearch, because we already know they are not present in the topK

Right now we score all vectors present in the filter, even if many of them are already scored and rejected in graph search

Here are some very rough changes to support this -- what do you think @benwtrent?

kaivalnp · 2023-11-15T17:24:30Z

could you test on cohere with Max-inner product?

Thanks, the gist was really helpful and gave some files including normalized and un-normalized vectors. I assume that since you mentioned MAXIMUM_INNER_PRODUCT, you wanted the un-normalized vectors

I saw ~476k vectors of 768 dimensions there and indexed the first 400k in a single segment, while querying the next 10k, using the following command:

./gradlew :lucene:core:similarity-benchmark --args=" --vecPath=/home/kaivalnp/working/similarity-benchmark/cohere-768.vec --indexPath=/home/kaivalnp/working/similarity-benchmark/cohere-indexes --dim=768 --function=MAXIMUM_INNER_PRODUCT --numDocs=400000 --numQueries=10000 --topKs=5000,2500,1000,500,100 --topK-thresholds=300,305,310,315,320 --traversalSimilarities=295,300,305,310,315 --resultSimilarities=300,305,310,315,320"

KNN search

maxConn	beamWidth	topK	threshold	count	numVisited	latency	recall
16	100	5000	300.00	1123.19	40056.44	98.96	0.89
16	100	2500	305.00	480.82	23258.29	54.91	0.83
16	100	1000	310.00	191.52	11249.93	26.12	0.73
16	100	500	315.00	83.21	6487.60	14.87	0.69
16	100	100	320.00	23.80	1832.45	4.00	0.43
16	200	5000	300.00	1126.33	44928.96	107.69	0.89
16	200	2500	305.00	482.17	26242.83	61.47	0.83
16	200	1000	310.00	192.13	12751.78	29.42	0.73
16	200	500	315.00	83.49	7360.26	16.67	0.70
16	200	100	320.00	23.89	2056.14	4.51	0.44
32	100	5000	300.00	1128.81	51636.98	122.67	0.89
32	100	2500	305.00	483.29	30892.01	72.01	0.84
32	100	1000	310.00	192.65	15424.38	35.12	0.73
32	100	500	315.00	83.72	9060.78	20.28	0.70
32	100	100	320.00	24.00	2606.37	5.70	0.44
32	200	5000	300.00	1130.18	61350.93	145.76	0.89
32	200	2500	305.00	483.95	37178.70	86.05	0.84
32	200	1000	310.00	192.99	18778.34	42.14	0.73
32	200	500	315.00	83.90	11083.97	24.54	0.70
32	200	100	320.00	24.08	3172.91	6.83	0.44
64	100	5000	300.00	1129.81	58389.13	138.14	0.89
64	100	2500	305.00	483.77	35567.55	81.62	0.84
64	100	1000	310.00	192.87	18093.55	40.34	0.73
64	100	500	315.00	83.84	10734.50	23.76	0.70
64	100	100	320.00	24.06	3122.13	6.77	0.44
64	200	5000	300.00	1130.78	72620.92	169.86	0.89
64	200	2500	305.00	484.24	45052.36	101.93	0.84
64	200	1000	310.00	193.16	23283.96	51.61	0.73
64	200	500	315.00	83.99	13908.95	30.44	0.70
64	200	100	320.00	24.13	4035.89	8.61	0.44

Similarity-based search

maxConn	beamWidth	traversalSimilarity	resultSimilarity	count	numVisited	latency	recall
16	100	295.00	300.00	1209.53	18270.70	44.38	0.95
16	100	300.00	305.00	538.00	8833.17	21.02	0.93
16	100	305.00	310.00	239.11	4249.13	9.97	0.91
16	100	310.00	315.00	105.02	2050.95	4.87	0.87
16	100	315.00	320.00	45.71	1028.26	2.35	0.83
16	200	295.00	300.00	1217.74	20335.62	49.38	0.96
16	200	300.00	305.00	542.19	9851.65	23.54	0.94
16	200	305.00	310.00	240.68	4726.50	11.04	0.91
16	200	310.00	315.00	106.02	2287.34	5.33	0.88
16	200	315.00	320.00	46.09	1139.68	2.60	0.84
32	100	295.00	300.00	1235.75	25159.18	59.94	0.98
32	100	300.00	305.00	554.76	12709.10	29.69	0.96
32	100	305.00	310.00	247.15	6275.45	14.46	0.94
32	100	310.00	315.00	108.95	3093.07	7.00	0.91
32	100	315.00	320.00	47.39	1544.48	3.47	0.86
32	200	295.00	300.00	1243.78	29690.87	70.66	0.98
32	200	300.00	305.00	558.98	15064.99	34.99	0.97
32	200	305.00	310.00	249.03	7442.06	17.09	0.95
32	200	310.00	315.00	110.01	3664.88	8.28	0.92
32	200	315.00	320.00	47.92	1826.35	4.06	0.87
64	100	295.00	300.00	1228.98	29028.54	68.77	0.97
64	100	300.00	305.00	549.09	14931.68	34.43	0.95
64	100	305.00	310.00	242.41	7417.15	16.89	0.92
64	100	310.00	315.00	105.26	3613.84	8.12	0.88
64	100	315.00	320.00	45.14	1794.89	4.02	0.82
64	200	295.00	300.00	1243.45	36266.02	85.05	0.98
64	200	300.00	305.00	557.47	18811.49	42.83	0.96
64	200	305.00	310.00	246.42	9377.28	21.11	0.94
64	200	310.00	315.00	107.09	4559.22	10.20	0.89
64	200	315.00	320.00	45.99	2249.22	4.99	0.84

IF the goal is to "get all vectors above a similarity", then looks like using the new similarity-based search API scales better than having a large topK and post-filtering results later

benwtrent

Thank you for all the benchmarking due diligence @kaivalnp. From that standpoint, it looks good to me.

I do worry a bit around the post-filtering. It seems likely in a restrictive search scenario, we would do a bunch of searching to no avail. However, I don't know a good way around it.

What I would like to see are tests covering this query, its expected behavior and edge cases (no docs, no matching docs with a filter, do we fall back to exact ok, etc.).

- Set traversalSimilarity = resultSimilarity by default - Continue graph search until better nodes are available - Add filter to determine visitLimit for falling back to exact search

kaivalnp · 2023-11-30T21:30:56Z

Thanks @benwtrent! I also simplified the queries:

I realized that the API may be difficult to use in the current state (we are leaving two parameters - traversalSimilarity and visitLimit upto the user to configure, which may be a large overhead)

I noticed from above benchmarks that traversalSimilarity is good for tuning (acts like the fanout equivalent of topK) but most users need not change this -- and we can keep it equal to resultSimilarity by default (but still allow configuring it, whenever required)

Another issue previously encountered (amplified by the above change) is that we stop graph search too early when the entry node is far away from the query. To overcome this, can we continue search as long as we find better scoring nodes (so we know there is a possibility of reaching nodes above resultSimilarity)?

For configuring visitLimit, seems like the best option is to add a filter (like in AbstractKnnVectorQuery) - where we determine the visitLimit from the cost of the filter, and fall back to exact search over filtered docs - once this limit is reached..

Here is the benchmark setup and results with these changes (same range as before): https://gist.github.com/kaivalnp/07d6a96d22adfad4d3cd5924b13ed524

Also added some tests

I do worry a bit around the post-filtering. It seems likely in a restrictive search scenario, we would do a bunch of searching to no avail

Agreed, we do some work in graph search (like similarity computations, collecting results, etc) - which should be reusable from exact search

I had opened #12820 to discuss this issue (also affects KNN queries) - perhaps we can include these similarity-based queries if we arrive to a solution there?

benwtrent

@kaivalnp thank you so much for all the hard work here. I think this is ready to merge as an experimental query.

Could you add changes for Lucene 9.10? I can merge and backport.

It would also be good to update this branch with latest main to make sure CI is still happy.

benwtrent · 2023-12-07T20:59:27Z

lucene/CHANGES.txt

+* GITHUB#12679: Add support for similarity-based vector searches. Finds all vectors scoring above a `resultSimilarity`
+  while traversing the HNSW graph till better-scoring nodes are available, or the best candidate is below a score of
+  `traversalSimilarity` in the lowest level. (Aditya Prakash, Kaival Parikh)


Maybe add the vector query names?

Sorry, didn't get what you mean here?

Add support for similarity-based vector searches

Well, what are the query names? :D

Ahh got it.. Updated now :)

kaivalnp · 2023-12-07T21:01:22Z

Thanks for all the help here @benwtrent !

Could you add changes for Lucene 9.10?

Added an entry under "New Features" (also added one of my teammates along with whom this change was designed)

### Description Background in #12579 Add support for getting "all vectors within a radius" as opposed to getting the "topK closest vectors" in the current system ### Considerations I've tried to keep this change minimal and non-invasive by not modifying any APIs and re-using existing HNSW graphs -- changing the graph traversal and result collection criteria to: 1. Visit all nodes (reachable from the entry node in the last level) that are within an outer "traversal" radius 2. Collect all nodes that are within an inner "result" radius ### Advantages 1. Queries that have a high number of "relevant" results will get all of those (not limited by `topK`) 2. Conversely, arbitrary queries where many results are not "relevant" will not waste time in getting all `topK` (when some of them will be removed later) 3. Results of HNSW searches need not be sorted - and we can store them in a plain list as opposed to min-max heaps (saving on `heapify` calls). Merging results from segments is also cheaper, where we just concatenate results as opposed to calculating the index-level `topK` On a higher level, finding `topK` results needed HNSW searches to happen in `#rewrite` because of an interdependence of results between segments - where we want to find the index-level `topK` from multiple segment-level results. This is kind of against Lucene's concept of segments being independently searchable sub-indexes? Moreover, we needed explicit concurrency (#12160) to perform these in parallel, and these shortcomings would be naturally overcome with the new objective of finding "all vectors within a radius" - inherently independent of results from another segment (so we can move searches to a more fitting place?) ### Caveats I could not find much precedent in using HNSW graphs this way (or even the radius-based search for that matter - please add links to existing work if someone is aware) and consequently marked all classes as `@lucene.experimental` For now I have re-used lots of functionality from `AbstractKnnVectorQuery` to keep this minimal, but if the use-case is accepted more widely we can look into writing more suitable queries (as mentioned above briefly)

epotyom · 2023-12-11T23:32:21Z

I see random test failures that could be related to this change:

   >     java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 123
   >         at __randomizedtesting.SeedInfo.seed([119135B1F0803918:13366D71F5841AB]:0)
   >         at org.apache.lucene.codecs.simpletext.SimpleTextKnnVectorsReader$SimpleTextFloatVectorValues.vectorValue(SimpleTextKnnVectorsReader.java:346)
   >         at org.apache.lucene.search.VectorScorer$FloatVectorScorer.score(VectorScorer.java:120)
   >         at org.apache.lucene.search.AbstractVectorSimilarityQuery$VectorSimilarityScorer$2.match(AbstractVectorSimilarityQuery.java:259)
   >         at org.apache.lucene.search.FilteredDocIdSetIterator.nextDoc(FilteredDocIdSetIterator.java:64)
   >         at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreRange(Weight.java:269)
   >         at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:238)
   >         at org.apache.lucene.tests.search.AssertingBulkScorer.score(AssertingBulkScorer.java:101)
   >         at org.apache.lucene.search.TimeLimitingBulkScorer.score(TimeLimitingBulkScorer.java:82)
   >         at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:38)
   >         at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:684)
   >         at org.apache.lucene.tests.search.AssertingIndexSearcher.search(AssertingIndexSearcher.java:79)
   >         at org.apache.lucene.search.IndexSearcher.lambda$search$2(IndexSearcher.java:636)
   >         at org.apache.lucene.search.TaskExecutor$TaskGroup.lambda$createTask$0(TaskExecutor.java:118)
   >         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   >         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
   >         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
   >         at java.base/java.lang.Thread.run(Thread.java:840)
  2> NOTE: reproduce with: gradlew test --tests TestFloatVectorSimilarityQuery.testRandomFilter -Dtests.seed=119135B1F0803918 -Dtests.locale=mer-Latn-KE -Dtests.timezone=Australia/South -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   >     java.lang.UnsupportedOperationException
   >         at __randomizedtesting.SeedInfo.seed([119135B1F0803918:19473B03FDAEADE7]:0)
   >         at org.apache.lucene.search.TestFloatVectorSimilarityQuery$1.createVectorScorer(TestFloatVectorSimilarityQuery.java:82)
   >         at org.apache.lucene.search.AbstractVectorSimilarityQuery$1.scorer(AbstractVectorSimilarityQuery.java:148)
   >         at org.apache.lucene.search.Weight.scorerSupplier(Weight.java:135)
   >         at org.apache.lucene.search.Weight.bulkScorer(Weight.java:167)
   >         at org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.cache(LRUQueryCache.java:708)
   >         at org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:927)
   >         at org.apache.lucene.tests.search.AssertingWeight.bulkScorer(AssertingWeight.java:122)
   >         at org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.cache(LRUQueryCache.java:708)
   >         at org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:927)
   >         at org.apache.lucene.tests.search.AssertingWeight.bulkScorer(AssertingWeight.java:122)
   >         at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:678)
   >         at org.apache.lucene.tests.search.AssertingIndexSearcher.search(AssertingIndexSearcher.java:79)
   >         at org.apache.lucene.search.IndexSearcher.lambda$search$2(IndexSearcher.java:636)
   >         at org.apache.lucene.search.TaskExecutor$TaskGroup.lambda$createTask$0(TaskExecutor.java:118)
   >         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   >         at org.apache.lucene.search.TaskExecutor$TaskGroup.invokeAll(TaskExecutor.java:153)
   >         at org.apache.lucene.search.TaskExecutor.invokeAll(TaskExecutor.java:76)
   >         at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:640)
   >         at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:607)
   >         at org.apache.lucene.search.IndexSearcher.count(IndexSearcher.java:423)
   >         at org.apache.lucene.search.BaseVectorSimilarityQueryTestCase.testApproximate(BaseVectorSimilarityQueryTestCase.java:460)
   >         at org.apache.lucene.search.TestFloatVectorSimilarityQuery.testApproximate(TestFloatVectorSimilarityQuery.java:26)
   >         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   >         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
   >         at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   >         at java.base/java.lang.reflect.Method.invoke(Method.java:568)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
   >         at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
   >         at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   >         at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
   >         at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
   >         at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
   >         at org.junit.rules.RunRules.evaluate(RunRules.java:20)
   >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
   >         at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
   >         at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
   >         at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
   >         at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
   >         at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
   >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
   >         at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   >         at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
   >         at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
   >         at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
   >         at org.junit.rules.RunRules.evaluate(RunRules.java:20)
   >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
   >         at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
   >         at java.base/java.lang.Thread.run(Thread.java:840)
  2> NOTE: reproduce with: gradlew test --tests TestFloatVectorSimilarityQuery.testApproximate -Dtests.seed=119135B1F0803918 -Dtests.locale=mer-Latn-KE -Dtests.timezone=Australia/South -Dtests.asserts=true -Dtests.file.encoding=UTF-8

Discovered in #12921, and introduced in #12679 The first issue is that we weren't advancing the `VectorScorer` [here](https://github.com/apache/lucene/blob/cf13a9295052288b748ed8f279f05ee26f3bfd5f/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java#L257-L262) -- so it was still un-positioned while trying to compute the similarity score Earlier in the PR, the underlying delegate of the `FilteredDocIdSetIterator` was `scorer.iterator()` (see [here](https://github.com/apache/lucene/blob/cad565439be512ac6e95a698007b1fc971173f00/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java#L107)) -- so we didn't need to explicitly advance it Later, we decided to maintain parity to `AbstractKnnVectorQuery` and introduce filtering in `AbstractVectorSimilarityQuery` (see [this commit](5096790)) to determine the `visitLimit` of approximate search -- after which the underlying iterator changed to the accepted docs (see [here](https://github.com/apache/lucene/blob/5096790f281e477c529a7c8311aeb353ccdffdeb/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java#L255)) and I missed advancing the `VectorScorer` explicitly.. After doing so, we no longer get the original `java.lang.ArrayIndexOutOfBoundsException` -- but the `BaseVectorSimilarityQueryTestCase#testApproximate` starts failing because it falls back to exact search, as the limit of the prefilter is met during graph search Relaxed the parameters of the test to fix this (making the filter less restrictive, and trying to visit a fewer number of nodes so that approximate search completes without hitting its limit) Sorry for missing this earlier!

junqiu-lei · 2024-01-24T00:38:40Z

Hi, do we have any scheduled release date for this exciting feature?

kaivalnp · 2024-01-24T11:40:05Z

This feature will ship with Lucene 9.10

I'm not sure when that will be released, though I see ~2-4 months between previous minor versions

alessandrobenedetti · 2024-05-23T10:16:28Z

Hi @kaivalnp, thanks for this contribution!

My question is why do we have two thresholds, one for grap traversal (used to decide if it's worth exploring a candidate neighbour) and the resulting threshold (used to accept or not a result)?
The lower the traversal threshold, the longer the search but the more likely the possibility of finding more vectors within the accepted threshold.
The higher the traversal threshold, the shorter the search but you are less likely to find all vectors within the accepted threshold.
Is it a matter of tuning? Did I get it right?

kaivalnp · 2024-05-23T11:00:40Z

Yes @alessandrobenedetti that is correct -- some result may be missed if nodes along its path from the entry node score below the result threshold (but still higher than a traversal threshold <= result threshold)

This traversal threshold exists purely as a tunable parameter for recall v/s latency, somewhat like the ef parameter in KNN based search, where we request for ef >= k results from the graph and retain the best-scoring k for a better recall but higher latency

Add support for radius-based vector searches

a384967

shubhamvishu reviewed Oct 14, 2023

View reviewed changes

Address comments

5fc970e

Change naming from radius-based search to similarity-based search

5a6f555

kaivalnp changed the title ~~Add support for radius-based vector searches~~ Add support for similarity-based vector searches Oct 18, 2023

kaivalnp closed this Oct 18, 2023

kaivalnp deleted the radius-based-vector-search branch October 18, 2023 16:47

kaivalnp restored the radius-based-vector-search branch October 18, 2023 16:48

kaivalnp reopened this Oct 18, 2023

Refactor into a more appropriate query

cad5654

- Make use of inherent independence of segment-level results - Do not greedily collect exact matches, return a lazy-loading iterator instead

benwtrent reviewed Nov 29, 2023

View reviewed changes

Kaival Parikh added 2 commits November 30, 2023 21:20

Simplify the query

5096790

- Set traversalSimilarity = resultSimilarity by default - Continue graph search until better nodes are available - Add filter to determine visitLimit for falling back to exact search

Add tests

a40c388

benwtrent self-requested a review November 30, 2023 21:28

benwtrent reviewed Dec 7, 2023

View reviewed changes

Kaival Parikh added 3 commits December 7, 2023 20:44

Merge branch 'main' into radius-based-vector-search

8aadd92

Minor fixes

0c24d79

Add a CHANGES.txt entry

f62fb74

benwtrent reviewed Dec 7, 2023

View reviewed changes

Update CHANGES.txt entry to include query and collector names

07f4223

benwtrent added the vector-based-search label Dec 11, 2023

benwtrent merged commit cd19598 into apache:main Dec 11, 2023
4 checks passed

easyice mentioned this pull request Dec 12, 2023

Reproducible failure in TestFloatVectorSimilarityQuery.testApproximate #12921

Closed

kaivalnp mentioned this pull request Dec 12, 2023

Fix failing BaseVectorSimilarityQueryTestCase#testApproximate #12922

Merged

kaivalnp deleted the radius-based-vector-search branch January 2, 2024 15:04

junqiu-lei mentioned this pull request Feb 7, 2024

[FEATURE] Support Radius Search in k-NN opensearch-project/k-NN#814

Closed

kaivalnp mentioned this pull request Mar 4, 2024

[DISCUSS] Should there be a threshold-based vector search API? #12579

Closed

benwtrent mentioned this pull request Jun 18, 2024

add similarity threshold for hnsw #11946

Closed

Add support for similarity-based vector searches #12679

Add support for similarity-based vector searches #12679

Conversation

kaivalnp commented Oct 14, 2023

Description

Considerations

Advantages

Caveats

Next steps

shubhamvishu Oct 14, 2023 • edited Loading

Choose a reason for hiding this comment

kaivalnp Nov 30, 2023

Choose a reason for hiding this comment

shubhamvishu commented Oct 14, 2023 • edited Loading

jpountz commented Oct 17, 2023

kaivalnp commented Oct 17, 2023

jpountz commented Oct 17, 2023

kaivalnp commented Oct 17, 2023

benwtrent commented Oct 17, 2023

kaivalnp commented Oct 17, 2023

Benchmarks

K-NN Search (current system)

R-NN Search (new system)

kaivalnp commented Oct 17, 2023

benwtrent commented Oct 18, 2023

benwtrent commented Oct 18, 2023

kaivalnp commented Oct 18, 2023

kaivalnp commented Oct 18, 2023

benwtrent commented Oct 18, 2023

benwtrent commented Oct 18, 2023

kaivalnp commented Oct 18, 2023

kaivalnp commented Oct 18, 2023

kaivalnp commented Oct 25, 2023

benwtrent commented Oct 25, 2023

kaivalnp commented Oct 25, 2023

kaivalnp commented Nov 10, 2023

kaivalnp commented Nov 15, 2023 • edited Loading

kaivalnp commented Nov 15, 2023

KNN search

Similarity-based search

benwtrent left a comment

Choose a reason for hiding this comment

kaivalnp commented Nov 30, 2023

benwtrent left a comment

Choose a reason for hiding this comment

benwtrent Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

kaivalnp Dec 7, 2023

Choose a reason for hiding this comment

benwtrent Dec 7, 2023

Choose a reason for hiding this comment

kaivalnp Dec 7, 2023

Choose a reason for hiding this comment

kaivalnp commented Dec 7, 2023

epotyom commented Dec 11, 2023

junqiu-lei commented Jan 24, 2024

kaivalnp commented Jan 24, 2024

alessandrobenedetti commented May 23, 2024

kaivalnp commented May 23, 2024

shubhamvishu Oct 14, 2023 •

edited

Loading

shubhamvishu commented Oct 14, 2023 •

edited

Loading

kaivalnp commented Nov 15, 2023 •

edited

Loading

benwtrent Dec 7, 2023 •

edited

Loading