Speedup concurrent multi-segment HNSW graph search 2 #12962

mayya-sharipova · 2023-12-21T19:34:16Z

A second implementation of #12794 using Queue instead of MaxScoreAccumulator.

Speedup concurrent multi-segment HNWS graph search by exchanging the global top scores collected so far across segments. These global top scores set the minimum threshold that candidates need to pass to be considered. This allows earlier stopping for segments that don't have good candidates.

mayya-sharipova · 2023-12-21T19:34:42Z

Experiments:
Available processors: 10; thread pool size: 16
luceneutil tool

Search:

baseline: Lucene main branch
candidate1: Implementation based on MaxScoreAccumulator
candidate2: this PR based on the global queue

Summary:
As seen from the results below, this PR implementation based on queue:

As compared with baseline always delivers higher QPS: 1-300% faster depending on k and number of vectors. But comes at the 3-11% drop in recall compared with the single segment recall.
As compared with the implementation based on MaxScoreAccumulator: delivers higher QPS: 1-83% faster. But comes at
slight drop in recall 1-6%.

I think increase in QPS worth the slight drop in recall. And we should go with this implementation.

Details:

1M vectors of 100 dims

k=10, fanout=90

	Avg visited nodes	QPS	Recall
Baseline Single segment	980	2336	0.739
Baseline 3 segments concurrent	2627	2857	0.772
Candidate1_with_min_score	2458	2816	0.766
Candidate2_with_queue	2477	2865	0.767

k=100, fanout=900

	Avg visited nodes	QPS	Recall
Baseline Single segment	6722	430	0.921
Baseline 3 segments concurrent	17595	438	0.949
Candidate1_with_min_score	16386	464	0.947
Candidate2_with_queue	13483	469	0.940

Candidate2_with_queue VS Baseline:

${\color{green}Recall}$ is better than single segment
${\color{green} QPS}$ are 0.3-7% better than multiple segments

Candidate2_with_queue VS Candidate1_with_min_score:

${\color{red}Recall}$ is slighlty worse
${\color{green} QPS}$ are 1-2% better

10M vectors of 100 dims

k=10, fanout=90

	Avg visited nodes	QPS	Recall
Baseline Single segment	1081	1798	0.634
Baseline 13 segments concurrent	11869	1371	0.680
Candidate1_with_min_score	7660	1845	0.600
Candidate2_with_queue	7942	1776	0.606

k=100, fanout=900

	Avg visited nodes	QPS	Recall
Baseline Single segment	7213	272	0.824
Baseline 13 segments concurrent	78069	239	0.894
Candidate1_with_min_score	50649	301	0.868
Candidate2_with_queue	33139	361	0.834

k=100, fanout=9900

	Avg visited nodes	QPS	Recall
Baseline Single segment	55168	36	0.916
Baseline 13 segments concurrent	521851	29	0.962
Candidate1_with_min_score	331252	34	0.954
Candidate2_with_queue	193082	55	0.938

Candidate2_with_queue VS Baseline:

${\color{black}Recall}$ is slightly worse or better than single segment
${\color{green} QPS}$ are 30-52% better than multiple segments

Candidate2_with_queue VS Candidate1_with_min_score:

${\color{red}Recall}$ is slighlty worse
${\color{green} QPS}$ is 20-60% better (but is -6% worse on small k)

10M vectors of 768 dims

k=10, fanout=90

	Avg visited nodes	QPS	Recall
Baseline Single segment	1095	974	0.542
Baseline 19 segments concurrent	18091	569	0.541
Candidate1_with_min_score	10700	824	0.474
Candidate2_with_queue	11007	783	0.479

k=100, fanout=900

	Avg visited nodes	QPS	Recall
Baseline Single segment	7118	152	0.688
Baseline 19 segments concurrent	120453	88	0.732
Candidate1_with_min_score	65778	139	0.693
Candidate2_with_queue	44775	183	0.650

k=100, fanout=9900

	Avg visited nodes	QPS	Recall
Baseline Single segment	59308	19	0.797
Baseline 19 segments concurrent	818268	11	0.847
Candidate1_with_min_score	463464	18	0.828
Candidate2_with_queue	251612	33	0.779

Candidate2_with_queue VS Baseline:

${\color{red}Recall}$ is slightly worse than than single segment
${\color{green} QPS}$ are 38-300% better than multiple segments

Candidate2_with_queue VS Candidate1_with_min_score:

${\color{red}Recall}$ is slighlty worse (but better on small k)
${\color{green} QPS}$ is 31-83% better (but is 5% worse on small k)

@tveasey @jimczi @benwtrent what do you think?

tveasey · 2024-01-02T10:41:25Z

IMO we shouldn't focus too much on recall since the greediness of globally non-competitive search allows us to tune this. My main concern is does contention on the queue updates cause slow down. This aside, I think the queue is strictly better.

The search might wind up visiting fewer vertices for min score sharing, because of earlier decisions might mean it by chance gets transiently better bounds, but this should be low probability particularly when the search has to visit many vertices. And indeed these cases are where we see big wins from using a queue.

There appears to be some evidence of contention. This is suggested by looking at the runtime vs expected runtime from vertices visited, e.g.

scenario	QPS(score) / QPS(queue)	Visited(queue) / Visited(score)
n=10M, dim=100, k = 100, fo = 900	0.83	0.65
n=10M, dim=768, k = 100, fo = 900	0.76	0.68

Note that the direction of this effect is consistent, but the size is not (fo = 900 shows the largest effect). However, all that said we still get significant wins in performance, so my vote would be to use the queue and work on strategies for reducing contention, there are various ideas we had for ways to achieve this.

mayya-sharipova · 2024-01-12T20:51:47Z

I have also done experiments using Cohere dataset, as as seen below:

for 10M docs dataset, the speedups with the proposed approach are 1.7-2.5x times.
for 10M docs dataset, where k+fanout = 1000, QPS is close to the QPS of a single segment, while recall is better.

Cohere/wikipedia-22-12-en-embeddings

Cohere/wikipedia-22-12-en-embeddings dataset
768 dims

1M vectors

k=10, fanout=90

	Avg visited nodes	QPS	Recall
Baseline Single segment	2033	1212	0.880
Baseline 8 segments concurrent	13927	807	0.974
Candidate2_with_queue	12241	888	0.963

k=100, fanout=900

	Avg visited nodes	QPS	Recall
Baseline Single segment	13083	167	0.964
Baseline 8 segments concurrent	81824	124	0.997
Candidate2_with_queue	38929	193	0.985

10M vectors

k=10, fanout=90

	Avg visited nodes	QPS	Recall
Baseline Single segment	2189	984	0.929
Baseline 19 segments concurrent	37656	316	0.951
Candidate2_with_queue	22041	472	0.932

k=100, fanout=900

	Avg visited nodes	QPS	Recall
Baseline Single segment	15040	122	0.950
Baseline 19 segments concurrent	229945	48	0.990
Candidate2_with_queue	78740	115	0.972

jimczi

Thanks @mayya-sharipova for all the testing. I left some comments.
Some of the recalls for the single segment baseline seem seem quite off (0.477?). Are you sure that there was no issue during the testing?

lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java

lucene/core/src/java/org/apache/lucene/index/LeafReader.java

jimczi · 2024-01-15T09:14:41Z

lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java

+      if (kResultsCollected) {
+        // as we've collected k results, we can start do periodic updates with the global queue
+        if (firstKResultsCollected || (visitedCount & interval) == 0) {
+          cachedGlobalMinSim = globalSimilarityQueue.offer(updatesQueue.getHeap());


This way of updating the global queue periodically doesn't bring much if k is close to the size of the queue. For instance if k is 1000, we only save 24 updates with this strategy. That's fine I guess especially considering that the interval is not only to save concurrent update in the queue but also to ensure that we are far enough in the walk of the graph.

@jimczi Sorry I did not understand this comment, can you please clarify it.

What does it mean k is close to the size of queue? The globalSimilarityQueue is of size k.

Also I am not clear how you derive the value of 24?

Sorry I meant if k is close to the interval (1024). In such case delaying the update to the global queue doesn't save much.

There is a subtlety: note that firstKResultsCollected is only true the first time we visit k nodes, so this is saying only refresh every 1024 iterations thereafter. The refresh schedule is k, 1024, 2048, .... (The || threw me initially.)

As per my comment above, 1024 seems infrequent to me. (Of course you may have tested smaller values and determined this to be a good choice.) If we think it is risky sharing information too early, I would be inclined to share on the schedule max(k, c1), max(k, c1) + c2, max(k, c1) + 2 * c2, ... with c1 > c2 and decouple the concerns.

Also, there maybe issues with using information too early, in which case minCompetitiveSimilarity would need a check that visitedCount > max(k, c1) before it starts modifying minSim.

In any case, I can see results being rather sensitive to these choices so if you haven't done a parameter exploration it might be worth trying a few choices.

mayya-sharipova · 2024-01-15T18:54:06Z

@jimczi Thanks for your feedback.

Some of the recalls for the single segment baseline seem seem quite off (0.477?). Are you sure that there was no issue during the testing?

Sorry, I made a mistake. I've updated the results.

I will work on more experiments to address your other comments.

lucene/core/src/java/org/apache/lucene/util/hnsw/BlockingFloatHeap.java

mayya-sharipova · 2024-01-16T19:45:14Z

I have done more experiments with different interval values:

Cohere dataset of 786 dims:

1M vectors, k=10, fanout=90

Interval	Avg visited nodes	QPS	Recall
255	10283	959	0.946
511	11455	912	0.956
1023	12241	888	0.963

1M vectors, k=100, fanout=900

Interval	Avg visited nodes	QPS	Recall
255	28175	286	0.978
511	32410	231	0.982
1023	38929	193	0.985

10M vectors, k=10, fanout=90

Interval	Avg visited nodes	QPS	Recall
255	18325	567	0.906
511	20108	518	0.919
1023	22041	472	0.932

10M vectors, k=100, fanout=900

Interval	Avg visited nodes	QPS	Recall
255	58543	157	0.958
511	66518	134	0.966
1023	78740	115	0.972

Updating from 1023 interval to 255 interval gives us:

average 28% decrease in visited nodes and 28% gains in QPS at the expense of lower recall
Recall with 255 interval is mostly larger than on recall on a single segment (except a case of 10M docs with k=10)

Dataset of 100 dims:

1M vectors, k=10, fanout=90

Interval	Avg visited nodes	QPS	Recall
255	2244	3021	0.753
511	2398	2898	0.762
1023	2477	2865	0.767

1M vectors, k=100, fanout=900

Interval	Avg visited nodes	QPS	Recall
255	10600	566	0.927
511	11882	520	0.933
1023	13483	469	0.940

10M vectors, k=10, fanout=90

Interval	Avg visited nodes	QPS	Recall
255	6562	1897	0.580
511	7332	1773	0.595
1023	7942	1776	0.606

10M vectors, k=100, fanout=900

Interval	Avg visited nodes	QPS	Recall
255	26801	424	0.821
511	29534	399	0.826
1023	33139	361	0.834

Updating from 1023 interval to 255 interval gives us:

average 20% decrease in visited nodes and 13% gains in QPS at the expense of lower recall
recall with 255 interval is larger than on recall on a single segment for 1M docs, and slightly lower for 10M docs.

This makes an interval of 255 a reasonable choice. @jimczi @tveasey What do you think?

mayya-sharipova · 2024-01-17T11:24:53Z

It's also important to check the order of execution. For instance what happens if all segments are executed serially (rather than in parallel), does it changes the recall?

@jimczi The results are below. For sequential run, this change also brings significant speedups:

average 2.2x decrease in visited nodes and 2.2x gains in QPS at the expense of lower recall. Gains in QPS with higher k (for k=100, gains reach up to 3x)
We usually can achieve recall comparable to the recall on a single segment (except the case of 10M docs k=10)

Sequential run cohere: 768 dims; interval: 255

1M vectors

k=10, fanout=90

	Avg visited nodes	QPS	Recall
Baseline Single segment	2033	1212	0.880
Baseline 8 segments sequential	13927	175	0.974
Candidate_with_queue sequential	9407	269	0.923

k=100, fanout=900

	Avg visited nodes	QPS	Recall
Baseline Single segment	13083	167	0.964
Baseline 8 segments sequential	81824	28	0.997
Candidate_with_queue sequential	36851	62	0.983

10M vectors

k=10, fanout=90

	Avg visited nodes	QPS	Recall
Baseline Single segment	2189	984	0.929
Baseline 19 segments sequential	37656	59	0.951
Candidate_with_queue sequential	17748	121	0.884

k=100, fanout=900

	Avg visited nodes	QPS	Recall
Baseline Single segment	15040	122	0.950
Baseline 19 segments sequential	229945	9	0.990
Candidate_with_queue sequential	77910	27	0.964

tveasey · 2024-01-17T14:25:19Z

This makes an interval of 255 a reasonable choice.

I agree. This looks better to me. One thing I would be intrigued to try is the slight change in schedules as per this. Particularly, what happens if we delay using the information in minCompetitiveSimilarity. However, these results are already very good and we could push testing these out to a follow on PR.

One last thing I think we should consider is exploring the variance we get in recall as a result of this change. Specifically, if we were to run with some random waits in the different segment searches what is the variation in the recalls we see?

The danger is we get unlucky in ordering of searches and prune segment searches which contain the true nearest neighbours more aggressively sometimes. On average this shouldn't happen, but if we also see low variation in recall for the 1M realistic vectors in such a test case it would reassuring.

lucene/core/src/java/org/apache/lucene/index/LeafReader.java

lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java

benwtrent · 2024-01-26T14:29:29Z

I ran my own experiment, which showed some interesting and frustrating results.

My idea was that gathering at least k documents for every segment, no matter their size, doesn't make much sense. So I combined the dynamic k setting from @jimczi, which adjusts how much we explore the graph based on the total number of vectors in the shard vs. how many vectors are in the segment.

float v = (float)Math.log(sumVectorCount / (double) leafSize);
float filterWeightValue = 1/v;

I then only require at a minimum k * filterWeightValue per segment to be explored before reaching to the global queue for non-competitive scores.

Additionally, I adjusted the indexing to randomly commit() on every 500 docs or so. I indexed the first 10M docs of cohere-wiki and used max-inner product over the raw float32. This showed that we have some graph building problems, will include those results as well.

Max Connections: 16
Beamwidth: 250

This resulted in some weird graphs (which we should fix separately), but around 47 segments of non-uniform sizes in various tiers.

Graph

The python code & raw data used:

# Add pareto lines
plt.plot([1508, 1784, 1926, 2190, 2822], [0.856, 0.881, 0.886, 0.887, 0.887], marker='o', label='baseline_single')
plt.plot([3297, 3562, 3698, 5797, 14196], [0.840, 0.861, 0.866, 0.866, 0.866], marker='o', label='baseline_multi')
plt.plot([3050, 3345, 3503, 3739, 3963], [0.520, 0.846, 0.858, 0.866, 0.866], marker='o', label='candidate_dynamic_k')
plt.plot([3297, 3562, 3698, 5797, 14191], [0.840, 0.861, 0.866, 0.866, 0.866], marker='o', label='candidate_static_k')
# Add labels and title
plt.xlabel('Avg Visited')
plt.ylabel('Recall')
plt.title("Recall vs. Avg Visited for corpus-wiki-cohere.fvec 10M")
plt.legend()

Graph stats over the multiple segments. This shows the connectedness of layer 0 and the percentiles of the connectedness.

Many many segments are extremely sparse, including some larger segments, which is not very nice :(

Leaf 0 has 5 layers
Leaf 0 has 1728501 documents
Graph size=1728501, Fanout min=1, mean=12.26, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1  16  22  28  32  32
Leaf 1 has 5 layers
Leaf 1 has 1743500 documents
Graph size=1743500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 2 has 5 layers
Leaf 2 has 508000 documents
Graph size=508000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 3 has 5 layers
Leaf 3 has 1147500 documents
Graph size=1147500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 4 has 5 layers
Leaf 4 has 1196000 documents
Graph size=1196000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 5 has 5 layers
Leaf 5 has 506500 documents
Graph size=506500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 6 has 4 layers
Leaf 6 has 200500 documents
Graph size=200500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 7 has 5 layers
Leaf 7 has 521500 documents
Graph size=521500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 8 has 5 layers
Leaf 8 has 1088000 documents
Graph size=1088000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 9 has 5 layers
Leaf 9 has 480500 documents
Graph size=480500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 10 has 4 layers
Leaf 10 has 176000 documents
Graph size=176000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 11 has 4 layers
Leaf 11 has 157500 documents
Graph size=157500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 12 has 4 layers
Leaf 12 has 51500 documents
Graph size=51500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 13 has 3 layers
Leaf 13 has 39000 documents
Graph size=39000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 14 has 3 layers
Leaf 14 has 21500 documents
Graph size=21500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 15 has 3 layers
Leaf 15 has 50000 documents
Graph size=50000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 16 has 3 layers
Leaf 16 has 19000 documents
Graph size=19000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 17 has 4 layers
Leaf 17 has 95500 documents
Graph size=95500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 18 has 3 layers
Leaf 18 has 17000 documents
Graph size=17000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 19 has 3 layers
Leaf 19 has 16500 documents
Graph size=16500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 20 has 3 layers
Leaf 20 has 17500 documents
Graph size=17500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 21 has 3 layers
Leaf 21 has 53000 documents
Graph size=53000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 22 has 4 layers
Leaf 22 has 68500 documents
Graph size=68500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 23 has 3 layers
Leaf 23 has 3500 documents
Graph size=3500, Fanout min=1, mean=1.01, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 24 has 3 layers
Leaf 24 has 10000 documents
Graph size=10000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 25 has 3 layers
Leaf 25 has 15000 documents
Graph size=15000, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 26 has 3 layers
Leaf 26 has 11500 documents
Graph size=11500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 27 has 3 layers
Leaf 27 has 10500 documents
Graph size=10500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 28 has 3 layers
Leaf 28 has 5000 documents
Graph size=5000, Fanout min=1, mean=1.01, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 29 has 3 layers
Leaf 29 has 21500 documents
Graph size=21500, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 30 has 3 layers
Leaf 30 has 1000 documents
Graph size=1000, Fanout min=1, mean=1.03, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 31 has 3 layers
Leaf 31 has 1000 documents
Graph size=1000, Fanout min=1, mean=1.03, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 32 has 3 layers
Leaf 32 has 1000 documents
Graph size=1000, Fanout min=1, mean=1.03, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 33 has 3 layers
Leaf 33 has 1000 documents
Graph size=1000, Fanout min=1, mean=1.03, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 34 has 3 layers
Leaf 34 has 1500 documents
Graph size=1500, Fanout min=1, mean=1.02, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 35 has 3 layers
Leaf 35 has 1000 documents
Graph size=1000, Fanout min=1, mean=1.03, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 36 has 3 layers
Leaf 36 has 1000 documents
Graph size=1000, Fanout min=1, mean=1.03, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 37 has 3 layers
Leaf 37 has 500 documents
Graph size=500, Fanout min=1, mean=1.06, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 38 has 3 layers
Leaf 38 has 500 documents
Graph size=500, Fanout min=1, mean=1.06, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 39 has 3 layers
Leaf 39 has 1000 documents
Graph size=1000, Fanout min=1, mean=1.03, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 40 has 3 layers
Leaf 40 has 4000 documents
Graph size=4000, Fanout min=1, mean=1.01, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 41 has 3 layers
Leaf 41 has 1500 documents
Graph size=1500, Fanout min=1, mean=1.02, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 42 has 3 layers
Leaf 42 has 1000 documents
Graph size=1000, Fanout min=1, mean=1.03, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 43 has 3 layers
Leaf 43 has 1500 documents
Graph size=1500, Fanout min=1, mean=1.02, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 44 has 3 layers
Leaf 44 has 500 documents
Graph size=500, Fanout min=1, mean=1.06, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 45 has 3 layers
Leaf 45 has 500 documents
Graph size=500, Fanout min=1, mean=1.06, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Leaf 46 has 3 layers
Leaf 46 has 1499 documents
Graph size=1499, Fanout min=1, mean=1.02, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32

benwtrent · 2024-01-30T20:58:41Z

OK, I reran my tests, but over the first 500k docs. Randomly committing & merging with a 500MB buffer. The numbers are much saner. I wonder if I have a bug in my data and start sending garbage once I get above 1M vectors...

This shows that having the dynamic k based on segment size brings us closer to baseline single segment number of visited. We are still in a different magnitude, but both static & dynamic k with the queue are much much better (2-3x) than the baseline multi-segment search.

plt.plot([985, 1225, 1398], [0.993, 1.0, 1.0], marker='o', label='baseline_single')
plt.plot([37505, 57693, 72611], [1.0, 1.0, 1.0], marker='o', label='baseline_multi')
plt.plot([12099, 18411, 24452], [0.999, 1.0, 1.0], marker='o', label='candidate_static_k')
plt.plot([10078, 13369, 16769], [0.992, 0.999, 1.0], marker='o', label='candidate_dynamic_k')

benwtrent · 2024-01-31T15:09:30Z

I fixed my data and ran with 1.5M cohere:

static_k is this PR
dynamic_k is this PR + scaling the k explored by

loat v = (float)Math.log(sumVectorCount / (double) leafSize);
float filterWeightValue = 1/v;

Scaling the k is another nice incremental change. Up and to the left is best. We get better recall with visiting fewer consistently.

plt.plot([2304, 2485, 3189, 4028, 5610, 16165], [0.875, 0.886, 0.916, 0.938, 0.960, 0.992], marker='o', label='baseline_single')
plt.plot([43015, 45946, 56864, 69149, 90772, 210727], [0.980, 0.982, 0.989, 0.992, 0.996, 1.000], marker='o', label='baseline_multi')
plt.plot([22706, 23749, 27142, 30893, 37162, 76032], [0.959, 0.962, 0.970, 0.976, 0.983, 0.996], marker='o', label='candidate_static_k')
plt.plot([16099, 17183, 20788, 24809, 31334, 64921], [0.937, 0.945, 0.962, 0.973, 0.983, 0.996], marker='o', label='candidate_dynamic_k')

EDIT: Here is 6.5M English Cohere (I randomly commit every 500 docs, ended up with 20-30 segments, which is hilarious, as baseline multi-segment visits about 24x more docs than baseline single segment, dynamic-k only visits about 7x more docs).

plt.plot([2362, 2557, 3291, 4193, 5925, 18114], [0.845, 0.856, 0.886, 0.910, 0.935, 0.979 ], marker='o', label='baseline_single')
plt.plot([58236, 62131, 76589, 92736, 120898, 272212], [0.942, 0.946, 0.960, 0.970, 0.980, 0.994], marker='o', label='baseline_multi')
plt.plot([24758, 26242, 31826, 37229, 46271, 103811], [0.931, 0.936, 0.950, 0.960, 0.970, 0.988], marker='o', label='candidate_static_k')
plt.plot([16367, 17262, 20630, 24232, 31281, 70344], [0.897, 0.904, 0.926, 0.943, 0.959, 0.987 ], marker='o', label='candidate_dynamic_k')

…with-queue

mayya-sharipova · 2024-01-31T16:37:45Z

@benwtrent Thanks for running additional tests. Looks like running with dynamic k can speed up searches, but I think we need to run more experiments on smaller dims datasets as well, how about we leave this for the follow up?

@jimczi @tveasey I've addressed your comments. Are we ok to merge as it is now.

The following is left for the follow up work:

Dynamic k based on segment size. Currently we gather at least k documents for every segment before starting synchronizing with the global queue. We need to adjust k based on the segment size.
Improve synchronization schedule with the global queue. Currently we do synchronization with the global queue every 255 visited docs. Can we improve schedule as per comment.

benwtrent · 2024-01-31T16:42:57Z

but I think we need to run more experiments on smaller dims datasets as well, how about we leave this for the follow up?

I am 100% fine with this. It was a crazy idea and it only gives us an incremental change. This PR gives a much better improvement as it is already.

tveasey · 2024-01-31T18:13:01Z

@jimczi @tveasey I've addressed your comments. Are we ok to merge as it is now.

I'm happy

mayya-sharipova · 2024-01-31T18:32:43Z

I've re-ran the sets with latest changes on this PR (candidate) and main branch (baseline):

I have also done experiments using Cohere dataset, as as seen below:

for 10M docs dataset, the speedups with the proposed approach are 1.7-2.5x times.
for 10M docs dataset, where k+fanout = 1000, QPS is close to the QPS of a single segment, while recall is better.

Cohere/wikipedia-22-12-en-embeddings

Cohere/wikipedia-22-12-en-embeddings dataset
768 dims
interval to synchronize with global queue: 255 visited docs

1M vectors

k=10, fanout=90

	Avg visited nodes	QPS	Recall
Baseline Single segment			0.880
Baseline 8 segments concurrent	13927	815	0.974
Candidate2_with_queue	12670	859	0.964

k=100, fanout=900

	Avg visited nodes	QPS	Recall
Baseline Single segment			0.964
Baseline 8 segments concurrent	81824	126	0.997
Candidate2_with_queue	62085	165	0.995

10M vectors

k=10, fanout=90

	Avg visited nodes	QPS	Recall
Baseline Single segment			0.929
Baseline 19 segments concurrent	37656	271	0.951
Candidate2_with_queue	21921	443	0.927

k=100, fanout=900

	Avg visited nodes	QPS	Recall
Baseline Single segment			0.950
Baseline 19 segments concurrent	229945	44	0.990
Candidate2_with_queue	101970	91	0.984

jimczi

Thank @mayya-sharipova it's getting close. I left more comments. Can you set the pr to ready for review in case others want to chime in.
@benwtrent can you look at the changes in the knn query with the KnnCollectorManager?

lucene/core/src/java/org/apache/lucene/index/LeafReader.java

lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java

jimczi · 2024-02-01T17:14:58Z

lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java

@@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOException {
      filterWeight = null;
    }

+    final boolean supportsConcurrency = indexSearcher.getSlices().length > 1;


A slice can have multiple segments so it should rather check the reader's leaves here and maybe call the boolean isMultiSegments?

I copied this how we do that for top docs collectors

But you are right, because we see speedups even in sequential run, you suggestions make sense.

because we see speedups even in sequential run

Do you mean speed ups without concurrency via sharing information? That is interesting, I wonder why that is.

It is because later graph searches are faster: they use information gathered in earlier ones. (Note though the speed up is only relative to not sharing information and searching multiple graphs not relative to searching a single graph.)

For the dynamic k follow up we can also look at the order of the segments. For instance if we start with largest first we can be more aggressive on smaller segments.

lucene/core/src/java/org/apache/lucene/index/LeafReader.java

lucene/core/src/java/org/apache/lucene/search/AbstractKnnCollector.java

benwtrent · 2024-02-01T19:38:48Z

lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java

@@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOException {
      filterWeight = null;
    }

+    final boolean supportsConcurrency = indexSearcher.getSlices().length > 1;


because we see speedups even in sequential run

Do you mean speed ups without concurrency via sharing information? That is interesting, I wonder why that is.

lucene/core/src/java/org/apache/lucene/search/knn/KnnCollectorManager.java

lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java

benwtrent · 2024-02-01T19:50:12Z

lucene/core/src/java/org/apache/lucene/search/knn/KnnCollectorManager.java

+   * @param visitedLimit the maximum number of nodes that the search is allowed to visit
+   * @param parentBitSet the parent bitset, {@code null} if not applicable
+   */
+  public abstract C newCollector(int visitedLimit, BitSet parentBitSet) throws IOException;


Suggested change

public abstract C newCollector(int visitedLimit, BitSet parentBitSet) throws IOException;

public abstract C newCollector(int visitedLimit, LeafReaderContext context) throws IOException;

Also, I am not even sure visitedLimit should be there. It seems like something the manager should already know about (as in this instance its static) and we just need to know about the context (the context is for DiversifyingChildrenFloatKnnVectorQuery so that its collector manager can create BitSet parentBitSet from its encapsulated BitSetProducer).

I also think this method could return null if collection is not applicable for that given leaf context.

Addressed except visitedLimit, because visitedLimit changes per segment as different cost of using filter in a segment context

benwtrent · 2024-02-01T19:51:53Z

lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java

@@ -123,7 +124,16 @@ protected TopDocs exactSearch(LeafReaderContext context, DocIdSetIterator accept
  }

  @Override
-  protected TopDocs approximateSearch(LeafReaderContext context, Bits acceptDocs, int visitedLimit)
+  protected KnnCollectorManager<?> getKnnCollectorManager(int k, boolean supportsConcurrency) {
+    return new DiversifyingNearestChildrenKnnCollectorManager(k);


If we adjust the interface, this manager could know about BitSetProducer parentsFilter; and abstract that away from this query.

lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java

mayya-sharipova · 2024-02-02T20:48:01Z

@benwtrent @jimczi Thanks for your great feedback, with the latest changes I tried to address your comments, please check if we need something else.

- keep original LeafReader::searchNearestVectors functions - Make KnnCollectorManager as interface - Add MultiLeafTopKnnCollector

benwtrent · 2024-02-05T13:02:01Z

lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java

-import org.apache.lucene.search.TopDocs;
-import org.apache.lucene.search.TopDocsCollector;
-import org.apache.lucene.search.TotalHits;
+import org.apache.lucene.search.*;


Don't think we want * here

benwtrent

One minor thing, but this looks much better.

…with-queue

Speedup concurrent multi-segment HNWS graph search by exchanging the global top candidated collected so far across segments. These global top candidates set the minimum threshold that new candidates need to pass to be considered. This allows earlier stopping for segments that don't have good candidates.

jpountz · 2024-02-23T13:39:22Z

FYI I pushed an annotation, this yielded a major speedup: http://people.apache.org/~mikemccand/lucenebench/VectorSearch.html.

benwtrent · 2024-02-23T13:42:20Z

So cool. We are now faster at 768 dimensions than we were on 100 dimensions.

⚡ ⚡ ⚡ ⚡ ⚡

uschindler · 2024-02-24T18:04:35Z

Is it called HNSW or HNWS? I just noticed the title of this PR and differing changes entries.

msokolov · 2024-02-25T23:19:44Z

HNSW stands for "hierarchical navigable small world" - that should make it easy to remember :)

Draft

8b8bab4

mayya-sharipova marked this pull request as draft December 22, 2023 14:59

jimczi reviewed Jan 15, 2024

View reviewed changes

tveasey reviewed Jan 16, 2024

View reviewed changes

lucene/core/src/java/org/apache/lucene/util/hnsw/BlockingFloatHeap.java Show resolved Hide resolved

jimczi reviewed Jan 18, 2024

View reviewed changes

Adjust interval and more comments

0a56610

mayya-sharipova added 2 commits January 31, 2024 10:11

Introduce KnnCollectorManager

7cff84a

Merge remote-tracking branch 'upstream/main' into multi-search-graph-…

0ab7b59

…with-queue

mayya-sharipova force-pushed the multi-search-graph-with-queue branch 2 times, most recently from 12ccd3a to 3ed93df Compare January 31, 2024 16:26

Correct imports and docs

13ae978

mayya-sharipova force-pushed the multi-search-graph-with-queue branch from 3ed93df to 13ae978 Compare January 31, 2024 16:31

jimczi reviewed Feb 1, 2024

View reviewed changes

benwtrent self-requested a review February 1, 2024 17:38

mayya-sharipova marked this pull request as ready for review February 1, 2024 18:43

benwtrent reviewed Feb 1, 2024

View reviewed changes

mayya-sharipova force-pushed the multi-search-graph-with-queue branch from d369e06 to 3e5b0a4 Compare February 2, 2024 20:43

Address feedback

9005b04

- keep original LeafReader::searchNearestVectors functions - Make KnnCollectorManager as interface - Add MultiLeafTopKnnCollector

mayya-sharipova force-pushed the multi-search-graph-with-queue branch from 3e5b0a4 to 9005b04 Compare February 2, 2024 20:57

benwtrent reviewed Feb 5, 2024

View reviewed changes

benwtrent approved these changes Feb 5, 2024

View reviewed changes

mayya-sharipova added 4 commits February 5, 2024 09:55

Correct Imports

93b394c

Add changes

d5c247a

Merge remote-tracking branch 'upstream/main' into multi-search-graph-…

787eb94

…with-queue

Supress forbidden Thread sleep call

b8fbd3f

mayya-sharipova merged commit d095ed0 into apache:main Feb 6, 2024
4 checks passed

mayya-sharipova deleted the multi-search-graph-with-queue branch February 6, 2024 14:16

mayya-sharipova mentioned this pull request Feb 6, 2024

Speedup concurrent multi-segment HNWS graph search #12794

Closed

mayya-sharipova mentioned this pull request Feb 7, 2024

Speedup concurrent multi-segment HNWS graph search (#12962) #13086

Merged

msokolov changed the title ~~Speedup concurrent multi-segment HNWS graph search 2~~ Speedup concurrent multi-segment HNSW graph search 2 Feb 25, 2024

	public abstract C newCollector(int visitedLimit, BitSet parentBitSet) throws IOException;
	public abstract C newCollector(int visitedLimit, LeafReaderContext context) throws IOException;

Speedup concurrent multi-segment HNSW graph search 2 #12962

Speedup concurrent multi-segment HNSW graph search 2 #12962

Conversation

mayya-sharipova commented Dec 21, 2023

mayya-sharipova commented Dec 21, 2023 • edited

1M vectors of 100 dims

10M vectors of 100 dims

10M vectors of 768 dims

tveasey commented Jan 2, 2024 • edited

mayya-sharipova commented Jan 12, 2024 • edited

Cohere/wikipedia-22-12-en-embeddings

1M vectors

10M vectors

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tveasey Jan 16, 2024 • edited

Choose a reason for hiding this comment

mayya-sharipova commented Jan 15, 2024 • edited

mayya-sharipova commented Jan 16, 2024 • edited

mayya-sharipova commented Jan 17, 2024 • edited

1M vectors

10M vectors

tveasey commented Jan 17, 2024 • edited

benwtrent commented Jan 26, 2024 • edited

Graph

benwtrent commented Jan 30, 2024

benwtrent commented Jan 31, 2024 • edited

mayya-sharipova commented Jan 31, 2024 • edited

benwtrent commented Jan 31, 2024

tveasey commented Jan 31, 2024

mayya-sharipova commented Jan 31, 2024 • edited

Cohere/wikipedia-22-12-en-embeddings

1M vectors

10M vectors

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayya-sharipova Feb 2, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayya-sharipova commented Feb 2, 2024 • edited

Choose a reason for hiding this comment

benwtrent left a comment

Choose a reason for hiding this comment

jpountz commented Feb 23, 2024

benwtrent commented Feb 23, 2024

uschindler commented Feb 24, 2024

msokolov commented Feb 25, 2024

mayya-sharipova commented Dec 21, 2023 •

edited

tveasey commented Jan 2, 2024 •

edited

mayya-sharipova commented Jan 12, 2024 •

edited

tveasey Jan 16, 2024 •

edited

mayya-sharipova commented Jan 15, 2024 •

edited

mayya-sharipova commented Jan 16, 2024 •

edited

mayya-sharipova commented Jan 17, 2024 •

edited

tveasey commented Jan 17, 2024 •

edited

benwtrent commented Jan 26, 2024 •

edited

benwtrent commented Jan 31, 2024 •

edited

mayya-sharipova commented Jan 31, 2024 •

edited

mayya-sharipova commented Jan 31, 2024 •

edited

mayya-sharipova Feb 2, 2024 •

edited

mayya-sharipova commented Feb 2, 2024 •

edited