Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize top-k counting for approximate queries #160

Closed
alexklibisz opened this issue Sep 20, 2020 · 36 comments
Closed

Optimize top-k counting for approximate queries #160

alexklibisz opened this issue Sep 20, 2020 · 36 comments

Comments

@alexklibisz
Copy link
Owner

Currently the biggest bottleneck at query time is the countHits method in MatchHashesAndScoreQuery. This counts the number of times each doc in the segment matches one of the query vector's hashes. https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73

AFAIK, Lucene is generally optimized for a small number of terms (e.g. the words in a search query). Elastiknn, on the other hand, can require retrieving doc IDs for tens or hundreds of terms (the hashes of a query vector).

It seems the main thing worth exploring is using a different PostingsFormat, or potentially implementing a custom one. Maybe there's a way to optimize the storage/access pattern for checking a larger number of terms? Maybe there's a way to simply wrap an existing postings format and offer some helpers that cache the expensive method calls?

Some specific numbers:

When running the MatchHashesAndScoreQueryPerformanceSuite, with 100k indexed vectors and 5k search vectors, the VisualVM sampler reports spending ~92% of its time in the countHits method:

image

When running the ContinuousBenchmark with the SIFT dataset (1M indexed vectors, 1k search vectors), the VisualVM sampler reports spending ~36% of the search thread time in countHits:

image

@mikemccand
Copy link

Hmm, why is Self time so high in your profiler output? What is countHits actually doing?

Is there any way to cluster multiple hashes into a single term during indexing? I.e. do certain subsets of hashes frequently co-occur at search time?

@alexklibisz
Copy link
Owner Author

Hi @mikemccand, thanks for the reply. As a side note, I've found many of your articles very helpful!

Hmm, why is Self time so high in your profiler output? What is countHits actually doing?

To clarify a bit, the first screenshot is of a benchmark that uses Lucene by itself, i.e. no Elasticsearch involved. The second is from inside the search thread in the Elasticsearch JVM. The self times in the first case is proportionally higher, but both are > 50% of the method runtime.

The countHits method is here: https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73

The rough outline of countHits is:

  • Get an iterator for the terms at the field. There is a single field where the terms (hashes of a vector) are indexed.
  • Instantiate an ArrayHitCounter. This is basically allocating a short[] with one entry for every doc in the segment. In this benchmark there are 1M docs in the segment. Each time a doc matches one of the hashes, increment the value at its docId in that array. This is in lieu of using a hashmap (docId -> count), which I found to be at least twice as slow as using an array, even when using primitive collections. More on that here: More efficient counter for query hits #156
  • Iterate over the query vector hashes.
  • For each hash, seek it in the terms iterator, retrieve its postings (docs that match that term).
  • For each posting, call .increment(docId), which increments the array-based counter like described above.
  • After those two loops, return the counter.

Perhaps it's taking a while to allocate the array-backed counter? I don't have a good sense for whether that should actually take such a significant amount of time. Even if it does, the only alternatives I can think of are even slower. I can try it out later with a dummy counter implementation that just throws away the counts to get a lower bound. Would appreciate any more input you have about this and #156 !

Is there any way to cluster multiple hashes into a single term during indexing? I.e. do certain subsets of hashes frequently co-occur at search time?

That would be a smart optimization, but I'm not sure there's a way to do that without having the entire dataset up-front, which doesn't fit the model of the plugin (index/update/delete vectors just like you would index/update/delete traditional docs).

@alexklibisz
Copy link
Owner Author

Some more notes-to-self for when I get back to this:

Here are the VisualVM hotspots from running the SIFT benchmark (1M stored vectors, 10k total queries) on a local ES JVM using the ArrayHitCounter:

image

Here are the VisualVM hotspots from the same configuration, but using a dummy counter that just returns the same count and maintains no internal state (i.e. no memory allocated/updated for counting):

image

I made sure to start the sampling right when it started running queries and stop right after it finished queries. Otherwise the percentages get skewed as the search thread continues running.

@mikemccand
Copy link

Hi @mikemccand, thanks for the reply. As a side note, I've found many of your articles very helpful!

Thanks, I am glad to hear that :)

The countHits method looks fine, though you should not iterate to docs.cost() but rather iterate until you see NO_MORE_DOCS returned. In general the cost method is not guaranteed to be a precise count of how many hits this PostingsEnum will return, though for the default Codec that appears to be the case for now.

You are reusing your TermsEnum and PostingsEnum which is great. You might want to pre-sort the Hash in unicode order ... it might give a tiny improvement since the TermsEnum can share more internal state on each seekExact.

Your countHits method is very similar to what Lucene's MultiTermQuery.rewrite does when there are many terms for the non-scoring case, except that method accumulates into a Lucene bitset rather than int[] since the Lucene Query is just concerned with knowing which documents matched, not how many terms matched each hit.

I can try it out later with a dummy counter implementation that just throws away the counts to get a lower bound.

Measuring just the iteration and no count incrementing is a good idea. Oh I see, you just concurrently did that, as I am here commenting, heh. Oddly, it looks like Self time increased with the dummy counter (28.7% to 31.4%)? And the overall time dropped a bit ... this is what I would expect: Lucene is working hard to iterate through all those matched docids.

That would be a smart optimization, but I'm not sure there's a way to do that without having the entire dataset up-front, which doesn't fit the model of the plugin (index/update/delete vectors just like you would index/update/delete traditional docs).

Yeah, it is not simple to implement, and would require that you have some way to tell which hashes frequently co-occur during indexing. It would also increase your index size, but then make queries faster. You might also index larger and larger co-occurrence sets, so that at search time you could "factor out" the largest set(s) that can apply to that query.

How large is each vector hash?

@alexklibisz
Copy link
Owner Author

Thanks again for digging into this a bit.

The countHits method looks fine, though you should not iterate to docs.cost()

Good to know, I'll fix that.

You might want to pre-sort the Hash in unicode order ... it might give a tiny improvement since the TermsEnum can share more internal state on each seekExact.

I might already be doing this in the query constructor, here
The comparator just uses Java's Arrays.compareUnsigned, which, IIRC, is what Lucene does for ByteRefs. Is that what you are suggesting or something different?

Your countHits method is very similar to what Lucene's MultiTermQuery.rewrite does when there are many terms for the non-scoring case, except that method accumulates into a Lucene bitset rather than int[] since the Lucene Query is just concerned with knowing which documents matched, not how many terms matched each hit.

I modeled the implementation after the TermsInSetQuery. Which, similarly, only tracks whether a doc contained a matching term, not the actual count. It seems like the main difference and probably the main bottleneck between this custom query and prior art in Lucene is that I actually need to know how many matches there are in each doc. If you have any suggestions/tricks for very fast counting on the JVM, I'm all ears. As mentioned before, hashmaps, even from primitive collection libraries, seem to be about 2x slower.

How large is each vector hash?

With this benchmark config, there are 100 hashes per vector, 20 bytes each. I'd say that's a pretty typical setup.

@alexklibisz
Copy link
Owner Author

Maybe there's a clever stopping criteria to visiting all of the terms? I started reading about MaxScore and WAND scoring. Maybe that's dead end here?

@alexklibisz
Copy link
Owner Author

Put together a working early-stopping solution today. Roughly the idea is:

  • Compute the total number of term hits up front by iterating over the terms enum once.
  • Iterate over the terms and docs for each term again.
  • Use a counter to track the hits for each doc and a heap to track the top k hit counts.
  • For every doc, decrement the total number of term hits by the number of hits in that doc.
  • Stop iterating when the kth largest count is larger than the remaining number of term hits. Even if all of the remaining hits were for a single doc, it still couldn't exceed the kth largest.

Seems to work on the benchmark discussed previously. The times are roughly equivalent, but there's some obvious room for optimization. The method spends more time managing the heap than it does accessing lucene postings. I'm using a standard PriorityQueue<Integer>, and I bet if I wrote one using primitives it would be faster. The total runtime spends 10% of its time computing the kth largest hit from the array-backed counter. I just have to change some internal classes and methods to be able to use the kth largest from the heap instead of re-computing it.

@mikemccand
Copy link

You might want to pre-sort the Hash in unicode order ... it might give a tiny improvement since the TermsEnum can share more internal state on each seekExact.

I might already be doing this in the query constructor, here
The comparator just uses Java's Arrays.compareUnsigned, which, IIRC, is what Lucene does for ByteRefs. Is that what you are suggesting or something different?

compareUnsigned is perfect and matches Lucene's term sort order (Unicode sort order on utf-8 encoded byte[]).

Put together a working early-stopping solution today. Roughly the idea is:

That sounds promising! Do you take the docFreq (or maybe totalTermFreq) of terms into account? E.g., collecting all term + docFreq from the TermsEnum, then sort them in increasing docFreq order? That way your early termination would save the most work (by skipping the terms that had many docs)? Though, maybe your terms (vector hashes) are somewhat uniform and their docFreq do not vary much.

Be sure to test on real vectors. Testing on random data leads to random conclusions!

@alexklibisz
Copy link
Owner Author

That sounds promising! Do you take the docFreq (or maybe totalTermFreq) of terms into account? E.g., collecting all term + docFreq from the TermsEnum, then sort them in increasing docFreq order? That way your early termination would save the most work (by skipping the terms that had many docs)? Though, maybe your terms (vector hashes) are somewhat uniform and their docFreq do not vary much.

I thought about this but my understanding was that Lucene is optimized to seek over the terms in sorted order? Maybe the cost of unsorted seeks over terms is trivial compared to other things here?

@alexklibisz
Copy link
Owner Author

alexklibisz commented Oct 1, 2020

I'm afraid the early-stopping method as I described it isn't going to work. Specifically, it's pretty easy to find a case where a single vector matches for multiple consecutive hash terms and fills up the heap with its incrementing match counts. You end up determining that the cutoff for match counts is an old match count for that doc and you early-stop, but then there's only one doc that exceeds that cutoff. Replacing a doc's previous count in the heap might solve this?
Edit: I got it to work by clearing the heap after each term. If the smallest count in the heap for a specific term exceeds the remaining possible counts, then you early stop. But this ends up still being slower than just counting all the docs. Setting this aside for a few days.

@mikemccand
Copy link

That sounds promising! Do you take the docFreq (or maybe totalTermFreq) of terms into account? E.g., collecting all term + docFreq from the TermsEnum, then sort them in increasing docFreq order? That way your early termination would save the most work (by skipping the terms that had many docs)? Though, maybe your terms (vector hashes) are somewhat uniform and their docFreq do not vary much.

I thought about this but my understanding was that Lucene is optimized to seek over the terms in sorted order? Maybe the cost of unsorted seeks over terms is trivial compared to other things here?

You are right, Lucene would prefer that you seek the terms in Unicode order.

But it allows you to NOT do so, and the performance penalty might be lower than the gains you could see by early terminating sooner?

Just pure speculation ... I think you would need to test and see to know for sure.

@mikemccand
Copy link

I'm afraid the early-stopping method as I described it isn't going to work. Specifically, it's pretty easy to find a case where a single vector matches for multiple consecutive hash terms and fills up the heap with its incrementing match counts. You end up determining that the cutoff for match counts is an old match count for that doc and you early-stop, but then there's only one doc that exceeds that cutoff. Replacing a doc's previous count in the heap might solve this?

Hmm, I see, your top-K heap allows the same doc to be added in multiple slots?

Edit: I got it to work by clearing the heap after each term. If the smallest count in the heap for a specific term exceeds the remaining possible counts, then you early stop. But this ends up still being slower than just counting all the docs. Setting this aside for a few days.

OK, sigh :)

I still think you should explore indexing "sets of commonly co-occurring hashes" if possible. If your query-time vectors are such that the same sets of hashes commonly co-occur (i.e. you can predict these common subsets offline, before indexing, with highish accuracy to future queries), the impact can be sizable. It'd make your index a bit larger, but allow simpler/faster queries at search time.

@alexklibisz
Copy link
Owner Author

OK, sigh :)

I still think you should explore indexing "sets of commonly co-occurring hashes" if possible. If your query-time vectors are such that the same sets of hashes commonly co-occur (i.e. you can predict these common subsets offline, before indexing, with highish accuracy to future queries), the impact can be sizable. It'd make your index a bit larger, but allow simpler/faster queries at search time.

I agree.. I'm trying to think through some ways that could work without requiring a huge sample of representative vectors up-front.

@alexklibisz
Copy link
Owner Author

alexklibisz commented Oct 2, 2020

I'm trying to make precise what advantage indexing co-ocurring hashes would have. If we assume that Lucene's retrieval speed is maxed out, the next obvious speedup is to somehow decrease the number of irrelevant docs that I'm iterating over, i.e. increase the "quality" of candidates retrieved for each term. Aggregating co-occurring hashes would definitely lead to fewer terms. Would it lead to higher-quality matches for each term? As the number of hashes in an aggregate increases, the probability of retrieving an irrelevant candidate decreases. But that's also the exact same reasoning and procedure behind the k parameter in most of the LSH mappings (http://elastiknn.klibisz.com/api/#l2-lsh-mapping). The vector is already getting hashed L * k times, and the adjacent sets of k hashes are getting concatenated to form L hashes. So maybe I can get the same effect by increasing L and k, and my hunch is that the increased cost of computing more hashes is probably less than the decreased cost of Lucene iterating over irrelevant docs. Up til now I've been assuming (at least implicitly) that we should avoid a large number of terms, i.e. L and k should be as low as possible. A bit rambly.. but needed to get thoughts down in writing.

@mikemccand
Copy link

Oh, I was proposing a purely functionality neutral optimization, since indexing co-occurring hashes would result in fewer disjunctive terms at search-time, and should make your searches run faster? But you're right, the scoring would be impacted, unless you could e.g. make a co-occurring hash that matched 7 hashes record and count as +7 at search timed. I had not thought through the implications of changing the score ...

@alexklibisz
Copy link
Owner Author

Yeah - really we should be able to know precisely how many times each vector matched. It's already an approximate algorithm, so introducing more approximation for performance-sake might not be worth it.

I'm thinking we've possibly hit the limit here. This effort is mostly motivated by trying to get the performance on-par with some C/C++ in-memory approximate nearest neighbors solutions. Right now for this benchmark dataset, the queries/second is 1 to 2 OOM lower than those alternatives. Of course there are other tradeoffs that make this a favorable approach in some ways

But after trying out these options and getting some sanity checks on the current implementation (thanks again to @mikemccand!), I think I'm fine with the conclusion that we've hit the limit for now.

@alexklibisz
Copy link
Owner Author

Since one of the major bottlenecks is reading the matched doc ids from the segment, the last thing I'm tempted to look into is whether there is some construct for caching access to the postings for a particular term in a segment? I don't see any existing construct like that in the Lucene library. I could hack it together myself pretty easily using guava. But, ideally it would also bust the cache when there are appends to a segment, and I'm not sure how I'd wire it up so that the cache is busted on appends.

@aecio
Copy link

aecio commented Oct 7, 2020

I am wondering whether did you consider using top-k aggregation algorithms like the TA or NRA (see http://alumni.cs.ucr.edu/~skulhari/Top-k-Query.pdf) that avoid traversing the complete posting list (and actually counting the hits)? And how they compare with the early-termination strategy that you experimented with? Also, did you consider using the WANDScorer in Lucene? I guess that these algorithms also don't perform well in this case because the query has many terms?

@alexklibisz
Copy link
Owner Author

Hi @aecio . Thanks for the link. I wasn't aware of those algorithms so I'll read the slides, hopefully tonight.

I looked through the WANDScorer code. I didn't really see a way to apply it here. IIRC it looked like it was designed to optimize the top hits scoring for N individual TermQuery's, rather than one query with N terms. A couple months ago I was using a BooleanQuery with a TermQuery in a should clause for each of the terms. That was 2-3x slower compared to the current implementation and re-scoring the vectors with exact similarities was a much more awkward arrangement.

@alexklibisz
Copy link
Owner Author

@aecio The algorithms in those slides seem worth trying. It seems that for both Fagin's Algorithm and the Threshold Algorithm, I'd need to keep an array of PostingsEnums to "step" through the docs for each of the terms. I'll probably try the TA first. It's not clear to me how I'd implement the No-Random-Access algorithm. Specifically, I'm confused on how they define the lower and upper bounds.

@alexklibisz
Copy link
Owner Author

Took a first pass at the Threshold Algorithm this morning: https://github.com/alexklibisz/elastiknn/blob/top-k-algos/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L51-L134
A few of the tests pass.. definitely not all. There are probably some mistakes in there.

@alexklibisz
Copy link
Owner Author

alexklibisz commented Oct 9, 2020

I'm not seeing any practical improvement from using the Threshold Algorithm.
The kth greatest score is consistently ~2 OOM lower than the threshold. This makes some intuitive sense when you consider there are usually 300 to 600 query terms, k = 1000 to 10,000, and x_i is always 1. Extremely few of the docs will match all of the terms.
Here's the latest code: https://github.com/alexklibisz/elastiknn/blob/top-k-algos/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L56

It is nice to know that I can keep an iterator for each term and step through the sorted docs, effectively doing a document-first traversal rather than a term-first. That might be useful in revisiting my original idea for early-stopping.

@alexklibisz
Copy link
Owner Author

alexklibisz commented Oct 11, 2020

I've been going through the various optimizations described in this fantastic paper: http://www.dcs.gla.ac.uk/~craigm/publications/fnt-efficient-query-processing.pdf

So far I've tried the TAAT Quit and Continue described on page 39, TAAT MaxScore on page 41, and DAAT MaxScore on page 47. TAAT Quit and Continue yields a significant speedup on the Fashion MNIST dataset. 0.7 recall was previously at ~190 queries/second, now at 260 queries/second. So far none of them have yielded much of a speedup for the pathological parameters on the SIFT dataset.

Edit: this was sadly a bug in how I was evaluating recall. The change made it so that there were duplicate docs returned. Those duplicates were very good.. but a query for k=100 docs was usually only returning about 30 distinct docs. 🤦

@aecio
Copy link

aecio commented Oct 11, 2020

Thanks for reporting the results of your experiments @alexklibisz!

The paper you mentioned is indeed excellent. If you're willing to dig deeper into this, the following paper is more recent and describes new techniques not included in that paper: https://dl.acm.org/doi/10.1145/3397271.3401076 I'm not sure they would be directly applicable and improve for your use case, but they have some plots that show their algorithm becomes particularly more efficient when the number of terms in query increases. They compare with the several variants of these optimizations, including BlockMax MaxScore, BlockMax WAND, and others.

WRT the previous comments, can you clarify what you mean by "The kth greatest score is consistently ~2 OOM lower than the threshold". I'm not sure what OOM means in this context.

@alexklibisz
Copy link
Owner Author

WRT the previous comments, can you clarify what you mean by "The kth greatest score is consistently ~2 OOM lower than the threshold". I'm not sure what OOM means in this context.

OOM = orders of magnitude. Not out-of-memory :)

@alexklibisz
Copy link
Owner Author

@aecio I appreciate your tips. I'm checking out the paper you linked.

I'll also explain the "pathological parameters on the SIFT dataset" that I mentioned before.
Basically, there are two datasets in the ann-benchmarks suite which use euclidean distance: Fashion-MNIST and SIFT.
The distributions of the vector values are quite different, here's a histogram:
image
SIFT has far more zeros than Fashion-MNIST.
The Locality Sensitive Hashing model used by Elastiknn takes a width parameter w. Each of the vectors is projected onto a random scalar vector, and the line onto which they are projected is broken into segments of length w. Since there are so many zeros in the SIFT dataset, the projections (dot-products) are much smaller, so if the value of w is too large, you basically asign almost all of the vectors into a small handful of buckets. SIFT does best with w = 1 or 2. On the other hand, Fashion MNIST has larger dot-products, so it does best with w=6 or 7.
When you run SIFT with w >= 3 almost every term ends up matching 50-70% of the vectors in the corpus, forcing you to iterate over extremely long postings lists, which makes the searches extremely slow.
Unfortunately, the ann-benchmarks framework doesn't let you say "only use w=1 or 2 on this dataset and only use w = 6 or 7 on this other dataset." So it ends up wasting a bunch of time and/or timing-out on SIFT with w>=3.
That's a big part of what motivated this entire endeavor.

@alexklibisz alexklibisz changed the title Optimize countHits method Optimize top-k counting for approximate queries Oct 12, 2020
@aecio
Copy link

aecio commented Oct 12, 2020

Thanks for the details. It all makes sense now. :)

If the problem is iterating over the long postings lists, then a promising solution might be to use the BlockMax optimizations that allow you to jump whole blocks of the posting lists that have no competitive scores (as opposed to evaluating every single doc nextDoc()). Altough it seems the code can't be reused in your query, the WANDScorer implements this idea.

@alexklibisz
Copy link
Owner Author

As I read more about the block-max strategies, I don't actually see it being a particularly useful optimization here. The reason I think this: Every vector ("document") has the exact same number of hashes ("terms"). So the term upper bound for any given term should be the same for every doc in the postings list for that term. Maybe I'm missing something, but the block-max strategies seem to be predicated on the idea that there are blocks of docs where the document upper-bound within that block is so low that you shouldn't even bother looking, so you skip the whole block.

@aecio
Copy link

aecio commented Oct 12, 2020

I thought that because each hash (term) has a numeric value (frequency) associated with it, the score upper bounds would not be the same because they would depend on the frequency value. But I may probably be missing something as well, I only have a shallow understanding of both your query and these query optimizations.

@alexklibisz
Copy link
Owner Author

Ah, terms do technically have an associated frequency. It's only > 1 for one of the LSH models, Permutation LSH when repeating = true. However, repeating = false seems to outperform repeating = true.. So I've been basically pretending every term appears exactly once for the sake of this problem. I've also considered, if I get a sufficient speedup here, just discarding that setting.

@alexklibisz
Copy link
Owner Author

@aecio LMK if you're interested in trying out the code to try to implement some optimizations. I haven't yet documented the development setup, but it's fairly simple so I'd be happy to write up a development guide.

@alexklibisz
Copy link
Owner Author

Some summary from looking at various top-k pruning optimizations over the long weekend:

  • This specific version of the top-k problem is at least a little atypical. We can virtually always assume that the contribution of a doc for a given term is equivalent to that term's upper bound. This is because all but one of the LSH models produce exactly the same number of hashes (terms) for each vector (doc) (i.e., each term in a doc occurs exactly once, and all docs have the same number of terms). From what I understand, this renders the block-max-style algorithms useless; all blocks in the postings list have the same max contribution, so you can't skip any of them.
  • The current implementation is basically exhaustive TAAT. To improve on that, I implemented the TAAT stop/continue, TAAT max-score, exhaustive DAAT, and DAAT max-score variants. All of them had at least plausible performance in terms of recall, so I don't think there were any flagrant errors in my implementation. But, for a segment with 60k docs, the cost of maintaining additional bookkeeping for these optimizations was large enough that the method was actually slower than exhaustive TAAT (in some cases as much as 10x slower), even though it was traversing fewer terms and/or documents. I didn't have time to do WAND, but that also seemed like the bookkeeping cost would be non-negligible. All of this code is on a branch, organized somewhat coherently, if anyone is interested.
  • The most expensive operations, according to VisualVM, are always self time and nextDoc.

@alexklibisz
Copy link
Owner Author

I recently watched Adrien Grand's talk at ApacheCon 2021, where he spent some time covering sorted indices. I'm thinking this might be an interesting option for speeding up this query.

If I understand it correctly, this would look like:

  1. Update the LSH models to compute hashes such that lexical sorted order of the hashes corresponds to spatial proximity of the vectors.
  2. Sort the index by the field where the hash terms are stored. It seems like this is a pretty simple change in the index settings.
  3. Then Lucene would, in theory, have to traverse a much smaller subset of each segment.

This is also relevant re #278. I'll likely give it a try as part of implementing the big-ann-benchmarks submission.

@alexklibisz
Copy link
Owner Author

alexklibisz commented Feb 27, 2023

I benchmarked a few primitive collections libraries, to see if any of them could be used to optimize the hit counting.

The libraries I tried:

The benchmarks are on this branch: https://github.com/alexklibisz/elastiknn/tree/160-primitive-collections-benchmarks, specifically in this file: https://github.com/alexklibisz/elastiknn/blob/160-primitive-collections-benchmarks/elastiknn-micro-benchmarks/src/main/scala/com/klibisz/elastiknn/vectors/HitCounterBenchmarks.scala

Here are the results:

Benchmark                                                Mode  Cnt      Score      Error  Units
HitCounterBenchmarks.agronaInt2IntCounterMap            thrpt   10   2893.557 ± 1294.608  ops/s
HitCounterBenchmarks.arrayCountBaseline                 thrpt   10   5435.787 ±  124.854  ops/s
HitCounterBenchmarks.eclipseIntIntHashMapAddToValue     thrpt   10  12497.779 ±  293.243  ops/s
HitCounterBenchmarks.eclipseIntIntHashMapGetAndPut      thrpt   10   9182.423 ±  143.209  ops/s
HitCounterBenchmarks.eclipseIntShortHashMapAddToValue   thrpt   10  13919.468 ±  190.643  ops/s
HitCounterBenchmarks.eclipseIntShortHashMapGetAndPut    thrpt   10  11399.590 ±  161.827  ops/s
HitCounterBenchmarks.hashMapGetOrDefault                thrpt   10    491.672 ±   28.412  ops/s
HitCounterBenchmarks.speigerInt2IntLinkedOpenHashMap    thrpt   10   8192.194 ±  119.720  ops/s
HitCounterBenchmarks.speigerInt2IntOpenHashMap          thrpt   10  10443.230 ± 3885.601  ops/s
HitCounterBenchmarks.speigerInt2ShortLinkedOpenHashMap  thrpt   10   8738.557 ±  589.641  ops/s
HitCounterBenchmarks.speigerInt2ShortOpenHashMap        thrpt   10  12429.252 ±  567.338  ops/s

So the most promising option looks like the eclipse IntShortHashMap, using the .addToValue method. It beats the arrayCountBaseline (basically same as the existing ArrayHitCounter) by a factor of ~2.5x. And it should use less memory.

@alexklibisz
Copy link
Owner Author

I took another pass at this today. Specifically, I tried the new-ish IntIntHashMap from Lucene 9.x to track the document counts.

I implemented a simple JMH benchmark here: https://github.com/alexklibisz/elastiknn/pull/598/files#diff-c9afec758d911ce41f3ef81c96227c75c85e36465b0248bec43dccbcd20ae655

The benchmark results are promising, particularly when I simulate large segment sizes. Here are the results on an AWS EC2 r6i.8xlarge instance with a simulated segment size of 60k, 2k hits, initial counter size 1k:

[info] Benchmark                                               Mode  Cnt      Score     Error  Units
[info] HitCounterBenchmarks.arrayCountBaseline                thrpt    5  36686.200 ± 539.850  ops/s
[info] HitCounterBenchmarks.eclipseIntShortHashMapAddToValue  thrpt    5  69033.519 ±  60.579  ops/s
[info] HitCounterBenchmarks.hashMapGetOrDefault               thrpt    5  26861.607 ±  66.100  ops/s
[info] HitCounterBenchmarks.luceneIntIntHashMap               thrpt    5  33082.150 ±  59.188  ops/s

The Lucene IntIntHashMap is ~10% slower than using a plain array, but should use less memory. The Eclipse IntShortHashMap is 2x faster than both of them. However, the Eclipse collections JAR is like ~10MB, which I'd prefer to avoid.

So I implemented a HitCounter based on the Lucene IntIntHashMap in this PR: #598

But the benchmark results got worse across the board. Again on the same instance:

image

So I'm gonna sit on this for now. Maybe the Lucene IntIntHashMap would have a greater effect on a larger dataset. Maybe it's worthwhile to pull in the Eclipse collections library, or maybe I can extract just that particular IntShortHashMap as a standalone file. For now I want to try a couple other ideas.

@alexklibisz
Copy link
Owner Author

Closing in favor of #662

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants