Fix HNSW graph visitation limit bug #12413

benwtrent · 2023-07-03T21:36:46Z

We have some weird behavior in HNSW searcher when finding the candidate entry point for the zeroth layer.

While trying to find the best entry point to gather the full candidate set, we don't filter based on the acceptableOrds bitset. Consequently, if we exit the search early (before hitting the zeroth layer), the results that are returned may contain documents NOT within that bitset.

Luckily since the results are marked as incomplete, the *VectorQuery logic switches back to an exact scan and throws away the results.

However, if any user called the leaf searcher directly, bypassing the query, they could run into this bug.

jpountz · 2023-07-04T16:14:01Z

Intuitively, it sounds like a good approach to me to not take live docs into account to find good entry points, as there could be nodes that are good entry points even though they might be marked as deleted or not match the filter? Should we consider never exiting before hitting the zero-th level instead?

benwtrent · 2023-07-04T16:20:32Z

Should we consider never exiting before hitting the zero-th level instead?

🤔

The idea is that if we cannot even get to the zeroth level before hitting the visitation limit, we shouldn't even bother going through the graph structure anymore as it will likely be slower than an exact match.

I think exiting with nothing and indicating incomplete makes the most sense to me as it gives clear indication that searching the graph structure shouldn't be done. If we end early on the zeroth lauer we could give a "knn" that is nowhere near the actual kNN because we exited early on the zero-th layer.

I don't know why that would be any better than exiting before reaching that layer.

jpountz · 2023-07-04T16:31:35Z

Sorry, I commented too quickly, before understanding what your change was doing, I thought it was ignoring filtered out ords on levels > 0 at first. Your change makes sense to me now.

…inding entry point

benwtrent · 2023-07-05T13:48:17Z

OK, I reverted my minor optimizations and moved the method to be more inline with what Lucene did before.

Now I am getting exactly the same recall and the weird bug is fixed where we return partial results that may or may not contain documents not within the acceptableOrd set.

I still kept it a unique method to be perfectly clear what this method is doing.

@jpountz

msokolov · 2023-07-05T18:43:37Z

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java

+          visitedCount++;
+          if (friendSimilarity >= minAcceptedSimilarity) {
+            candidates.add(friendOrd, friendSimilarity);
+            if (results.insertWithOverflow(friendOrd, friendSimilarity) && results.size() >= 1) {


it seems a little odd to preserve the ceremony of adding to a priority queue that will always be of length 1, although I suppose this preserves the idea of length > 1? Maybe we would want to do that?

I agree @msokolov, but there is a weird edge case I am not 100% sure of. When I revert back to my commit here: 22da1e4

My recall numbers change. I can dig more into why that commit's solution is buggy and go back to something similar.

msokolov · 2023-07-05T18:45:21Z

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java

-      visitedLimit -= results.visitedCount();
-
-      if (results.incomplete()) {
-        results.setVisitedCount(numVisited);


Wouldn't it be simpler to discard results here? There will never be more than one, right?

@msokolov there shouldn't be. but commit 22da1e4 did exactly this. Only keeping track of the single best candidate and result, but my recall numbers were different.

benwtrent · 2023-07-06T15:14:19Z

@msokolov found my bug 🤦 in the simplified version. I updated, removed the need for tracking candidates & results since we only care about the best found entry point.

msokolov

I don't know what bug I found, but this LGTM

msokolov · 2023-07-06T19:27:14Z

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java

+      foundBetter = true;
+      visited.set(currentEp);
+      // Keep searching the given level until we stop finding a better candidate entry point
+      while (foundBetter) {


nit: with do/while you could avoid setting foundBetter = true above

do/while is indeed valid. But daggum, it confuses the crap out of me, wish Java didn't have it. If you don't mind, I will keep the while loop. I don't particularly mind if somebody changes it later.

benwtrent · 2023-07-06T19:45:53Z

I don't know what bug I found, but this LGTM

Commas are important! Just meant to tell you I found my own bug. Not that YOU found my bug.

We have some weird behavior in HNSW searcher when finding the candidate entry point for the zeroth layer. While trying to find the best entry point to gather the full candidate set, we don't filter based on the acceptableOrds bitset. Consequently, if we exit the search early (before hitting the zeroth layer), the results that are returned may contain documents NOT within that bitset. Luckily since the results are marked as incomplete, the *VectorQuery logic switches back to an exact scan and throws away the results. However, if any user called the leaf searcher directly, bypassing the query, they could run into this bug.

benwtrent added 2 commits July 3, 2023 16:38

Fix HNSW graph visitation limit bug

22da1e4

Merge remote-tracking branch 'upstream/main' into bugfix/filtered-hnsw

535285b

benwtrent requested a review from msokolov July 3, 2023 21:36

reverting some optimizations, building candidate list as normal for f…

481784d

…inding entry point

msokolov reviewed Jul 5, 2023

View reviewed changes

simplify entry point discovery

7375f0c

benwtrent requested a review from msokolov July 6, 2023 15:14

adding changes

6ee7ee5

msokolov approved these changes Jul 6, 2023

View reviewed changes

benwtrent merged commit 8611530 into apache:main Jul 6, 2023
4 checks passed

benwtrent deleted the bugfix/filtered-hnsw branch July 6, 2023 19:46

zhaih added this to the 9.8.0 milestone Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix HNSW graph visitation limit bug #12413

Fix HNSW graph visitation limit bug #12413

benwtrent commented Jul 3, 2023 •

edited

Loading

jpountz commented Jul 4, 2023

benwtrent commented Jul 4, 2023

jpountz commented Jul 4, 2023

benwtrent commented Jul 5, 2023

msokolov Jul 5, 2023

benwtrent Jul 5, 2023

msokolov Jul 5, 2023

benwtrent Jul 5, 2023

benwtrent commented Jul 6, 2023

msokolov left a comment

msokolov Jul 6, 2023

benwtrent Jul 6, 2023

benwtrent commented Jul 6, 2023

Fix HNSW graph visitation limit bug #12413

Fix HNSW graph visitation limit bug #12413

Conversation

benwtrent commented Jul 3, 2023 • edited Loading

jpountz commented Jul 4, 2023

benwtrent commented Jul 4, 2023

jpountz commented Jul 4, 2023

benwtrent commented Jul 5, 2023

msokolov Jul 5, 2023

Choose a reason for hiding this comment

benwtrent Jul 5, 2023

Choose a reason for hiding this comment

msokolov Jul 5, 2023

Choose a reason for hiding this comment

benwtrent Jul 5, 2023

Choose a reason for hiding this comment

benwtrent commented Jul 6, 2023

msokolov left a comment

Choose a reason for hiding this comment

msokolov Jul 6, 2023

Choose a reason for hiding this comment

benwtrent Jul 6, 2023

Choose a reason for hiding this comment

benwtrent commented Jul 6, 2023

benwtrent commented Jul 3, 2023 •

edited

Loading