Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix HNSW graph visitation limit bug #12413

Merged
merged 5 commits into from
Jul 6, 2023

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Jul 3, 2023

We have some weird behavior in HNSW searcher when finding the candidate entry point for the zeroth layer.

While trying to find the best entry point to gather the full candidate set, we don't filter based on the acceptableOrds bitset. Consequently, if we exit the search early (before hitting the zeroth layer), the results that are returned may contain documents NOT within that bitset.

Luckily since the results are marked as incomplete, the *VectorQuery logic switches back to an exact scan and throws away the results.

However, if any user called the leaf searcher directly, bypassing the query, they could run into this bug.

@benwtrent benwtrent requested a review from msokolov July 3, 2023 21:36
@jpountz
Copy link
Contributor

jpountz commented Jul 4, 2023

Intuitively, it sounds like a good approach to me to not take live docs into account to find good entry points, as there could be nodes that are good entry points even though they might be marked as deleted or not match the filter? Should we consider never exiting before hitting the zero-th level instead?

@benwtrent
Copy link
Member Author

Should we consider never exiting before hitting the zero-th level instead?

🤔

The idea is that if we cannot even get to the zeroth level before hitting the visitation limit, we shouldn't even bother going through the graph structure anymore as it will likely be slower than an exact match.

I think exiting with nothing and indicating incomplete makes the most sense to me as it gives clear indication that searching the graph structure shouldn't be done. If we end early on the zeroth lauer we could give a "knn" that is nowhere near the actual kNN because we exited early on the zero-th layer.

I don't know why that would be any better than exiting before reaching that layer.

@jpountz
Copy link
Contributor

jpountz commented Jul 4, 2023

Sorry, I commented too quickly, before understanding what your change was doing, I thought it was ignoring filtered out ords on levels > 0 at first. Your change makes sense to me now.

@benwtrent
Copy link
Member Author

OK, I reverted my minor optimizations and moved the method to be more inline with what Lucene did before.

Now I am getting exactly the same recall and the weird bug is fixed where we return partial results that may or may not contain documents not within the acceptableOrd set.

I still kept it a unique method to be perfectly clear what this method is doing.

@jpountz

visitedCount++;
if (friendSimilarity >= minAcceptedSimilarity) {
candidates.add(friendOrd, friendSimilarity);
if (results.insertWithOverflow(friendOrd, friendSimilarity) && results.size() >= 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems a little odd to preserve the ceremony of adding to a priority queue that will always be of length 1, although I suppose this preserves the idea of length > 1? Maybe we would want to do that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree @msokolov, but there is a weird edge case I am not 100% sure of. When I revert back to my commit here: 22da1e4

My recall numbers change. I can dig more into why that commit's solution is buggy and go back to something similar.

visitedLimit -= results.visitedCount();

if (results.incomplete()) {
results.setVisitedCount(numVisited);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be simpler to discard results here? There will never be more than one, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msokolov there shouldn't be. but commit 22da1e4 did exactly this. Only keeping track of the single best candidate and result, but my recall numbers were different.

@benwtrent
Copy link
Member Author

@msokolov found my bug 🤦 in the simplified version. I updated, removed the need for tracking candidates & results since we only care about the best found entry point.

@benwtrent benwtrent requested a review from msokolov July 6, 2023 15:14
Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what bug I found, but this LGTM

foundBetter = true;
visited.set(currentEp);
// Keep searching the given level until we stop finding a better candidate entry point
while (foundBetter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: with do/while you could avoid setting foundBetter = true above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do/while is indeed valid. But daggum, it confuses the crap out of me, wish Java didn't have it. If you don't mind, I will keep the while loop. I don't particularly mind if somebody changes it later.

@benwtrent
Copy link
Member Author

I don't know what bug I found, but this LGTM

Commas are important! Just meant to tell you I found my own bug. Not that YOU found my bug.

@benwtrent benwtrent merged commit 8611530 into apache:main Jul 6, 2023
4 checks passed
@benwtrent benwtrent deleted the bugfix/filtered-hnsw branch July 6, 2023 19:46
benwtrent added a commit that referenced this pull request Jul 6, 2023
We have some weird behavior in HNSW searcher when finding the candidate entry point for the zeroth layer.

While trying to find the best entry point to gather the full candidate set, we don't filter based on the acceptableOrds bitset. Consequently, if we exit the search early (before hitting the zeroth layer), the results that are returned may contain documents NOT within that bitset.

Luckily since the results are marked as incomplete, the *VectorQuery logic switches back to an exact scan and throws away the results.

However, if any user called the leaf searcher directly, bypassing the query, they could run into this bug.
@zhaih zhaih added this to the 9.8.0 milestone Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants