Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utilize exact kNN search when gathering k > numVectors in a segment #12806

Merged
merged 4 commits into from
Nov 15, 2023

Conversation

benwtrent
Copy link
Member

When requesting for k >= numVectors, it doesn't make sense to go through the HNSW graph. Even without a user supplied filter, we should not explore the HNSW graph if it contains fewer than k vectors.

One scenario where we may still explore the graph if k >= numVectors is when not every document has a vector and there are deleted docs. But, this commit significantly improves things regardless.

@@ -110,6 +110,12 @@ private TopDocs getLeafResults(LeafReaderContext ctx, Weight filterWeight) throw
int maxDoc = ctx.reader().maxDoc();

if (filterWeight == null) {
int cost = liveDocs == null ? maxDoc : liveDocs.length();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fairly ignorant around what liveDocs.length() means. Here I am assuming if there are liveDocs, this iterator might actually be smaller than maxDoc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

liveDocs.length() is always equal to maxDoc

Comment on lines 181 to 182
DocIdSetIterator iterator =
ConjunctionUtils.intersectIterators(List.of(acceptIterator, vectorScorer.iterator()));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An intersection of the vector values and the user provided filter seemed like the way it should have always been.

Do we want to remove the fieldExists query logic when a user provides a pre-filter?

* @param ctx the leaf reader context
* @return the number of vectors in the given leaf.
*/
protected abstract int numVectorsInLeaf(LeafReaderContext ctx) throws IOException;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cheap, we store it per field. This way we have a good count even when not every document has a vector.

@@ -779,6 +781,16 @@ Directory getIndexStore(
doc.add(getKnnVectorField(field, contents[i], vectorSimilarityFunction));
doc.add(new StringField("id", "id" + i, Field.Store.YES));
writer.addDocument(doc);
if (randomBoolean()) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding docs in between vector docs to ensure our sparse vector reader is adequately exercised. Simple improvement in coverage here.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea makes sense to me, what is less clear to me is whether this logic belongs to the Query or to the vector reader: should searchNearestNeighbors implicitly do a linear scan when k is greater than size()?

@@ -110,6 +110,12 @@ private TopDocs getLeafResults(LeafReaderContext ctx, Weight filterWeight) throw
int maxDoc = ctx.reader().maxDoc();

if (filterWeight == null) {
int cost = liveDocs == null ? maxDoc : liveDocs.length();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

liveDocs.length() is always equal to maxDoc

@benwtrent
Copy link
Member Author

The idea makes sense to me, what is less clear to me is whether this logic belongs to the Query or to the vector reader: should searchNearestNeighbors implicitly do a linear scan when k is greater than size()?

I like that idea, we should do that.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I find that it also looks nicer, and it saves one IndexInput.clone() as well.

new OrdinalTranslatedKnnCollector(knnCollector, scorer::ordToDoc),
getGraph(fieldEntry),
scorer.getAcceptOrds(acceptDocs));
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: from a style perspective, I'd rather not return here and have an else block instead

@benwtrent benwtrent merged commit 05a336e into apache:main Nov 15, 2023
4 checks passed
@benwtrent benwtrent deleted the feature/knn-exact-on-small-segments branch November 15, 2023 17:56
benwtrent added a commit that referenced this pull request Nov 15, 2023
…12806)

When requesting for k >= numVectors, it doesn't make sense to go through the HNSW graph. Even without a user supplied filter, we should not explore the HNSW graph if it contains fewer than k vectors.

One scenario where we may still explore the graph if k >= numVectors is when not every document has a vector and there are deleted docs. But, this commit significantly improves things regardless.
mccullocht added a commit to mccullocht/lucene that referenced this pull request Feb 27, 2024
As of apache#12806 the hnsw codec has implemented a more complete version of this logic
that may trigger without a pre-filter query.

Reference apache#12505
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants