-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Utilize exact kNN search when gathering k > numVectors in a segment #12806
Utilize exact kNN search when gathering k > numVectors in a segment #12806
Conversation
@@ -110,6 +110,12 @@ private TopDocs getLeafResults(LeafReaderContext ctx, Weight filterWeight) throw | |||
int maxDoc = ctx.reader().maxDoc(); | |||
|
|||
if (filterWeight == null) { | |||
int cost = liveDocs == null ? maxDoc : liveDocs.length(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fairly ignorant around what liveDocs.length()
means. Here I am assuming if there are liveDocs
, this iterator might actually be smaller than maxDoc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
liveDocs.length() is always equal to maxDoc
DocIdSetIterator iterator = | ||
ConjunctionUtils.intersectIterators(List.of(acceptIterator, vectorScorer.iterator())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An intersection of the vector values and the user provided filter seemed like the way it should have always been.
Do we want to remove the fieldExists
query logic when a user provides a pre-filter?
* @param ctx the leaf reader context | ||
* @return the number of vectors in the given leaf. | ||
*/ | ||
protected abstract int numVectorsInLeaf(LeafReaderContext ctx) throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cheap, we store it per field. This way we have a good count even when not every document has a vector.
@@ -779,6 +781,16 @@ Directory getIndexStore( | |||
doc.add(getKnnVectorField(field, contents[i], vectorSimilarityFunction)); | |||
doc.add(new StringField("id", "id" + i, Field.Store.YES)); | |||
writer.addDocument(doc); | |||
if (randomBoolean()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding docs in between vector docs to ensure our sparse vector reader is adequately exercised. Simple improvement in coverage here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea makes sense to me, what is less clear to me is whether this logic belongs to the Query or to the vector reader: should searchNearestNeighbors
implicitly do a linear scan when k
is greater than size()
?
@@ -110,6 +110,12 @@ private TopDocs getLeafResults(LeafReaderContext ctx, Weight filterWeight) throw | |||
int maxDoc = ctx.reader().maxDoc(); | |||
|
|||
if (filterWeight == null) { | |||
int cost = liveDocs == null ? maxDoc : liveDocs.length(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
liveDocs.length() is always equal to maxDoc
I like that idea, we should do that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I find that it also looks nicer, and it saves one IndexInput.clone()
as well.
new OrdinalTranslatedKnnCollector(knnCollector, scorer::ordToDoc), | ||
getGraph(fieldEntry), | ||
scorer.getAcceptOrds(acceptDocs)); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: from a style perspective, I'd rather not return here and have an else
block instead
…12806) When requesting for k >= numVectors, it doesn't make sense to go through the HNSW graph. Even without a user supplied filter, we should not explore the HNSW graph if it contains fewer than k vectors. One scenario where we may still explore the graph if k >= numVectors is when not every document has a vector and there are deleted docs. But, this commit significantly improves things regardless.
As of apache#12806 the hnsw codec has implemented a more complete version of this logic that may trigger without a pre-filter query. Reference apache#12505
When requesting for k >= numVectors, it doesn't make sense to go through the HNSW graph. Even without a user supplied filter, we should not explore the HNSW graph if it contains fewer than
k
vectors.One scenario where we may still explore the graph if k >= numVectors is when not every document has a vector and there are deleted docs. But, this commit significantly improves things regardless.