Utilize exact kNN search when gathering k > numVectors in a segment #12806

benwtrent · 2023-11-14T19:26:33Z

When requesting for k >= numVectors, it doesn't make sense to go through the HNSW graph. Even without a user supplied filter, we should not explore the HNSW graph if it contains fewer than k vectors.

One scenario where we may still explore the graph if k >= numVectors is when not every document has a vector and there are deleted docs. But, this commit significantly improves things regardless.

benwtrent · 2023-11-14T19:27:48Z

lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java

@@ -110,6 +110,12 @@ private TopDocs getLeafResults(LeafReaderContext ctx, Weight filterWeight) throw
    int maxDoc = ctx.reader().maxDoc();

    if (filterWeight == null) {
+      int cost = liveDocs == null ? maxDoc : liveDocs.length();


I am fairly ignorant around what liveDocs.length() means. Here I am assuming if there are liveDocs, this iterator might actually be smaller than maxDoc.

liveDocs.length() is always equal to maxDoc

benwtrent · 2023-11-14T19:28:49Z

lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java

+    DocIdSetIterator iterator =
+        ConjunctionUtils.intersectIterators(List.of(acceptIterator, vectorScorer.iterator()));


An intersection of the vector values and the user provided filter seemed like the way it should have always been.

Do we want to remove the fieldExists query logic when a user provides a pre-filter?

benwtrent · 2023-11-14T19:30:39Z

lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java

+   * @param ctx the leaf reader context
+   * @return the number of vectors in the given leaf.
+   */
+  protected abstract int numVectorsInLeaf(LeafReaderContext ctx) throws IOException;


This is cheap, we store it per field. This way we have a good count even when not every document has a vector.

benwtrent · 2023-11-14T19:32:33Z

lucene/core/src/test/org/apache/lucene/search/BaseKnnVectorQueryTestCase.java

@@ -779,6 +781,16 @@ Directory getIndexStore(
      doc.add(getKnnVectorField(field, contents[i], vectorSimilarityFunction));
      doc.add(new StringField("id", "id" + i, Field.Store.YES));
      writer.addDocument(doc);
+      if (randomBoolean()) {


Adding docs in between vector docs to ensure our sparse vector reader is adequately exercised. Simple improvement in coverage here.

jpountz

The idea makes sense to me, what is less clear to me is whether this logic belongs to the Query or to the vector reader: should searchNearestNeighbors implicitly do a linear scan when k is greater than size()?

jpountz · 2023-11-15T11:45:30Z

lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java

@@ -110,6 +110,12 @@ private TopDocs getLeafResults(LeafReaderContext ctx, Weight filterWeight) throw
    int maxDoc = ctx.reader().maxDoc();

    if (filterWeight == null) {
+      int cost = liveDocs == null ? maxDoc : liveDocs.length();


liveDocs.length() is always equal to maxDoc

benwtrent · 2023-11-15T12:05:16Z

The idea makes sense to me, what is less clear to me is whether this logic belongs to the Query or to the vector reader: should searchNearestNeighbors implicitly do a linear scan when k is greater than size()?

I like that idea, we should do that.

jpountz

Thanks, I find that it also looks nicer, and it saves one IndexInput.clone() as well.

jpountz · 2023-11-15T14:12:05Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java

+          new OrdinalTranslatedKnnCollector(knnCollector, scorer::ordToDoc),
+          getGraph(fieldEntry),
+          scorer.getAcceptOrds(acceptDocs));
+      return;


nit: from a style perspective, I'd rather not return here and have an else block instead

…12806) When requesting for k >= numVectors, it doesn't make sense to go through the HNSW graph. Even without a user supplied filter, we should not explore the HNSW graph if it contains fewer than k vectors. One scenario where we may still explore the graph if k >= numVectors is when not every document has a vector and there are deleted docs. But, this commit significantly improves things regardless.

As of apache#12806 the hnsw codec has implemented a more complete version of this logic that may trigger without a pre-filter query. Reference apache#12505

Utilize exact kNN search when gathering k > numVectors in a segment

c46c057

benwtrent added the vector-based-search label Nov 14, 2023

benwtrent commented Nov 14, 2023

View reviewed changes

adding CHANGES

ef0efc1

jpountz reviewed Nov 15, 2023

View reviewed changes

moving check to knnReader

6fbf8f9

benwtrent requested a review from jpountz November 15, 2023 12:18

jpountz approved these changes Nov 15, 2023

View reviewed changes

addressing PR comments

a8607b1

jpountz approved these changes Nov 15, 2023

View reviewed changes

benwtrent merged commit 05a336e into apache:main Nov 15, 2023
4 checks passed

benwtrent deleted the feature/knn-exact-on-small-segments branch November 15, 2023 17:56

benwtrent mentioned this pull request Feb 2, 2024

Re-explore the logic around when Vector search should be Exact #12505

Open

mccullocht mentioned this pull request Feb 28, 2024

Remove unnecessary AbstractKnnVectorQuery.exactSearch() #13143

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Utilize exact kNN search when gathering k > numVectors in a segment #12806

Utilize exact kNN search when gathering k > numVectors in a segment #12806

benwtrent commented Nov 14, 2023

benwtrent Nov 14, 2023

jpountz Nov 15, 2023

benwtrent Nov 14, 2023

benwtrent Nov 14, 2023

benwtrent Nov 14, 2023

jpountz left a comment

jpountz Nov 15, 2023

benwtrent commented Nov 15, 2023

jpountz left a comment

jpountz Nov 15, 2023

		DocIdSetIterator iterator =
		ConjunctionUtils.intersectIterators(List.of(acceptIterator, vectorScorer.iterator()));

Utilize exact kNN search when gathering k > numVectors in a segment #12806

Utilize exact kNN search when gathering k > numVectors in a segment #12806

Conversation

benwtrent commented Nov 14, 2023

benwtrent Nov 14, 2023

Choose a reason for hiding this comment

jpountz Nov 15, 2023

Choose a reason for hiding this comment

benwtrent Nov 14, 2023

Choose a reason for hiding this comment

benwtrent Nov 14, 2023

Choose a reason for hiding this comment

benwtrent Nov 14, 2023

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

jpountz Nov 15, 2023

Choose a reason for hiding this comment

benwtrent commented Nov 15, 2023

jpountz left a comment

Choose a reason for hiding this comment

jpountz Nov 15, 2023

Choose a reason for hiding this comment