LUCENE-10040: Handle deletions in nearest vector search #239

jtibshirani · 2021-08-09T09:18:44Z

This PR extends LeafReader#searchNearestVectors to take a parameter specifying
the live docs. The method then always returns the k nearest undeleted docs.

To implement this, the HNSW algorithm will only add a candidate to the result
set if it is a live doc. The graph search still visits and traverses deleted
docs as it gathers candidates.

This PR extends VectorReader#search to take a parameter specifying the live docs. LeafReader#searchNearestVectors then always returns the k nearest undeleted docs. To implement this, the HNSW algorithm will only add a candidate to the result set if it is a live doc. The graph search still visits and traverses deleted docs as it gathers candidates.

jtibshirani · 2021-08-09T09:19:52Z

lucene/core/src/test/org/apache/lucene/codecs/lucene90/TestLucene90HnswVectorsFormat.java

 import org.apache.lucene.util.TestUtil;

 public class TestLucene90HnswVectorsFormat extends BaseKnnVectorsFormatTestCase {
  @Override
  protected Codec getCodec() {
    return TestUtil.getDefaultCodec();
  }
+
+  public void testSearchWithDeletions() throws IOException {


This is not a perfect place for this test, I could move it to a test like TestKnnVectorQuery once the higher-level API is merged.

It's a good test; fine to have it here, but there is also TestKnnGraph which is not tied to any specific class.

How about also having a simple test for deletions where we delete all the documents and make sure none are returned yet the TopDocs.totalHits.value > 0 ?

mayya-sharipova · 2021-08-09T11:36:46Z

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java

        float score = similarityFunction.compare(query, vectors.vectorValue(friendOrd));
-        if (results.insertWithOverflow(friendOrd, score)) {
+        if (results.size() < topK || bound.check(score) == false) {


Should the condition results.size() < topK be instead results.size() <= numSeed to imitate the previous behaviour of results.insertWithOverflow(friendOrd, score)?

yes I think so. They're generally the same now in the normal path, but unless/until we eliminate the distinction there are a few cases where they differ (tiny index, mostly) and we should use numSeed here.

Good point, I'll update this!

msokolov · 2021-08-09T12:40:19Z

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java

-        results.add(entryPoint, similarityFunction.compare(query, vectors.vectorValue(entryPoint)));
+        float score = similarityFunction.compare(query, vectors.vectorValue(entryPoint));
+        candidates.add(entryPoint, score);
+        if (acceptOrds == null || acceptOrds.get(entryPoint)) {


Maybe cleaner to require non-null. The caller can supply a MatchAllBits if that's what they need

I tried both ways and liked this approach because it matches the API for BulkScorer#score and the return value for getLiveDocs. For me this consistency was worth it, even though in isolation it's slightly less clean.

OK, I agree it's nice to be able to pass getLiveDocs return value directly

msokolov · 2021-08-09T12:41:37Z

lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java

@@ -42,13 +42,6 @@
    }
  }

-  NeighborQueue copy(boolean reversed) {


this was unused I guess?

It was previously used in HnswGraph#search but this PR removed that.

msokolov · 2021-08-09T12:43:02Z

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java

@@ -136,7 +136,7 @@ public void setInfoStream(InfoStream infoStream) {
  void addGraphNode(float[] value) throws IOException {
    NeighborQueue candidates =
        HnswGraph.search(
-            value, beamWidth, beamWidth, vectorValues, similarityFunction, hnsw, random);
+            value, beamWidth, beamWidth, vectorValues, similarityFunction, hnsw, null, random);


Maybe add a comment to the effect we don't expect to see any deleted documents here?

msokolov · 2021-08-09T12:46:25Z

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraph.java

        float score = similarityFunction.compare(query, vectors.vectorValue(friendOrd));
-        if (results.insertWithOverflow(friendOrd, score)) {
+        if (results.size() < topK || bound.check(score) == false) {


yes I think so. They're generally the same now in the normal path, but unless/until we eliminate the distinction there are a few cases where they differ (tiny index, mostly) and we should use numSeed here.

msokolov · 2021-08-09T12:48:09Z

lucene/core/src/test/org/apache/lucene/codecs/lucene90/TestLucene90HnswVectorsFormat.java

+  public void testSearchWithDeletions() throws IOException {
+    Directory dir = newDirectory();
+    IndexWriterConfig cfg = new IndexWriterConfig();
+    IndexWriter w = new IndexWriter(dir, cfg);


I like to use try-with-resources in these tests; then the directory, reader and writer get auto-closed when exiting the scope, and in case there is a failure you don't get messy nastygrams from the test framework about unclosed resources.

I learned a new word 'nastygram'!

msokolov · 2021-08-09T12:49:13Z

lucene/core/src/test/org/apache/lucene/codecs/lucene90/TestLucene90HnswVectorsFormat.java

+        d.add(new KnnVectorField("vector", randomVector(dim)));
+        docIndex++;
+      } else {
+        d.add(new StringField("other", "value", Field.Store.NO));


Maybe use "id" for these too so they sometimes get deleted?

This is a good idea, it actually helped catch a test bug.

msokolov · 2021-08-09T12:52:45Z

lucene/core/src/test/org/apache/lucene/codecs/lucene90/TestLucene90HnswVectorsFormat.java

 import org.apache.lucene.util.TestUtil;

 public class TestLucene90HnswVectorsFormat extends BaseKnnVectorsFormatTestCase {
  @Override
  protected Codec getCodec() {
    return TestUtil.getDefaultCodec();
  }
+
+  public void testSearchWithDeletions() throws IOException {


It's a good test; fine to have it here, but there is also TestKnnGraph which is not tied to any specific class.

How about also having a simple test for deletions where we delete all the documents and make sure none are returned yet the TopDocs.totalHits.value > 0 ?

msokolov

Thanks, looks good!

mayya-sharipova

@jtibshirani Thanks Julie, LGTM as well.

jpountz · 2021-08-10T07:38:56Z

lucene/core/src/java/org/apache/lucene/index/CodecReader.java

@@ -220,7 +220,7 @@ public final TopDocs searchNearestVectors(String field, float[] target, int k)
      return null;
    }

-    return getVectorReader().search(field, target, k);
+    return getVectorReader().search(field, target, k, getLiveDocs());


We probably need to change the signature of searchNearestVectors to take a Bits acceptDocs instead, otherwise it would be impossible to wrap this reader in order to change the set of live documents.

To check I understand, you're thinking of a wrapper like FilterLeafReader, specifically a use case like SoftDeletesFilterLeafReader?

Yes indeed.

That makes sense, and this is a pretty low-level API so the extra parameter/ requirement to pass in live docs seems okay.

This ensures it's possible to wrap a LeafReader and modify its live docs, as we do in SoftDeletesFilterLeafReader.

jtibshirani · 2021-08-13T13:39:02Z

I tried modifying my usual ann-benchmark to randomly delete 25% of documents after force merging. As expected, it's slower but not by a huge amount.

sift-128-euclidean, M=16, efSearch=500

Approach                   Baseline QPS    Deletions QPS
LuceneHnsw(n_cands=10)     9045.574        7871.898
LuceneHnsw(n_cands=50)     3789.169        3035.740
LuceneHnsw(n_cands=100)    2333.714        1769.330
LuceneHnsw(n_cands=500)     579.850         448.190
LuceneHnsw(n_cands=800)     386.105         299.636

If you don't have objections, I'd like to merge now and then work on some final benchmarks -- ideally we could make sure luceneutil covers ANN with deletions.

If we set numSeed = 10, this test fails sometimes because it may mark expected results docs (from 0 to 9) as deleted which don't end up being retrieved, resulting in a low recall - set numSeed to 10 to ensure 10 results are returned - add startIndex paramenter to createRandomAcceptOrds that allows documents before startIndex to be NOT deleted - use startIndex equal to 10 for createRandomAcceptOrds Relates to apache#239

If we set numSeed = 10, this test fails sometimes because it may mark expected results docs (from 0 to 9) as deleted which don't end up being retrieved, resulting in a low recall - set numSeed to 10 to ensure 10 results are returned - add startIndex paramenter to createRandomAcceptOrds that allows documents before startIndex to be NOT deleted - use startIndex equal to 10 for createRandomAcceptOrds Relates to #239

harishankar-gopalan · 2022-10-18T05:46:36Z

Hi @jtibshirani / @msokolov , had a quick doubt on the deletion implementation. From the discussion and changes I understand that for now we have only marked the docs as deleted with the Bits value and ignored those documents while selecting our candidates.
So with this basic understanding my doubts are:

Is the search space being expanded to meet the top-k criteria ?
Is there any active work going on to prune these "marked-for-delete" nodes in the graph i.e implementing the version of delete suggested by the Weaviate author(s) here ?

Another unrelated doubt:

How are segment merges in general handled as I suppose graphs of the corresponding segments also would have to be merged ? And in the same context does deleted nodes have any effect on the merge operations in terms of high time complexity for some edge cases that the users should be aware of ?

jtibshirani · 2022-10-31T21:32:32Z

@harishankar-gopalan sorry for the slow response! Your overall understanding is right. In Lucene, deletions are handled by marking a document as deleted using a 'tombstone'. The index structures are not actually updated (this includes the HNSW graph).

In response to your questions...

Yes, when there are deletions, we make sure to expand the search space to retrieve top k. We use a similar strategy that many vector search engines use for kNN with filtering. During the HNSW search, we make sure to exclude deleted nodes from the final candidate set, but deleted nodes are still used for traversing hte graph.
No, there is no work in that direction. This is because Lucene segments are never updated after they are first written (except under rare circumstances). Lucene's immutable segment model is core to its design, and it's the reason we use 'tombstones' instead of modifying the graph in place.

For segment merges, we combine the vectors across all segments and build a new graph from scratch. We make sure to skip over deleted documents during this merge, so they have no effect on the time it takes to build the graph. This merge process is quite expensive and we're brainstorming ways of making it faster (#11354).

jtibshirani commented Aug 9, 2021

View reviewed changes

mayya-sharipova reviewed Aug 9, 2021

View reviewed changes

msokolov reviewed Aug 9, 2021

View reviewed changes

jtibshirani added 5 commits August 9, 2021 18:38

Correct size check when collecting results

bc97a80

Add comment for graph building

52700ea

Use try-with-resources in format test

0bea9b3

Add test where all docs are deleted

23cb26a

In test, sometimes delete non-vector docs

2f4356b

msokolov approved these changes Aug 9, 2021

View reviewed changes

mayya-sharipova approved these changes Aug 9, 2021

View reviewed changes

jpountz reviewed Aug 10, 2021

View reviewed changes

jtibshirani added 3 commits August 10, 2021 11:29

Expose acceptDocs in LeafReader#searchNearestVectors

e8f56c7

This ensures it's possible to wrap a LeafReader and modify its live docs, as we do in SoftDeletesFilterLeafReader.

Merge remote-tracking branch 'upstream/main' into hnsw-deletions

df30313

Update changelog

d21d846

jpountz approved these changes Aug 13, 2021

View reviewed changes

jtibshirani added 2 commits August 15, 2021 12:19

Merge remote-tracking branch 'upstream/main' into hnsw-deletions

a9593ce

Update KnnVectorQuery and move test cases

42f1ce9

msokolov approved these changes Aug 15, 2021

View reviewed changes

jtibshirani merged commit 6993fb9 into apache:main Aug 16, 2021

jtibshirani deleted the hnsw-deletions branch August 16, 2021 14:44

mayya-sharipova mentioned this pull request Sep 2, 2021

LUCENE-10040 Correct TestHnswGraph.testSearchWithAcceptOrds #277

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-10040: Handle deletions in nearest vector search #239

LUCENE-10040: Handle deletions in nearest vector search #239

jtibshirani commented Aug 9, 2021 •

edited

Loading

jtibshirani Aug 9, 2021

msokolov Aug 9, 2021

jtibshirani Aug 9, 2021

mayya-sharipova Aug 9, 2021

msokolov Aug 9, 2021

jtibshirani Aug 9, 2021

msokolov Aug 9, 2021

jtibshirani Aug 9, 2021

msokolov Aug 9, 2021

msokolov Aug 9, 2021

jtibshirani Aug 9, 2021

msokolov Aug 9, 2021

jtibshirani Aug 9, 2021

msokolov Aug 9, 2021

msokolov Aug 9, 2021

jtibshirani Aug 9, 2021

msokolov Aug 9, 2021

jtibshirani Aug 9, 2021

msokolov Aug 9, 2021

msokolov left a comment

mayya-sharipova left a comment

jpountz Aug 10, 2021

jtibshirani Aug 10, 2021

jpountz Aug 10, 2021

jtibshirani Aug 10, 2021

jtibshirani commented Aug 13, 2021

harishankar-gopalan commented Oct 18, 2022

jtibshirani commented Oct 31, 2022

LUCENE-10040: Handle deletions in nearest vector search #239

LUCENE-10040: Handle deletions in nearest vector search #239

Conversation

jtibshirani commented Aug 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msokolov left a comment

Choose a reason for hiding this comment

mayya-sharipova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani commented Aug 13, 2021

harishankar-gopalan commented Oct 18, 2022

jtibshirani commented Oct 31, 2022

jtibshirani commented Aug 9, 2021 •

edited

Loading