Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10040: Handle deletions in nearest vector search #239

Merged
merged 11 commits into from
Aug 16, 2021

Conversation

jtibshirani
Copy link
Member

@jtibshirani jtibshirani commented Aug 9, 2021

This PR extends LeafReader#searchNearestVectors to take a parameter specifying
the live docs. The method then always returns the k nearest undeleted docs.

To implement this, the HNSW algorithm will only add a candidate to the result
set if it is a live doc. The graph search still visits and traverses deleted
docs as it gathers candidates.

This PR extends VectorReader#search to take a parameter specifying the live
docs. LeafReader#searchNearestVectors then always returns the k nearest
undeleted docs.

To implement this, the HNSW algorithm will only add a candidate to the result
set if it is a live doc. The graph search still visits and traverses deleted
docs as it gathers candidates.
import org.apache.lucene.util.TestUtil;

public class TestLucene90HnswVectorsFormat extends BaseKnnVectorsFormatTestCase {
@Override
protected Codec getCodec() {
return TestUtil.getDefaultCodec();
}

public void testSearchWithDeletions() throws IOException {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a perfect place for this test, I could move it to a test like TestKnnVectorQuery once the higher-level API is merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good test; fine to have it here, but there is also TestKnnGraph which is not tied to any specific class.

How about also having a simple test for deletions where we delete all the documents and make sure none are returned yet the TopDocs.totalHits.value > 0 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

float score = similarityFunction.compare(query, vectors.vectorValue(friendOrd));
if (results.insertWithOverflow(friendOrd, score)) {
if (results.size() < topK || bound.check(score) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the condition results.size() < topK be instead results.size() <= numSeed to imitate the previous behaviour of results.insertWithOverflow(friendOrd, score)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I think so. They're generally the same now in the normal path, but unless/until we eliminate the distinction there are a few cases where they differ (tiny index, mostly) and we should use numSeed here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll update this!

results.add(entryPoint, similarityFunction.compare(query, vectors.vectorValue(entryPoint)));
float score = similarityFunction.compare(query, vectors.vectorValue(entryPoint));
candidates.add(entryPoint, score);
if (acceptOrds == null || acceptOrds.get(entryPoint)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe cleaner to require non-null. The caller can supply a MatchAllBits if that's what they need

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried both ways and liked this approach because it matches the API for BulkScorer#score and the return value for getLiveDocs. For me this consistency was worth it, even though in isolation it's slightly less clean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I agree it's nice to be able to pass getLiveDocs return value directly

@@ -42,13 +42,6 @@
}
}

NeighborQueue copy(boolean reversed) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was unused I guess?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was previously used in HnswGraph#search but this PR removed that.

@@ -136,7 +136,7 @@ public void setInfoStream(InfoStream infoStream) {
void addGraphNode(float[] value) throws IOException {
NeighborQueue candidates =
HnswGraph.search(
value, beamWidth, beamWidth, vectorValues, similarityFunction, hnsw, random);
value, beamWidth, beamWidth, vectorValues, similarityFunction, hnsw, null, random);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment to the effect we don't expect to see any deleted documents here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

float score = similarityFunction.compare(query, vectors.vectorValue(friendOrd));
if (results.insertWithOverflow(friendOrd, score)) {
if (results.size() < topK || bound.check(score) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I think so. They're generally the same now in the normal path, but unless/until we eliminate the distinction there are a few cases where they differ (tiny index, mostly) and we should use numSeed here.

public void testSearchWithDeletions() throws IOException {
Directory dir = newDirectory();
IndexWriterConfig cfg = new IndexWriterConfig();
IndexWriter w = new IndexWriter(dir, cfg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to use try-with-resources in these tests; then the directory, reader and writer get auto-closed when exiting the scope, and in case there is a failure you don't get messy nastygrams from the test framework about unclosed resources.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I learned a new word 'nastygram'!

d.add(new KnnVectorField("vector", randomVector(dim)));
docIndex++;
} else {
d.add(new StringField("other", "value", Field.Store.NO));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use "id" for these too so they sometimes get deleted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea, it actually helped catch a test bug.

import org.apache.lucene.util.TestUtil;

public class TestLucene90HnswVectorsFormat extends BaseKnnVectorsFormatTestCase {
@Override
protected Codec getCodec() {
return TestUtil.getDefaultCodec();
}

public void testSearchWithDeletions() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good test; fine to have it here, but there is also TestKnnGraph which is not tied to any specific class.

How about also having a simple test for deletions where we delete all the documents and make sure none are returned yet the TopDocs.totalHits.value > 0 ?

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good!

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani Thanks Julie, LGTM as well.

@@ -220,7 +220,7 @@ public final TopDocs searchNearestVectors(String field, float[] target, int k)
return null;
}

return getVectorReader().search(field, target, k);
return getVectorReader().search(field, target, k, getLiveDocs());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to change the signature of searchNearestVectors to take a Bits acceptDocs instead, otherwise it would be impossible to wrap this reader in order to change the set of live documents.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check I understand, you're thinking of a wrapper like FilterLeafReader, specifically a use case like SoftDeletesFilterLeafReader?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, and this is a pretty low-level API so the extra parameter/ requirement to pass in live docs seems okay.

This ensures it's possible to wrap a LeafReader and modify its live docs, as we
do in SoftDeletesFilterLeafReader.
@jtibshirani
Copy link
Member Author

I tried modifying my usual ann-benchmark to randomly delete 25% of documents after force merging. As expected, it's slower but not by a huge amount.

sift-128-euclidean, M=16, efSearch=500

Approach                   Baseline QPS    Deletions QPS
LuceneHnsw(n_cands=10)     9045.574        7871.898
LuceneHnsw(n_cands=50)     3789.169        3035.740
LuceneHnsw(n_cands=100)    2333.714        1769.330
LuceneHnsw(n_cands=500)     579.850         448.190
LuceneHnsw(n_cands=800)     386.105         299.636                      

If you don't have objections, I'd like to merge now and then work on some final benchmarks -- ideally we could make sure luceneutil covers ANN with deletions.

@jtibshirani jtibshirani merged commit 6993fb9 into apache:main Aug 16, 2021
@jtibshirani jtibshirani deleted the hnsw-deletions branch August 16, 2021 14:44
mayya-sharipova added a commit to mayya-sharipova/lucene that referenced this pull request Sep 2, 2021
If we set numSeed = 10, this test fails sometimes  because it may mark
expected results docs (from 0 to 9) as deleted which don't end up
being retrieved, resulting in a low recall

- set numSeed to 10 to ensure 10 results are returned
- add startIndex paramenter to createRandomAcceptOrds that allows
  documents before startIndex to be NOT deleted
- use startIndex equal to 10 for createRandomAcceptOrds

Relates to apache#239
mayya-sharipova added a commit that referenced this pull request Sep 6, 2021
If we set numSeed = 10, this test fails sometimes  because it may mark
expected results docs (from 0 to 9) as deleted which don't end up
being retrieved, resulting in a low recall

- set numSeed to 10 to ensure 10 results are returned
- add startIndex paramenter to createRandomAcceptOrds that allows
  documents before startIndex to be NOT deleted
- use startIndex equal to 10 for createRandomAcceptOrds

Relates to #239
@harishankar-gopalan
Copy link

Hi @jtibshirani / @msokolov , had a quick doubt on the deletion implementation. From the discussion and changes I understand that for now we have only marked the docs as deleted with the Bits value and ignored those documents while selecting our candidates.
So with this basic understanding my doubts are:

  1. Is the search space being expanded to meet the top-k criteria ?
  2. Is there any active work going on to prune these "marked-for-delete" nodes in the graph i.e implementing the version of delete suggested by the Weaviate author(s) here ?

Another unrelated doubt:

  1. How are segment merges in general handled as I suppose graphs of the corresponding segments also would have to be merged ? And in the same context does deleted nodes have any effect on the merge operations in terms of high time complexity for some edge cases that the users should be aware of ?

@jtibshirani
Copy link
Member Author

@harishankar-gopalan sorry for the slow response! Your overall understanding is right. In Lucene, deletions are handled by marking a document as deleted using a 'tombstone'. The index structures are not actually updated (this includes the HNSW graph).

In response to your questions...

  1. Yes, when there are deletions, we make sure to expand the search space to retrieve top k. We use a similar strategy that many vector search engines use for kNN with filtering. During the HNSW search, we make sure to exclude deleted nodes from the final candidate set, but deleted nodes are still used for traversing hte graph.
  2. No, there is no work in that direction. This is because Lucene segments are never updated after they are first written (except under rare circumstances). Lucene's immutable segment model is core to its design, and it's the reason we use 'tombstones' instead of modifying the graph in place.

For segment merges, we combine the vectors across all segments and build a new graph from scratch. We make sure to skip over deleted documents during this merge, so they have no effect on the time it takes to build the graph. This merge process is quite expensive and we're brainstorming ways of making it faster (#11354).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants