-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10040: Handle deletions in nearest vector search #239
Conversation
This PR extends VectorReader#search to take a parameter specifying the live docs. LeafReader#searchNearestVectors then always returns the k nearest undeleted docs. To implement this, the HNSW algorithm will only add a candidate to the result set if it is a live doc. The graph search still visits and traverses deleted docs as it gathers candidates.
import org.apache.lucene.util.TestUtil; | ||
|
||
public class TestLucene90HnswVectorsFormat extends BaseKnnVectorsFormatTestCase { | ||
@Override | ||
protected Codec getCodec() { | ||
return TestUtil.getDefaultCodec(); | ||
} | ||
|
||
public void testSearchWithDeletions() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a perfect place for this test, I could move it to a test like TestKnnVectorQuery
once the higher-level API is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good test; fine to have it here, but there is also TestKnnGraph which is not tied to any specific class.
How about also having a simple test for deletions where we delete all the documents and make sure none are returned yet the TopDocs.totalHits.value > 0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
float score = similarityFunction.compare(query, vectors.vectorValue(friendOrd)); | ||
if (results.insertWithOverflow(friendOrd, score)) { | ||
if (results.size() < topK || bound.check(score) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the condition results.size() < topK
be instead results.size() <= numSeed
to imitate the previous behaviour of results.insertWithOverflow(friendOrd, score)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I think so. They're generally the same now in the normal path, but unless/until we eliminate the distinction there are a few cases where they differ (tiny index, mostly) and we should use numSeed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll update this!
results.add(entryPoint, similarityFunction.compare(query, vectors.vectorValue(entryPoint))); | ||
float score = similarityFunction.compare(query, vectors.vectorValue(entryPoint)); | ||
candidates.add(entryPoint, score); | ||
if (acceptOrds == null || acceptOrds.get(entryPoint)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe cleaner to require non-null. The caller can supply a MatchAllBits
if that's what they need
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried both ways and liked this approach because it matches the API for BulkScorer#score
and the return value for getLiveDocs
. For me this consistency was worth it, even though in isolation it's slightly less clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I agree it's nice to be able to pass getLiveDocs
return value directly
@@ -42,13 +42,6 @@ | |||
} | |||
} | |||
|
|||
NeighborQueue copy(boolean reversed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was unused I guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was previously used in HnswGraph#search
but this PR removed that.
@@ -136,7 +136,7 @@ public void setInfoStream(InfoStream infoStream) { | |||
void addGraphNode(float[] value) throws IOException { | |||
NeighborQueue candidates = | |||
HnswGraph.search( | |||
value, beamWidth, beamWidth, vectorValues, similarityFunction, hnsw, random); | |||
value, beamWidth, beamWidth, vectorValues, similarityFunction, hnsw, null, random); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment to the effect we don't expect to see any deleted documents here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
float score = similarityFunction.compare(query, vectors.vectorValue(friendOrd)); | ||
if (results.insertWithOverflow(friendOrd, score)) { | ||
if (results.size() < topK || bound.check(score) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I think so. They're generally the same now in the normal path, but unless/until we eliminate the distinction there are a few cases where they differ (tiny index, mostly) and we should use numSeed here.
public void testSearchWithDeletions() throws IOException { | ||
Directory dir = newDirectory(); | ||
IndexWriterConfig cfg = new IndexWriterConfig(); | ||
IndexWriter w = new IndexWriter(dir, cfg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like to use try-with-resources in these tests; then the directory, reader and writer get auto-closed when exiting the scope, and in case there is a failure you don't get messy nastygrams from the test framework about unclosed resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I learned a new word 'nastygram'!
d.add(new KnnVectorField("vector", randomVector(dim))); | ||
docIndex++; | ||
} else { | ||
d.add(new StringField("other", "value", Field.Store.NO)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use "id" for these too so they sometimes get deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea, it actually helped catch a test bug.
import org.apache.lucene.util.TestUtil; | ||
|
||
public class TestLucene90HnswVectorsFormat extends BaseKnnVectorsFormatTestCase { | ||
@Override | ||
protected Codec getCodec() { | ||
return TestUtil.getDefaultCodec(); | ||
} | ||
|
||
public void testSearchWithDeletions() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good test; fine to have it here, but there is also TestKnnGraph which is not tied to any specific class.
How about also having a simple test for deletions where we delete all the documents and make sure none are returned yet the TopDocs.totalHits.value > 0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtibshirani Thanks Julie, LGTM as well.
@@ -220,7 +220,7 @@ public final TopDocs searchNearestVectors(String field, float[] target, int k) | |||
return null; | |||
} | |||
|
|||
return getVectorReader().search(field, target, k); | |||
return getVectorReader().search(field, target, k, getLiveDocs()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to change the signature of searchNearestVectors
to take a Bits acceptDocs
instead, otherwise it would be impossible to wrap this reader in order to change the set of live documents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To check I understand, you're thinking of a wrapper like FilterLeafReader
, specifically a use case like SoftDeletesFilterLeafReader
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, and this is a pretty low-level API so the extra parameter/ requirement to pass in live docs seems okay.
This ensures it's possible to wrap a LeafReader and modify its live docs, as we do in SoftDeletesFilterLeafReader.
I tried modifying my usual ann-benchmark to randomly delete 25% of documents after force merging. As expected, it's slower but not by a huge amount. sift-128-euclidean, M=16, efSearch=500
If you don't have objections, I'd like to merge now and then work on some final benchmarks -- ideally we could make sure luceneutil covers ANN with deletions. |
If we set numSeed = 10, this test fails sometimes because it may mark expected results docs (from 0 to 9) as deleted which don't end up being retrieved, resulting in a low recall - set numSeed to 10 to ensure 10 results are returned - add startIndex paramenter to createRandomAcceptOrds that allows documents before startIndex to be NOT deleted - use startIndex equal to 10 for createRandomAcceptOrds Relates to apache#239
If we set numSeed = 10, this test fails sometimes because it may mark expected results docs (from 0 to 9) as deleted which don't end up being retrieved, resulting in a low recall - set numSeed to 10 to ensure 10 results are returned - add startIndex paramenter to createRandomAcceptOrds that allows documents before startIndex to be NOT deleted - use startIndex equal to 10 for createRandomAcceptOrds Relates to #239
Hi @jtibshirani / @msokolov , had a quick doubt on the deletion implementation. From the discussion and changes I understand that for now we have only marked the docs as deleted with the Bits value and ignored those documents while selecting our candidates.
Another unrelated doubt:
|
@harishankar-gopalan sorry for the slow response! Your overall understanding is right. In Lucene, deletions are handled by marking a document as deleted using a 'tombstone'. The index structures are not actually updated (this includes the HNSW graph). In response to your questions...
For segment merges, we combine the vectors across all segments and build a new graph from scratch. We make sure to skip over deleted documents during this merge, so they have no effect on the time it takes to build the graph. This merge process is quite expensive and we're brainstorming ways of making it faster (#11354). |
This PR extends LeafReader#searchNearestVectors to take a parameter specifying
the live docs. The method then always returns the k nearest undeleted docs.
To implement this, the HNSW algorithm will only add a candidate to the result
set if it is a live doc. The graph search still visits and traverses deleted
docs as it gathers candidates.