Reuse BitSet when there are deleted documents in the index instead of creating new BitSet #12857

Pulkitg64 · 2023-11-30T10:39:06Z

Description

Fixes issue: #12414

Before this change we were creating new BitSet every time when there are deletions in the index with use of matched Docs and Live Docs, which required iteration over all matched docs which is a time consuming process with linear time complexity. This is not required when the iterator is of BitSetIterator instance.

With this change we have wrapped matching Docs and live Docs under single Bits instance which can be directly passed for HNSW search. So, now during HNSW Search when a node is explored, we will check if the doc is accepted or not by checking bits of both matchedDocs and liveDocs which is constant time operation. Cost (int cost = acceptDocs.length()) of the new acceptedDocs is not exactly accurate but gives an upper bound on number, as it is not considering live-docs count.

shubhamvishu

Thanks for working on this @Pulkitg64 . I went through the change but I didn't understand how are we not reusing the bitset in the current approach. We do wrap the BitSetIterator with a FilteredDocIdSetIterator when there are deleted docs right which would eventually use the bitset to advance the inner iterator(See this).

Instead in this PR we are unnecessarily always wrapping the BitSetIterator with a FilteredDocIdSetIterator even when there are no deletions which is just an added overhead and is avoided in the current code. Additionally, I find the current approach(separate function) more cleaner too. WDYT?

shubhamvishu · 2023-11-30T12:36:29Z

lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java

+    Bits acceptDocs =
+        new Bits() {
+          @Override
+          public boolean get(int index) {
+            return liveDocs != null ? liveDocs.get(index) & bitSet.get(index) : bitSet.get(index);
+          }
+
+          @Override
+          public int length() {
+            return bitSet.cardinality();
+          }
+        };


Why are we creating this when we could directly do int cost = bitSet.cardinality();

Here we are implementing the Bits interface that's why we need to override both functions.

Pulkitg64 · 2023-12-01T14:14:48Z

Thanks @shubhamvishu for taking a look.

I went through the change but I didn't understand how are we not reusing the bitset in the current approach. We do wrap the BitSetIterator with a FilteredDocIdSetIterator when there are deleted docs right which would eventually use the bitset to advance the inner iterator(See this).

Sorry! I think, I should have used different title for this PR. The part in the current approach, which I am trying to optimize is that when the iterator is of BitSetIterator instance and live docs are not null. So in current approach we create a new BitSet while taking live docs into consideration. But this bitset creation is a linear time complexity process, because to create bitset we need to iterate over all matched docs. This BitSet creation is not required as we can wrap both matched docs bitset and live docs bitset under single Bits instance which can be later used directly during approximate search. So instead of creating new Bitset, we are computing if a document is valid for searching or not at runtime. This saves us time to create new BitSet.

kaivalnp

Interesting! So instead of greedily collecting all matching + live docs into a BitSet, we're saving on the filter collection step at the cost of running #approximateSearch with an upper bound of visitLimit

Can you run some benchmarks for different filters to measure this tradeoff?

kaivalnp · 2023-12-01T17:30:59Z

lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java

+          @Override
+          public int length() {
+            return bitSet.cardinality();
+          }


length() is more like the maximum doc you can request, this should be bitSet.length()?

kaivalnp · 2023-12-01T17:31:47Z

lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java

+        new Bits() {
+          @Override
+          public boolean get(int index) {
+            return liveDocs != null ? liveDocs.get(index) & bitSet.get(index) : bitSet.get(index);


nit: Can we make this cleaner by return (liveDocs == null || liveDocs.get(index)) && bitSet.get(index) ?

kaivalnp · 2023-12-01T17:33:24Z

lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java

+            ? ((BitSetIterator) iterator).getBitSet()
+            : BitSet.of(iterator, maxDoc);
+    Bits acceptDocs =
+        new Bits() {


Perhaps we can wrap the BitSet in a new Bits only for the case we're trying to optimize (when iterator instanceof BitSetIterator) -- not changing the BitSet.of flow like @shubhamvishu also mentioned?

kaivalnp · 2023-12-01T17:36:00Z

lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java

+          }
+        };
+
+    int cost = acceptDocs.length();


The cost here determines what limit to set for #approximateSearch

If we use acceptDocs.length(), this will be equal to maxDoc (and always complete graph search without falling back to exact search, even when we want to..)

Perhaps this should be acceptDocs.cardinality()?

shubhamvishu · 2023-12-01T18:43:35Z

@kaivalnp We could use the acceptDocs.cardinality() when its a BitSetIterator to get the upper bound which might have some deletes but that would still change the decision sometimes of whether to go for exact search or not. Since we don't know how many of those docs are live but we do know the num of deletes in the segment(we don't know the intersections of these two). One thing that might be tried is to come up with some heuristic that adds some penalty to the cost based on the num of deletes in the segment (i.e. ctx.reader().numDeletedDocs()/ctx.reader().maxDoc()). Like maybe if there are 10% deletes we could for eg decrease the cost by 10% or maybe 5%. This might help in cases where we miss falling back to exact search. Though this would need some thorough benchmarking to see what works best.

On separate note, I'm thinking if there is some use case where we don't require to know this cost upfront and directly go for approximate search only for instance. Currently, this optimization only kicks in when the iterator is of BitSetIterator but if its possible to ignore this cost step or get this cost by some other heuristic/approximation then we could completely make it completely lazily evaluated using DISI#advance(docid) for those use cases. @msokolov @benwtrent Maybe you could share your thoughts on this?

benwtrent · 2023-12-01T20:41:28Z

Is our goal memory usage or speed?

We could use FixedBitSet#intersectionCount and keep from having to create a new bit set that is the intersection.

I am honestly not sure if the implementation here is any faster than just creating the bit set upfront and checking it. During search, you now have to check two bitsets now instead of one.

If the filter happens to be < number of docs visited in a typical search, your implementation here seems like it would be slower.

benwtrent · 2023-12-01T20:42:48Z

Broad feedback: any "optimizations" without benchmarking aren't optimizations, they are just guesses.

I am curious to see if this helps CPU usage in anyway. I could see it helping memory usage.

github-actions · 2024-01-08T12:22:44Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

mikemccand · 2024-05-10T15:57:34Z

I don't fully understand this change, but it looks like it is stalled on proving it shows lower CPU and/or heap/GC load?

Could we benchmark this change using luceneutil? It's able to create vector indices that have X% deletions and then run KNNByte/FloatVectorQuery...

Pulkitg64 · 2024-05-13T05:10:35Z

Thanks @mikemccand for the pointers. Will try to run benchmarks on this change.

github-actions · 2024-05-28T00:18:58Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

Wrap Accepted Doc and Live Docs in single object

a503885

shubhamvishu reviewed Nov 30, 2023

View reviewed changes

kaivalnp reviewed Dec 1, 2023

View reviewed changes

github-actions bot added the Stale label Jan 8, 2024

github-actions bot removed the Stale label May 11, 2024

github-actions bot added the Stale label May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse BitSet when there are deleted documents in the index instead of creating new BitSet #12857

Reuse BitSet when there are deleted documents in the index instead of creating new BitSet #12857

Pulkitg64 commented Nov 30, 2023 •

edited

shubhamvishu left a comment

shubhamvishu Nov 30, 2023

Pulkitg64 Nov 30, 2023

Pulkitg64 commented Dec 1, 2023

kaivalnp left a comment

kaivalnp Dec 1, 2023

kaivalnp Dec 1, 2023

kaivalnp Dec 1, 2023

kaivalnp Dec 1, 2023

shubhamvishu commented Dec 1, 2023

benwtrent commented Dec 1, 2023

benwtrent commented Dec 1, 2023

github-actions bot commented Jan 8, 2024

mikemccand commented May 10, 2024

Pulkitg64 commented May 13, 2024

github-actions bot commented May 28, 2024

Reuse BitSet when there are deleted documents in the index instead of creating new BitSet #12857

Are you sure you want to change the base?

Reuse BitSet when there are deleted documents in the index instead of creating new BitSet #12857

Conversation

Pulkitg64 commented Nov 30, 2023 • edited

Description

shubhamvishu left a comment

Choose a reason for hiding this comment

shubhamvishu Nov 30, 2023

Choose a reason for hiding this comment

Pulkitg64 Nov 30, 2023

Choose a reason for hiding this comment

Pulkitg64 commented Dec 1, 2023

kaivalnp left a comment

Choose a reason for hiding this comment

kaivalnp Dec 1, 2023

Choose a reason for hiding this comment

kaivalnp Dec 1, 2023

Choose a reason for hiding this comment

kaivalnp Dec 1, 2023

Choose a reason for hiding this comment

kaivalnp Dec 1, 2023

Choose a reason for hiding this comment

shubhamvishu commented Dec 1, 2023

benwtrent commented Dec 1, 2023

benwtrent commented Dec 1, 2023

github-actions bot commented Jan 8, 2024

mikemccand commented May 10, 2024

Pulkitg64 commented May 13, 2024

github-actions bot commented May 28, 2024

Pulkitg64 commented Nov 30, 2023 •

edited