Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse BitSet when there are deleted documents in the index instead of creating new BitSet #12857

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Pulkitg64
Copy link
Contributor

@Pulkitg64 Pulkitg64 commented Nov 30, 2023

Description

Fixes issue: #12414

Before this change we were creating new BitSet every time when there are deletions in the index with use of matched Docs and Live Docs, which required iteration over all matched docs which is a time consuming process with linear time complexity. This is not required when the iterator is of BitSetIterator instance.

With this change we have wrapped matching Docs and live Docs under single Bits instance which can be directly passed for HNSW search. So, now during HNSW Search when a node is explored, we will check if the doc is accepted or not by checking bits of both matchedDocs and liveDocs which is constant time operation. Cost (int cost = acceptDocs.length()) of the new acceptedDocs is not exactly accurate but gives an upper bound on number, as it is not considering live-docs count.

Copy link
Contributor

@shubhamvishu shubhamvishu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @Pulkitg64 . I went through the change but I didn't understand how are we not reusing the bitset in the current approach. We do wrap the BitSetIterator with a FilteredDocIdSetIterator when there are deleted docs right which would eventually use the bitset to advance the inner iterator(See this).

Instead in this PR we are unnecessarily always wrapping the BitSetIterator with a FilteredDocIdSetIterator even when there are no deletions which is just an added overhead and is avoided in the current code. Additionally, I find the current approach(separate function) more cleaner too. WDYT?

Comment on lines +126 to +137
Bits acceptDocs =
new Bits() {
@Override
public boolean get(int index) {
return liveDocs != null ? liveDocs.get(index) & bitSet.get(index) : bitSet.get(index);
}

@Override
public int length() {
return bitSet.cardinality();
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we creating this when we could directly do int cost = bitSet.cardinality();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are implementing the Bits interface that's why we need to override both functions.

@Pulkitg64
Copy link
Contributor Author

Thanks @shubhamvishu for taking a look.

I went through the change but I didn't understand how are we not reusing the bitset in the current approach. We do wrap the BitSetIterator with a FilteredDocIdSetIterator when there are deleted docs right which would eventually use the bitset to advance the inner iterator(See this).

Sorry! I think, I should have used different title for this PR. The part in the current approach, which I am trying to optimize is that when the iterator is of BitSetIterator instance and live docs are not null. So in current approach we create a new BitSet while taking live docs into consideration. But this bitset creation is a linear time complexity process, because to create bitset we need to iterate over all matched docs. This BitSet creation is not required as we can wrap both matched docs bitset and live docs bitset under single Bits instance which can be later used directly during approximate search. So instead of creating new Bitset, we are computing if a document is valid for searching or not at runtime. This saves us time to create new BitSet.

Copy link
Contributor

@kaivalnp kaivalnp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! So instead of greedily collecting all matching + live docs into a BitSet, we're saving on the filter collection step at the cost of running #approximateSearch with an upper bound of visitLimit

Can you run some benchmarks for different filters to measure this tradeoff?

Comment on lines +133 to +136
@Override
public int length() {
return bitSet.cardinality();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

length() is more like the maximum doc you can request, this should be bitSet.length()?

new Bits() {
@Override
public boolean get(int index) {
return liveDocs != null ? liveDocs.get(index) & bitSet.get(index) : bitSet.get(index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we make this cleaner by return (liveDocs == null || liveDocs.get(index)) && bitSet.get(index) ?

? ((BitSetIterator) iterator).getBitSet()
: BitSet.of(iterator, maxDoc);
Bits acceptDocs =
new Bits() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can wrap the BitSet in a new Bits only for the case we're trying to optimize (when iterator instanceof BitSetIterator) -- not changing the BitSet.of flow like @shubhamvishu also mentioned?

}
};

int cost = acceptDocs.length();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cost here determines what limit to set for #approximateSearch

If we use acceptDocs.length(), this will be equal to maxDoc (and always complete graph search without falling back to exact search, even when we want to..)

Perhaps this should be acceptDocs.cardinality()?

@shubhamvishu
Copy link
Contributor

@kaivalnp We could use the acceptDocs.cardinality() when its a BitSetIterator to get the upper bound which might have some deletes but that would still change the decision sometimes of whether to go for exact search or not. Since we don't know how many of those docs are live but we do know the num of deletes in the segment(we don't know the intersections of these two). One thing that might be tried is to come up with some heuristic that adds some penalty to the cost based on the num of deletes in the segment (i.e. ctx.reader().numDeletedDocs()/ctx.reader().maxDoc()). Like maybe if there are 10% deletes we could for eg decrease the cost by 10% or maybe 5%. This might help in cases where we miss falling back to exact search. Though this would need some thorough benchmarking to see what works best.

On separate note, I'm thinking if there is some use case where we don't require to know this cost upfront and directly go for approximate search only for instance. Currently, this optimization only kicks in when the iterator is of BitSetIterator but if its possible to ignore this cost step or get this cost by some other heuristic/approximation then we could completely make it completely lazily evaluated using DISI#advance(docid) for those use cases. @msokolov @benwtrent Maybe you could share your thoughts on this?

@benwtrent
Copy link
Member

Is our goal memory usage or speed?

We could use FixedBitSet#intersectionCount and keep from having to create a new bit set that is the intersection.

I am honestly not sure if the implementation here is any faster than just creating the bit set upfront and checking it. During search, you now have to check two bitsets now instead of one.

If the filter happens to be < number of docs visited in a typical search, your implementation here seems like it would be slower.

@benwtrent
Copy link
Member

Broad feedback: any "optimizations" without benchmarking aren't optimizations, they are just guesses.

I am curious to see if this helps CPU usage in anyway. I could see it helping memory usage.

Copy link

github-actions bot commented Jan 8, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 8, 2024
@mikemccand
Copy link
Member

I don't fully understand this change, but it looks like it is stalled on proving it shows lower CPU and/or heap/GC load?

Could we benchmark this change using luceneutil? It's able to create vector indices that have X% deletions and then run KNNByte/FloatVectorQuery...

@github-actions github-actions bot removed the Stale label May 11, 2024
@Pulkitg64
Copy link
Contributor Author

Thanks @mikemccand for the pointers. Will try to run benchmarks on this change.

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants