-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Background
We are currently working on a feature in OpenSearch to support context aware segment within OpenSearch which involves maintaining multiple IndexWriter instances, one for each group, within a shard to collocate related data into same segment or group of segments. The design is detailed in the following RFCs and LLD:
Current Use Case
With Context Aware Segment, within a shard, writes are routed to respective group-specific IndexWriter instances. To maintain consistent versioning across writers during update operation, we perform a hard delete of the previous document version in the parent (accumulating) IndexWriter whenever a new version is added to a group-specific writer.
Problem Description
Currently with just soft deletes enabled, during OpenSearch's DocRep recovery, OpenSearch uses SegmentReader.hardLiveDocs to query live docs from segments with hard deletes (which may have gotten introduced due to IndexWriter hitting non-aborted exceptions). The number of liveDocs is efficiently derived as:
segmentReader.maxDoc() - segmentReader.getSegmentInfo().getDelCount()
However, by performing both soft and hard delete on a context aware enabled Lucene Index, the above calculation breaks down as segmentReader.getSegmentInfo().getDelCount() no longer provide the accurate live delete count on a segment. Based on Lucene's unit tests for mixed deletes, the only reliable method to get the live doc count is to iterate through the hardLiveDocs and count the set bits.
Performance Impact
This iterative counting operation is computationally expensive for large segments and can potentially cause significant performance regressions during shard recovery.
Ask from this issue
Is there a more optimized, direct way to retrieve the count of live documents from a SegmentReader's hardLiveDocs when a segment has undergone both hard and soft deletes?