Skip to content

Efficient way to calculate hardLiveDocs count of a SegmentReader when both hard and soft deletes are present #15352

@RS146BIJAY

Description

@RS146BIJAY

Background

We are currently working on a feature in OpenSearch to support context aware segment within OpenSearch which involves maintaining multiple IndexWriter instances, one for each group, within a shard to collocate related data into same segment or group of segments. The design is detailed in the following RFCs and LLD:

Current Use Case

With Context Aware Segment, within a shard, writes are routed to respective group-specific IndexWriter instances. To maintain consistent versioning across writers during update operation, we perform a hard delete of the previous document version in the parent (accumulating) IndexWriter whenever a new version is added to a group-specific writer.

Problem Description

Currently with just soft deletes enabled, during OpenSearch's DocRep recovery, OpenSearch uses SegmentReader.hardLiveDocs to query live docs from segments with hard deletes (which may have gotten introduced due to IndexWriter hitting non-aborted exceptions). The number of liveDocs is efficiently derived as:

segmentReader.maxDoc() - segmentReader.getSegmentInfo().getDelCount()

However, by performing both soft and hard delete on a context aware enabled Lucene Index, the above calculation breaks down as segmentReader.getSegmentInfo().getDelCount() no longer provide the accurate live delete count on a segment. Based on Lucene's unit tests for mixed deletes, the only reliable method to get the live doc count is to iterate through the hardLiveDocs and count the set bits.

Performance Impact

This iterative counting operation is computationally expensive for large segments and can potentially cause significant performance regressions during shard recovery.

Ask from this issue

Is there a more optimized, direct way to retrieve the count of live documents from a SegmentReader's hardLiveDocs when a segment has undergone both hard and soft deletes?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions