High heap usage due to snapshot post-deletion cleanup #108278

DaveCTurner · 2024-05-04T07:30:17Z

When deleting a snapshot we accumulate in memory a list of all the blobs that can be deleted after the repository update is committed. Each blob name takes only ~80B of heap, but it's possible for there to be very many blobs (it's theoretically unbounded). I've seen ~100M blobs to be deleted in practice, which can add up to several GiBs of heap in total. We should find a way to track this work with bounded heap usage.

elasticsearchmachine · 2024-05-04T07:30:40Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2024-05-14T09:35:37Z

As a quick improvement, I think we could accumulate the blob names in memory using a (compressed) BytesStreamOutput rather than each one being a separate String object. Each name should have ~17 bytes of entropy (16B for the UUID plus a little overhead) so that's a ~4.7× memory saving right away vs the 80-bytes-per-name we have at the moment.

As a slightly-less-quick (but still fairly quick) improvement that achieves O(1) memory usage: whenever such a BytesStreamOutput gets large enough we could spill its contents out to a blob in the blob store and drop it from memory, then read it back in later on after the new RepositoryData is committed and we're processing those deletes. That introduces some complexity around cleaning up those blobs after a master failover, but it seems surmountable.

Encapsulates this component of the snapshot deletion process so we can follow up with some optimizations in isolation. Relates elastic#108278

Encapsulates this component of the snapshot deletion process so we can follow up with some optimizations in isolation. Relates #108278

DaveCTurner added >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels May 4, 2024

elasticsearchmachine added the Team:Distributed Meta label for distributed team label May 4, 2024

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 14, 2024

Extract ShardBlobsToDelete

0810cad

Encapsulates this component of the snapshot deletion process so we can follow up with some optimizations in isolation. Relates elastic#108278

DaveCTurner mentioned this issue May 14, 2024

Extract ShardBlobsToDelete #108610

Merged

elasticsearchmachine pushed a commit that referenced this issue May 14, 2024

Extract ShardBlobsToDelete (#108610)

812ff9e

Encapsulates this component of the snapshot deletion process so we can follow up with some optimizations in isolation. Relates #108278

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High heap usage due to snapshot post-deletion cleanup #108278

High heap usage due to snapshot post-deletion cleanup #108278

DaveCTurner commented May 4, 2024

elasticsearchmachine commented May 4, 2024

DaveCTurner commented May 14, 2024

High heap usage due to snapshot post-deletion cleanup #108278

High heap usage due to snapshot post-deletion cleanup #108278

Comments

DaveCTurner commented May 4, 2024

elasticsearchmachine commented May 4, 2024

DaveCTurner commented May 14, 2024