Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High heap usage due to snapshot post-deletion cleanup #108278

Open
DaveCTurner opened this issue May 4, 2024 · 2 comments
Open

High heap usage due to snapshot post-deletion cleanup #108278

DaveCTurner opened this issue May 4, 2024 · 2 comments
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team

Comments

@DaveCTurner
Copy link
Contributor

When deleting a snapshot we accumulate in memory a list of all the blobs that can be deleted after the repository update is committed. Each blob name takes only ~80B of heap, but it's possible for there to be very many blobs (it's theoretically unbounded). I've seen ~100M blobs to be deleted in practice, which can add up to several GiBs of heap in total. We should find a way to track this work with bounded heap usage.

@DaveCTurner DaveCTurner added >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels May 4, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label May 4, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner
Copy link
Contributor Author

As a quick improvement, I think we could accumulate the blob names in memory using a (compressed) BytesStreamOutput rather than each one being a separate String object. Each name should have ~17 bytes of entropy (16B for the UUID plus a little overhead) so that's a ~4.7× memory saving right away vs the 80-bytes-per-name we have at the moment.

As a slightly-less-quick (but still fairly quick) improvement that achieves O(1) memory usage: whenever such a BytesStreamOutput gets large enough we could spill its contents out to a blob in the blob store and drop it from memory, then read it back in later on after the new RepositoryData is committed and we're processing those deletes. That introduces some complexity around cleaning up those blobs after a master failover, but it seems surmountable.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 14, 2024
Encapsulates this component of the snapshot deletion process so we can
follow up with some optimizations in isolation.

Relates elastic#108278
elasticsearchmachine pushed a commit that referenced this issue May 14, 2024
Encapsulates this component of the snapshot deletion process so we can
follow up with some optimizations in isolation.

Relates #108278
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

2 participants