Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: Force Onode cache to trim when quota changes #35171

Closed
wants to merge 1 commit into from

Conversation

aclamk
Copy link
Contributor

@aclamk aclamk commented May 21, 2020

Solves case when there is fixed amount of Onodes and all operations only operate on existing ones.
Without this fix, there is no trigger to actually drop some cached Onodes.

Problem was detected during BlueStore compression tests, where all objects were created before filling them with data.
Initially there was 10000 objects, each taking ~100 bytes Onode space.
When each object size grew to 32MB, each used ~1.1MB of Onode(+Extents+Blobs+SharedBlobs), for a total of ~12GB.

Each OnodeCacheShard separately could be triggered to trim() if some dummy rados object was added, but it only works for affected shard.

Solves case when there is fixed amount of Onodes and all operations only operate on existing once.
Without this fix, there is no trigger to actually drop some cached Onodes.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
@@ -3926,6 +3926,7 @@ void BlueStore::MempoolThread::_resize_shards(bool interval_stats)

for (auto i : store->onode_cache_shards) {
i->set_max(max_shard_onodes);
i->trim();
}
for (auto i : store->buffer_cache_shards) {
i->set_max(max_shard_buffer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do the same for buffer_cache_shards?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also missing for buffers.
Whether we should copy this solution depends on possibility to have truly running cluster that does not read/write. Or possibility to have some Buffer shards that for some reasons are dormant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this for buffers unless the size of a given buffer can change behind the buffer cache's back (which it shouldn't). We exactly track how many bytes are in the cache and know how big a cached buffer is.

The onode case is special because caching an onode can result in a variable amount of required memory that can change behind the cache's back depending on the number of extents/blobs/etc. That's why the onode cache trims based on the number of items instead of the number of bytes, and we have the whole wonky periodic average metadata size calculation that calls set_max to change the maximum number of onodes to cache.

@markhpc
Copy link
Member

markhpc commented May 21, 2020

Hi Adam,

Let me think about this a little bit. Can we trigger a trim on growth rather than in the loop here? One of the goals of the refactor last summer was to try to avoid temporary growth between MempoolThread cycles.

Edit: Also, 1.1MB of Onode for a 32MB object is insane! (to the point of being broken). We shouldn't have ~3% space overhead for extents/blobs/sharedblobs. Any idea how much of that 1.1MB was used for extents vs blobs vs sharedblobs?

Copy link
Member

@markhpc markhpc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so thinking about this some more:

  1. imho one of the primary issues here is that the average metadata memory consumption resulting from caching an onode can change (apparently dramatically) behind the onode cache's back. Normally this change should be small and you'll hit a new trim cycle from a cache insert before it gets too bad, but Adam mentioned that he's seeing a case where we may cache 1MB+ of metadata for a 32MB object. That combined with no new cache inserts (because all onodes are already cached) leads to this divergent behavior.

  2. This PR is probably good enough to solve the issue nearly 100% of the time (the only cases I can think of are extremely rapid and extreme onode size growth, but normally that shouldn't happen). I would like to consider moving the trim() call into set_max() itself though (and possibly adding some logic to avoid doing the trim if max hasn't significantly diverged). Then it's just a question of when we run set_max. Currently we do it in a loop in the mempool thread, but perhaps in the future we'll want to do that on object modification (or after a certain number of bytes have been written/deleted/etc). If we potentially may have to trim after changing the max value, let's abstract it away.

  3. fixing 2) should more or less fix the memory growth issue, but if we have a huge divergence in the size of object metadata (some taking 100-1000s of bytes and others taking MBs) we're going to really start thrashing the onode cache. We'll trade extreme memory growth for terrible performance with potentially huge trims and lots of cache misses for onodes with little associated metadata (Better than excessive memory growth, but still not good). The memory growth was only a side effect imho of what appears to be the bigger problem (huge metadata memory consumption associated with some cached onodes).

Edit: One last thought: If we think Crimson is going to take a while and we do see that there is a legitimate big divergence in the amount of metadata per object depending on some criteria, we may want to rethink the caches a little bit to take the number of blobs/extents/sharedblobs into account.

That's my take!

@@ -3926,6 +3926,7 @@ void BlueStore::MempoolThread::_resize_shards(bool interval_stats)

for (auto i : store->onode_cache_shards) {
i->set_max(max_shard_onodes);
i->trim();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this into set_max itself (and possibly add logic to only do the trim if the max value has dropped by some threshold, but this could be a little tricky).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markhpc Do you think this should apply to all caches that inherit from CacheShard, or just to OnodeCacheShard ?

@@ -3926,6 +3926,7 @@ void BlueStore::MempoolThread::_resize_shards(bool interval_stats)

for (auto i : store->onode_cache_shards) {
i->set_max(max_shard_onodes);
i->trim();
}
for (auto i : store->buffer_cache_shards) {
i->set_max(max_shard_buffer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this for buffers unless the size of a given buffer can change behind the buffer cache's back (which it shouldn't). We exactly track how many bytes are in the cache and know how big a cached buffer is.

The onode case is special because caching an onode can result in a variable amount of required memory that can change behind the cache's back depending on the number of extents/blobs/etc. That's why the onode cache trims based on the number of items instead of the number of bytes, and we have the whole wonky periodic average metadata size calculation that calls set_max to change the maximum number of onodes to cache.

@markhpc
Copy link
Member

markhpc commented May 26, 2020

Potentially related tracker issue: https://tracker.ceph.com/issues/45706

@stale
Copy link

stale bot commented Jul 29, 2020

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Jul 29, 2020
@tchaikov
Copy link
Contributor

@aclamk ping?

@stale stale bot removed the stale label Aug 17, 2020
@stale
Copy link

stale bot commented Oct 17, 2020

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Oct 17, 2020
@tchaikov
Copy link
Contributor

@aclamk ping?

@djgalloway djgalloway changed the base branch from master to main July 9, 2022 00:00
@djgalloway djgalloway requested a review from a team as a code owner July 9, 2022 00:00
@github-actions github-actions bot removed the stale label Jul 15, 2022
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Sep 13, 2022
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants