os/bluestore: Force Onode cache to trim when quota changes #35171

aclamk · 2020-05-21T09:32:55Z

Solves case when there is fixed amount of Onodes and all operations only operate on existing ones.
Without this fix, there is no trigger to actually drop some cached Onodes.

Problem was detected during BlueStore compression tests, where all objects were created before filling them with data.
Initially there was 10000 objects, each taking ~100 bytes Onode space.
When each object size grew to 32MB, each used ~1.1MB of Onode(+Extents+Blobs+SharedBlobs), for a total of ~12GB.

Each OnodeCacheShard separately could be triggered to trim() if some dummy rados object was added, but it only works for affected shard.

Solves case when there is fixed amount of Onodes and all operations only operate on existing once. Without this fix, there is no trigger to actually drop some cached Onodes. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>

jdurgin · 2020-05-21T14:35:33Z

src/os/bluestore/BlueStore.cc

@@ -3926,6 +3926,7 @@ void BlueStore::MempoolThread::_resize_shards(bool interval_stats)

  for (auto i : store->onode_cache_shards) {
    i->set_max(max_shard_onodes);
+    i->trim();
  }
  for (auto i : store->buffer_cache_shards) {
    i->set_max(max_shard_buffer);


should we do the same for buffer_cache_shards?

It is also missing for buffers.
Whether we should copy this solution depends on possibility to have truly running cluster that does not read/write. Or possibility to have some Buffer shards that for some reasons are dormant.

I don't think we need this for buffers unless the size of a given buffer can change behind the buffer cache's back (which it shouldn't). We exactly track how many bytes are in the cache and know how big a cached buffer is.

The onode case is special because caching an onode can result in a variable amount of required memory that can change behind the cache's back depending on the number of extents/blobs/etc. That's why the onode cache trims based on the number of items instead of the number of bytes, and we have the whole wonky periodic average metadata size calculation that calls set_max to change the maximum number of onodes to cache.

markhpc · 2020-05-21T14:50:41Z

Hi Adam,

Let me think about this a little bit. Can we trigger a trim on growth rather than in the loop here? One of the goals of the refactor last summer was to try to avoid temporary growth between MempoolThread cycles.

Edit: Also, 1.1MB of Onode for a 32MB object is insane! (to the point of being broken). We shouldn't have ~3% space overhead for extents/blobs/sharedblobs. Any idea how much of that 1.1MB was used for extents vs blobs vs sharedblobs?

markhpc

Ok, so thinking about this some more:

imho one of the primary issues here is that the average metadata memory consumption resulting from caching an onode can change (apparently dramatically) behind the onode cache's back. Normally this change should be small and you'll hit a new trim cycle from a cache insert before it gets too bad, but Adam mentioned that he's seeing a case where we may cache 1MB+ of metadata for a 32MB object. That combined with no new cache inserts (because all onodes are already cached) leads to this divergent behavior.
This PR is probably good enough to solve the issue nearly 100% of the time (the only cases I can think of are extremely rapid and extreme onode size growth, but normally that shouldn't happen). I would like to consider moving the trim() call into set_max() itself though (and possibly adding some logic to avoid doing the trim if max hasn't significantly diverged). Then it's just a question of when we run set_max. Currently we do it in a loop in the mempool thread, but perhaps in the future we'll want to do that on object modification (or after a certain number of bytes have been written/deleted/etc). If we potentially may have to trim after changing the max value, let's abstract it away.
fixing 2) should more or less fix the memory growth issue, but if we have a huge divergence in the size of object metadata (some taking 100-1000s of bytes and others taking MBs) we're going to really start thrashing the onode cache. We'll trade extreme memory growth for terrible performance with potentially huge trims and lots of cache misses for onodes with little associated metadata (Better than excessive memory growth, but still not good). The memory growth was only a side effect imho of what appears to be the bigger problem (huge metadata memory consumption associated with some cached onodes).

Edit: One last thought: If we think Crimson is going to take a while and we do see that there is a legitimate big divergence in the amount of metadata per object depending on some criteria, we may want to rethink the caches a little bit to take the number of blobs/extents/sharedblobs into account.

That's my take!

markhpc · 2020-05-22T14:26:22Z

src/os/bluestore/BlueStore.cc

@@ -3926,6 +3926,7 @@ void BlueStore::MempoolThread::_resize_shards(bool interval_stats)

  for (auto i : store->onode_cache_shards) {
    i->set_max(max_shard_onodes);
+    i->trim();


Let's move this into set_max itself (and possibly add logic to only do the trim if the max value has dropped by some threshold, but this could be a little tricky).

@markhpc Do you think this should apply to all caches that inherit from CacheShard, or just to OnodeCacheShard ?

markhpc · 2020-05-22T14:34:03Z

src/os/bluestore/BlueStore.cc

@@ -3926,6 +3926,7 @@ void BlueStore::MempoolThread::_resize_shards(bool interval_stats)

  for (auto i : store->onode_cache_shards) {
    i->set_max(max_shard_onodes);
+    i->trim();
  }
  for (auto i : store->buffer_cache_shards) {
    i->set_max(max_shard_buffer);


I don't think we need this for buffers unless the size of a given buffer can change behind the buffer cache's back (which it shouldn't). We exactly track how many bytes are in the cache and know how big a cached buffer is.

The onode case is special because caching an onode can result in a variable amount of required memory that can change behind the cache's back depending on the number of extents/blobs/etc. That's why the onode cache trims based on the number of items instead of the number of bytes, and we have the whole wonky periodic average metadata size calculation that calls set_max to change the maximum number of onodes to cache.

markhpc · 2020-05-26T15:32:43Z

Potentially related tracker issue: https://tracker.ceph.com/issues/45706

stale · 2020-07-29T04:49:59Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

tchaikov · 2020-08-17T04:14:14Z

@aclamk ping?

stale · 2020-10-17T20:39:08Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

tchaikov · 2020-10-19T06:40:12Z

@aclamk ping?

github-actions · 2022-09-13T15:01:58Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2022-10-28T21:02:04Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

os/bluestore: Force Onode cache to trim when quota changes

7ba8a3f

Solves case when there is fixed amount of Onodes and all operations only operate on existing once. Without this fix, there is no trigger to actually drop some cached Onodes. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>

aclamk added bug-fix bluestore labels May 21, 2020

aclamk requested review from jdurgin and markhpc May 21, 2020 09:32

jdurgin reviewed May 21, 2020

View reviewed changes

markhpc requested changes May 22, 2020

View reviewed changes

stale bot added the stale label Jul 29, 2020

stale bot removed the stale label Aug 17, 2020

stale bot added the stale label Oct 17, 2020

djgalloway changed the base branch from master to main July 9, 2022 00:00

djgalloway requested a review from a team as a code owner July 9, 2022 00:00

github-actions bot removed the stale label Jul 15, 2022

github-actions bot added the stale label Sep 13, 2022

github-actions bot closed this Oct 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

os/bluestore: Force Onode cache to trim when quota changes #35171

os/bluestore: Force Onode cache to trim when quota changes #35171

aclamk commented May 21, 2020 •

edited

jdurgin May 21, 2020

aclamk May 21, 2020

markhpc May 22, 2020

markhpc commented May 21, 2020 •

edited

markhpc left a comment •

edited

markhpc May 22, 2020

aclamk May 29, 2020

markhpc May 22, 2020

markhpc commented May 26, 2020

stale bot commented Jul 29, 2020

tchaikov commented Aug 17, 2020

stale bot commented Oct 17, 2020

tchaikov commented Oct 19, 2020

github-actions bot commented Sep 13, 2022

github-actions bot commented Oct 28, 2022

os/bluestore: Force Onode cache to trim when quota changes #35171

os/bluestore: Force Onode cache to trim when quota changes #35171

Conversation

aclamk commented May 21, 2020 • edited

jdurgin May 21, 2020

Choose a reason for hiding this comment

aclamk May 21, 2020

Choose a reason for hiding this comment

markhpc May 22, 2020

Choose a reason for hiding this comment

markhpc commented May 21, 2020 • edited

markhpc left a comment • edited

Choose a reason for hiding this comment

markhpc May 22, 2020

Choose a reason for hiding this comment

aclamk May 29, 2020

Choose a reason for hiding this comment

markhpc May 22, 2020

Choose a reason for hiding this comment

markhpc commented May 26, 2020

stale bot commented Jul 29, 2020

tchaikov commented Aug 17, 2020

stale bot commented Oct 17, 2020

tchaikov commented Oct 19, 2020

github-actions bot commented Sep 13, 2022

github-actions bot commented Oct 28, 2022

aclamk commented May 21, 2020 •

edited

markhpc commented May 21, 2020 •

edited

markhpc left a comment •

edited