New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/bluestore: Force Onode cache to trim when quota changes #35171
Conversation
Solves case when there is fixed amount of Onodes and all operations only operate on existing once. Without this fix, there is no trigger to actually drop some cached Onodes. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
@@ -3926,6 +3926,7 @@ void BlueStore::MempoolThread::_resize_shards(bool interval_stats) | |||
|
|||
for (auto i : store->onode_cache_shards) { | |||
i->set_max(max_shard_onodes); | |||
i->trim(); | |||
} | |||
for (auto i : store->buffer_cache_shards) { | |||
i->set_max(max_shard_buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we do the same for buffer_cache_shards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also missing for buffers.
Whether we should copy this solution depends on possibility to have truly running cluster that does not read/write. Or possibility to have some Buffer shards that for some reasons are dormant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this for buffers unless the size of a given buffer can change behind the buffer cache's back (which it shouldn't). We exactly track how many bytes are in the cache and know how big a cached buffer is.
The onode case is special because caching an onode can result in a variable amount of required memory that can change behind the cache's back depending on the number of extents/blobs/etc. That's why the onode cache trims based on the number of items instead of the number of bytes, and we have the whole wonky periodic average metadata size calculation that calls set_max to change the maximum number of onodes to cache.
Hi Adam, Let me think about this a little bit. Can we trigger a trim on growth rather than in the loop here? One of the goals of the refactor last summer was to try to avoid temporary growth between MempoolThread cycles. Edit: Also, 1.1MB of Onode for a 32MB object is insane! (to the point of being broken). We shouldn't have ~3% space overhead for extents/blobs/sharedblobs. Any idea how much of that 1.1MB was used for extents vs blobs vs sharedblobs? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so thinking about this some more:
-
imho one of the primary issues here is that the average metadata memory consumption resulting from caching an onode can change (apparently dramatically) behind the onode cache's back. Normally this change should be small and you'll hit a new trim cycle from a cache insert before it gets too bad, but Adam mentioned that he's seeing a case where we may cache 1MB+ of metadata for a 32MB object. That combined with no new cache inserts (because all onodes are already cached) leads to this divergent behavior.
-
This PR is probably good enough to solve the issue nearly 100% of the time (the only cases I can think of are extremely rapid and extreme onode size growth, but normally that shouldn't happen). I would like to consider moving the trim() call into set_max() itself though (and possibly adding some logic to avoid doing the trim if max hasn't significantly diverged). Then it's just a question of when we run set_max. Currently we do it in a loop in the mempool thread, but perhaps in the future we'll want to do that on object modification (or after a certain number of bytes have been written/deleted/etc). If we potentially may have to trim after changing the max value, let's abstract it away.
-
fixing 2) should more or less fix the memory growth issue, but if we have a huge divergence in the size of object metadata (some taking 100-1000s of bytes and others taking MBs) we're going to really start thrashing the onode cache. We'll trade extreme memory growth for terrible performance with potentially huge trims and lots of cache misses for onodes with little associated metadata (Better than excessive memory growth, but still not good). The memory growth was only a side effect imho of what appears to be the bigger problem (huge metadata memory consumption associated with some cached onodes).
Edit: One last thought: If we think Crimson is going to take a while and we do see that there is a legitimate big divergence in the amount of metadata per object depending on some criteria, we may want to rethink the caches a little bit to take the number of blobs/extents/sharedblobs into account.
That's my take!
@@ -3926,6 +3926,7 @@ void BlueStore::MempoolThread::_resize_shards(bool interval_stats) | |||
|
|||
for (auto i : store->onode_cache_shards) { | |||
i->set_max(max_shard_onodes); | |||
i->trim(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this into set_max itself (and possibly add logic to only do the trim if the max value has dropped by some threshold, but this could be a little tricky).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markhpc Do you think this should apply to all caches that inherit from CacheShard, or just to OnodeCacheShard ?
@@ -3926,6 +3926,7 @@ void BlueStore::MempoolThread::_resize_shards(bool interval_stats) | |||
|
|||
for (auto i : store->onode_cache_shards) { | |||
i->set_max(max_shard_onodes); | |||
i->trim(); | |||
} | |||
for (auto i : store->buffer_cache_shards) { | |||
i->set_max(max_shard_buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this for buffers unless the size of a given buffer can change behind the buffer cache's back (which it shouldn't). We exactly track how many bytes are in the cache and know how big a cached buffer is.
The onode case is special because caching an onode can result in a variable amount of required memory that can change behind the cache's back depending on the number of extents/blobs/etc. That's why the onode cache trims based on the number of items instead of the number of bytes, and we have the whole wonky periodic average metadata size calculation that calls set_max to change the maximum number of onodes to cache.
Potentially related tracker issue: https://tracker.ceph.com/issues/45706 |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
@aclamk ping? |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
@aclamk ping? |
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
Solves case when there is fixed amount of Onodes and all operations only operate on existing ones.
Without this fix, there is no trigger to actually drop some cached Onodes.
Problem was detected during BlueStore compression tests, where all objects were created before filling them with data.
Initially there was 10000 objects, each taking ~100 bytes Onode space.
When each object size grew to 32MB, each used ~1.1MB of Onode(+Extents+Blobs+SharedBlobs), for a total of ~12GB.
Each OnodeCacheShard separately could be triggered to trim() if some dummy rados object was added, but it only works for affected shard.