New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/bluestore: do not assert if BlueFS rebalance is unable to allocate sufficient space #18494
Conversation
_insert_free(off, len); | ||
return false; | ||
} | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would reverse the return value so that 'true' means the function handled it, and 'false' means the interval_set should make its usual insertion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
Shouldn't the callers requesting large bins handle an allocation that is composed of lots of little pieces? (1) I thought that's what we were doing already :/, and (2) bluefs shouldn't ever see fragmentation because all of its allocations are 1MB, right? Oh, you're probably thinking of the freespace balance method allocating space to give to bluefs? That would break, yeah. Have you looked at the AVLAllocator? See #18187. I wonder if we should focus our efforts there as it is a bit more elegant than stupid... |
@liewegas: WRT to handling large segmented allocations at the client side - I'm not sure but IMO that's the last retreat. Generally it's better to avoid complicated stuff at client side whenever possible. On the other side - we already support similar behavior for BlueStore allocations with pextents... |
I wonder if a simpler fix is to be less aggressive about the hint. If the extent it finds is beyond some distance from the hint, it can fall back to best-fit? The reason we switched to stupid was because it was faster. IIRC this was especially true for HDD, I suspect in part because the hint state meant it did mostly sequential allocations for write workloads. |
@liewegas - and what's do you think about rewinding hint on extent release? It's even simpler and should resolve the issue? |
Yeah, it would resolve it, but I have a feeling it would pretty dramatically reduce HDD sequentiality. It seems like resetting the hint every N seconds or something would also work. (The hint is useless once the head has moved, so actually resetting it on a read would almost work too, if the we got the timing right.) |
Got it. One more concern about that sequentiality stuff though- IMO it works well until we can effectively locate long enough continuous extents ( e.g. on a clean store or when hints point to the remnants of such a store). Once you fragment the space (or hints stop to work this way, e.g. when we reset or rewind them ;) ) - one will probably lose most of benefit from that functionality. Generally current approach (and perhaps all other simple hint tricks) might lead to sequential access performance degradation over time. |
Yeah, I think we really need to come up with a straw-man policy here instead of kludging with this greedy one. Something like:
Or, have a 'weight' associated with the hint...
|
The dual index in the AVL allocator is great for best-fit, but the binning in stupid is nice for quickly finding extents that are big enough. I think which we use depends on what policy we target... |
@ifed01: Good to have you back! 👍 |
af73ad9
to
619cc7e
Compare
619cc7e
to
3b792f6
Compare
@liewegas - reworked to implement partial but simple fix - do not assert on insufficient allocation during rebalance. IMO this error isn't that critical and can be ignored. Adding test case to reproduce the issue too. Mind reviewing? |
…_free Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
rebalance Signed-off-by: Igor Fedotov <ifedotov@suse.com>
3b792f6
to
64abc7b
Compare
the failures are either caused by #19117 or tracked by http://tracker.ceph.com/issues/9356 . |
This test case was introduced by ceph#18494 to verify allocation failure handling while gifting during bluefs rebalance Not it looks outdated as there is no periodic gifting any more. Fixes: https://tracker.ceph.com/issues/45788 Signed-off-by: Igor Fedotov <ifedotov@suse.com>
This test case was introduced by ceph#18494 to verify allocation failure handling while gifting during bluefs rebalance Not it looks outdated as there is no periodic gifting any more. Fixes: https://tracker.ceph.com/issues/45788 Signed-off-by: Igor Fedotov <ifedotov@suse.com>
This test case was introduced by ceph#18494 to verify allocation failure handling while gifting during bluefs rebalance Not it looks outdated as there is no periodic gifting any more. Fixes: https://tracker.ceph.com/issues/45788 Signed-off-by: Igor Fedotov <ifedotov@suse.com>
This test case was introduced by ceph#18494 to verify allocation failure handling while gifting during bluefs rebalance Not it looks outdated as there is no periodic gifting any more. Fixes: https://tracker.ceph.com/issues/45788 Signed-off-by: Igor Fedotov <ifedotov@suse.com> (cherry picked from commit b852703)
This test case was introduced by ceph#18494 to verify allocation failure handling while gifting during bluefs rebalance Not it looks outdated as there is no periodic gifting any more. Fixes: https://tracker.ceph.com/issues/45788 Signed-off-by: Igor Fedotov <ifedotov@suse.com>
This test case was introduced by ceph/ceph#18494 to verify allocation failure handling while gifting during bluefs rebalance Not it looks outdated as there is no periodic gifting any more. Fixes: https://tracker.ceph.com/issues/45788 Signed-off-by: Igor Fedotov <ifedotov@suse.com>
This test case was introduced by ceph#18494 to verify allocation failure handling while gifting during bluefs rebalance Not it looks outdated as there is no periodic gifting any more. Fixes: https://tracker.ceph.com/issues/45788 Signed-off-by: Igor Fedotov <ifedotov@suse.com> (cherry picked from commit b852703)
This test case was introduced by ceph/ceph#18494 to verify allocation failure handling while gifting during bluefs rebalance Not it looks outdated as there is no periodic gifting any more. Fixes: https://tracker.ceph.com/issues/45788 Signed-off-by: Igor Fedotov <ifedotov@suse.com> (cherry picked from commit b852703)
Under some(many?) conditions StupidAllocator can use free extents from bins greater than necessary that results in high fragmentation. Two related issues were fixed:
The scenario above (or similar variations) looks pretty repetitive and draining the topmost bin first seems to be quite a common case.
As a result one can face an assert when allocator is unable to return 1Mb long extent(s) for BlueFS while reported free space is much much longer.
Partial fix is to remove assert on insufficient allocation during BlueFS rebalance..