os/bluestore: do not assert if BlueFS rebalance is unable to allocate sufficient space #18494

ifed01 · 2017-10-23T21:29:54Z

Under some(many?) conditions StupidAllocator can use free extents from bins greater than necessary that results in high fragmentation. Two related issues were fixed:

(minor) init_rm_free doesn't demote extent to lower bin.
Allocation hint preserved in StupidAllocator internally might result in an allocation from a bin higher than it's really needed. E.g. following sequence causes that on a clear store:

step 1: allocate small extents (len=0x1000) multiple times. This effectively increases the hint to the offset following the highest allocated extent.
step 2: release some of the allocated extents.
step 3: allocate small extent(s) again - due to the preserved hint and lower_bound(hint) call when searching bins up one will skip lower bin(s) as it(they) contains offsets below the hint value only. Hence we allocate from the highest bin instead of reusing short available extents and fragment the space this way. Besides this increases the hint again causing any next allocations to drain the highest bin first over and over.
The scenario above (or similar variations) looks pretty repetitive and draining the topmost bin first seems to be quite a common case.
As a result one can face an assert when allocator is unable to return 1Mb long extent(s) for BlueFS while reported free space is much much longer.
Partial fix is to remove assert on insufficient allocation during BlueFS rebalance..

liewegas · 2017-10-23T21:41:13Z

src/os/bluestore/StupidAllocator.cc

+              _insert_free(off, len);
+              return false;
+            }
+            return true;


I would reverse the return value so that 'true' means the function handled it, and 'false' means the interval_set should make its usual insertion.

liewegas · 2017-10-23T21:43:50Z

Shouldn't the callers requesting large bins handle an allocation that is composed of lots of little pieces? (1) I thought that's what we were doing already :/, and (2) bluefs shouldn't ever see fragmentation because all of its allocations are 1MB, right? Oh, you're probably thinking of the freespace balance method allocating space to give to bluefs? That would break, yeah.

Have you looked at the AVLAllocator? See #18187. I wonder if we should focus our efforts there as it is a bit more elegant than stupid...

ifed01 · 2017-10-23T22:16:05Z

@liewegas: WRT to handling large segmented allocations at the client side - I'm not sure but IMO that's the last retreat. Generally it's better to avoid complicated stuff at client side whenever possible. On the other side - we already support similar behavior for BlueStore allocations with pextents...
As for 2) - yes I observed that for free space rebalance only. This is probably not an issue when ALL allocations/releases are fixed.
Will take a look at #18187 but I'm afraid Stupid Allocator needs this fix any way since it might be a pretty long story to bring AVL into action: validation under different use cases, benchmarking etc...
I was quite surprised you reverted to Stupid Allocator after that pretty long live with BitMap one.....

liewegas · 2017-10-24T02:45:04Z

I wonder if a simpler fix is to be less aggressive about the hint. If the extent it finds is beyond some distance from the hint, it can fall back to best-fit?

The reason we switched to stupid was because it was faster. IIRC this was especially true for HDD, I suspect in part because the hint state meant it did mostly sequential allocations for write workloads.

ifed01 · 2017-10-24T19:04:13Z

@liewegas - and what's do you think about rewinding hint on extent release? It's even simpler and should resolve the issue?

liewegas · 2017-10-24T19:53:01Z

Yeah, it would resolve it, but I have a feeling it would pretty dramatically reduce HDD sequentiality. It seems like resetting the hint every N seconds or something would also work. (The hint is useless once the head has moved, so actually resetting it on a read would almost work too, if the we got the timing right.)

ifed01 · 2017-10-24T21:31:04Z

Got it. One more concern about that sequentiality stuff though- IMO it works well until we can effectively locate long enough continuous extents ( e.g. on a clean store or when hints point to the remnants of such a store). Once you fragment the space (or hints stop to work this way, e.g. when we reset or rewind them ;) ) - one will probably lose most of benefit from that functionality. Generally current approach (and perhaps all other simple hint tricks) might lead to sequential access performance degradation over time.

liewegas · 2017-10-24T21:42:27Z

Yeah, I think we really need to come up with a straw-man policy here instead of kludging with this greedy one. Something like:

if no hint, best-fit
look for an extent near in the hint
- if it's not close, ignore hint and fall back to best-fit
if hint is stale (count allocations/uses?) clear it

Or, have a 'weight' associated with the hint...

decrement weight on each hint use
candidate extent score inversely related to distance from hint and inversely related to oversized-ness
only use hint if candidate extent is < weight
so that the further away from the hint the less willing we are to use an oversized extent.

liewegas · 2017-10-24T21:44:34Z

The dual index in the AVL allocator is great for best-fit, but the binning in stupid is nice for quickly finding extents that are big enough. I think which we use depends on what policy we target...

markhpc · 2017-10-26T13:00:52Z

@ifed01: Good to have you back! 👍

ifed01 · 2017-12-21T14:22:12Z

Similar symptoms in
http://tracker.ceph.com/issues/22510?issue_count=40&issue_position=1&next_issue_id=21218

ifed01 · 2018-02-06T18:08:13Z

@liewegas - reworked to implement partial but simple fix - do not assert on insufficient allocation during rebalance. IMO this error isn't that critical and can be ignored. Adding test case to reproduce the issue too. Mind reviewing?

…_free Signed-off-by: Igor Fedotov <ifedotov@suse.com>

Signed-off-by: Igor Fedotov <ifedotov@suse.com>

rebalance Signed-off-by: Igor Fedotov <ifedotov@suse.com>

tchaikov · 2018-02-09T02:42:28Z

the failures are either caused by #19117 or tracked by http://tracker.ceph.com/issues/9356 .

This test case was introduced by ceph#18494 to verify allocation failure handling while gifting during bluefs rebalance Not it looks outdated as there is no periodic gifting any more. Fixes: https://tracker.ceph.com/issues/45788 Signed-off-by: Igor Fedotov <ifedotov@suse.com>

This test case was introduced by ceph#18494 to verify allocation failure handling while gifting during bluefs rebalance Not it looks outdated as there is no periodic gifting any more. Fixes: https://tracker.ceph.com/issues/45788 Signed-off-by: Igor Fedotov <ifedotov@suse.com> (cherry picked from commit b852703)

This test case was introduced by ceph#18494 to verify allocation failure handling while gifting during bluefs rebalance Not it looks outdated as there is no periodic gifting any more. Fixes: https://tracker.ceph.com/issues/45788 Signed-off-by: Igor Fedotov <ifedotov@suse.com>

This test case was introduced by ceph/ceph#18494 to verify allocation failure handling while gifting during bluefs rebalance Not it looks outdated as there is no periodic gifting any more. Fixes: https://tracker.ceph.com/issues/45788 Signed-off-by: Igor Fedotov <ifedotov@suse.com>

This test case was introduced by ceph#18494 to verify allocation failure handling while gifting during bluefs rebalance Not it looks outdated as there is no periodic gifting any more. Fixes: https://tracker.ceph.com/issues/45788 Signed-off-by: Igor Fedotov <ifedotov@suse.com> (cherry picked from commit b852703)

This test case was introduced by ceph/ceph#18494 to verify allocation failure handling while gifting during bluefs rebalance Not it looks outdated as there is no periodic gifting any more. Fixes: https://tracker.ceph.com/issues/45788 Signed-off-by: Igor Fedotov <ifedotov@suse.com> (cherry picked from commit b852703)

ifed01 added bluestore bug-fix labels Oct 23, 2017

liewegas reviewed Oct 23, 2017

View reviewed changes

ifed01 mentioned this pull request Oct 24, 2017

os/bluestore: AVL-tree & extent - based space allocator #18187

Closed

markhpc added the performance label Oct 26, 2017

ifed01 force-pushed the wip-stupidalloc-fix2 branch from af73ad9 to 619cc7e Compare January 8, 2018 19:26

ifed01 force-pushed the wip-stupidalloc-fix2 branch from 619cc7e to 3b792f6 Compare February 6, 2018 18:03

ifed01 changed the title ~~os/bluestore: fix excessive fragmentation in StupidAllocator~~ os/bluestore: do not assert if BlueFS rebalance is unable to allocate sufficient space Feb 6, 2018

liewegas approved these changes Feb 6, 2018

View reviewed changes

liewegas added the needs-qa label Feb 6, 2018

ifed01 and others added 3 commits February 6, 2018 22:51

os/bluestore: fix lack of extent demotion in StupidAllocator::init_rm…

89aeea8

…_free Signed-off-by: Igor Fedotov <ifedotov@suse.com>

test/store_test: add test case to for excessive fragmentation

eb0443a

Signed-off-by: Igor Fedotov <ifedotov@suse.com>

os/bluestore: do not assert on insufficient allocation during bluefs

64abc7b

rebalance Signed-off-by: Igor Fedotov <ifedotov@suse.com>

ifed01 force-pushed the wip-stupidalloc-fix2 branch from 3b792f6 to 64abc7b Compare February 6, 2018 20:49

tchaikov added the wip-kefu-testing label Feb 7, 2018

tchaikov merged commit 7894961 into ceph:master Feb 9, 2018

tchaikov mentioned this pull request Feb 9, 2018

test/store_test: fix FTBFS as Sequencer is removed #20382

Merged

ifed01 deleted the wip-stupidalloc-fix2 branch February 9, 2018 07:24

ifed01 mentioned this pull request Feb 21, 2018

os/bluestore: avoid frequent and massive allocator's dump on bluefs rebalance failure #20465

Merged

tchaikov mentioned this pull request Oct 17, 2019

os/bluestore: AVL-tree & extent - based space allocator #30897

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

os/bluestore: do not assert if BlueFS rebalance is unable to allocate sufficient space #18494

os/bluestore: do not assert if BlueFS rebalance is unable to allocate sufficient space #18494

ifed01 commented Oct 23, 2017 •

edited

liewegas Oct 23, 2017

ifed01 Oct 23, 2017

liewegas commented Oct 23, 2017

ifed01 commented Oct 23, 2017 •

edited

liewegas commented Oct 24, 2017

ifed01 commented Oct 24, 2017

liewegas commented Oct 24, 2017

ifed01 commented Oct 24, 2017

liewegas commented Oct 24, 2017

liewegas commented Oct 24, 2017

markhpc commented Oct 26, 2017

ifed01 commented Dec 21, 2017

ifed01 commented Feb 6, 2018

tchaikov commented Feb 9, 2018

os/bluestore: do not assert if BlueFS rebalance is unable to allocate sufficient space #18494

os/bluestore: do not assert if BlueFS rebalance is unable to allocate sufficient space #18494

Conversation

ifed01 commented Oct 23, 2017 • edited

liewegas Oct 23, 2017

Choose a reason for hiding this comment

ifed01 Oct 23, 2017

Choose a reason for hiding this comment

liewegas commented Oct 23, 2017

ifed01 commented Oct 23, 2017 • edited

liewegas commented Oct 24, 2017

ifed01 commented Oct 24, 2017

liewegas commented Oct 24, 2017

ifed01 commented Oct 24, 2017

liewegas commented Oct 24, 2017

liewegas commented Oct 24, 2017

markhpc commented Oct 26, 2017

ifed01 commented Dec 21, 2017

ifed01 commented Feb 6, 2018

tchaikov commented Feb 9, 2018

ifed01 commented Oct 23, 2017 •

edited

ifed01 commented Oct 23, 2017 •

edited