Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: do not assert if BlueFS rebalance is unable to allocate sufficient space #18494

Merged
merged 3 commits into from Feb 9, 2018

Conversation

ifed01
Copy link
Contributor

@ifed01 ifed01 commented Oct 23, 2017

Under some(many?) conditions StupidAllocator can use free extents from bins greater than necessary that results in high fragmentation. Two related issues were fixed:

  1. (minor) init_rm_free doesn't demote extent to lower bin.
  2. Allocation hint preserved in StupidAllocator internally might result in an allocation from a bin higher than it's really needed. E.g. following sequence causes that on a clear store:
  • step 1: allocate small extents (len=0x1000) multiple times. This effectively increases the hint to the offset following the highest allocated extent.
  • step 2: release some of the allocated extents.
  • step 3: allocate small extent(s) again - due to the preserved hint and lower_bound(hint) call when searching bins up one will skip lower bin(s) as it(they) contains offsets below the hint value only. Hence we allocate from the highest bin instead of reusing short available extents and fragment the space this way. Besides this increases the hint again causing any next allocations to drain the highest bin first over and over.
    The scenario above (or similar variations) looks pretty repetitive and draining the topmost bin first seems to be quite a common case.
    As a result one can face an assert when allocator is unable to return 1Mb long extent(s) for BlueFS while reported free space is much much longer.
    Partial fix is to remove assert on insufficient allocation during BlueFS rebalance..

_insert_free(off, len);
return false;
}
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would reverse the return value so that 'true' means the function handled it, and 'false' means the interval_set should make its usual insertion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

@liewegas
Copy link
Member

Shouldn't the callers requesting large bins handle an allocation that is composed of lots of little pieces? (1) I thought that's what we were doing already :/, and (2) bluefs shouldn't ever see fragmentation because all of its allocations are 1MB, right? Oh, you're probably thinking of the freespace balance method allocating space to give to bluefs? That would break, yeah.

Have you looked at the AVLAllocator? See #18187. I wonder if we should focus our efforts there as it is a bit more elegant than stupid...

@ifed01
Copy link
Contributor Author

ifed01 commented Oct 23, 2017

@liewegas: WRT to handling large segmented allocations at the client side - I'm not sure but IMO that's the last retreat. Generally it's better to avoid complicated stuff at client side whenever possible. On the other side - we already support similar behavior for BlueStore allocations with pextents...
As for 2) - yes I observed that for free space rebalance only. This is probably not an issue when ALL allocations/releases are fixed.
Will take a look at #18187 but I'm afraid Stupid Allocator needs this fix any way since it might be a pretty long story to bring AVL into action: validation under different use cases, benchmarking etc...
I was quite surprised you reverted to Stupid Allocator after that pretty long live with BitMap one.....

@liewegas
Copy link
Member

I wonder if a simpler fix is to be less aggressive about the hint. If the extent it finds is beyond some distance from the hint, it can fall back to best-fit?

The reason we switched to stupid was because it was faster. IIRC this was especially true for HDD, I suspect in part because the hint state meant it did mostly sequential allocations for write workloads.

@ifed01
Copy link
Contributor Author

ifed01 commented Oct 24, 2017

@liewegas - and what's do you think about rewinding hint on extent release? It's even simpler and should resolve the issue?

@liewegas
Copy link
Member

Yeah, it would resolve it, but I have a feeling it would pretty dramatically reduce HDD sequentiality. It seems like resetting the hint every N seconds or something would also work. (The hint is useless once the head has moved, so actually resetting it on a read would almost work too, if the we got the timing right.)

@ifed01
Copy link
Contributor Author

ifed01 commented Oct 24, 2017

Got it. One more concern about that sequentiality stuff though- IMO it works well until we can effectively locate long enough continuous extents ( e.g. on a clean store or when hints point to the remnants of such a store). Once you fragment the space (or hints stop to work this way, e.g. when we reset or rewind them ;) ) - one will probably lose most of benefit from that functionality. Generally current approach (and perhaps all other simple hint tricks) might lead to sequential access performance degradation over time.

@liewegas
Copy link
Member

Yeah, I think we really need to come up with a straw-man policy here instead of kludging with this greedy one. Something like:

  • if no hint, best-fit
  • look for an extent near in the hint
    • if it's not close, ignore hint and fall back to best-fit
  • if hint is stale (count allocations/uses?) clear it

Or, have a 'weight' associated with the hint...

  • decrement weight on each hint use
  • candidate extent score inversely related to distance from hint and inversely related to oversized-ness
  • only use hint if candidate extent is < weight
    so that the further away from the hint the less willing we are to use an oversized extent.

@liewegas
Copy link
Member

The dual index in the AVL allocator is great for best-fit, but the binning in stupid is nice for quickly finding extents that are big enough. I think which we use depends on what policy we target...

@markhpc
Copy link
Member

markhpc commented Oct 26, 2017

@ifed01: Good to have you back! 👍

@ifed01
Copy link
Contributor Author

ifed01 commented Dec 21, 2017

@ifed01 ifed01 changed the title os/bluestore: fix excessive fragmentation in StupidAllocator os/bluestore: do not assert if BlueFS rebalance is unable to allocate sufficient space Feb 6, 2018
@ifed01
Copy link
Contributor Author

ifed01 commented Feb 6, 2018

@liewegas - reworked to implement partial but simple fix - do not assert on insufficient allocation during rebalance. IMO this error isn't that critical and can be ignored. Adding test case to reproduce the issue too. Mind reviewing?

ifed01 and others added 3 commits February 6, 2018 22:51
…_free

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
rebalance

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
@tchaikov
Copy link
Contributor

tchaikov commented Feb 9, 2018

the failures are either caused by #19117 or tracked by http://tracker.ceph.com/issues/9356 .

@tchaikov tchaikov merged commit 7894961 into ceph:master Feb 9, 2018
@ifed01 ifed01 deleted the wip-stupidalloc-fix2 branch February 9, 2018 07:24
ifed01 added a commit to ifed01/ceph that referenced this pull request Jun 9, 2020
This test case was introduced by ceph#18494
to verify allocation failure handling while gifting during bluefs rebalance
Not it looks outdated as there is no periodic gifting any more.

Fixes: https://tracker.ceph.com/issues/45788

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
fabrizio8 pushed a commit to rhcs-dashboard/ceph that referenced this pull request Jun 15, 2020
This test case was introduced by ceph#18494
to verify allocation failure handling while gifting during bluefs rebalance
Not it looks outdated as there is no periodic gifting any more.

Fixes: https://tracker.ceph.com/issues/45788

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
agayev pushed a commit to agayev/ceph that referenced this pull request Jun 23, 2020
This test case was introduced by ceph#18494
to verify allocation failure handling while gifting during bluefs rebalance
Not it looks outdated as there is no periodic gifting any more.

Fixes: https://tracker.ceph.com/issues/45788

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
smithfarm pushed a commit to smithfarm/ceph that referenced this pull request Jul 11, 2020
This test case was introduced by ceph#18494
to verify allocation failure handling while gifting during bluefs rebalance
Not it looks outdated as there is no periodic gifting any more.

Fixes: https://tracker.ceph.com/issues/45788

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
(cherry picked from commit b852703)
ideepika pushed a commit to ideepika/ceph that referenced this pull request Aug 7, 2020
This test case was introduced by ceph#18494
to verify allocation failure handling while gifting during bluefs rebalance
Not it looks outdated as there is no periodic gifting any more.

Fixes: https://tracker.ceph.com/issues/45788

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
ideepika pushed a commit to ceph/ceph-ci that referenced this pull request Sep 3, 2020
This test case was introduced by ceph/ceph#18494
to verify allocation failure handling while gifting during bluefs rebalance
Not it looks outdated as there is no periodic gifting any more.

Fixes: https://tracker.ceph.com/issues/45788

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
smithfarm pushed a commit to smithfarm/ceph that referenced this pull request Oct 26, 2020
This test case was introduced by ceph#18494
to verify allocation failure handling while gifting during bluefs rebalance
Not it looks outdated as there is no periodic gifting any more.

Fixes: https://tracker.ceph.com/issues/45788

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
(cherry picked from commit b852703)
ideepika pushed a commit to ceph/ceph-ci that referenced this pull request Nov 12, 2020
This test case was introduced by ceph/ceph#18494
to verify allocation failure handling while gifting during bluefs rebalance
Not it looks outdated as there is no periodic gifting any more.

Fixes: https://tracker.ceph.com/issues/45788

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
(cherry picked from commit b852703)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants