New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/bluestore: introduce hybrid_btree2 allocator #52489
base: main
Are you sure you want to change the base?
Conversation
11549b0
to
c44b1c2
Compare
hi @ifed01, i found a error, i report here first:
|
I add some debug log, as below:
file ino 39, it want need 1048576, unit 1048576, but btree2 allocate twice, the first time alloc offset and len is all not align 1048576. it cause to extents verify error after restart osd.
|
I found the initial position of this error is
this leads to product a btree node with a size of less than 1M in bucket 8, and then we have not align |
@YiteGu - thanks a lot for the investigation. Indeed the new allocator doesn't follow the requirement to allocate unit size aligned extents any more. From now on unit size parameter determines new extent's length only. And extent's location to be aligned with block device unit (=4K). Hence I'll have to get rid off the relevant verification at BlueFS::_check_allocations() method. Will do that shortly |
I got it. :) |
@YiteGu - I've just realized that all the releases back to Quincy have already been switched to 4K alignment requirement in BlueFS::_verify_alloc_granularity() method. Hence they shouldn't report the above error. I presume you're trying new allocator on top of pacific release prior to 16.2.14, aren't you? Please confirm. |
Yes, your presume is right. |
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
c44b1c2
to
1bf7ef2
Compare
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
1bf7ef2
to
c403558
Compare
jenkins test make check |
jenkins test api |
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Intended for the forthcoming major update with new allocator implementation. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Refactor hybrid allocator in a way to permit alternative hybrid allocator implementations using the same code base. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Test case depends strongly on avl allocator results. Hence sticking to this allocator explicitly. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
#define dout_context (T::get_context()) | ||
#define dout_subsys ceph_subsys_bluestore | ||
#undef dout_prefix | ||
#define dout_prefix *_dout << (std::string(this->get_type()) + "::").c_str() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to degrade std::string
to c_str ?
want, unit, max_alloc_size, hint, extents); | ||
ceph_assert(res >= 0); // all the errors to be handled inside | ||
// __allocate_or_rollback() | ||
if ((uint64_t)res < want) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its either res == 0 or res == want.
So, how about
ceph_assert(res==0 || res == want); if (res == 0) {
and there is no need for confusing 'want-resbelow, nor
res += res2`.
if (primary) { | ||
res = T::_allocate(want, unit, max_alloc_size, hint, extents); | ||
} else if (bmap_alloc) { | ||
res = bmap_alloc->allocate(want, unit, max_alloc_size, hint, extents); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be better ceph_assert(bmap_alloc)
?
__allocate_or_release
is not an interface at any point, and I see its integrated in fallback logic.
} else if (bmap_alloc) { | ||
res = bmap_alloc->allocate(want, unit, max_alloc_size, hint, extents); | ||
} | ||
if (res < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case that res < 0
it would be r==-ENOSPC.
I think you had in mind res < want
.
} | ||
if (res < 0) { | ||
// got a failure, release already allocated | ||
PExtentVector local_extents; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for optimizing for successful case.
} | ||
|
||
template <typename T> | ||
int64_t HybridAllocatorBase<T>::__allocate_or_rollback( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what special meaning is behind double underscore __
that appeared here.
* [0, base] | ||
* (base, base*factor*2] | ||
* (base*factor*2, base*factor*4] | ||
* (base*factor*4, base*factor*8] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction:
* [0, base]
* (base, base*2^factor]
* (base*2^factor, base*2^(factor*2)]
* (base*2^(factor*2), base*2^(factor*3]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the work is ready to go.
Minor non-critical comments.
Note: I did not review logic of btree allocator. I assume it is the copy from avl.
@aclamk Is it worth doing some preliminary performance sanity checks at this point? |
/* | ||
* return whether x is aligned with (align) | ||
* eg, p2aligned(1200, 0x400) ==> false | ||
* eg, p2aligned(1024, 0x400) ==> true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very confusing to use decimal vs hex here.
#include "BitmapAllocator.h" | ||
|
||
class HybridAllocator : public AvlAllocator { | ||
template <typename T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
template <typename T> | |
template <typename PrimaryAllocator> |
|
||
template <typename T> | ||
void HybridAllocatorBase<T>::init_rm_free(uint64_t offset, uint64_t length) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In non-corrupted hybrid allocator state there is always a gap between chunk free in primary (avl/btree/btree2) and chunk free seconary (bitmap).
1 2 3 4 5
0 8 f0 8 f0 8 f0 8 f0 8 f0
rrrrrrrrr bbbbb bb bb b b b rrrrrrrrrrrrrrrrrr rrrrrrrrrrrrrrrrrrrr b rrrrrrrrrr
r - ranged, b - bitmap
It is not possible to have: rrrrrrbbbbbb
because bitmap part will be moved to range,
and they would form continuous rrrrrrrrrrrr
.
So, there are only 2 kinds of valid init_rm_free(off,len) ranges:
a) [off-len) is completely inside some range rrrrrrrrr
b) bitmap has all set in range [off-len) bbbbbbb
I claim that logic of _try_remove_from_tree can be much simpler.
<< std::dec | ||
<< dendl; | ||
ceph_assert(size); | ||
if (!bmap_alloc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main, if not only goal of hybrid was to save memory.
It is suboptimal that when we are consuming too much memory with ranged allocator, we create bitmap and consume even more.
We should have a strategy, that if we pass over some threshold, we create bitmap, but then move ranges to bitmap until we are again under memory limit.
} | ||
|
||
void Btree2Allocator::_remove_from_tree(uint64_t start, uint64_t size) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we create the Bitmap side of hybrid, there is no path when we decide that it is no longer needed?
This patch introduces new bluestore allocator: hybrid_btree2 intended primarily to lower disk fragmentation happening to aged OSD volumes.
The key features of the new allocator includes:
As a result new allocator generally shows much better performance, less RAM footprint, lower free space fragmentation and a bit higher data-at-rest fragmentation.
Sample numbers for hybrid and hybrid_btree allocators from allocator's benchmarking (
unittest_alloc_bench --gtest_filter="*test_alloc_bench2_90_500_x2/*"
):hybrid:
Executed in 209.750353
Avail 13106 MB Fragmentation:0.91834
latest iteration data-at-rest fragmentation ratio (allocated extents/allocation requests): 1.1089
hybrid_btree2:
Executed in 62.844530
Avail 13107 MB Fragmentation:0.639356
latest iteration data-at-rest fragmentation ratio (allocated extents/allocation requests): 1.6903
Some presentation slides can be found at https://docs.google.com/presentation/d/1TBLRvY7AF-K-DGijJifMnyUx_Oil2Mb-vIqp2TpMGQY/edit?usp=sharing
Signed-off-by: Igor Fedotov igor.fedotov@croit.io
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows