New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/bluestore/AvlAllocator: introduce bluestore_avl_alloc_ff_max_* options #41615
Conversation
@aclamk Hi Adam, how do we (you) profile the allocator in general? i think i should test the performance characteristic of this changeset and paste it to help the reviewer and myself to understand this change better. |
@tchaikov - very simple allocator benchmarking is available through unittest_alloc_bench |
@ifed01 thank you, Igor. will test the change using the tool and paste the result. |
@tchaikov Also |
@tchaikov - Kefu, I can't see an implementation of switching to best-fit mode. Returning -1 from block_picker isn't enough for that IMO. Neither logic to bypass these new caps is available in the best-fit mode. I presume we should search through the whole extent tree in that mode. The allocator would apparently return ENOSPC much more often without these means... |
f843f63
to
839ff5a
Compare
92fc288
to
523da1d
Compare
unittest_alloc_bench
unittest_alloc_aging
|
@ifed01 regarding #30897 (review) , i think we can switch to b-tree instead of using binary-tree, like AVL tree or red-black tree. as, in b-tree, the nodes have much less overhead from the memory footprint perspective. |
jenkins test make check |
jenkins test make check |
start = _pick_block_after(cursor, size, unit); | ||
dout(20) << __func__ << " first fit=" << start << " size=" << size << dendl; | ||
} | ||
if (start == -1ULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this modification we gave up first-fit mode for all follow-up attempts with reduced sizes when original request size hasn't been satisfied. Is this intentional? Do you think we shouldn't try first fit mode any more when attempting to allocate with reduced size? I'm pretty uncertain myself on whether this is correct or not though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ifed01 no, it's not intentional. i need to ponder over this a little bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this modification we gave up first-fit mode for all follow-up attempts with reduced sizes when original request size hasn't been satisfied.
@ifed01 this is not true. for instance, assuming
- we need to offer 67KB chunk, and
- the
unit
is 1KB, - the maximum contiguous chunk in system is 64KB, so
max_size
is 64KB - the maximum contiguous chunk in the area pointed by cursor for 64K aligned allocations is 32 KB.
- the maximum contiguous chunk in the area pointed by cursor for 32K aligned allocations is 32 KB.
- the maximum contiguous chunk in the area pointed by cursor for 1K aligned allocations is 5 KB.
- the total free space is 2MB. so the free percentage is higher than
range_size_alloc_free_pct
- range_size_alloc_threshold is 1KB. as we always want to use first-fit if possible.
where the "area" is limited by bluestore_avl_alloc_ff_max_search_count
and bluestore_avl_alloc_ff_max_search_bytes
settings.
so we have following call sequence
- _allocate(67KB) // greater than
max_size
- best_fit returns 64 KB
- _allocate(3KB) // less than
max_size
- first_fit after 1KB cursor returns 3KB
- fin
but i agree that, there are quite a few cases, we could have allocated chunks in the first_fit mode if we allow it to allocate smaller
chunks. if we use the algorithm proposed by this PR, we have
- _allocate(128KB) // greater than
max_size
- best_fit returns 64 KB
- _allocate(64KB) // less-or-equal than
max_size
- first_fit after 1KB cursor fails
- best_fit returns 64 KB
- fin
but if we could allow first-fit to try harder by allocating smaller chunks:
- _allocate(128KB) // greater than
max_size
- best_fit returns 64 KB
- _allocate(64KB) // less-or-equal than
max_size
- first_fit after 1KB cursor fails, but it managed to allocate 32KB in 32KB aligned area.
- _allocate(32KB)
- first_fit after 32KB cursor returns 32KB. because it is allowed to move further in a new search. but this chunk is not located next to the previous 32KB chunk, otherwise the previous first_fit call would have returned 64KB.
so,
- this allocator does allow switch back to first-fit mode even if the previous call ends up with a reduced size chunk
- allowing first-fit to try harder by allocating smaller chunk if it is not able to fulfill the required size does not help with the clumping the allocating the adjacent allocation request or the sub-requests belonging to the same original request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tchaikov - not sure I got all the points above but it looks like we're talking about different scenarios.
The one of mine looks like the following (based on the recent case I made a fix for in #41369):
- _allocate(4M) is invoked
- max_size = 4M and there is the only 4M chunk in the pool, it's unaligned but plenty of e.g. 2MB chunks are available. Hence first_fit mode is chosen at first step but _block_picker_after() call returns -1 due to 4M chunk misalignment.
So far there is no logic difference between your patch and the preceding implementation ( os/bluestore: fix unexpected ENOSPC in Avl/Hybrid allocators. #41369) . - Then requested size is dropped to 2MB and _allocate retries with 2MB chunk size.
Prior implementation did that in first-fit mode too which provided geo locality(via cursor usage) for 2MB allocations.
With your patch this reattempt is going to be in best-fit mode unconditionally which looks like a sort of regression.
Again not sure it's that important...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- _allocate(4M) is invoked
- first_fit mode is chosen at first step but _block_picker_after() call returns -1
- "Then requested size is dropped to 2MB"
i don't think the requested size is dropped here. in my change, if first_fit mode fails, _allocate(4M)
just switches over to best_fit mode without changing the requested size. and in your case, since we have non-aligned 4M chunks in the pool, best_fit just returns the first found 4M chunk to the caller.
Overall LGTM, just one open question on whether to switch back and forth to first fit mode when attempting smaller allocations |
a spin-off of this change is #41690. as it is relatively orthogonal to this change, i put it in a separated PR. |
523da1d
to
8c5a29b
Compare
src/os/bluestore/AvlAllocator.cc
Outdated
uint64_t *cursor, | ||
uint64_t size, | ||
uint64_t align) | ||
uint64_t AvlAllocator::_pick_block_after(uint64_t *cursor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for splitting _block_picker into _pick_block_after and _pick_block_fits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like addition of limited-search logic. The next step possibly should be making a max_search_count and max_search_bytes values dependent on kind of device ssd/hdd.
The result shown:
- reduced fragmentation of allocated elements
- increased fragmentation of entire device
is consistent with my intuition that early escape from "first-fit" to "best-fit" will preserve object continuity at the cost of global fragmentation.
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
8c5a29b
to
76e832c
Compare
changelog
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
nit: @tchaikov - for the sake of convenience you might want to add special treatment (no limits) for new parameters equal to zero. Hence one can easily revert back to the original behavior.
…h_count so AvlAllocator can switch from the first-first mode to best-fit mode without walking through the whole space map tree. in the highly-fragmented system, iterating the whole tree could hurt the performance of fast storage system a lot. the idea comes from openzfs's metaslab allocator. Signed-off-by: Kefu Chai <kchai@redhat.com>
…h_bytes so AvlAllocator can switch from the first-first mode to best-fit mode without walking through the whole space map tree. in the highly-fragmented system, iterating the whole tree could hurt the performance of fast storage system a lot. the idea comes from openzfs's metaslab allocator. Signed-off-by: Kefu Chai <kchai@redhat.com>
76e832c
to
5a26875
Compare
@ifed01 implemented the logic for disabling these options if they are 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-7> 2021-06-24T12:45:59.890+0000 7f006b229700 20 bluestore.OnodeSpace(0x559b0a4d9220 in 0x559b0a486000) _remove 0#3:f8150445:::smithi08940054-5745:head#
-6> 2021-06-24T12:45:59.890+0000 7f006b229700 20 _unpin_and_rm 0x559b0a486000 0#3:f8a47234:::smithi08940054-4423:head#
-5> 2021-06-24T12:45:59.890+0000 7f006b229700 20 bluestore.OnodeSpace(0x559b0a4d9220 in 0x559b0a486000) _remove 0#3:f8a47234:::smithi08940054-4423:head#
-4> 2021-06-24T12:45:59.890+0000 7f006b229700 20 _unpin_and_rm 0x559b0a486000 0#3:f82d7537:::smithi08940054-1090:head#
-3> 2021-06-24T12:45:59.890+0000 7f006b229700 20 bluestore.OnodeSpace(0x559b0a4d9220 in 0x559b0a486000) _remove 0#3:f82d7537:::smithi08940054-1090:head#
-2> 2021-06-24T12:45:59.890+0000 7f006b229700 20 bluestore(/var/lib/ceph/osd/ceph-0) _kv_finalize_thread sleep
-1> 2021-06-24T12:45:59.890+0000 7f005ca0c700 10 bluestore.OnodeSpace(0x559b0a4d8c80 in 0x559b0a486000) clear 0
0> 2021-06-24T12:45:59.891+0000 7f005ca0c700 -1 *** Caught signal (Segmentation fault) **
in thread 7f005ca0c700 thread_name:tp_osd_tp
not sure if it's related need to rerun the test without this change.
this change makes AvlAllocator a more updated port of ZFS's metaslab dynamic fit allocator. https://github.com/openzfs/zfs/blob/master/module/zfs/metaslab.c#L1647
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox