os/bluestore/AvlAllocator: introduce bluestore_avl_alloc_ff_max_* options #41615

tchaikov · 2021-06-01T11:16:50Z

this change makes AvlAllocator a more updated port of ZFS's metaslab dynamic fit allocator. https://github.com/openzfs/zfs/blob/master/module/zfs/metaslab.c#L1647

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

tchaikov · 2021-06-01T13:29:26Z

@aclamk Hi Adam, how do we (you) profile the allocator in general? i think i should test the performance characteristic of this changeset and paste it to help the reviewer and myself to understand this change better.

ifed01 · 2021-06-01T13:47:37Z

@aclamk Hi Adam, how do we (you) profile the allocator in general? i think i should test the performance characteristic of this changeset and paste it to help the reviewer and myself to understand this change better.

@tchaikov - very simple allocator benchmarking is available through unittest_alloc_bench

tchaikov · 2021-06-01T14:41:35Z

@ifed01 thank you, Igor. will test the change using the tool and paste the result.

aclamk · 2021-06-01T16:00:32Z

@tchaikov Also unittest_alloc_aging can hint on longer term allocator reuse. But of course - this is syntetic.

ifed01 · 2021-06-01T23:40:54Z

@tchaikov - Kefu, I can't see an implementation of switching to best-fit mode. Returning -1 from block_picker isn't enough for that IMO. Neither logic to bypass these new caps is available in the best-fit mode. I presume we should search through the whole extent tree in that mode. The allocator would apparently return ENOSPC much more often without these means...

tchaikov · 2021-06-02T10:36:05Z

unittest_alloc_bench

test	master	this PR
AllocTest.test_alloc_bench_seq/3	675382 ms	668600 ms
AllocTest.test_alloc_bench/3	16132 ms	15628 ms
AllocTest.test_alloc_bench_90_300/3	38048 ms	38053 ms
AllocTest.test_alloc_bench_50_300/3	37818 ms	37725 ms
AllocTest.mempoolAccounting/3	112 ms	113 ms

unittest_alloc_aging

metric	master	this PR
fragmented allocs	0.727767%	0%
#frags	2.00015	2
free_score	0.756594	0.672269
time	62716.6ms	49466.1ms

tchaikov · 2021-06-02T10:49:39Z

@ifed01 indeed. i missed the important part in this algorithm. updated and repushed

@aclamk thank you, Adam. now i recall that i was using this test back in #30897 !

tchaikov · 2021-06-02T10:52:51Z

@ifed01 regarding #30897 (review) , i think we can switch to b-tree instead of using binary-tree, like AVL tree or red-black tree. as, in b-tree, the nodes have much less overhead from the memory footprint perspective.

ifed01 · 2021-06-02T10:56:15Z

jenkins test make check

tchaikov · 2021-06-02T14:38:48Z

jenkins test make check

ifed01 · 2021-06-02T15:23:14Z

src/os/bluestore/AvlAllocator.cc

+    start = _pick_block_after(cursor, size, unit);
+    dout(20) << __func__ << " first fit=" << start << " size=" << size << dendl;
+  }
+  if (start == -1ULL) {


With this modification we gave up first-fit mode for all follow-up attempts with reduced sizes when original request size hasn't been satisfied. Is this intentional? Do you think we shouldn't try first fit mode any more when attempting to allocate with reduced size? I'm pretty uncertain myself on whether this is correct or not though.

@ifed01 no, it's not intentional. i need to ponder over this a little bit.

With this modification we gave up first-fit mode for all follow-up attempts with reduced sizes when original request size hasn't been satisfied.

@ifed01 this is not true. for instance, assuming

we need to offer 67KB chunk, and

the unit is 1KB,

the maximum contiguous chunk in system is 64KB, so max_size is 64KB

the maximum contiguous chunk in the area pointed by cursor for 64K aligned allocations is 32 KB.

the maximum contiguous chunk in the area pointed by cursor for 32K aligned allocations is 32 KB.

the maximum contiguous chunk in the area pointed by cursor for 1K aligned allocations is 5 KB.

the total free space is 2MB. so the free percentage is higher than range_size_alloc_free_pct

range_size_alloc_threshold is 1KB. as we always want to use first-fit if possible.

where the "area" is limited by bluestore_avl_alloc_ff_max_search_count and bluestore_avl_alloc_ff_max_search_bytes settings.

so we have following call sequence

_allocate(67KB) // greater than max_size

best_fit returns 64 KB

_allocate(3KB) // less than max_size

first_fit after 1KB cursor returns 3KB

fin

but i agree that, there are quite a few cases, we could have allocated chunks in the first_fit mode if we allow it to allocate smaller
chunks. if we use the algorithm proposed by this PR, we have

_allocate(128KB) // greater than max_size

best_fit returns 64 KB

_allocate(64KB) // less-or-equal than max_size

first_fit after 1KB cursor fails

best_fit returns 64 KB

fin

but if we could allow first-fit to try harder by allocating smaller chunks:

_allocate(128KB) // greater than max_size

best_fit returns 64 KB

_allocate(64KB) // less-or-equal than max_size

first_fit after 1KB cursor fails, but it managed to allocate 32KB in 32KB aligned area.

_allocate(32KB)

first_fit after 32KB cursor returns 32KB. because it is allowed to move further in a new search. but this chunk is not located next to the previous 32KB chunk, otherwise the previous first_fit call would have returned 64KB.

so,

this allocator does allow switch back to first-fit mode even if the previous call ends up with a reduced size chunk

allowing first-fit to try harder by allocating smaller chunk if it is not able to fulfill the required size does not help with the clumping the allocating the adjacent allocation request or the sub-requests belonging to the same original request.

@tchaikov - not sure I got all the points above but it looks like we're talking about different scenarios.
The one of mine looks like the following (based on the recent case I made a fix for in #41369):

_allocate(4M) is invoked

max_size = 4M and there is the only 4M chunk in the pool, it's unaligned but plenty of e.g. 2MB chunks are available. Hence first_fit mode is chosen at first step but _block_picker_after() call returns -1 due to 4M chunk misalignment.
So far there is no logic difference between your patch and the preceding implementation ( os/bluestore: fix unexpected ENOSPC in Avl/Hybrid allocators. #41369) .

Then requested size is dropped to 2MB and _allocate retries with 2MB chunk size.
Prior implementation did that in first-fit mode too which provided geo locality(via cursor usage) for 2MB allocations.
With your patch this reattempt is going to be in best-fit mode unconditionally which looks like a sort of regression.
Again not sure it's that important...

@ifed01

_allocate(4M) is invoked

first_fit mode is chosen at first step but _block_picker_after() call returns -1

"Then requested size is dropped to 2MB"

i don't think the requested size is dropped here. in my change, if first_fit mode fails, _allocate(4M) just switches over to best_fit mode without changing the requested size. and in your case, since we have non-aligned 4M chunks in the pool, best_fit just returns the first found 4M chunk to the caller.

ifed01 · 2021-06-02T15:25:17Z

Overall LGTM, just one open question on whether to switch back and forth to first fit mode when attempting smaller allocations

tchaikov · 2021-06-04T06:10:15Z

a spin-off of this change is #41690. as it is relatively orthogonal to this change, i put it in a separated PR.

tchaikov · 2021-06-05T13:32:37Z

failure tracked by https://tracker.ceph.com/issues/51074

aclamk · 2021-06-09T08:37:00Z

src/os/bluestore/AvlAllocator.cc

-				     uint64_t *cursor,
-				     uint64_t size,
-				     uint64_t align)
+uint64_t AvlAllocator::_pick_block_after(uint64_t *cursor,


+1 for splitting _block_picker into _pick_block_after and _pick_block_fits

aclamk

I like addition of limited-search logic. The next step possibly should be making a max_search_count and max_search_bytes values dependent on kind of device ssd/hdd.

The result shown:

reduced fragmentation of allocated elements
increased fragmentation of entire device
is consistent with my intuition that early escape from "first-fit" to "best-fit" will preserve object continuity at the cost of global fragmentation.

tchaikov · 2021-06-12T13:30:15Z

@ifed01 regarding #30897 (review) , i think we can switch to b-tree instead of using binary-tree, like AVL tree or red-black tree. as, in b-tree, the nodes have much less overhead from the memory footprint perspective.

#41828

github-actions · 2021-06-16T09:15:43Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

tchaikov · 2021-06-16T09:43:55Z

changelog

rebased against master.

ifed01

LGTM.
nit: @tchaikov - for the sake of convenience you might want to add special treatment (no limits) for new parameters equal to zero. Hence one can easily revert back to the original behavior.

…h_count so AvlAllocator can switch from the first-first mode to best-fit mode without walking through the whole space map tree. in the highly-fragmented system, iterating the whole tree could hurt the performance of fast storage system a lot. the idea comes from openzfs's metaslab allocator. Signed-off-by: Kefu Chai <kchai@redhat.com>

…h_bytes so AvlAllocator can switch from the first-first mode to best-fit mode without walking through the whole space map tree. in the highly-fragmented system, iterating the whole tree could hurt the performance of fast storage system a lot. the idea comes from openzfs's metaslab allocator. Signed-off-by: Kefu Chai <kchai@redhat.com>

tchaikov · 2021-06-21T14:12:46Z

@ifed01 implemented the logic for disabling these options if they are 0.

tchaikov

    -7> 2021-06-24T12:45:59.890+0000 7f006b229700 20 bluestore.OnodeSpace(0x559b0a4d9220 in 0x559b0a486000) _remove 0#3:f8150445:::smithi08940054-5745:head#
    -6> 2021-06-24T12:45:59.890+0000 7f006b229700 20 _unpin_and_rm 0x559b0a486000 0#3:f8a47234:::smithi08940054-4423:head#
    -5> 2021-06-24T12:45:59.890+0000 7f006b229700 20 bluestore.OnodeSpace(0x559b0a4d9220 in 0x559b0a486000) _remove 0#3:f8a47234:::smithi08940054-4423:head#
    -4> 2021-06-24T12:45:59.890+0000 7f006b229700 20 _unpin_and_rm 0x559b0a486000 0#3:f82d7537:::smithi08940054-1090:head#
    -3> 2021-06-24T12:45:59.890+0000 7f006b229700 20 bluestore.OnodeSpace(0x559b0a4d9220 in 0x559b0a486000) _remove 0#3:f82d7537:::smithi08940054-1090:head#
    -2> 2021-06-24T12:45:59.890+0000 7f006b229700 20 bluestore(/var/lib/ceph/osd/ceph-0) _kv_finalize_thread sleep
    -1> 2021-06-24T12:45:59.890+0000 7f005ca0c700 10 bluestore.OnodeSpace(0x559b0a4d8c80 in 0x559b0a486000) clear 0
     0> 2021-06-24T12:45:59.891+0000 7f005ca0c700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f005ca0c700 thread_name:tp_osd_tp

https://pulpito.ceph.com/kchai-2021-06-24_12:12:02-rados-wip-kefu-testing-2021-06-23-2306-distro-basic-smithi/6188928/

not sure if it's related need to rerun the test without this change.

tchaikov · 2021-06-25T05:48:24Z

rerunning without this changeset

https://pulpito.ceph.com/kchai-2021-06-25_05:56:03-rados-wip-kefu-testing-2021-06-23-2306-without-41615-distro-basic-smithi/ : still failed with a segfault and a similar backtrace. see https://pulpito.ceph.com/kchai-2021-06-25_05:56:03-rados-wip-kefu-testing-2021-06-23-2306-without-41615-distro-basic-smithi/6192478/

This reverts commit 74df5af, reversing changes made to 7b07baf.

tchaikov requested review from aclamk and ifed01 June 1, 2021 11:16

tchaikov added the performance label Jun 1, 2021

github-actions bot added bluestore common core labels Jun 1, 2021

tchaikov requested a review from markhpc June 1, 2021 13:29

tchaikov self-assigned this Jun 1, 2021

tchaikov force-pushed the wip-avl-alloc-ff branch from f843f63 to 839ff5a Compare June 2, 2021 08:17

github-actions bot added the tests label Jun 2, 2021

tchaikov force-pushed the wip-avl-alloc-ff branch from 92fc288 to 523da1d Compare June 2, 2021 10:10

tchaikov removed their assignment Jun 2, 2021

ifed01 reviewed Jun 2, 2021

View reviewed changes

tchaikov self-assigned this Jun 3, 2021

tchaikov force-pushed the wip-avl-alloc-ff branch from 523da1d to 8c5a29b Compare June 4, 2021 10:14

tchaikov added the wip-kefu-testing label Jun 4, 2021

tchaikov removed the wip-kefu-testing label Jun 5, 2021

aclamk reviewed Jun 9, 2021

View reviewed changes

aclamk approved these changes Jun 9, 2021

View reviewed changes

tchaikov mentioned this pull request Jun 12, 2021

os/bluestore/AvlAllocator: specialize _block_picker() and cleanups #41825

Merged

3 tasks

github-actions bot added the needs-rebase label Jun 16, 2021

tchaikov force-pushed the wip-avl-alloc-ff branch from 8c5a29b to 76e832c Compare June 16, 2021 09:42

github-actions bot removed the needs-rebase label Jun 16, 2021

ifed01 approved these changes Jun 21, 2021

View reviewed changes

tchaikov added 2 commits June 21, 2021 22:10

tchaikov force-pushed the wip-avl-alloc-ff branch from 76e832c to 5a26875 Compare June 21, 2021 14:11

tchaikov added needs-qa wip-kefu-testing labels Jun 21, 2021

tchaikov commented Jun 25, 2021

View reviewed changes

tchaikov added DNM and removed wip-kefu-testing needs-review labels Jun 25, 2021

tchaikov removed the DNM label Jun 25, 2021

tchaikov merged commit 74df5af into ceph:master Jun 25, 2021

tchaikov deleted the wip-avl-alloc-ff branch June 25, 2021 09:01

This was referenced Oct 29, 2021

pacific: os/bluestore/AvlAllocator: introduce bluestore_avl_alloc_ff_max_* options #43745

Merged

octopus: os/bluestore/AvlAllocator: introduce bluestore_avl_alloc_ff_max_* options #43747

Merged

markhpc pushed a commit to markhpc/ceph that referenced this pull request Mar 31, 2022

Revert "Merge pull request ceph#41615 from tchaikov/wip-avl-alloc-ff"

af8b734

This reverts commit 74df5af, reversing changes made to 7b07baf.

This was referenced Apr 1, 2022

common/options: Disable AVL allocator first-fit optimizations #45755

Closed

os/bluestore: Switch to time-based adaptive near-fit alogrithm #45771

Closed

markhpc mentioned this pull request Apr 13, 2022

os/bluestore: Always update the cursor position in AVL near-fit search. #45884

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

os/bluestore/AvlAllocator: introduce bluestore_avl_alloc_ff_max_* options #41615

os/bluestore/AvlAllocator: introduce bluestore_avl_alloc_ff_max_* options #41615

tchaikov commented Jun 1, 2021 •

edited

tchaikov commented Jun 1, 2021

ifed01 commented Jun 1, 2021

tchaikov commented Jun 1, 2021

aclamk commented Jun 1, 2021

ifed01 commented Jun 1, 2021

tchaikov commented Jun 2, 2021 •

edited

tchaikov commented Jun 2, 2021

tchaikov commented Jun 2, 2021 •

edited

ifed01 commented Jun 2, 2021

tchaikov commented Jun 2, 2021

ifed01 Jun 2, 2021

tchaikov Jun 3, 2021

tchaikov Jun 4, 2021

ifed01 Jun 8, 2021

tchaikov Jun 9, 2021

ifed01 commented Jun 2, 2021

tchaikov commented Jun 4, 2021 •

edited

tchaikov commented Jun 5, 2021 •

edited

aclamk Jun 9, 2021

aclamk left a comment

tchaikov commented Jun 12, 2021

github-actions bot commented Jun 16, 2021

tchaikov commented Jun 16, 2021

ifed01 left a comment

tchaikov commented Jun 21, 2021

tchaikov left a comment

tchaikov commented Jun 25, 2021 •

edited

os/bluestore/AvlAllocator: introduce bluestore_avl_alloc_ff_max_* options #41615

os/bluestore/AvlAllocator: introduce bluestore_avl_alloc_ff_max_* options #41615

Conversation

tchaikov commented Jun 1, 2021 • edited

Checklist

tchaikov commented Jun 1, 2021

ifed01 commented Jun 1, 2021

tchaikov commented Jun 1, 2021

aclamk commented Jun 1, 2021

ifed01 commented Jun 1, 2021

tchaikov commented Jun 2, 2021 • edited

unittest_alloc_bench

unittest_alloc_aging

tchaikov commented Jun 2, 2021

tchaikov commented Jun 2, 2021 • edited

ifed01 commented Jun 2, 2021

tchaikov commented Jun 2, 2021

ifed01 Jun 2, 2021

Choose a reason for hiding this comment

tchaikov Jun 3, 2021

Choose a reason for hiding this comment

tchaikov Jun 4, 2021

Choose a reason for hiding this comment

ifed01 Jun 8, 2021

Choose a reason for hiding this comment

tchaikov Jun 9, 2021

Choose a reason for hiding this comment

ifed01 commented Jun 2, 2021

tchaikov commented Jun 4, 2021 • edited

tchaikov commented Jun 5, 2021 • edited

aclamk Jun 9, 2021

Choose a reason for hiding this comment

aclamk left a comment

Choose a reason for hiding this comment

tchaikov commented Jun 12, 2021

github-actions bot commented Jun 16, 2021

tchaikov commented Jun 16, 2021

ifed01 left a comment

Choose a reason for hiding this comment

tchaikov commented Jun 21, 2021

tchaikov left a comment

Choose a reason for hiding this comment

tchaikov commented Jun 25, 2021 • edited

tchaikov commented Jun 1, 2021 •

edited

tchaikov commented Jun 2, 2021 •

edited

tchaikov commented Jun 2, 2021 •

edited

tchaikov commented Jun 4, 2021 •

edited

tchaikov commented Jun 5, 2021 •

edited

tchaikov commented Jun 25, 2021 •

edited