Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore/AvlAllocator: introduce bluestore_avl_alloc_ff_max_* options #41615

Merged
merged 2 commits into from Jun 25, 2021

Conversation

tchaikov
Copy link
Contributor

@tchaikov tchaikov commented Jun 1, 2021

this change makes AvlAllocator a more updated port of ZFS's metaslab dynamic fit allocator. https://github.com/openzfs/zfs/blob/master/module/zfs/metaslab.c#L1647

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@tchaikov
Copy link
Contributor Author

tchaikov commented Jun 1, 2021

@aclamk Hi Adam, how do we (you) profile the allocator in general? i think i should test the performance characteristic of this changeset and paste it to help the reviewer and myself to understand this change better.

@tchaikov tchaikov requested a review from markhpc June 1, 2021 13:29
@ifed01
Copy link
Contributor

ifed01 commented Jun 1, 2021

@aclamk Hi Adam, how do we (you) profile the allocator in general? i think i should test the performance characteristic of this changeset and paste it to help the reviewer and myself to understand this change better.

@tchaikov - very simple allocator benchmarking is available through unittest_alloc_bench

@tchaikov
Copy link
Contributor Author

tchaikov commented Jun 1, 2021

@ifed01 thank you, Igor. will test the change using the tool and paste the result.

@tchaikov tchaikov self-assigned this Jun 1, 2021
@aclamk
Copy link
Contributor

aclamk commented Jun 1, 2021

@tchaikov Also unittest_alloc_aging can hint on longer term allocator reuse. But of course - this is syntetic.

@ifed01
Copy link
Contributor

ifed01 commented Jun 1, 2021

@tchaikov - Kefu, I can't see an implementation of switching to best-fit mode. Returning -1 from block_picker isn't enough for that IMO. Neither logic to bypass these new caps is available in the best-fit mode. I presume we should search through the whole extent tree in that mode. The allocator would apparently return ENOSPC much more often without these means...

@tchaikov
Copy link
Contributor Author

tchaikov commented Jun 2, 2021

unittest_alloc_bench

test master this PR
AllocTest.test_alloc_bench_seq/3 675382 ms 668600 ms
AllocTest.test_alloc_bench/3 16132 ms 15628 ms
AllocTest.test_alloc_bench_90_300/3 38048 ms 38053 ms
AllocTest.test_alloc_bench_50_300/3 37818 ms 37725 ms
AllocTest.mempoolAccounting/3 112 ms 113 ms

unittest_alloc_aging

metric master this PR
fragmented allocs 0.727767% 0%
#frags 2.00015 2
free_score 0.756594 0.672269
time 62716.6ms 49466.1ms

@tchaikov
Copy link
Contributor Author

tchaikov commented Jun 2, 2021

@ifed01 indeed. i missed the important part in this algorithm. updated and repushed

@aclamk thank you, Adam. now i recall that i was using this test back in #30897 !

@tchaikov
Copy link
Contributor Author

tchaikov commented Jun 2, 2021

@ifed01 regarding #30897 (review) , i think we can switch to b-tree instead of using binary-tree, like AVL tree or red-black tree. as, in b-tree, the nodes have much less overhead from the memory footprint perspective.

@ifed01
Copy link
Contributor

ifed01 commented Jun 2, 2021

jenkins test make check

@tchaikov tchaikov removed their assignment Jun 2, 2021
@tchaikov
Copy link
Contributor Author

tchaikov commented Jun 2, 2021

jenkins test make check

start = _pick_block_after(cursor, size, unit);
dout(20) << __func__ << " first fit=" << start << " size=" << size << dendl;
}
if (start == -1ULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this modification we gave up first-fit mode for all follow-up attempts with reduced sizes when original request size hasn't been satisfied. Is this intentional? Do you think we shouldn't try first fit mode any more when attempting to allocate with reduced size? I'm pretty uncertain myself on whether this is correct or not though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ifed01 no, it's not intentional. i need to ponder over this a little bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this modification we gave up first-fit mode for all follow-up attempts with reduced sizes when original request size hasn't been satisfied.

@ifed01 this is not true. for instance, assuming

  • we need to offer 67KB chunk, and
  • the unit is 1KB,
  • the maximum contiguous chunk in system is 64KB, so max_size is 64KB
  • the maximum contiguous chunk in the area pointed by cursor for 64K aligned allocations is 32 KB.
  • the maximum contiguous chunk in the area pointed by cursor for 32K aligned allocations is 32 KB.
  • the maximum contiguous chunk in the area pointed by cursor for 1K aligned allocations is 5 KB.
  • the total free space is 2MB. so the free percentage is higher than range_size_alloc_free_pct
  • range_size_alloc_threshold is 1KB. as we always want to use first-fit if possible.

where the "area" is limited by bluestore_avl_alloc_ff_max_search_count and bluestore_avl_alloc_ff_max_search_bytes settings.

so we have following call sequence

  1. _allocate(67KB) // greater than max_size
    1. best_fit returns 64 KB
  2. _allocate(3KB) // less than max_size
    1. first_fit after 1KB cursor returns 3KB
  3. fin

but i agree that, there are quite a few cases, we could have allocated chunks in the first_fit mode if we allow it to allocate smaller
chunks. if we use the algorithm proposed by this PR, we have

  1. _allocate(128KB) // greater than max_size
    1. best_fit returns 64 KB
  2. _allocate(64KB) // less-or-equal than max_size
    1. first_fit after 1KB cursor fails
    2. best_fit returns 64 KB
  3. fin

but if we could allow first-fit to try harder by allocating smaller chunks:

  1. _allocate(128KB) // greater than max_size
    1. best_fit returns 64 KB
  2. _allocate(64KB) // less-or-equal than max_size
    1. first_fit after 1KB cursor fails, but it managed to allocate 32KB in 32KB aligned area.
  3. _allocate(32KB)
    1. first_fit after 32KB cursor returns 32KB. because it is allowed to move further in a new search. but this chunk is not located next to the previous 32KB chunk, otherwise the previous first_fit call would have returned 64KB.

so,

  1. this allocator does allow switch back to first-fit mode even if the previous call ends up with a reduced size chunk
  2. allowing first-fit to try harder by allocating smaller chunk if it is not able to fulfill the required size does not help with the clumping the allocating the adjacent allocation request or the sub-requests belonging to the same original request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaikov - not sure I got all the points above but it looks like we're talking about different scenarios.
The one of mine looks like the following (based on the recent case I made a fix for in #41369):

  1. _allocate(4M) is invoked
  2. max_size = 4M and there is the only 4M chunk in the pool, it's unaligned but plenty of e.g. 2MB chunks are available. Hence first_fit mode is chosen at first step but _block_picker_after() call returns -1 due to 4M chunk misalignment.
    So far there is no logic difference between your patch and the preceding implementation ( os/bluestore: fix unexpected ENOSPC in Avl/Hybrid allocators. #41369) .
  3. Then requested size is dropped to 2MB and _allocate retries with 2MB chunk size.
    Prior implementation did that in first-fit mode too which provided geo locality(via cursor usage) for 2MB allocations.
    With your patch this reattempt is going to be in best-fit mode unconditionally which looks like a sort of regression.
    Again not sure it's that important...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ifed01

  1. _allocate(4M) is invoked
  2. first_fit mode is chosen at first step but _block_picker_after() call returns -1
  3. "Then requested size is dropped to 2MB"

i don't think the requested size is dropped here. in my change, if first_fit mode fails, _allocate(4M) just switches over to best_fit mode without changing the requested size. and in your case, since we have non-aligned 4M chunks in the pool, best_fit just returns the first found 4M chunk to the caller.

@ifed01
Copy link
Contributor

ifed01 commented Jun 2, 2021

Overall LGTM, just one open question on whether to switch back and forth to first fit mode when attempting smaller allocations

@tchaikov tchaikov self-assigned this Jun 3, 2021
@tchaikov
Copy link
Contributor Author

tchaikov commented Jun 4, 2021

a spin-off of this change is #41690. as it is relatively orthogonal to this change, i put it in a separated PR.

uint64_t *cursor,
uint64_t size,
uint64_t align)
uint64_t AvlAllocator::_pick_block_after(uint64_t *cursor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for splitting _block_picker into _pick_block_after and _pick_block_fits

Copy link
Contributor

@aclamk aclamk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like addition of limited-search logic. The next step possibly should be making a max_search_count and max_search_bytes values dependent on kind of device ssd/hdd.

The result shown:

  • reduced fragmentation of allocated elements
  • increased fragmentation of entire device
    is consistent with my intuition that early escape from "first-fit" to "best-fit" will preserve object continuity at the cost of global fragmentation.

@tchaikov
Copy link
Contributor Author

@ifed01 regarding #30897 (review) , i think we can switch to b-tree instead of using binary-tree, like AVL tree or red-black tree. as, in b-tree, the nodes have much less overhead from the memory footprint perspective.

#41828

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@tchaikov
Copy link
Contributor Author

changelog

  • rebased against master.

Copy link
Contributor

@ifed01 ifed01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
nit: @tchaikov - for the sake of convenience you might want to add special treatment (no limits) for new parameters equal to zero. Hence one can easily revert back to the original behavior.

…h_count

so AvlAllocator can switch from the first-first mode to best-fit mode
without walking through the whole space map tree. in the
highly-fragmented system, iterating the whole tree could hurt the
performance of fast storage system a lot.

the idea comes from openzfs's metaslab allocator.

Signed-off-by: Kefu Chai <kchai@redhat.com>
…h_bytes

so AvlAllocator can switch from the first-first mode to best-fit mode
without walking through the whole space map tree. in the
highly-fragmented system, iterating the whole tree could hurt the
performance of fast storage system a lot.

the idea comes from openzfs's metaslab allocator.

Signed-off-by: Kefu Chai <kchai@redhat.com>
@tchaikov
Copy link
Contributor Author

@ifed01 implemented the logic for disabling these options if they are 0.

Copy link
Contributor Author

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    -7> 2021-06-24T12:45:59.890+0000 7f006b229700 20 bluestore.OnodeSpace(0x559b0a4d9220 in 0x559b0a486000) _remove 0#3:f8150445:::smithi08940054-5745:head#
    -6> 2021-06-24T12:45:59.890+0000 7f006b229700 20 _unpin_and_rm 0x559b0a486000 0#3:f8a47234:::smithi08940054-4423:head#
    -5> 2021-06-24T12:45:59.890+0000 7f006b229700 20 bluestore.OnodeSpace(0x559b0a4d9220 in 0x559b0a486000) _remove 0#3:f8a47234:::smithi08940054-4423:head#
    -4> 2021-06-24T12:45:59.890+0000 7f006b229700 20 _unpin_and_rm 0x559b0a486000 0#3:f82d7537:::smithi08940054-1090:head#
    -3> 2021-06-24T12:45:59.890+0000 7f006b229700 20 bluestore.OnodeSpace(0x559b0a4d9220 in 0x559b0a486000) _remove 0#3:f82d7537:::smithi08940054-1090:head#
    -2> 2021-06-24T12:45:59.890+0000 7f006b229700 20 bluestore(/var/lib/ceph/osd/ceph-0) _kv_finalize_thread sleep
    -1> 2021-06-24T12:45:59.890+0000 7f005ca0c700 10 bluestore.OnodeSpace(0x559b0a4d8c80 in 0x559b0a486000) clear 0
     0> 2021-06-24T12:45:59.891+0000 7f005ca0c700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f005ca0c700 thread_name:tp_osd_tp

https://pulpito.ceph.com/kchai-2021-06-24_12:12:02-rados-wip-kefu-testing-2021-06-23-2306-distro-basic-smithi/6188928/

not sure if it's related need to rerun the test without this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants