BlueStore: Remove Allocations from RocksDB #39871

benhanokh · 2021-03-05T17:38:41Z

[BlueStore]: [Remove Allocations from RocksDB]

Currently BlueStore keeps its allocation info inside RocksDB.
BlueStore is committing all allocation information (alloc/release) into RocksDB (column-family B) before the client Write is performed causing a delay in write path and adding significant load to the CPU/Memory/Disk.
Committing all state into RocksDB allows Ceph to survive failures without losing the allocation state.

The new code skips the RocksDB updates on allocation time and instead perform a full desatge of the allocator object with all the OSD allocation state in a single step during umount().
This results with an 25% increase in IOPS and reduced latency in small random-write workloads, but exposes the system to losing allocation info in failure cases where we don't call umount.
We added code to perform a full allocation-map rebuild from information stored inside the ONode which is used in failure cases.
When we perform a graceful shutdown there is no need for recovery and we simply read the allocation-map from a flat file where the allocation-map was stored during umount() (in fact this mode is faster and shaves few seconds from boot time since reading a flat file is faster than iterating over RocksDB)

Open Issues:

There is a bug in the src/stop.sh script killing ceph without invoking umount() which means anyone using it will always invoke the recovery path.
Adam Kupczyk is fixing this issue in a separate PR.
A simple workaround is to add a call to 'killall -15 ceph-osd' before calling src/stop.sh
Fast-Shutdown and Ceph Suicide (done when the system underperforms) stop the system without a proper drain and a call to umount.
This will trigger a full recovery which can be long( 3 minutes in my testing, but your your mileage may vary).
We plan on adding a follow up PR doing the following in Fast-Shutdown and Ceph Suicide:

Block the OSD queues from accepting any new request
Delete all items in queue which we didn't start yet
Drain all in-flight tasks
call umount (and destage the allocation-map)
If drain didn't complete within a predefined time-limit (say 3 minutes) -> kill the OSD

jdurgin · 2021-03-05T21:27:01Z

'make check' fails during umount in one case:

 in thread 7fee545152c0 thread_name:ceph-osd
 ceph version Development (no_version) quincy (dev)
 1: ceph-osd(+0x3208792) [0x56268d330792]
 2: /lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0) [0x7fee5386d8a0]
 3: (BlueFS::_flush_and_sync_log(std::unique_lock<ceph::mutex_debug_detail::mutex_debug_impl<false> >&, unsigned long, unsigned long)+0x1ea2) [0x56268d2c99bc]
 4: (BlueFS::sync_metadata(bool)+0x270) [0x56268d2cfbd6]
 5: (BlueFS::umount(bool)+0x122) [0x56268d2b6670]
 6: (BlueStore::_close_bluefs(bool)+0x2a) [0x56268d11ce56]
 7: (BlueStore::_close_db(bool)+0x80) [0x56268d1205b6]
 8: (BlueStore::_open_db_and_around(bool, bool)+0x68d) [0x56268d11da87]
 9: (BlueStore::_mount()+0x4a4) [0x56268d12a73a]
 10: (BlueStore::mount()+0x18) [0x56268d1aff42]
 11: (OSD::init()+0x4d3) [0x56268c83cbdb]
 12: main()
 13: __libc_start_main()
 14: _start()

jdurgin · 2021-03-05T21:44:13Z

There are some things to cleanup still - assuming you're still working on that and looking at other areas

jdurgin · 2021-03-05T22:03:15Z

src/os/bluestore/BlueStore.cc

+  extent_t       *p_curr        = buffer;
+  const extent_t *p_end         = buffer + MAX_EXTENTS_IN_BUFFER;
+  allocator_image_header  header(s_format_version, s_serial);
+  memcpy((byte*)p_curr, (byte*)&header, sizeof(header));


rather than copy the raw struct, allocator_image_header should implement a DENC() method that calls denc() on each relevant data member - see bluestore_onode_t::DENC() for an example, as well as include/denc.h.

This makes the format compatible with different endianness (helpful for analyzing a disk image from a different architecture) and allows us to easily version it if we need to add more fields later.

ifed01

Just run briefly through part of the PR, will proceed later.

src/os/bluestore/FreelistManager.h

src/os/bluestore/bluestore_tool.cc

src/os/bluestore/BlueStore.h

ifed01 · 2021-03-05T23:12:27Z

src/os/bluestore/BlueStore.cc

@@ -8729,6 +9824,8 @@ int BlueStore::_fsck_on_open(BlueStore::FSCKDepth depth, bool repair)
    }

    dout(1) << __func__ << " checking freelist vs allocated" << dendl;
+    // skip freelist vs allocated compare when we have Null fm
+    if (!fm->is_null_manager())


You might want to modify this verification by comparing allocator's allocated extents vs. list of actual allocations from onodes, bluefs etc. And repair if needed by updating the allocator. Hence no need to bypass the testing stuff in store_test

Do you mean build an allocation-map from Onodes and then compare it with the allocation-map read from file?
It is easy to add this test as I already got the code (invoked from bluestore-tool), but can be time consuming (minutes ...).
If you think it is worth the time I can add it

As for fixing - once we find the first error I can set a flag to copy the allocator generated from Onodes to the shared-allocator (will take few extra seconds )

yeah, that's what I am suggesting. Actually such a check wouldn't take much time - fsck reads all the objects any way. If one uses bitmap to track free/allocated extents it should be fast enough...
May be done later via next PR though..

ifed01 · 2021-03-05T23:13:09Z

src/test/objectstore/store_test.cc

-  ASSERT_EQ(bstore->fsck(false), 0);
-
+
+  if (bstore->has_null_fm() == false) {


This might be preserved if fsck implements allocation verification for the new scheme..

ifed01 · 2021-03-05T23:35:32Z

src/os/bluestore/BlueStore.cc

@@ -3108,6 +3109,23 @@ unsigned BlueStore::ExtentMap::decode_some(bufferlist& bl)
  return num;
 }

+//-------------------------------------------------------------------------
+void BlueStore::ExtentMap::add_shard_info_to_onode(bufferlist v, int shard_id)


IMO all you need in this func is to call decode_some, which will load extents and blobs to extent map. You're not goind to use this onode later anyway...

void BlueStore::ExtentMap::add_shard_info_to_onode(bufferlist v)
{
decode_some(v);
}

Or even make the next step forward... decode_some builds a temporary array of blobs this onode uses. Hence you can refactor it (or make a clone) which will load such an array for you. As a result you wouldn't need all the stuff from to build list of unique physical extents which is currently performed in read_allocation_from_single_onode

Generally one should distinguish 3 types of blobs:

regular blobs which are attached to a specific onode only (i.e. aren't shared among multiple onodes) and serialized with the first referencing extent. decode_some is good in loading them in an unique set (aka vector blobs).

spanning blobs - similarly to the above aren't shared but they're serialized separately from the extent map. Not sure you're loading them at all...

shared blobs - blobs which are shared among multiple onodes (due to onode cloning) and hence serialized separately. I need to double check where(when) physical extents are serialized (IIRC that happens out of onode (de)serialization). But the key point is that one might find such blobs while reading multiple onodes - hence there is a need to handle that properly, i.e. avoid pextent duplications. I presume this issue is hidden with that sorted_extents_t structure you maintain but in fact it isn't enough since generally there is no guarantee that all the onodes referencing the same shared blob put their allocations into a single sorted_extents_t instance(or bunch/commit - don't know how to name that portion which is committed when one reaches memory cap).

The are 3 places handling duplicated p_extents:
First, at read_allocation_from_single_onode() we store the extents in a temporary map and skipping duplication.
Second, the sorted_extent_t structure attempts to remove duplication, but extremely large allocations (i.e. more than 300M physical extents after the first filter) will not fit in a single batch and we will still pass duplication to the allocator.
Out last line of defense as I see it is the allocator itself which I assume can handle duplication.

Are you saying that the allocator can't handle duplication?
I can add another pass after building the allocator to read all its entries - sort, merge and remove duplicates.

The spanning blobs comment is worrisome as that means I will break the allocation state in recovery flow.
Can you please schedule a meeting with me, you and Adam to get to the bottom of this?

300M physical extents after the first filter) will not fit in a single batch and we will still pass duplication to the allocator.
Out last line of defense as I see it is the allocator itself which I assume can handle duplication.

Are you saying that the allocator can't handle duplication?

No it's not mandatory for allocators to handle them (some implementations might assert though) - that's an additional performance overhead and useless when using free list manager..

ifed01 · 2021-03-11T16:26:35Z

Looks like failing 'make check' is relevant

benhanokh · 2021-04-19T11:37:21Z

With latest code we are running about 5X faster than fsck which means that customers doing FSCK after crash will only see about 20% slowdown in startup time caused by the allocation-table-rebuild.
We tested with the following setup:
onode_count = 3,750,660
shard_count = 21,101,838
Onode-extent_count = 501,489,874
Continuous extents_count = 3,619,347
Recovery completed in 220 seconds

jdurgin · 2021-04-20T15:22:20Z

bluestore: remove allocation metadata from rocksdb

benhanokh · 2021-04-21T16:48:29Z

Code is ready for review @aclamk @markhpc @ifed01 @jdurgin

github-actions · 2021-04-24T07:02:50Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

neha-ojha · 2021-07-22T15:16:23Z

https://pulpito.ceph.com/nojha-2021-07-21_17:56:04-rados-wip-39871-distro-basic-smithi/ - the valgrind issues are related

benhanokh · 2021-07-22T17:27:34Z

https://pulpito.ceph.com/nojha-2021-07-21_17:56:04-rados-wip-39871-distro-basic-smithi/ - the valgrind issues are related

I reviewed the failures and can't see how any of those issues is related to my PR

neha-ojha · 2021-07-22T18:06:14Z

https://pulpito.ceph.com/nojha-2021-07-21_17:56:04-rados-wip-39871-distro-basic-smithi/ - the valgrind issues are related

I reviewed the failures and can't see how any of those issues is related to my PR

Did you see the following?

/a/nojha-2021-07-21_17:56:04-rados-wip-39871-distro-basic-smithi/6284822/remote/smithi168/log/valgrind

<error>
  <unique>0x5728d</unique>
  <tid>1</tid>
  <kind>Leak_DefinitelyLost</kind>
  <xwhat>
    <text>176 bytes in 1 blocks are definitely lost in loss record 18 of 33</text>
    <leakedbytes>176</leakedbytes>
    <leakedblocks>1</leakedblocks>
  </xwhat>
  <stack>
    <frame>
      <ip>0x4C31C93</ip>
      <obj>/usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so</obj>
      <fn>operator new[](unsigned long)</fn>
      <dir>/builddir/build/BUILD/valgrind-3.16.0/coregrind/m_replacemalloc</dir>
      <file>vg_replace_malloc.c</file>
      <line>431</line>
    </frame>
    <frame>
      <ip>0xDD02DD</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>BlueFS::_create_writer(boost::intrusive_ptr&lt;BlueFS::File&gt;)</fn>
      <dir>/usr/src/debug/ceph-17.0.0-6217.gb866a216.el8.x86_64/src/os/bluestore</dir>
      <file>BlueFS.cc</file>
      <line>3287</line>
    </frame>
    <frame>
      <ip>0xDE157F</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>BlueFS::open_for_write(std::basic_string_view&lt;char, std::char_traits&lt;char&gt; &gt;, std::basic_string_view&lt;char, std::char_traits&lt;char&gt; &gt;, BlueFS::FileWriter**, bool)</fn>
      <dir>/usr/src/debug/ceph-17.0.0-6217.gb866a216.el8.x86_64/src/os/bluestore</dir>
      <file>BlueFS.cc</file>
      <line>3264</line>
    </frame>
    <frame>
      <ip>0xCF57C2</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>BlueStore::store_allocator(Allocator*)</fn>
      <dir>/usr/src/debug/ceph-17.0.0-6217.gb866a216.el8.x86_64/src/os/bluestore</dir>
      <file>BlueStore.cc</file>
      <line>17096</line>
    </frame>
    <frame>
      <ip>0xD3D34E</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>BlueStore::umount()</fn>
      <dir>/usr/src/debug/ceph-17.0.0-6217.gb866a216.el8.x86_64/src/os/bluestore</dir>
      <file>BlueStore.cc</file>
      <line>7233</line>
    </frame>
...

jdurgin · 2021-07-27T06:44:25Z

src/os/bluestore/BlueStore.cc

  if (ret != 0) {
    derr <<  __func__ << "Failed open_for_write with error-code " << ret << dendl;
    return -1;
  }
-  unique_ptr<BlueFS::FileWriter> p_handle(p_temp_handle);
+
+  //auto deleter = [](BlueFS::FileWriter* fw) { bluefs->close_writer(fw); delete fw;};


can remove commented out lines

jdurgin

looks good to me, can follow-up with further improvements in subsequent PRs

src/os/bluestore/BlueStore.cc

tchaikov · 2021-07-30T01:47:16Z

valgrind issue, crashes and some tests had been running for more than 8 hours before being terminated.

github-actions · 2021-08-02T11:25:06Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

aclamk

Looks good!

github-actions · 2021-08-10T04:35:08Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Currently BlueStore keeps its allocation info inside RocksDB. BlueStore is committing all allocation information (alloc/release) into RocksDB (column-family B) before the client Write is performed causing a delay in write path and adding significant load to the CPU/Memory/Disk. Committing all state into RocksDB allows Ceph to survive failures without losing the allocation state. The new code skips the RocksDB updates on allocation time and instead perform a full desatge of the allocator object with all the OSD allocation state in a single step during umount(). This results with an 25% increase in IOPS and reduced latency in small random-write workloads, but exposes the system to losing allocation info in failure cases where we don't call umount. We added code to perform a full allocation-map rebuild from information stored inside the ONode which is used in failure cases. When we perform a graceful shutdown there is no need for recovery and we simply read the allocation-map from a flat file where the allocation-map was stored during umount() (in fact this mode is faster and shaves few seconds from boot time since reading a flat file is faster than iterating over RocksDB) Open Issues: There is a bug in the src/stop.sh script killing ceph without invoking umount() which means anyone using it will always invoke the recovery path. Adam Kupczyk is fixing this issue in a separate PR. A simple workaround is to add a call to 'killall -15 ceph-osd' before calling src/stop.sh Fast-Shutdown and Ceph Suicide (done when the system underperforms) stop the system without a proper drain and a call to umount. This will trigger a full recovery which can be long( 3 minutes in my testing, but your your mileage may vary). We plan on adding a follow up PR doing the following in Fast-Shutdown and Ceph Suicide: Block the OSD queues from accepting any new request Delete all items in queue which we didn't start yet Drain all in-flight tasks call umount (and destage the allocation-map) If drain didn't complete within a predefined time-limit (say 3 minutes) -> kill the OSD Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com> create allocator from on-disk onodes and BlueFS inodes change allocator + add stat counters + report illegal physical-extents compare allocator after rebuild from ONodes prevent collection from being open twice removed FSCK repo check for null-fm Bug-Fix: don't add BlueFS allocation to shared allocator add configuration option to commit to No-Column-B Only invalidate allocation file after opening rocksdb in read-write mode fix tests not to expect failure in cases unapplicable to null-allocator accept non-existing allocation file and don't fail the invaladtion as it could happen legally don't commit to null-fm when db is opened in repair-mode add a reverse mechanism from null_fm to real_fm (using RocksDB) Using Ceph encode/decode, adding more info to header/trailer, add crc protection Code cleanup some changes requested by Adam (cleanup and style changes) Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>

neha-ojha · 2021-08-11T18:26:51Z

http://pulpito.front.sepia.ceph.com/benhanokh-2021-08-04_06:12:22-rados-wip_gbenhano_ncbz-distro-basic-smithi/
Rerun https://pulpito.ceph.com/nojha-2021-08-09_20:10:33-rados-wip_gbenhano_ncbz-distro-basic-smithi/

No related failures, issue exposed by this PR is being tracked in https://tracker.ceph.com/issues/52138

xiexingguo · 2021-08-28T07:56:24Z

👍

This effectively enables having 4K allocation units for BlueFS. But it doesn't turn it on by default for the sake of performance. Using main device which lacks enough free large continuous extents might do the trick though. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 001b08d) Conflicts: src/os/bluestore/BlueFS.cc (trivial, no ceph#39871) src/os/bluestore/BlueStore.cc (trivial, no commits for zoned support) src/test/objectstore/test_bluefs.cc (trivial, no ceph#45883)

benhanokh added performance bluestore labels Mar 5, 2021

benhanokh requested review from jdurgin, ifed01 and aclamk March 5, 2021 17:38

benhanokh self-assigned this Mar 5, 2021

github-actions bot added core tests labels Mar 5, 2021

jdurgin reviewed Mar 5, 2021

View reviewed changes

ifed01 reviewed Mar 5, 2021

View reviewed changes

benhanokh force-pushed the no_column_b branch from 76f6218 to debc561 Compare March 8, 2021 11:58

ifed01 mentioned this pull request Mar 10, 2021

os/bluestore: Some more plumbing for zone cleaning (WIP) #38641

Merged

3 tasks

markhpc self-requested a review March 11, 2021 14:48

neha-ojha changed the title ~~BlueStore: Remove Allocations from RocksDB~~ [WIP] BlueStore: Remove Allocations from RocksDB Mar 25, 2021

benhanokh force-pushed the no_column_b branch 5 times, most recently from 1860456 to 8c1f5c7 Compare April 19, 2021 10:48

benhanokh force-pushed the no_column_b branch from f094c2a to 7618f13 Compare April 21, 2021 06:31

benhanokh changed the title ~~[WIP] BlueStore: Remove Allocations from RocksDB~~ BlueStore: Remove Allocations from RocksDB Apr 21, 2021

benhanokh requested review from jdurgin and ifed01 April 21, 2021 16:45

github-actions bot added the documentation label Jul 21, 2021

jdurgin reviewed Jul 27, 2021

View reviewed changes

jdurgin approved these changes Jul 27, 2021

View reviewed changes

jdurgin added the needs-qa label Jul 27, 2021

tchaikov reviewed Jul 27, 2021

View reviewed changes

src/os/bluestore/BlueStore.cc Outdated Show resolved Hide resolved

tchaikov reviewed Jul 27, 2021

View reviewed changes

src/os/bluestore/BlueStore.cc Outdated Show resolved Hide resolved

tchaikov reviewed Jul 27, 2021

View reviewed changes

src/os/bluestore/BlueStore.cc Show resolved Hide resolved

tchaikov added the wip-kefu-testing label Jul 29, 2021

tchaikov removed needs-qa wip-kefu-testing labels Jul 30, 2021

github-actions bot added the needs-rebase label Aug 2, 2021

aclamk self-requested a review August 3, 2021 10:09

aclamk approved these changes Aug 3, 2021

View reviewed changes

benhanokh force-pushed the no_column_b branch from 7cf4c2e to 626e0d0 Compare August 3, 2021 18:31

github-actions bot removed the needs-rebase label Aug 3, 2021

github-actions bot added the needs-rebase label Aug 10, 2021

benhanokh force-pushed the no_column_b branch from 626e0d0 to 272160a Compare August 11, 2021 13:54

github-actions bot removed the needs-rebase label Aug 11, 2021

neha-ojha merged commit 94239d4 into ceph:master Aug 11, 2021

neha-ojha mentioned this pull request May 6, 2022

Default startupProbe isn't long enough to allow for OSD repairs rook/rook#10196

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BlueStore: Remove Allocations from RocksDB #39871

BlueStore: Remove Allocations from RocksDB #39871

benhanokh commented Mar 5, 2021

jdurgin commented Mar 5, 2021

jdurgin commented Mar 5, 2021

jdurgin Mar 5, 2021

ifed01 left a comment

ifed01 Mar 5, 2021

benhanokh Mar 7, 2021

ifed01 Mar 11, 2021

ifed01 Mar 5, 2021

benhanokh Mar 7, 2021

ifed01 Mar 5, 2021

ifed01 Mar 5, 2021

ifed01 Mar 6, 2021

benhanokh Mar 7, 2021

benhanokh Mar 7, 2021

ifed01 Mar 11, 2021 •

edited

ifed01 commented Mar 11, 2021

benhanokh commented Apr 19, 2021

jdurgin commented Apr 20, 2021

benhanokh commented Apr 21, 2021

github-actions bot commented Apr 24, 2021

neha-ojha commented Jul 22, 2021

benhanokh commented Jul 22, 2021

neha-ojha commented Jul 22, 2021

jdurgin Jul 27, 2021

jdurgin left a comment

tchaikov commented Jul 30, 2021

github-actions bot commented Aug 2, 2021

aclamk left a comment

github-actions bot commented Aug 10, 2021

neha-ojha commented Aug 11, 2021

xiexingguo commented Aug 28, 2021

		ASSERT_EQ(bstore->fsck(false), 0);


		if (bstore->has_null_fm() == false) {

BlueStore: Remove Allocations from RocksDB #39871

BlueStore: Remove Allocations from RocksDB #39871

Conversation

benhanokh commented Mar 5, 2021

jdurgin commented Mar 5, 2021

jdurgin commented Mar 5, 2021

Choose a reason for hiding this comment

ifed01 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ifed01 Mar 11, 2021 • edited

Choose a reason for hiding this comment

ifed01 commented Mar 11, 2021

benhanokh commented Apr 19, 2021

jdurgin commented Apr 20, 2021

benhanokh commented Apr 21, 2021

github-actions bot commented Apr 24, 2021

neha-ojha commented Jul 22, 2021

benhanokh commented Jul 22, 2021

neha-ojha commented Jul 22, 2021

Choose a reason for hiding this comment

jdurgin left a comment

Choose a reason for hiding this comment

tchaikov commented Jul 30, 2021

github-actions bot commented Aug 2, 2021

aclamk left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 10, 2021

neha-ojha commented Aug 11, 2021

xiexingguo commented Aug 28, 2021

ifed01 Mar 11, 2021 •

edited