Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BlueStore: Remove Allocations from RocksDB #39871

Merged
merged 1 commit into from Aug 11, 2021

Conversation

benhanokh
Copy link
Contributor

[BlueStore]: [Remove Allocations from RocksDB]

Currently BlueStore keeps its allocation info inside RocksDB.
BlueStore is committing all allocation information (alloc/release) into RocksDB (column-family B) before the client Write is performed causing a delay in write path and adding significant load to the CPU/Memory/Disk.
Committing all state into RocksDB allows Ceph to survive failures without losing the allocation state.

The new code skips the RocksDB updates on allocation time and instead perform a full desatge of the allocator object with all the OSD allocation state in a single step during umount().
This results with an 25% increase in IOPS and reduced latency in small random-write workloads, but exposes the system to losing allocation info in failure cases where we don't call umount.
We added code to perform a full allocation-map rebuild from information stored inside the ONode which is used in failure cases.
When we perform a graceful shutdown there is no need for recovery and we simply read the allocation-map from a flat file where the allocation-map was stored during umount() (in fact this mode is faster and shaves few seconds from boot time since reading a flat file is faster than iterating over RocksDB)

Open Issues:

  1. There is a bug in the src/stop.sh script killing ceph without invoking umount() which means anyone using it will always invoke the recovery path.
    Adam Kupczyk is fixing this issue in a separate PR.
    A simple workaround is to add a call to 'killall -15 ceph-osd' before calling src/stop.sh

  2. Fast-Shutdown and Ceph Suicide (done when the system underperforms) stop the system without a proper drain and a call to umount.
    This will trigger a full recovery which can be long( 3 minutes in my testing, but your your mileage may vary).
    We plan on adding a follow up PR doing the following in Fast-Shutdown and Ceph Suicide:

  • Block the OSD queues from accepting any new request
  • Delete all items in queue which we didn't start yet
  • Drain all in-flight tasks
  • call umount (and destage the allocation-map)
  • If drain didn't complete within a predefined time-limit (say 3 minutes) -> kill the OSD

@jdurgin
Copy link
Member

jdurgin commented Mar 5, 2021

'make check' fails during umount in one case:

 in thread 7fee545152c0 thread_name:ceph-osd
 ceph version Development (no_version) quincy (dev)
 1: ceph-osd(+0x3208792) [0x56268d330792]
 2: /lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0) [0x7fee5386d8a0]
 3: (BlueFS::_flush_and_sync_log(std::unique_lock<ceph::mutex_debug_detail::mutex_debug_impl<false> >&, unsigned long, unsigned long)+0x1ea2) [0x56268d2c99bc]
 4: (BlueFS::sync_metadata(bool)+0x270) [0x56268d2cfbd6]
 5: (BlueFS::umount(bool)+0x122) [0x56268d2b6670]
 6: (BlueStore::_close_bluefs(bool)+0x2a) [0x56268d11ce56]
 7: (BlueStore::_close_db(bool)+0x80) [0x56268d1205b6]
 8: (BlueStore::_open_db_and_around(bool, bool)+0x68d) [0x56268d11da87]
 9: (BlueStore::_mount()+0x4a4) [0x56268d12a73a]
 10: (BlueStore::mount()+0x18) [0x56268d1aff42]
 11: (OSD::init()+0x4d3) [0x56268c83cbdb]
 12: main()
 13: __libc_start_main()
 14: _start()

@jdurgin
Copy link
Member

jdurgin commented Mar 5, 2021

There are some things to cleanup still - assuming you're still working on that and looking at other areas

extent_t *p_curr = buffer;
const extent_t *p_end = buffer + MAX_EXTENTS_IN_BUFFER;
allocator_image_header header(s_format_version, s_serial);
memcpy((byte*)p_curr, (byte*)&header, sizeof(header));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than copy the raw struct, allocator_image_header should implement a DENC() method that calls denc() on each relevant data member - see bluestore_onode_t::DENC() for an example, as well as include/denc.h.

This makes the format compatible with different endianness (helpful for analyzing a disk image from a different architecture) and allows us to easily version it if we need to add more fields later.

Copy link
Contributor

@ifed01 ifed01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just run briefly through part of the PR, will proceed later.

src/os/bluestore/FreelistManager.h Outdated Show resolved Hide resolved
src/os/bluestore/FreelistManager.h Outdated Show resolved Hide resolved
src/os/bluestore/FreelistManager.h Outdated Show resolved Hide resolved
src/os/bluestore/bluestore_tool.cc Outdated Show resolved Hide resolved
src/os/bluestore/BlueStore.h Outdated Show resolved Hide resolved
@@ -8729,6 +9824,8 @@ int BlueStore::_fsck_on_open(BlueStore::FSCKDepth depth, bool repair)
}

dout(1) << __func__ << " checking freelist vs allocated" << dendl;
// skip freelist vs allocated compare when we have Null fm
if (!fm->is_null_manager())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to modify this verification by comparing allocator's allocated extents vs. list of actual allocations from onodes, bluefs etc. And repair if needed by updating the allocator. Hence no need to bypass the testing stuff in store_test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean build an allocation-map from Onodes and then compare it with the allocation-map read from file?
It is easy to add this test as I already got the code (invoked from bluestore-tool), but can be time consuming (minutes ...).
If you think it is worth the time I can add it

As for fixing - once we find the first error I can set a flag to copy the allocator generated from Onodes to the shared-allocator (will take few extra seconds )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's what I am suggesting. Actually such a check wouldn't take much time - fsck reads all the objects any way. If one uses bitmap to track free/allocated extents it should be fast enough...
May be done later via next PR though..

ASSERT_EQ(bstore->fsck(false), 0);


if (bstore->has_null_fm() == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be preserved if fsck implements allocation verification for the new scheme..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

@@ -3108,6 +3109,23 @@ unsigned BlueStore::ExtentMap::decode_some(bufferlist& bl)
return num;
}

//-------------------------------------------------------------------------
void BlueStore::ExtentMap::add_shard_info_to_onode(bufferlist v, int shard_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO all you need in this func is to call decode_some, which will load extents and blobs to extent map. You're not goind to use this onode later anyway...

void BlueStore::ExtentMap::add_shard_info_to_onode(bufferlist v)
{
decode_some(v);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even make the next step forward... decode_some builds a temporary array of blobs this onode uses. Hence you can refactor it (or make a clone) which will load such an array for you. As a result you wouldn't need all the stuff from to build list of unique physical extents which is currently performed in read_allocation_from_single_onode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally one should distinguish 3 types of blobs:

  1. regular blobs which are attached to a specific onode only (i.e. aren't shared among multiple onodes) and serialized with the first referencing extent. decode_some is good in loading them in an unique set (aka vector blobs).
  2. spanning blobs - similarly to the above aren't shared but they're serialized separately from the extent map. Not sure you're loading them at all...
  3. shared blobs - blobs which are shared among multiple onodes (due to onode cloning) and hence serialized separately. I need to double check where(when) physical extents are serialized (IIRC that happens out of onode (de)serialization). But the key point is that one might find such blobs while reading multiple onodes - hence there is a need to handle that properly, i.e. avoid pextent duplications. I presume this issue is hidden with that sorted_extents_t structure you maintain but in fact it isn't enough since generally there is no guarantee that all the onodes referencing the same shared blob put their allocations into a single sorted_extents_t instance(or bunch/commit - don't know how to name that portion which is committed when one reaches memory cap).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The are 3 places handling duplicated p_extents:
First, at read_allocation_from_single_onode() we store the extents in a temporary map and skipping duplication.
Second, the sorted_extent_t structure attempts to remove duplication, but extremely large allocations (i.e. more than 300M physical extents after the first filter) will not fit in a single batch and we will still pass duplication to the allocator.
Out last line of defense as I see it is the allocator itself which I assume can handle duplication.

Are you saying that the allocator can't handle duplication?
I can add another pass after building the allocator to read all its entries - sort, merge and remove duplicates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spanning blobs comment is worrisome as that means I will break the allocation state in recovery flow.
Can you please schedule a meeting with me, you and Adam to get to the bottom of this?

Copy link
Contributor

@ifed01 ifed01 Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

300M physical extents after the first filter) will not fit in a single batch and we will still pass duplication to the allocator.
Out last line of defense as I see it is the allocator itself which I assume can handle duplication.

Are you saying that the allocator can't handle duplication?

No it's not mandatory for allocators to handle them (some implementations might assert though) - that's an additional performance overhead and useless when using free list manager..

@ifed01
Copy link
Contributor

ifed01 commented Mar 11, 2021

Looks like failing 'make check' is relevant

@neha-ojha neha-ojha changed the title BlueStore: Remove Allocations from RocksDB [WIP] BlueStore: Remove Allocations from RocksDB Mar 25, 2021
@benhanokh benhanokh force-pushed the no_column_b branch 5 times, most recently from 1860456 to 8c1f5c7 Compare April 19, 2021 10:48
@benhanokh
Copy link
Contributor Author

With latest code we are running about 5X faster than fsck which means that customers doing FSCK after crash will only see about 20% slowdown in startup time caused by the allocation-table-rebuild.
We tested with the following setup:
onode_count = 3,750,660
shard_count = 21,101,838
Onode-extent_count = 501,489,874
Continuous extents_count = 3,619,347
Recovery completed in 220 seconds

@jdurgin
Copy link
Member

jdurgin commented Apr 20, 2021

@benhanokh benhanokh changed the title [WIP] BlueStore: Remove Allocations from RocksDB BlueStore: Remove Allocations from RocksDB Apr 21, 2021
@benhanokh
Copy link
Contributor Author

Code is ready for review @aclamk @markhpc @ifed01 @jdurgin

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@neha-ojha
Copy link
Member

@benhanokh
Copy link
Contributor Author

https://pulpito.ceph.com/nojha-2021-07-21_17:56:04-rados-wip-39871-distro-basic-smithi/ - the valgrind issues are related

I reviewed the failures and can't see how any of those issues is related to my PR

@neha-ojha
Copy link
Member

https://pulpito.ceph.com/nojha-2021-07-21_17:56:04-rados-wip-39871-distro-basic-smithi/ - the valgrind issues are related

I reviewed the failures and can't see how any of those issues is related to my PR

Did you see the following?

/a/nojha-2021-07-21_17:56:04-rados-wip-39871-distro-basic-smithi/6284822/remote/smithi168/log/valgrind

<error>
  <unique>0x5728d</unique>
  <tid>1</tid>
  <kind>Leak_DefinitelyLost</kind>
  <xwhat>
    <text>176 bytes in 1 blocks are definitely lost in loss record 18 of 33</text>
    <leakedbytes>176</leakedbytes>
    <leakedblocks>1</leakedblocks>
  </xwhat>
  <stack>
    <frame>
      <ip>0x4C31C93</ip>
      <obj>/usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so</obj>
      <fn>operator new[](unsigned long)</fn>
      <dir>/builddir/build/BUILD/valgrind-3.16.0/coregrind/m_replacemalloc</dir>
      <file>vg_replace_malloc.c</file>
      <line>431</line>
    </frame>
    <frame>
      <ip>0xDD02DD</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>BlueFS::_create_writer(boost::intrusive_ptr&lt;BlueFS::File&gt;)</fn>
      <dir>/usr/src/debug/ceph-17.0.0-6217.gb866a216.el8.x86_64/src/os/bluestore</dir>
      <file>BlueFS.cc</file>
      <line>3287</line>
    </frame>
    <frame>
      <ip>0xDE157F</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>BlueFS::open_for_write(std::basic_string_view&lt;char, std::char_traits&lt;char&gt; &gt;, std::basic_string_view&lt;char, std::char_traits&lt;char&gt; &gt;, BlueFS::FileWriter**, bool)</fn>
      <dir>/usr/src/debug/ceph-17.0.0-6217.gb866a216.el8.x86_64/src/os/bluestore</dir>
      <file>BlueFS.cc</file>
      <line>3264</line>
    </frame>
    <frame>
      <ip>0xCF57C2</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>BlueStore::store_allocator(Allocator*)</fn>
      <dir>/usr/src/debug/ceph-17.0.0-6217.gb866a216.el8.x86_64/src/os/bluestore</dir>
      <file>BlueStore.cc</file>
      <line>17096</line>
    </frame>
    <frame>
      <ip>0xD3D34E</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>BlueStore::umount()</fn>
      <dir>/usr/src/debug/ceph-17.0.0-6217.gb866a216.el8.x86_64/src/os/bluestore</dir>
      <file>BlueStore.cc</file>
      <line>7233</line>
    </frame>
...

if (ret != 0) {
derr << __func__ << "Failed open_for_write with error-code " << ret << dendl;
return -1;
}
unique_ptr<BlueFS::FileWriter> p_handle(p_temp_handle);

//auto deleter = [](BlueFS::FileWriter* fw) { bluefs->close_writer(fw); delete fw;};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove commented out lines

Copy link
Member

@jdurgin jdurgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, can follow-up with further improvements in subsequent PRs

@tchaikov
Copy link
Contributor

@github-actions
Copy link

github-actions bot commented Aug 2, 2021

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@aclamk aclamk self-requested a review August 3, 2021 10:09
Copy link
Contributor

@aclamk aclamk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Currently BlueStore keeps its allocation info inside RocksDB.
BlueStore is committing all allocation information (alloc/release) into RocksDB (column-family B) before the client Write is performed causing a delay in write path and adding significant load to the CPU/Memory/Disk.
Committing all state into RocksDB allows Ceph to survive failures without losing the allocation state.

The new code skips the RocksDB updates on allocation time and instead perform a full desatge of the allocator object with all the OSD allocation state in a single step during umount().
This results with an 25% increase in IOPS and reduced latency in small random-write workloads, but exposes the system to losing allocation info in failure cases where we don't call umount.
We added code to perform a full allocation-map rebuild from information stored inside the ONode which is used in failure cases.
When we perform a graceful shutdown there is no need for recovery and we simply read the allocation-map from a flat file where the allocation-map was stored during umount() (in fact this mode is faster and shaves few seconds from boot time since reading a flat file is faster than iterating over RocksDB)

Open Issues:

There is a bug in the src/stop.sh script killing ceph without invoking umount() which means anyone using it will always invoke the recovery path.
Adam Kupczyk is fixing this issue in a separate PR.
A simple workaround is to add a call to 'killall -15 ceph-osd' before calling src/stop.sh

Fast-Shutdown and Ceph Suicide (done when the system underperforms) stop the system without a proper drain and a call to umount.
This will trigger a full recovery which can be long( 3 minutes in my testing, but your your mileage may vary).
We plan on adding a follow up PR doing the following in Fast-Shutdown and Ceph Suicide:

Block the OSD queues from accepting any new request
Delete all items in queue which we didn't start yet
Drain all in-flight tasks
call umount (and destage the allocation-map)
If drain didn't complete within a predefined time-limit (say 3 minutes) -> kill the OSD
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>

create allocator from on-disk onodes and BlueFS inodes
change allocator + add stat counters + report illegal physical-extents
compare allocator after rebuild from ONodes
prevent collection from being open twice
removed FSCK repo check for null-fm
Bug-Fix: don't add BlueFS allocation to shared allocator
add configuration option to commit to No-Column-B
Only invalidate allocation file after opening rocksdb in read-write mode
fix tests not to expect failure in cases unapplicable to null-allocator
accept non-existing allocation file and don't fail the invaladtion as it could happen legally
don't commit to null-fm when db is opened in repair-mode
add a reverse mechanism from null_fm to real_fm (using RocksDB)
Using Ceph encode/decode, adding more info to header/trailer, add crc protection
Code cleanup

some changes requested by Adam (cleanup and style changes)

Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
@neha-ojha neha-ojha merged commit 94239d4 into ceph:master Aug 11, 2021
@xiexingguo
Copy link
Member

👍

ifed01 added a commit to ifed01/ceph that referenced this pull request Jun 27, 2023
This effectively enables having 4K allocation units for BlueFS.
But it doesn't turn it on by default for the sake of performance.
Using main device which lacks enough free large continuous extents
might do the trick though.

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit 001b08d)

 Conflicts:
	src/os/bluestore/BlueFS.cc
(trivial, no ceph#39871)
	src/os/bluestore/BlueStore.cc
(trivial, no commits for zoned support)
	src/test/objectstore/test_bluefs.cc
(trivial, no ceph#45883)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants