New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BlueStore: Remove Allocations from RocksDB #39871
Conversation
'make check' fails during umount in one case:
|
There are some things to cleanup still - assuming you're still working on that and looking at other areas |
src/os/bluestore/BlueStore.cc
Outdated
extent_t *p_curr = buffer; | ||
const extent_t *p_end = buffer + MAX_EXTENTS_IN_BUFFER; | ||
allocator_image_header header(s_format_version, s_serial); | ||
memcpy((byte*)p_curr, (byte*)&header, sizeof(header)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than copy the raw struct, allocator_image_header should implement a DENC() method that calls denc() on each relevant data member - see bluestore_onode_t::DENC() for an example, as well as include/denc.h.
This makes the format compatible with different endianness (helpful for analyzing a disk image from a different architecture) and allows us to easily version it if we need to add more fields later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just run briefly through part of the PR, will proceed later.
src/os/bluestore/BlueStore.cc
Outdated
@@ -8729,6 +9824,8 @@ int BlueStore::_fsck_on_open(BlueStore::FSCKDepth depth, bool repair) | |||
} | |||
|
|||
dout(1) << __func__ << " checking freelist vs allocated" << dendl; | |||
// skip freelist vs allocated compare when we have Null fm | |||
if (!fm->is_null_manager()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to modify this verification by comparing allocator's allocated extents vs. list of actual allocations from onodes, bluefs etc. And repair if needed by updating the allocator. Hence no need to bypass the testing stuff in store_test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean build an allocation-map from Onodes and then compare it with the allocation-map read from file?
It is easy to add this test as I already got the code (invoked from bluestore-tool), but can be time consuming (minutes ...).
If you think it is worth the time I can add it
As for fixing - once we find the first error I can set a flag to copy the allocator generated from Onodes to the shared-allocator (will take few extra seconds )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's what I am suggesting. Actually such a check wouldn't take much time - fsck reads all the objects any way. If one uses bitmap to track free/allocated extents it should be fast enough...
May be done later via next PR though..
src/test/objectstore/store_test.cc
Outdated
ASSERT_EQ(bstore->fsck(false), 0); | ||
|
||
|
||
if (bstore->has_null_fm() == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be preserved if fsck implements allocation verification for the new scheme..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea
src/os/bluestore/BlueStore.cc
Outdated
@@ -3108,6 +3109,23 @@ unsigned BlueStore::ExtentMap::decode_some(bufferlist& bl) | |||
return num; | |||
} | |||
|
|||
//------------------------------------------------------------------------- | |||
void BlueStore::ExtentMap::add_shard_info_to_onode(bufferlist v, int shard_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO all you need in this func is to call decode_some, which will load extents and blobs to extent map. You're not goind to use this onode later anyway...
void BlueStore::ExtentMap::add_shard_info_to_onode(bufferlist v)
{
decode_some(v);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or even make the next step forward... decode_some builds a temporary array of blobs this onode uses. Hence you can refactor it (or make a clone) which will load such an array for you. As a result you wouldn't need all the stuff from to build list of unique physical extents which is currently performed in read_allocation_from_single_onode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally one should distinguish 3 types of blobs:
- regular blobs which are attached to a specific onode only (i.e. aren't shared among multiple onodes) and serialized with the first referencing extent. decode_some is good in loading them in an unique set (aka vector blobs).
- spanning blobs - similarly to the above aren't shared but they're serialized separately from the extent map. Not sure you're loading them at all...
- shared blobs - blobs which are shared among multiple onodes (due to onode cloning) and hence serialized separately. I need to double check where(when) physical extents are serialized (IIRC that happens out of onode (de)serialization). But the key point is that one might find such blobs while reading multiple onodes - hence there is a need to handle that properly, i.e. avoid pextent duplications. I presume this issue is hidden with that sorted_extents_t structure you maintain but in fact it isn't enough since generally there is no guarantee that all the onodes referencing the same shared blob put their allocations into a single sorted_extents_t instance(or bunch/commit - don't know how to name that portion which is committed when one reaches memory cap).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The are 3 places handling duplicated p_extents:
First, at read_allocation_from_single_onode() we store the extents in a temporary map and skipping duplication.
Second, the sorted_extent_t structure attempts to remove duplication, but extremely large allocations (i.e. more than 300M physical extents after the first filter) will not fit in a single batch and we will still pass duplication to the allocator.
Out last line of defense as I see it is the allocator itself which I assume can handle duplication.
Are you saying that the allocator can't handle duplication?
I can add another pass after building the allocator to read all its entries - sort, merge and remove duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spanning blobs comment is worrisome as that means I will break the allocation state in recovery flow.
Can you please schedule a meeting with me, you and Adam to get to the bottom of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
300M physical extents after the first filter) will not fit in a single batch and we will still pass duplication to the allocator.
Out last line of defense as I see it is the allocator itself which I assume can handle duplication.Are you saying that the allocator can't handle duplication?
No it's not mandatory for allocators to handle them (some implementations might assert though) - that's an additional performance overhead and useless when using free list manager..
Looks like failing 'make check' is relevant |
1860456
to
8c1f5c7
Compare
With latest code we are running about 5X faster than fsck which means that customers doing FSCK after crash will only see about 20% slowdown in startup time caused by the allocation-table-rebuild. |
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
https://pulpito.ceph.com/nojha-2021-07-21_17:56:04-rados-wip-39871-distro-basic-smithi/ - the valgrind issues are related |
I reviewed the failures and can't see how any of those issues is related to my PR |
Did you see the following? /a/nojha-2021-07-21_17:56:04-rados-wip-39871-distro-basic-smithi/6284822/remote/smithi168/log/valgrind
|
src/os/bluestore/BlueStore.cc
Outdated
if (ret != 0) { | ||
derr << __func__ << "Failed open_for_write with error-code " << ret << dendl; | ||
return -1; | ||
} | ||
unique_ptr<BlueFS::FileWriter> p_handle(p_temp_handle); | ||
|
||
//auto deleter = [](BlueFS::FileWriter* fw) { bluefs->close_writer(fw); delete fw;}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can remove commented out lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, can follow-up with further improvements in subsequent PRs
valgrind issue, crashes and some tests had been running for more than 8 hours before being terminated. |
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
Currently BlueStore keeps its allocation info inside RocksDB. BlueStore is committing all allocation information (alloc/release) into RocksDB (column-family B) before the client Write is performed causing a delay in write path and adding significant load to the CPU/Memory/Disk. Committing all state into RocksDB allows Ceph to survive failures without losing the allocation state. The new code skips the RocksDB updates on allocation time and instead perform a full desatge of the allocator object with all the OSD allocation state in a single step during umount(). This results with an 25% increase in IOPS and reduced latency in small random-write workloads, but exposes the system to losing allocation info in failure cases where we don't call umount. We added code to perform a full allocation-map rebuild from information stored inside the ONode which is used in failure cases. When we perform a graceful shutdown there is no need for recovery and we simply read the allocation-map from a flat file where the allocation-map was stored during umount() (in fact this mode is faster and shaves few seconds from boot time since reading a flat file is faster than iterating over RocksDB) Open Issues: There is a bug in the src/stop.sh script killing ceph without invoking umount() which means anyone using it will always invoke the recovery path. Adam Kupczyk is fixing this issue in a separate PR. A simple workaround is to add a call to 'killall -15 ceph-osd' before calling src/stop.sh Fast-Shutdown and Ceph Suicide (done when the system underperforms) stop the system without a proper drain and a call to umount. This will trigger a full recovery which can be long( 3 minutes in my testing, but your your mileage may vary). We plan on adding a follow up PR doing the following in Fast-Shutdown and Ceph Suicide: Block the OSD queues from accepting any new request Delete all items in queue which we didn't start yet Drain all in-flight tasks call umount (and destage the allocation-map) If drain didn't complete within a predefined time-limit (say 3 minutes) -> kill the OSD Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com> create allocator from on-disk onodes and BlueFS inodes change allocator + add stat counters + report illegal physical-extents compare allocator after rebuild from ONodes prevent collection from being open twice removed FSCK repo check for null-fm Bug-Fix: don't add BlueFS allocation to shared allocator add configuration option to commit to No-Column-B Only invalidate allocation file after opening rocksdb in read-write mode fix tests not to expect failure in cases unapplicable to null-allocator accept non-existing allocation file and don't fail the invaladtion as it could happen legally don't commit to null-fm when db is opened in repair-mode add a reverse mechanism from null_fm to real_fm (using RocksDB) Using Ceph encode/decode, adding more info to header/trailer, add crc protection Code cleanup some changes requested by Adam (cleanup and style changes) Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
http://pulpito.front.sepia.ceph.com/benhanokh-2021-08-04_06:12:22-rados-wip_gbenhano_ncbz-distro-basic-smithi/ No related failures, issue exposed by this PR is being tracked in https://tracker.ceph.com/issues/52138 |
👍 |
This effectively enables having 4K allocation units for BlueFS. But it doesn't turn it on by default for the sake of performance. Using main device which lacks enough free large continuous extents might do the trick though. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 001b08d) Conflicts: src/os/bluestore/BlueFS.cc (trivial, no ceph#39871) src/os/bluestore/BlueStore.cc (trivial, no commits for zoned support) src/test/objectstore/test_bluefs.cc (trivial, no ceph#45883)
[BlueStore]: [Remove Allocations from RocksDB]
Currently BlueStore keeps its allocation info inside RocksDB.
BlueStore is committing all allocation information (alloc/release) into RocksDB (column-family B) before the client Write is performed causing a delay in write path and adding significant load to the CPU/Memory/Disk.
Committing all state into RocksDB allows Ceph to survive failures without losing the allocation state.
The new code skips the RocksDB updates on allocation time and instead perform a full desatge of the allocator object with all the OSD allocation state in a single step during umount().
This results with an 25% increase in IOPS and reduced latency in small random-write workloads, but exposes the system to losing allocation info in failure cases where we don't call umount.
We added code to perform a full allocation-map rebuild from information stored inside the ONode which is used in failure cases.
When we perform a graceful shutdown there is no need for recovery and we simply read the allocation-map from a flat file where the allocation-map was stored during umount() (in fact this mode is faster and shaves few seconds from boot time since reading a flat file is faster than iterating over RocksDB)
Open Issues:
There is a bug in the src/stop.sh script killing ceph without invoking umount() which means anyone using it will always invoke the recovery path.
Adam Kupczyk is fixing this issue in a separate PR.
A simple workaround is to add a call to 'killall -15 ceph-osd' before calling src/stop.sh
Fast-Shutdown and Ceph Suicide (done when the system underperforms) stop the system without a proper drain and a call to umount.
This will trigger a full recovery which can be long( 3 minutes in my testing, but your your mileage may vary).
We plan on adding a follow up PR doing the following in Fast-Shutdown and Ceph Suicide: