BlueStore: Remove Allocations from RocksDB #39871

Currently BlueStore keeps its allocation info inside RocksDB. BlueStore is committing all allocation information (alloc/release) into RocksDB (column-family B) before the client Write is performed causing a delay in write path and adding significant load to the CPU/Memory/Disk. Committing all state into RocksDB allows Ceph to survive failures without losing the allocation state. The new code skips the RocksDB updates on allocation time and instead perform a full desatge of the allocator object with all the OSD allocation state in a single step during umount(). This results with an 25% increase in IOPS and reduced latency in small random-write workloads, but exposes the system to losing allocation info in failure cases where we don't call umount. We added code to perform a full allocation-map rebuild from information stored inside the ONode which is used in failure cases. When we perform a graceful shutdown there is no need for recovery and we simply read the allocation-map from a flat file where the allocation-map was stored during umount() (in fact this mode is faster and shaves few seconds from boot time since reading a flat file is faster than iterating over RocksDB) Open Issues: There is a bug in the src/stop.sh script killing ceph without invoking umount() which means anyone using it will always invoke the recovery path. Adam Kupczyk is fixing this issue in a separate PR. A simple workaround is to add a call to 'killall -15 ceph-osd' before calling src/stop.sh Fast-Shutdown and Ceph Suicide (done when the system underperforms) stop the system without a proper drain and a call to umount. This will trigger a full recovery which can be long( 3 minutes in my testing, but your your mileage may vary). We plan on adding a follow up PR doing the following in Fast-Shutdown and Ceph Suicide: Block the OSD queues from accepting any new request Delete all items in queue which we didn't start yet Drain all in-flight tasks call umount (and destage the allocation-map) If drain didn't complete within a predefined time-limit (say 3 minutes) -> kill the OSD Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com> create allocator from on-disk onodes and BlueFS inodes change allocator + add stat counters + report illegal physical-extents compare allocator after rebuild from ONodes prevent collection from being open twice removed FSCK repo check for null-fm Bug-Fix: don't add BlueFS allocation to shared allocator add configuration option to commit to No-Column-B Only invalidate allocation file after opening rocksdb in read-write mode fix tests not to expect failure in cases unapplicable to null-allocator accept non-existing allocation file and don't fail the invaladtion as it could happen legally don't commit to null-fm when db is opened in repair-mode add a reverse mechanism from null_fm to real_fm (using RocksDB) Using Ceph encode/decode, adding more info to header/trailer, add crc protection Code cleanup some changes requested by Adam (cleanup and style changes) Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BlueStore: Remove Allocations from RocksDB #39871

BlueStore: Remove Allocations from RocksDB #39871

Commits on Aug 11, 2021