Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BlueStore: Remove Allocations from RocksDB #39871

Merged
merged 1 commit into from Aug 11, 2021

Commits on Aug 11, 2021

  1. [BlueStore]: [Remove Allocations from RocksDB]

    Currently BlueStore keeps its allocation info inside RocksDB.
    BlueStore is committing all allocation information (alloc/release) into RocksDB (column-family B) before the client Write is performed causing a delay in write path and adding significant load to the CPU/Memory/Disk.
    Committing all state into RocksDB allows Ceph to survive failures without losing the allocation state.
    
    The new code skips the RocksDB updates on allocation time and instead perform a full desatge of the allocator object with all the OSD allocation state in a single step during umount().
    This results with an 25% increase in IOPS and reduced latency in small random-write workloads, but exposes the system to losing allocation info in failure cases where we don't call umount.
    We added code to perform a full allocation-map rebuild from information stored inside the ONode which is used in failure cases.
    When we perform a graceful shutdown there is no need for recovery and we simply read the allocation-map from a flat file where the allocation-map was stored during umount() (in fact this mode is faster and shaves few seconds from boot time since reading a flat file is faster than iterating over RocksDB)
    
    Open Issues:
    
    There is a bug in the src/stop.sh script killing ceph without invoking umount() which means anyone using it will always invoke the recovery path.
    Adam Kupczyk is fixing this issue in a separate PR.
    A simple workaround is to add a call to 'killall -15 ceph-osd' before calling src/stop.sh
    
    Fast-Shutdown and Ceph Suicide (done when the system underperforms) stop the system without a proper drain and a call to umount.
    This will trigger a full recovery which can be long( 3 minutes in my testing, but your your mileage may vary).
    We plan on adding a follow up PR doing the following in Fast-Shutdown and Ceph Suicide:
    
    Block the OSD queues from accepting any new request
    Delete all items in queue which we didn't start yet
    Drain all in-flight tasks
    call umount (and destage the allocation-map)
    If drain didn't complete within a predefined time-limit (say 3 minutes) -> kill the OSD
    Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
    
    create allocator from on-disk onodes and BlueFS inodes
    change allocator + add stat counters + report illegal physical-extents
    compare allocator after rebuild from ONodes
    prevent collection from being open twice
    removed FSCK repo check for null-fm
    Bug-Fix: don't add BlueFS allocation to shared allocator
    add configuration option to commit to No-Column-B
    Only invalidate allocation file after opening rocksdb in read-write mode
    fix tests not to expect failure in cases unapplicable to null-allocator
    accept non-existing allocation file and don't fail the invaladtion as it could happen legally
    don't commit to null-fm when db is opened in repair-mode
    add a reverse mechanism from null_fm to real_fm (using RocksDB)
    Using Ceph encode/decode, adding more info to header/trailer, add crc protection
    Code cleanup
    
    some changes requested by Adam (cleanup and style changes)
    
    Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
    benhanokh committed Aug 11, 2021
    Configuration menu
    Copy the full SHA
    272160a View commit details
    Browse the repository at this point in the history