pacific: os/bluestore: cumulative bluefs backport #52212

ifed01 · 2023-06-27T16:19:42Z

This includes:

os/bluestore: BlueFS fine grain locking
(os/bluestore: BlueFS fine grain locking #43794)
os/bluestore/bluefs: Fix improper vselector tracking in _flush_special()
os/bluestore/bluefs: Fix improper vselector tracking in _flush_special() #44955
os/bluestore: enable 4K allocation unit for BlueFS
os/bluestore: enable 4K allocation unit for BlueFS #48854
os/bluestore/bluefs: Fix sync compaction
4fd98ce

Signed-off-by: Igor Fedotov igor.fedotov@croit.io

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

This refactor prepares flush_and_sync_log and compact_log_async for fine-grain locks in BlueFS. There is no new logic introduced, but refactor is accompanied by some new comments. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 05703cc)

This refactor prepares _flush for fine-grain locks in BlueFS. Introduced _flush_special, a flush dedicated to bluefs special files (ino=1) and (ino=0). Function _flush no longer accepts these special files. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit edebf9f)

…ocks Splits bluefs lock into log, dirty, dirs, file and writer locks. This breaks severe locking issues, and makes bluefs more multithreaded. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit e74474d) Conflicts: src/os/bluestore/BlueFS.cc (misordered backports, lacking ceph#48171 in the source commit)

Reorganize BlueFS state variables into separate domains: 1) log, 2) dirty, 3) nodes. Each has separate lock. This change is intended to make it easier to control which locks need to be held when specific elements are modified. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit b661fa2) Conflicts: src/os/bluestore/BlueFS.cc (misordered backports, lacking ceph#48171 in the source commit)

Splits seq into seq_live and seq_stable. Cleans up log sequencing. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 80ed7b3)

This is modification that only changes names of functions, so tracking of potential deadlocks is simpler. All internal functions start with _. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit dc78769)

Modified File.dirty_seq to capture dirty.seq_stable instead of 0. This is used to distunguish files already serialized for compact_log_async_dump_metadata() function. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 6590a30)

Extracted _maybe_compact_log outside of file lock. The sequence FL could deadlock with LNF that is executed in _async_dump_metadata. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit c967c57)

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit eac1807)

Rearranged locks in preallocate to avoid possible deadlock with compact_log_async_dump_metadata_NF. Cycle was: L->N rename/mkdir N->F compact_log_async_dump_metadata_NF F->L preallocate Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 49316ab)

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit d23f0b1)

Usually sequence of locking is 1) FileWriter 2) File. In _compact_log_async_LD_NF_D it was in reversed order. No real deadlock was possible, but lockdep complained. Bonus: Improved lock dependency graph. Fixes: https://tracker.ceph.com/issues/52939 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 7b7945d)

BlueFS fine grain lock refactor did break accounting in volume selection module. It caused ceph_test_objectstore to fail on SpilloverFixedTest test. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 41eb537)

It was possible to skip capture of file that was recently modified. New procedure under one log.lock flushed pending files and captures state. It is much less interruptible then I had hoped for but I cannot now do it better. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit a296989)

Moves vselector size tracking outside _flush_special(). Function _compact_log_async...() updated sizes twice. Problem could not be solved by making second modification of size just update, as it will possibly disrupt vselector consistency check (_vselector_check()). Feature to track vselector consistency relies on the fact that either log.lock or nodes.lock are taken when the check is performed. Which is not true for _compact_log_async...(). Now _flush_special does not update vselector sizes by itself but leaves the update to the caller. Fixes: https://tracker.ceph.com/issues/54248 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 4bc0f61)

Fix bluefs volume selector in device_migrate_to_existing. Fix bluefs volume selector in _rewrite_log_and_layout_sync_LNF_LD. Fixes: https://tracker.ceph.com/issues/54248 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 3813416)

During phase of log fnode switch from new_log to actual log it is necessary to hold lock. Added that locking. Modified procedure of transporting extents from old_log to new_log. Now new_log gets additional extents, instead of removing from old_log; this shortens time when we need to hold log.lock. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit f2acf52) Conflicts: src/os/bluestore/BlueFS.cc ( - Lacking ceph#41557 - misc stuff with vselector )

Fixed problem that fnode in _consume_dirty could be broken if during fsync() some other thread was writing to the file. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 49b0ee4)

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 9709b14)

Extract updating of num files and log size from _update_logger_stats and put it exactly where modification happens. It allows to escape problem of taking nodes.lock and log.lock. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 2815043)

Moved pending_release to struct dirty {}. Restructured BlueFS::open_for_write to modify pending_release under dirty.lock. Now all pending_release modifications are under dirty.lock. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 2b36f27)

Added comments. Removed comments. Fixed lock-tracking suffixes to functions stemming from change _compact_log_dump_metadata_N -> _compact_log_dump_metadata_NF Removed extra lock_fnode_print. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit fd9ef8b)

Add function tags and comments related to locks. Fixed locking graph documentation. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit a091ea7) Conflicts: src/os/bluestore/BlueFS.h (Lacking backport for ceph#41557)

Signed-off-by: Igor Fedotov <ifedotov@croit.io> (cherry picked from commit bd20741)

We can reuse _compact_log_dump_metadata_NF() instead Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 285df4b)

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 0fc0ced)

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 05478fc)

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 0bfc42a)

+ minor refactoring. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 0af2858)

Fixes problem with sync compaction (_rewrite_log_and_layout_sync). There was a problem with not updating log_seq after compacting log. It cause to stop _replay log right after first transaction. ... 20 bluefs _replay 0x0: op_dir_create sharding ... 20 bluefs _replay 0x0: op_dir_link sharding/def to 21 ... 20 bluefs _replay 0x0: op_jump_seq 1025 ... 10 bluefs _read h 0x555557c46400 0x1000~1000 from file(ino 1 size 0x1000 mtime 0.000000 allocated 410000 alloc_commit 410000 extents [1:0x1540000~410000]) ... 20 bluefs _read left 0xff000 len 0x1000 ... 20 bluefs _read got 4096 ... 10 bluefs _replay 0x1000: stop: seq 1025 != expected 1026 This is a product of bluefs fine grain locks refactor. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 2f8e370) Conflicts: src/test/objectstore/test_bluefs.cc (cherry picked from commit 4fd98ce)

ifed01 · 2023-06-28T11:13:58Z

jenkins test make check

ifed01 · 2023-06-28T11:14:14Z

jenkins test make check

aclamk · 2023-08-07T09:05:19Z

options/global.yaml.in dragged in by mistake.

When using bluefs_shared_alloc_size one might get a long-lasting state when that large chunks are not available any more and fallback to shared device min alloc size occurs. The introduced cooldown is intended to prevent repetitive allocation attempts with bluefs_shared_alloc_size for a while. The rationale is to eliminate performance penalty these failing attempts might cause. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit e52bcc8) Conflicts: src/common/options/global.yaml.in (legacy options declarations, no yamls in pacific)

ifed01 · 2023-08-07T09:26:04Z

options/global.yaml.in dragged in by mistake.

Fixed

aclamk

I was thinking that the use of UPDATE_INC op in BlueFS will be a blocker for this backport.
I completely forgot that we already merged #48915, which I believe was a mistake - we broke downgrading in the same version line.

But since it is already done, I think we can extent Pacific logic with this cumulative set of features.

aclamk · 2023-08-07T13:47:44Z

jenkins test api

YiteGu · 2023-08-09T02:07:16Z

I was thinking that the use of UPDATE_INC op in BlueFS will be a blocker for this backport. I completely forgot that we already merged #48915, which I believe was a mistake - we broke downgrading in the same version line.

But since it is already done, I think we can extent Pacific logic with this cumulative set of features.

Yes, we released #48915, but didn't seem to tell the user that can not downgrading. I only found out after the upgrade was completed.

YiteGu · 2023-08-10T11:59:45Z

Hi, @ifed01 Why do we need bluefs_failed_shared_alloc_cooldown? Isn't it good for bluefs to directly use 4k alloc units?

ifed01 · 2023-08-10T16:53:18Z

Hi, @ifed01 Why do we need bluefs_failed_shared_alloc_cooldown? Isn't it good for bluefs to directly use 4k alloc units?

Using 4K chunks is potentially less effective in terms of metadata size - file allocation map will generally contain more entries which takes more RAM and might impact the performance. Hence we still trying to use large chunks if possible and rollback to them if this failed at some point

YiteGu · 2023-08-11T02:09:20Z

Hi, @ifed01 Why do we need bluefs_failed_shared_alloc_cooldown? Isn't it good for bluefs to directly use 4k alloc units?

Using 4K chunks is potentially less effective in terms of metadata size - file allocation map will generally contain more entries which takes more RAM and might impact the performance. Hence we still trying to use large chunks if possible and rollback to them if this failed at some point

thanks, I got it.

ljflores · 2023-08-15T20:26:32Z

Rados approved: https://tracker.ceph.com/projects/rados/wiki/PACIFIC#httpstrellocomcTw3vOIOj1819-wip-yuri2-testing-2023-08-08-0755-pacific

aclamk and others added 30 commits June 27, 2023 13:51

os/bluestore/bluefs: Clean up log seq progression

ad8fbf0

Splits seq into seq_live and seq_stable. Cleans up log sequencing. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 80ed7b3)

os/bluestore/bluefs: Weaken locks in append_try_flush

10dbded

Extracted _maybe_compact_log outside of file lock. The sequence FL could deadlock with LNF that is executed in _async_dump_metadata. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit c967c57)

os/bluestore/bluefs: Weaken locks in open_for_write

c30442a

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit eac1807)

os/bluestore/bluefs: Removed unused fields

e9e78d0

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit d23f0b1)

os/bluestore/bluefs: Fix problem with access to fnode in _consume_dirty

0b2aac5

Fixed problem that fnode in _consume_dirty could be broken if during fsync() some other thread was writing to the file. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 49b0ee4)

os/bluestore/bluefs: Review comments, misc changes.

bb83aac

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 9709b14)

os/bluestore: unify allocation functions' signature at BlueFS.

0022812

Signed-off-by: Igor Fedotov <ifedotov@croit.io> (cherry picked from commit bd20741)

os/bluestore: get rid off BlueFS::_compact_log_async_dump_metadata_NF()

65e1de1

We can reuse _compact_log_dump_metadata_NF() instead Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 285df4b)

os/bluestore: simplify and cleanup BlueFS::_compact_log_async_...()

4bd3c67

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 0fc0ced)

os/bluestore: introduce method to estimate BlueFS transaction size

4f48d9c

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 05478fc)

os/bluestore: increment Bluefs::super.version at _write_super

681903d

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 0bfc42a)

os/bluestore: introduce bluefs_fnode_t::swap method

7f8204e

+ minor refactoring. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 0af2858)

github-actions bot added common core tests labels Jun 27, 2023

github-actions bot added this to the pacific milestone Jun 27, 2023

ifed01 added backport pacific and removed core tests common bluestore labels Jun 27, 2023

ifed01 force-pushed the wip-ifed-bluefs-cumulative-backports-pac branch from b88bb13 to c627275 Compare August 7, 2023 09:25

github-actions bot added bluestore common core tests labels Aug 7, 2023

aclamk added the needs-qa label Aug 7, 2023

aclamk approved these changes Aug 7, 2023

View reviewed changes

yuriw added the wip-yuri2-testing label Aug 8, 2023

YiteGu approved these changes Aug 11, 2023

View reviewed changes

yuriw merged commit c216ab5 into ceph:pacific Aug 15, 2023
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pacific: os/bluestore: cumulative bluefs backport #52212

pacific: os/bluestore: cumulative bluefs backport #52212

ifed01 commented Jun 27, 2023 •

edited

ifed01 commented Jun 28, 2023

ifed01 commented Jun 28, 2023

aclamk commented Aug 7, 2023

ifed01 commented Aug 7, 2023

aclamk left a comment

aclamk commented Aug 7, 2023

YiteGu commented Aug 9, 2023

YiteGu commented Aug 10, 2023

ifed01 commented Aug 10, 2023

YiteGu commented Aug 11, 2023

ljflores commented Aug 15, 2023

pacific: os/bluestore: cumulative bluefs backport #52212

pacific: os/bluestore: cumulative bluefs backport #52212

Conversation

ifed01 commented Jun 27, 2023 • edited

Contribution Guidelines

Checklist

ifed01 commented Jun 28, 2023

ifed01 commented Jun 28, 2023

aclamk commented Aug 7, 2023

ifed01 commented Aug 7, 2023

aclamk left a comment

Choose a reason for hiding this comment

aclamk commented Aug 7, 2023

YiteGu commented Aug 9, 2023

YiteGu commented Aug 10, 2023

ifed01 commented Aug 10, 2023

YiteGu commented Aug 11, 2023

ljflores commented Aug 15, 2023

ifed01 commented Jun 27, 2023 •

edited