New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pacific: os/bluestore: cumulative bluefs backport #52212
pacific: os/bluestore: cumulative bluefs backport #52212
Conversation
This refactor prepares flush_and_sync_log and compact_log_async for fine-grain locks in BlueFS. There is no new logic introduced, but refactor is accompanied by some new comments. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 05703cc)
This refactor prepares _flush for fine-grain locks in BlueFS. Introduced _flush_special, a flush dedicated to bluefs special files (ino=1) and (ino=0). Function _flush no longer accepts these special files. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit edebf9f)
…ocks Splits bluefs lock into log, dirty, dirs, file and writer locks. This breaks severe locking issues, and makes bluefs more multithreaded. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit e74474d) Conflicts: src/os/bluestore/BlueFS.cc (misordered backports, lacking ceph#48171 in the source commit)
Reorganize BlueFS state variables into separate domains: 1) log, 2) dirty, 3) nodes. Each has separate lock. This change is intended to make it easier to control which locks need to be held when specific elements are modified. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit b661fa2) Conflicts: src/os/bluestore/BlueFS.cc (misordered backports, lacking ceph#48171 in the source commit)
Splits seq into seq_live and seq_stable. Cleans up log sequencing. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 80ed7b3)
This is modification that only changes names of functions, so tracking of potential deadlocks is simpler. All internal functions start with _. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit dc78769)
Modified File.dirty_seq to capture dirty.seq_stable instead of 0. This is used to distunguish files already serialized for compact_log_async_dump_metadata() function. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 6590a30)
Extracted _maybe_compact_log outside of file lock. The sequence FL could deadlock with LNF that is executed in _async_dump_metadata. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit c967c57)
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit eac1807)
Rearranged locks in preallocate to avoid possible deadlock with compact_log_async_dump_metadata_NF. Cycle was: L->N rename/mkdir N->F compact_log_async_dump_metadata_NF F->L preallocate Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 49316ab)
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit d23f0b1)
Usually sequence of locking is 1) FileWriter 2) File. In _compact_log_async_LD_NF_D it was in reversed order. No real deadlock was possible, but lockdep complained. Bonus: Improved lock dependency graph. Fixes: https://tracker.ceph.com/issues/52939 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 7b7945d)
BlueFS fine grain lock refactor did break accounting in volume selection module. It caused ceph_test_objectstore to fail on SpilloverFixedTest test. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 41eb537)
It was possible to skip capture of file that was recently modified. New procedure under one log.lock flushed pending files and captures state. It is much less interruptible then I had hoped for but I cannot now do it better. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit a296989)
Moves vselector size tracking outside _flush_special(). Function _compact_log_async...() updated sizes twice. Problem could not be solved by making second modification of size just update, as it will possibly disrupt vselector consistency check (_vselector_check()). Feature to track vselector consistency relies on the fact that either log.lock or nodes.lock are taken when the check is performed. Which is not true for _compact_log_async...(). Now _flush_special does not update vselector sizes by itself but leaves the update to the caller. Fixes: https://tracker.ceph.com/issues/54248 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 4bc0f61)
Fix bluefs volume selector in device_migrate_to_existing. Fix bluefs volume selector in _rewrite_log_and_layout_sync_LNF_LD. Fixes: https://tracker.ceph.com/issues/54248 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 3813416)
During phase of log fnode switch from new_log to actual log it is necessary to hold lock. Added that locking. Modified procedure of transporting extents from old_log to new_log. Now new_log gets additional extents, instead of removing from old_log; this shortens time when we need to hold log.lock. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit f2acf52) Conflicts: src/os/bluestore/BlueFS.cc ( - Lacking ceph#41557 - misc stuff with vselector )
Fixed problem that fnode in _consume_dirty could be broken if during fsync() some other thread was writing to the file. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 49b0ee4)
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 9709b14)
Extract updating of num files and log size from _update_logger_stats and put it exactly where modification happens. It allows to escape problem of taking nodes.lock and log.lock. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 2815043)
Moved pending_release to struct dirty {}. Restructured BlueFS::open_for_write to modify pending_release under dirty.lock. Now all pending_release modifications are under dirty.lock. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 2b36f27)
Added comments. Removed comments. Fixed lock-tracking suffixes to functions stemming from change _compact_log_dump_metadata_N -> _compact_log_dump_metadata_NF Removed extra lock_fnode_print. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit fd9ef8b)
Add function tags and comments related to locks. Fixed locking graph documentation. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit a091ea7) Conflicts: src/os/bluestore/BlueFS.h (Lacking backport for ceph#41557)
Signed-off-by: Igor Fedotov <ifedotov@croit.io> (cherry picked from commit bd20741)
We can reuse _compact_log_dump_metadata_NF() instead Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 285df4b)
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 0fc0ced)
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 05478fc)
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 0bfc42a)
+ minor refactoring. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit 0af2858)
Fixes problem with sync compaction (_rewrite_log_and_layout_sync). There was a problem with not updating log_seq after compacting log. It cause to stop _replay log right after first transaction. ... 20 bluefs _replay 0x0: op_dir_create sharding ... 20 bluefs _replay 0x0: op_dir_link sharding/def to 21 ... 20 bluefs _replay 0x0: op_jump_seq 1025 ... 10 bluefs _read h 0x555557c46400 0x1000~1000 from file(ino 1 size 0x1000 mtime 0.000000 allocated 410000 alloc_commit 410000 extents [1:0x1540000~410000]) ... 20 bluefs _read left 0xff000 len 0x1000 ... 20 bluefs _read got 4096 ... 10 bluefs _replay 0x1000: stop: seq 1025 != expected 1026 This is a product of bluefs fine grain locks refactor. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 2f8e370) Conflicts: src/test/objectstore/test_bluefs.cc (cherry picked from commit 4fd98ce)
jenkins test make check |
1 similar comment
jenkins test make check |
|
When using bluefs_shared_alloc_size one might get a long-lasting state when that large chunks are not available any more and fallback to shared device min alloc size occurs. The introduced cooldown is intended to prevent repetitive allocation attempts with bluefs_shared_alloc_size for a while. The rationale is to eliminate performance penalty these failing attempts might cause. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> (cherry picked from commit e52bcc8) Conflicts: src/common/options/global.yaml.in (legacy options declarations, no yamls in pacific)
b88bb13
to
c627275
Compare
Fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that the use of UPDATE_INC op in BlueFS will be a blocker for this backport.
I completely forgot that we already merged #48915, which I believe was a mistake - we broke downgrading in the same version line.
But since it is already done, I think we can extent Pacific logic with this cumulative set of features.
jenkins test api |
Yes, we released #48915, but didn't seem to tell the user that can not downgrading. I only found out after the upgrade was completed. |
Hi, @ifed01 Why do we need bluefs_failed_shared_alloc_cooldown? Isn't it good for bluefs to directly use 4k alloc units? |
Using 4K chunks is potentially less effective in terms of metadata size - file allocation map will generally contain more entries which takes more RAM and might impact the performance. Hence we still trying to use large chunks if possible and rollback to them if this failed at some point |
thanks, I got it. |
This includes:
(os/bluestore: BlueFS fine grain locking #43794)
os/bluestore/bluefs: Fix improper vselector tracking in _flush_special() #44955
os/bluestore: enable 4K allocation unit for BlueFS #48854
4fd98ce
Signed-off-by: Igor Fedotov igor.fedotov@croit.io
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows