Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pacific: os/bluestore: cumulative bluefs backport #52212

Merged
merged 38 commits into from Aug 15, 2023

Conversation

ifed01
Copy link
Contributor

@ifed01 ifed01 commented Jun 27, 2023

This includes:

Signed-off-by: Igor Fedotov igor.fedotov@croit.io

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

aclamk and others added 30 commits June 27, 2023 13:51
This refactor prepares flush_and_sync_log and compact_log_async for fine-grain locks in BlueFS.
There is no new logic introduced, but refactor is accompanied by some new comments.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 05703cc)
This refactor prepares _flush for fine-grain locks in BlueFS.
Introduced _flush_special, a flush dedicated to bluefs special files (ino=1) and (ino=0).
Function _flush no longer accepts these special files.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit edebf9f)
…ocks

Splits bluefs lock into log, dirty, dirs, file and writer locks.
This breaks severe locking issues, and makes bluefs more multithreaded.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit e74474d)

 Conflicts:
	src/os/bluestore/BlueFS.cc
 (misordered backports, lacking ceph#48171
  in the source commit)
Reorganize BlueFS state variables into separate domains: 1) log, 2) dirty, 3) nodes.
Each has separate lock. This change is intended to make it easier to control which locks
need to be held when specific elements are modified.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit b661fa2)

 Conflicts:
	src/os/bluestore/BlueFS.cc
 (misordered backports, lacking ceph#48171 in the source commit)
Splits seq into seq_live and seq_stable. Cleans up log sequencing.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 80ed7b3)
This is modification that only changes names of functions, so tracking of potential deadlocks is simpler.
All internal functions start with _.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit dc78769)
Modified File.dirty_seq to capture dirty.seq_stable instead of 0. This is used to distunguish
files already serialized for compact_log_async_dump_metadata() function.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 6590a30)
Extracted _maybe_compact_log outside of file lock.
The sequence FL could deadlock with LNF that is executed in _async_dump_metadata.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit c967c57)
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit eac1807)
Rearranged locks in preallocate to avoid possible deadlock with
compact_log_async_dump_metadata_NF.
Cycle was:
L->N rename/mkdir
N->F compact_log_async_dump_metadata_NF
F->L preallocate

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 49316ab)
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit d23f0b1)
Usually sequence of locking is 1) FileWriter 2) File.
In _compact_log_async_LD_NF_D it was in reversed order.
No real deadlock was possible, but lockdep complained.

Bonus: Improved lock dependency graph.

Fixes: https://tracker.ceph.com/issues/52939
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 7b7945d)
BlueFS fine grain lock refactor did break accounting in volume selection module.
It caused ceph_test_objectstore to fail on SpilloverFixedTest test.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 41eb537)
It was possible to skip capture of file that was recently modified.
New procedure under one log.lock flushed pending files and captures state.
It is much less interruptible then I had hoped for but I cannot now do it better.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit a296989)
Moves vselector size tracking outside _flush_special().
Function _compact_log_async...() updated sizes twice.
Problem could not be solved by making second modification of size just update,
as it will possibly disrupt vselector consistency check (_vselector_check()).
Feature to track vselector consistency relies on the fact that either log.lock or nodes.lock
are taken when the check is performed. Which is not true for _compact_log_async...().

Now _flush_special does not update vselector sizes by itself but leaves the update to
the caller.

Fixes: https://tracker.ceph.com/issues/54248

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 4bc0f61)
Fix bluefs volume selector in device_migrate_to_existing.
Fix bluefs volume selector in _rewrite_log_and_layout_sync_LNF_LD.

Fixes: https://tracker.ceph.com/issues/54248

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 3813416)
During phase of log fnode switch from new_log to actual log it is necessary to hold lock.
Added that locking.
Modified procedure of transporting extents from old_log to new_log.
Now new_log gets additional extents, instead of removing from old_log; this shortens time
when we need to hold log.lock.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit f2acf52)

 Conflicts:
	src/os/bluestore/BlueFS.cc
(
- Lacking ceph#41557
- misc stuff with vselector
)
Fixed problem that fnode in _consume_dirty could be broken if during fsync()
some other thread was writing to the file.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 49b0ee4)
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 9709b14)
Extract updating of num files and log size from _update_logger_stats
and put it exactly where modification happens.
It allows to escape problem of taking nodes.lock and log.lock.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 2815043)
Moved pending_release to struct dirty {}.
Restructured BlueFS::open_for_write to modify pending_release under dirty.lock.
Now all pending_release modifications are under dirty.lock.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 2b36f27)
Added comments.
Removed comments.
Fixed lock-tracking suffixes to functions stemming from change
_compact_log_dump_metadata_N -> _compact_log_dump_metadata_NF
Removed extra lock_fnode_print.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit fd9ef8b)
Add function tags and comments related to locks.
Fixed locking graph documentation.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit a091ea7)

 Conflicts:
	src/os/bluestore/BlueFS.h
 (Lacking backport for ceph#41557)
Signed-off-by: Igor Fedotov <ifedotov@croit.io>
(cherry picked from commit bd20741)
We can reuse _compact_log_dump_metadata_NF() instead

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit 285df4b)
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit 0fc0ced)
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit 05478fc)
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit 0bfc42a)
+ minor refactoring.

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit 0af2858)
Fixes problem with sync compaction (_rewrite_log_and_layout_sync).
There was a problem with not updating log_seq after compacting log.

It cause to stop _replay log right after first transaction.

... 20 bluefs _replay 0x0:  op_dir_create sharding
... 20 bluefs _replay 0x0:  op_dir_link  sharding/def to 21
... 20 bluefs _replay 0x0:  op_jump_seq 1025
... 10 bluefs _read h 0x555557c46400 0x1000~1000 from file(ino 1 size 0x1000 mtime 0.000000 allocated 410000 alloc_commit 410000 extents [1:0x1540000~410000])
... 20 bluefs _read left 0xff000 len 0x1000
... 20 bluefs _read got 4096
... 10 bluefs _replay 0x1000: stop: seq 1025 != expected 1026

This is a product of bluefs fine grain locks refactor.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 2f8e370)

Conflicts:
	src/test/objectstore/test_bluefs.cc
(cherry picked from commit 4fd98ce)
@ifed01
Copy link
Contributor Author

ifed01 commented Jun 28, 2023

jenkins test make check

1 similar comment
@ifed01
Copy link
Contributor Author

ifed01 commented Jun 28, 2023

jenkins test make check

@aclamk
Copy link
Contributor

aclamk commented Aug 7, 2023

options/global.yaml.in dragged in by mistake.

When using bluefs_shared_alloc_size one might get a long-lasting state when
that large chunks are not available any more and fallback to shared
device min alloc size occurs. The introduced cooldown is intended to
prevent repetitive allocation attempts with bluefs_shared_alloc_size for
a while. The rationale is to eliminate performance penalty these failing
attempts might cause.

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit e52bcc8)

 Conflicts:
	src/common/options/global.yaml.in
 (legacy options declarations, no yamls in pacific)
@ifed01
Copy link
Contributor Author

ifed01 commented Aug 7, 2023

options/global.yaml.in dragged in by mistake.

Fixed

@aclamk aclamk added the needs-qa label Aug 7, 2023
Copy link
Contributor

@aclamk aclamk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that the use of UPDATE_INC op in BlueFS will be a blocker for this backport.
I completely forgot that we already merged #48915, which I believe was a mistake - we broke downgrading in the same version line.

But since it is already done, I think we can extent Pacific logic with this cumulative set of features.

@aclamk
Copy link
Contributor

aclamk commented Aug 7, 2023

jenkins test api

@YiteGu
Copy link
Contributor

YiteGu commented Aug 9, 2023

I was thinking that the use of UPDATE_INC op in BlueFS will be a blocker for this backport. I completely forgot that we already merged #48915, which I believe was a mistake - we broke downgrading in the same version line.

But since it is already done, I think we can extent Pacific logic with this cumulative set of features.

Yes, we released #48915, but didn't seem to tell the user that can not downgrading. I only found out after the upgrade was completed.

@YiteGu
Copy link
Contributor

YiteGu commented Aug 10, 2023

Hi, @ifed01 Why do we need bluefs_failed_shared_alloc_cooldown? Isn't it good for bluefs to directly use 4k alloc units?

@ifed01
Copy link
Contributor Author

ifed01 commented Aug 10, 2023

Hi, @ifed01 Why do we need bluefs_failed_shared_alloc_cooldown? Isn't it good for bluefs to directly use 4k alloc units?

Using 4K chunks is potentially less effective in terms of metadata size - file allocation map will generally contain more entries which takes more RAM and might impact the performance. Hence we still trying to use large chunks if possible and rollback to them if this failed at some point

@YiteGu
Copy link
Contributor

YiteGu commented Aug 11, 2023

Hi, @ifed01 Why do we need bluefs_failed_shared_alloc_cooldown? Isn't it good for bluefs to directly use 4k alloc units?

Using 4K chunks is potentially less effective in terms of metadata size - file allocation map will generally contain more entries which takes more RAM and might impact the performance. Hence we still trying to use large chunks if possible and rollback to them if this failed at some point

thanks, I got it.

@ljflores
Copy link
Contributor

@yuriw yuriw merged commit c216ab5 into ceph:pacific Aug 15, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants