New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/bluestore: enable 4K allocation unit for BlueFS #48854
Conversation
Signed-off-by: Igor Fedotov <ifedotov@croit.io>
We can reuse _compact_log_dump_metadata_NF() instead Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
+ minor refactoring. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
7b09b5a
to
6248340
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splendid work!
Some comments on possible improvements, but logic makes sense.
* that jumps the log write position to the new extent. At this point, the | ||
* old extent(s) won't be written to, and reflect everything to compact. | ||
* New events will be written to the new region that we'll keep. | ||
* ASYNC LOG COMPACTION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could cut backward-size dependancy if we split op_update_inc(ino1,)
between STARTER and META_DUMP blocks.
SUPERBLOCK:
- contains ino.1 extents of STARTER
STARTER (seq 1):
- op_init
- op_update_inc(ino.1, extents of META_DUMP)
- op_jump(2, sizeof(STARTER))
- unused space
META_DUMP (seq 2):
- ... dump_metadata
- op_update_inc(ino.1, extents of LOG_CONTINUATION)
- op_jump(LOG_CONT.seq, sizeof(STARTER) + sizeof(META_DUMP)
- unused space
LOG_CONT (seq cont):
- the continuation of previous log
The rationale is to have initial log fnode after compaction small enough to fit into 4K superblock. Without that compacted metadata might require fnode longer than 4K which goes beyond existing 4K superblock. BlueFS assert in this case for now. Hence the resulting log allocation disposition is like: - superblock(4K) keeps initial log fnode which refers: op_init, op_update_inc(log), op_jump(next seq) - updated log fnode built from superblock + above op_update_inc refers: compacted meta (a bunch of op_update and others) - * - more op_update_inc(log) to follow if log is extended - * Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
This includes finer position specification during replay and logging read size in hex. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
This effectively enables having 4K allocation units for BlueFS. But it doesn't turn it on by default for the sake of performance. Using main device which lacks enough free large continuous extents might do the trick though. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
When using bluefs_shared_alloc_size one might get a long-lasting state when that large chunks are not available any more and fallback to shared device min alloc size occurs. The introduced cooldown is intended to prevent repetitive allocation attempts with bluefs_shared_alloc_size for a while. The rationale is to eliminate performance penalty these failing attempts might cause. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
6248340
to
e52bcc8
Compare
jonkins test make check |
jenkins test make check |
1 similar comment
jenkins test make check |
We've been hitting this issue on Pacific while increasing @ifed01 Are you planning on backporting this PR to Pacific? Any chance it would make it into 16.2.11? |
|
Alright, thanks for the info! I tried to cherry-pick your commits on top of I'm trying to compile a version of I'm assuming that with this fix, we would manage to bring the OSD back up, it would allocate what it needs on BlueFS using the 4KiB blocks, and then even if we downgraded to the mainline Pacific version, it would still be able to start as long as it doesn't need to allocate more space on BlueFS. Is that correct, or it wouldn't be able to cope with the smaller blocks? |
I tried to do a backport yesterday but it looks like this needs some additional non-trivial PRs to be backported first for Pacific. Hence postponed that till this one is merged into main branch.
I wouldn't recommend to downgrade to Ceph versions which don't support 4K BlueFS once you had brought that support in. |
Rados suite review: https://pulpito.ceph.com/?branch=wip-yuri2-testing-2023-01-23-0928 Failures, unrelated: Details: |
hi @ifed01 I would like to know whether the problem of this pr repair will appear in ceph 15.2.x ? |
@ifed01 thank you very much for your reply, if I use ceph 15 version, it means that this bug will always exist, I can only adapt this pr by myself |
yes, that's true. Or upgrade to Pacific. And the bug is not that frequent though... |
thank you 😀 |
@NUABO - just realized that Pacific lacks 4K bluefs support as well. You'll need Quincy release to get in onboard. Sorry for the mileading. |
thanks for your friendly reminder (: |
Fixes: https://tracker.ceph.com/issues/53466
Signed-off-by: Igor Fedotov igor.fedotov@croit.io
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows