Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: enable 4K allocation unit for BlueFS #48854

Merged
merged 13 commits into from Jan 25, 2023

Conversation

ifed01
Copy link
Contributor

@ifed01 ifed01 commented Nov 11, 2022

Fixes: https://tracker.ceph.com/issues/53466
Signed-off-by: Igor Fedotov igor.fedotov@croit.io

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

Signed-off-by: Igor Fedotov <ifedotov@croit.io>
We can reuse _compact_log_dump_metadata_NF() instead

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
+ minor refactoring.

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Copy link
Contributor

@aclamk aclamk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splendid work!
Some comments on possible improvements, but logic makes sense.

src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved
src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved
src/os/bluestore/BlueFS.cc Outdated Show resolved Hide resolved
* that jumps the log write position to the new extent. At this point, the
* old extent(s) won't be written to, and reflect everything to compact.
* New events will be written to the new region that we'll keep.
* ASYNC LOG COMPACTION
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could cut backward-size dependancy if we split op_update_inc(ino1,) between STARTER and META_DUMP blocks.

  SUPERBLOCK:
  - contains ino.1 extents of STARTER

  STARTER (seq 1):
  - op_init
  - op_update_inc(ino.1, extents of META_DUMP)
  - op_jump(2, sizeof(STARTER))
  - unused space

  META_DUMP (seq 2):
  - ... dump_metadata
  - op_update_inc(ino.1, extents of LOG_CONTINUATION)
  - op_jump(LOG_CONT.seq, sizeof(STARTER) + sizeof(META_DUMP)
  - unused space

  LOG_CONT (seq cont):
  - the continuation of previous log

src/os/bluestore/BlueFS.cc Show resolved Hide resolved
src/os/bluestore/BlueFS.cc Show resolved Hide resolved
src/os/bluestore/BlueFS.cc Show resolved Hide resolved
The rationale is to have initial log fnode after compaction small
enough to fit into 4K superblock. Without that compacted metadata might
require fnode longer than 4K which goes beyond existing 4K
superblock. BlueFS assert in this case for now.
Hence the resulting log allocation disposition is like:
- superblock(4K) keeps initial log fnode which refers:
  op_init, op_update_inc(log), op_jump(next seq)
- updated log fnode built from superblock + above op_update_inc refers:
  compacted meta (a bunch of op_update and others)
- *
- more op_update_inc(log) to follow if log is extended
- *

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
This includes finer position specification during replay
and logging read size in hex.

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
This effectively enables having 4K allocation units for BlueFS.
But it doesn't turn it on by default for the sake of performance.
Using main device which lacks enough free large continuous extents
might do the trick though.

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
When using bluefs_shared_alloc_size one might get a long-lasting state when
that large chunks are not available any more and fallback to shared
device min alloc size occurs. The introduced cooldown is intended to
prevent repetitive allocation attempts with bluefs_shared_alloc_size for
a while. The rationale is to eliminate performance penalty these failing
attempts might cause.

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
@ifed01
Copy link
Contributor Author

ifed01 commented Nov 17, 2022

jonkins test make check

@ifed01
Copy link
Contributor Author

ifed01 commented Nov 17, 2022

jenkins test make check

1 similar comment
@ifed01
Copy link
Contributor Author

ifed01 commented Nov 17, 2022

jenkins test make check

@BenoitKnecht
Copy link
Contributor

We've been hitting this issue on Pacific while increasing pg_num in one of our pools, causing several OSDs to crash, some with as little as 70% usage.

@ifed01 Are you planning on backporting this PR to Pacific? Any chance it would make it into 16.2.11?

@ifed01
Copy link
Contributor Author

ifed01 commented Jan 11, 2023

We've been hitting this issue on Pacific while increasing pg_num in one of our pools, causing several OSDs to crash, some with as little as 70% usage.

@ifed01 Are you planning on backporting this PR to Pacific? Any chance it would make it into 16.2.11?
yes, backports are planned. But I doubt this would get into 16.2.11 which to be released soon

@BenoitKnecht
Copy link
Contributor

Alright, thanks for the info!

I tried to cherry-pick your commits on top of pacific, but I'm really not sure I managed to resolve conflicts correctly, so if you're already working on a backport on a different branch, I'd be very much interested in checking it out. 🙂

I'm trying to compile a version of ceph-osd with this fix in case we get crashes again, because otherwise the only workaround would be to copy the DB to a separate device, which is rather cumbersome.

I'm assuming that with this fix, we would manage to bring the OSD back up, it would allocate what it needs on BlueFS using the 4KiB blocks, and then even if we downgraded to the mainline Pacific version, it would still be able to start as long as it doesn't need to allocate more space on BlueFS. Is that correct, or it wouldn't be able to cope with the smaller blocks?

@ifed01
Copy link
Contributor Author

ifed01 commented Jan 12, 2023

Alright, thanks for the info!

I tried to cherry-pick your commits on top of pacific, but I'm really not sure I managed to resolve conflicts correctly, so if you're already working on a backport on a different branch, I'd be very much interested in checking it out. 🙂

I tried to do a backport yesterday but it looks like this needs some additional non-trivial PRs to be backported first for Pacific. Hence postponed that till this one is merged into main branch.

I'm trying to compile a version of ceph-osd with this fix in case we get crashes again, because otherwise the only workaround would be to copy the DB to a separate device, which is rather cumbersome.

I'm assuming that with this fix, we would manage to bring the OSD back up, it would allocate what it needs on BlueFS using the 4KiB blocks, and then even if we downgraded to the mainline Pacific version, it would still be able to start as long as it doesn't need to allocate more space on BlueFS. Is that correct, or it wouldn't be able to cope with the smaller blocks?

I wouldn't recommend to downgrade to Ceph versions which don't support 4K BlueFS once you had brought that support in.
Instead of all the tricks with unofficial backports you might want to set bluefs_shared_alloc_size to 32K to recover broken OSDs. We used such a workaround a few times and it looks safe enough (which isn't the case when setting it to 4K!). But generally this wouldn't fix the issue permanently and you can face the issue again after a while if space fragmenting goes on...

@ljflores
Copy link
Contributor

Rados suite review: https://pulpito.ceph.com/?branch=wip-yuri2-testing-2023-01-23-0928

Failures, unrelated:
1. https://tracker.ceph.com/issues/58585 -- new tracker
2. https://tracker.ceph.com/issues/58256 -- fix merged to latest main
3. https://tracker.ceph.com/issues/58475
4. https://tracker.ceph.com/issues/57754 -- closed
5. https://tracker.ceph.com/issues/57546 -- fix is in testing

Details:
1. rook: failed to pull kubelet image - Ceph - Orchestrator
2. ObjectStore/StoreTestSpecificAUSize.SpilloverTest/2: Expected: (logger->get(l_bluefs_slow_used_bytes)) >= (16 * 1024 * 1024), actual: 0 vs 16777216 - Ceph - RADOS
3. test_dashboard_e2e.sh: Conflicting peer dependency: postcss@8.4.21 - Ceph - Mgr - Dashboard
4. test_envlibrados_for_rocksdb.sh: update-alternatives: error: alternative path /usr/bin/gcc-11 doesn't exist - Ceph - RADOS
5. rados/thrash-erasure-code: wait_for_recovery timeout due to "active+clean+remapped+laggy" pgs - Ceph - RADOS

@yuriw yuriw merged commit c0309cf into ceph:main Jan 25, 2023
@NUABO
Copy link
Contributor

NUABO commented May 29, 2023

hi @ifed01 I would like to know whether the problem of this pr repair will appear in ceph 15.2.x ?

@ifed01
Copy link
Contributor Author

ifed01 commented May 29, 2023

hi @ifed01 I would like to know whether the problem of this pr repair will appear in ceph 15.2.x ?

Hi @NUABO - no, there are no plans to backport this fix to Octopus as it's at end-of-life state now

@NUABO
Copy link
Contributor

NUABO commented May 29, 2023

@ifed01 thank you very much for your reply, if I use ceph 15 version, it means that this bug will always exist, I can only adapt this pr by myself

@ifed01
Copy link
Contributor Author

ifed01 commented May 29, 2023

@ifed01 thank you very much for your reply, if I use ceph 15 version, it means that this bug will always exist, I can only adapt this pr by myself

yes, that's true. Or upgrade to Pacific. And the bug is not that frequent though...

@NUABO
Copy link
Contributor

NUABO commented May 29, 2023

@ifed01 thank you very much for your reply, if I use ceph 15 version, it means that this bug will always exist, I can only adapt this pr by myself

yes, that's true. Or upgrade to Pacific. And the bug is not that frequent though...

thank you 😀

@ifed01
Copy link
Contributor Author

ifed01 commented May 29, 2023

@ifed01 thank you very much for your reply, if I use ceph 15 version, it means that this bug will always exist, I can only adapt this pr by myself

yes, that's true. Or upgrade to Pacific. And the bug is not that frequent though...

thank you 😀

@NUABO - just realized that Pacific lacks 4K bluefs support as well. You'll need Quincy release to get in onboard. Sorry for the mileading.

@NUABO
Copy link
Contributor

NUABO commented May 29, 2023

@ifed01 thank you very much for your reply, if I use ceph 15 version, it means that this bug will always exist, I can only adapt this pr by myself

yes, that's true. Or upgrade to Pacific. And the bug is not that frequent though...

thank you 😀

@NUABO - just realized that Pacific lacks 4K bluefs support as well. You'll need Quincy release to get in onboard. Sorry for the mileading.

thanks for your friendly reminder (:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants