Skip to content

Ensure a majority of mon services remain available before entering maintenance mode#565

Merged
UtkarshBhatthere merged 1 commit intocanonical:mainfrom
chanchiwai-ray:SOLENG-1082/#534
Aug 13, 2025
Merged

Ensure a majority of mon services remain available before entering maintenance mode#565
UtkarshBhatthere merged 1 commit intocanonical:mainfrom
chanchiwai-ray:SOLENG-1082/#534

Conversation

@chanchiwai-ray
Copy link
Copy Markdown
Contributor

@chanchiwai-ray chanchiwai-ray commented Jun 4, 2025

Description

Supersede #557 (the author of that PR is on a long break)

Fixes: #534

Type of change

Delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Clean code (code refactor, test updates)
  • Documentation update (change to documentation

How has this been tested?

Locally, on a 3 node microceph deployment on LXD VMs.

Before

root@microceph-0:~# microceph status
MicroCeph deployment summary:
- microceph-0 (10.180.54.6)
  Services: mds, mgr, mon, osd
  Disks: 3
- microceph-1 (10.180.54.82)
  Services: mds, mgr, mon, osd
  Disks: 3
- microceph-2 (10.180.54.165)
  Services: mds, mgr, mon, osd
  Disks: 3

root@microceph-0:~# microceph cluster maintenance enter microceph-0
Error: failed to enter maintenance mode: error bringing node 'microceph-0' into maintenance: maintenance operations failed: [(need at least 3 mon, 1 mds, and 1 mgr services in the cluster besides those in node 'microceph-0')]

After

root@microceph-0:~# microceph cluster maintenance enter microceph-0
Check if osds.[1 2 3] in node 'microceph-0' are ok-to-stop. (succeeded)
Check if there are at least a majority of mon services, 1 mds service, and 1 mgr service in the cluster besides those in node 'microceph-0' (succeeded)
Run `ceph osd set noout`. (succeeded)
Assert osd has 'noout' flag set. (succeeded)

Contributor checklist

Please check that you have:

  • self-reviewed the code in this PR
  • added code comments, particularly in less straightforward areas
  • checked and added or updated relevant documentation
  • checked and added or updated relevant release notes
  • added tests to verify effectiveness of this change

Comment thread microceph/ceph/remove.go
return fmt.Errorf("Need at least 3 mon, 1 mds, and 1 mgr besides %v", name)

// check if the remaining non OSD services is enough to maintain a healthy cluster
err = EnsureNonOsdSvcEnough(services, name, 3, 1, 1)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid introducing breaking change, I keep the same number as before.

Comment thread microceph/ceph/operations.go Outdated
Comment thread microceph/ceph/services.go Outdated
Comment thread microceph/ceph/services.go Outdated
@chanchiwai-ray
Copy link
Copy Markdown
Contributor Author

@UtkarshBhatthere any chance you could take a look again, and maybe re-triggering timeout tests? Thanks a lot!

- Check if OSDs on the node are ``ok-to-stop`` to ensure sufficient redundancy to tolerate the loss
of OSDs on the node.
- Check if the number of running services is greater than the minimum (3 MON, 1 MDS, 1 MGR)
- Check if the number of running services is greater than the minimum (majority of MON, 1 MDS, 1 MGR)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this criteria modification done so that a minimal functional ceph cluster is available when a particular node is going into maintenance ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking because 1MDS daemon may not be able to sustain FS IO requests if more than one FS Volume are being consumed (one mds per fs volume)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this modification only focuses on the mon services. But I agree that a proper check can be implemented for MDS / MGR too, would bug reports sufficient for now?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, a bug will help keep track of status. Something like ensure FS (mds) availability during maintenance.

Comment thread microceph/ceph/monitor.go Outdated
Comment thread microceph/ceph/operations.go
Copy link
Copy Markdown
Contributor

@UtkarshBhatthere UtkarshBhatthere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for delay in review. have added some comments and suggestions. Please let me know if they are not applicable as is.

@UtkarshBhatthere
Copy link
Copy Markdown
Contributor

@chanchiwai-ray there is a suspicious timeout occurring in maintenance mode functional tests:

2025-07-07T17:14:52.703+0000 7f32535ab6c0  0 monclient(hunting): authenticate timed out after 300
[errno 110] RADOS timed out (error connecting to the cluster)

@chanchiwai-ray
Copy link
Copy Markdown
Contributor Author

@chanchiwai-ray there is a suspicious timeout occurring in maintenance mode functional tests:

2025-07-07T17:14:52.703+0000 7f32535ab6c0  0 monclient(hunting): authenticate timed out after 300
[errno 110] RADOS timed out (error connecting to the cluster)

This is fixed in latest commit: the problem seems to be the ceph mon was intentionally disabled for that particular node running microceph.ceph health

Fixes: canonical#534

Co-authored-by: Chi Wai Chan <chiwai.chan@canonical.com>
Signed-off-by: Chi Wai Chan <chiwai.chan@canonical.com>
@UtkarshBhatthere UtkarshBhatthere merged commit 04cf6ca into canonical:main Aug 13, 2025
39 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Adjustment] maintenance pre-flight check adjustment on CheckNonOsdSvcEnoughOps

2 participants