Ensure a majority of mon services remain available before entering maintenance mode#565
Conversation
| return fmt.Errorf("Need at least 3 mon, 1 mds, and 1 mgr besides %v", name) | ||
|
|
||
| // check if the remaining non OSD services is enough to maintain a healthy cluster | ||
| err = EnsureNonOsdSvcEnough(services, name, 3, 1, 1) |
There was a problem hiding this comment.
To avoid introducing breaking change, I keep the same number as before.
efe5ad4 to
0b84b84
Compare
|
@UtkarshBhatthere any chance you could take a look again, and maybe re-triggering timeout tests? Thanks a lot! |
| - Check if OSDs on the node are ``ok-to-stop`` to ensure sufficient redundancy to tolerate the loss | ||
| of OSDs on the node. | ||
| - Check if the number of running services is greater than the minimum (3 MON, 1 MDS, 1 MGR) | ||
| - Check if the number of running services is greater than the minimum (majority of MON, 1 MDS, 1 MGR) |
There was a problem hiding this comment.
Is this criteria modification done so that a minimal functional ceph cluster is available when a particular node is going into maintenance ?
There was a problem hiding this comment.
Asking because 1MDS daemon may not be able to sustain FS IO requests if more than one FS Volume are being consumed (one mds per fs volume)
There was a problem hiding this comment.
No, this modification only focuses on the mon services. But I agree that a proper check can be implemented for MDS / MGR too, would bug reports sufficient for now?
There was a problem hiding this comment.
yeah, a bug will help keep track of status. Something like ensure FS (mds) availability during maintenance.
UtkarshBhatthere
left a comment
There was a problem hiding this comment.
sorry for delay in review. have added some comments and suggestions. Please let me know if they are not applicable as is.
|
@chanchiwai-ray there is a suspicious timeout occurring in maintenance mode functional tests: |
aa4d997 to
1ca6419
Compare
This is fixed in latest commit: the problem seems to be the ceph mon was intentionally disabled for that particular node running |
Fixes: canonical#534 Co-authored-by: Chi Wai Chan <chiwai.chan@canonical.com> Signed-off-by: Chi Wai Chan <chiwai.chan@canonical.com>
94b0d44 to
4134a83
Compare
Description
Supersede #557 (the author of that PR is on a long break)
Fixes: #534
Type of change
Delete options that are not relevant.
How has this been tested?
Locally, on a 3 node microceph deployment on LXD VMs.
Before
After
Contributor checklist
Please check that you have: