fix: Add waitready command to verify cluster ready by johnramsden · Pull Request #683 · canonical/microceph

johnramsden · 2026-02-27T22:51:49Z

Description

When an operator attempts to do something before the cluster is up they can receive unexpected failures because bootstrap is not finished or microcluster is not yet available. This can be particularly problematic in CI or scripting.

Add an additional subcommand (similar to lxd waitready) https://manpages.debian.org/unstable/lxd/lxd.waitready.1

To confirm the cluster is up we check for the microcluster daemon to be ready, and for ceph to be ready (ceph -s)

On failure we get a message like the following if we haven't bootstrapped for example:

microceph waitready --timeout 30
Error: ceph not ready: timed out waiting for Ceph to become ready: context deadline exceeded

Running the following you should expect it to wait before running status, and it should succeed

sudo microceph cluster bootstrap &
sudo microceph waitready
sudo microceph status
[1] 35966
MicroCeph deployment summary:
- microceph (10.56.203.112) Services: mds, mgr, mon Disks: 0

Also add --storage flag:

When --storage is passed, after daemon and monitor readiness, poll until enough OSDs are up to satisfy pool replication requirements.

The required count is max(pool.Size) across all pools, falling back to osd_pool_default_size if no pools exist.

Update GetOSDPools to accept a context allowing us to reuse functionality

Fixes #653

Type of change

Bug fix (non-breaking change which fixes an issue)

How has this been tested?

Added tests demonstrating waiting and timeout prior to bootstrap, and waiting succeeding post bootstrap.

Contributor checklist

Please check that you have:

self-reviewed the code in this PR
added code comments, particularly in less straightforward areas
checked and added or updated relevant documentation
checked and added or updated relevant release notes
added tests to verify effectiveness of this change

When an operator attempts to do something before the cluster is up they can receive unexpected failures because bootstrap is not finished or microcluster is not yet available. This can be particularly problematic in CI or scripting. Add an additional subcommand (similar to lxd waitready) https://manpages.debian.org/unstable/lxd/lxd.waitready.1 To confirm the cluster is up we check for the microcluster daemon to be ready, and for ceph to be ready (ceph -s) On failure we get a message like the following if we haven't bootstrapped for example: microceph waitready --timeout 30 Error: ceph not ready: timed out waiting for Ceph to become ready: context deadline exceeded Running the following you should expect it to wait before running status, and it should succeed sudo microceph cluster bootstrap & sudo microceph waitready sudo microceph status [1] 35966 MicroCeph deployment summary: - microceph (10.56.203.112) Services: mds, mgr, mon Disks: 0 Signed-off-by: John Ramsden <john.ramsden@canonical.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

johnramsden · 2026-02-27T22:52:33Z

One note I have is I'm not sure if ceph -s is completely sufficient and if there's anything else we want to wait on

sabaini

Hey @johnramsden thank you, lgtm in general, two comments inline

microceph/ceph/monitor.go

sabaini · 2026-03-02T10:18:18Z

microceph/cmd/microceph/waitready.go

+	}
+
+	ctx := context.Background()
+	if c.flagTimeout > 0 {


Minor nit: should we be erroring out if operators pass in a neg. timeout value?

Good catch. I set it to unsigned and modified the error message to include a bit of context. I think it's nicer to have the type safety and have it unsigned but it means we receive a bit of a messy message:

Error: invalid argument "-1" for "--timeout" flag: strconv.ParseUint: parsing "-1": invalid syntax: timeout must be a positive number of seconds

The alternative is to switch back to an integer and have a nicer message that is just

Error: invalid argument "-1" for "--timeout" flag: timeout must be a positive number of seconds

But I think this tradeoff is worth it fine for the type safety

Ensures that we do not wait on ceph -s indefinitely Signed-off-by: John Ramsden <john.ramsden@canonical.com>

The value should be non-negative so using an unsigned value is more correct and gives us the expected error Set custom parsing error: Error: invalid argument "-1" for "--timeout" flag: strconv.ParseUint: parsing "-1": invalid syntax: timeout must be a positive number of seconds Rather than defuilt: Error: invalid argument "-1" for "--timeout" flag: strconv.ParseUint: parsing "-1": invalid syntax Signed-off-by: John Ramsden <john.ramsden@canonical.com>

Signed-off-by: John Ramsden <john.ramsden@canonical.com>

UtkarshBhatthere · 2026-03-03T06:34:38Z

One note I have is I'm not sure if ceph -s is completely sufficient and if there's anything else we want to wait on

@johnramsden I can see an operator wanting to wait for 2 things. 1. Ceph cluster bootstrap (ceph -s works) and 2. Storage ready (OSDs enrolled, atleast 1 for rf==1 or 3 otherwise)

UtkarshBhatthere

overall lgtm, that being said, imo a --storage flag would make this even better. Since microceph has opinions on what should be a minimum storage setup it should wait untill enough OSDs to spawn a storage pool are available (i.e. if rf==1 - > one OSD and if rf==3 -> 3 OSD). IDK what should the criteria be for EC pools.

sabaini

Hey @johnramsden thanks for the update, some nits below

microceph/cmd/microceph/waitready.go

.github/workflows/tests.yml

Leave better comment mentioning 'or zero' Connect all the relevant interfaces Leave a comment regarding bootstrap has not happened Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: John Ramsden <john.ramsden@canonical.com>

When --storage is passed, after daemon and monitor readiness, poll until enough OSDs are up to satisfy pool replication requirements. The required count is max(pool.Size) across all pools, falling back to osd_pool_default_size if no pools exist. Update GetOSDPools to accept a context allowing us to reuse functionality Signed-off-by: John Ramsden <john.ramsden@canonical.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

johnramsden · 2026-03-03T23:01:56Z

overall lgtm, that being said, imo a --storage flag would make this even better. Since microceph has opinions on what should be a minimum storage setup it should wait untill enough OSDs to spawn a storage pool are available (i.e. if rf==1 - > one OSD and if rf==3 -> 3 OSD). IDK what should the criteria be for EC pools.

Thanks this is a great suggestion. I added it, I would appreciate it if you could review.

I did touch a different part of the codebase that I thought I could reuse - just adding a context. I do not believe that there should be any implications other than a timeout potentially happening if a client disconnects

microceph/ceph/monitor.go

sabaini

I'm +1, interested to hear @UtkarshBhatthere thoughts

Move getRequiredOSDCount() inside the polling loop so it retries Signed-off-by: John Ramsden <john.ramsden@canonical.com>

sabaini requested changes Mar 2, 2026

View reviewed changes

johnramsden added 3 commits March 2, 2026 11:06

fix: Use context aware method to call ceph -s

c49678a

Ensures that we do not wait on ceph -s indefinitely Signed-off-by: John Ramsden <john.ramsden@canonical.com>

fix: Resolve test mocking wrong command

6c15294

Signed-off-by: John Ramsden <john.ramsden@canonical.com>

johnramsden requested a review from sabaini March 2, 2026 21:44

UtkarshBhatthere reviewed Mar 3, 2026

View reviewed changes

sabaini requested changes Mar 3, 2026

View reviewed changes

microceph/cmd/microceph/waitready.go Outdated Show resolved Hide resolved

.github/workflows/tests.yml Show resolved Hide resolved

.github/workflows/tests.yml Show resolved Hide resolved

johnramsden and others added 2 commits March 3, 2026 10:32

fix: Address PR review

75f1103

Leave better comment mentioning 'or zero' Connect all the relevant interfaces Leave a comment regarding bootstrap has not happened Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: John Ramsden <john.ramsden@canonical.com>

johnramsden requested a review from UtkarshBhatthere March 3, 2026 23:02

sabaini reviewed Mar 5, 2026

View reviewed changes

microceph/ceph/monitor.go Outdated Show resolved Hide resolved

sabaini reviewed Mar 5, 2026

View reviewed changes

fix: Re-evaluate required OSD count each iteration for robustness

ef76c54

Move getRequiredOSDCount() inside the polling loop so it retries Signed-off-by: John Ramsden <john.ramsden@canonical.com>

johnramsden mentioned this pull request Mar 5, 2026

CI: Test reef upgrades Failures #686

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Add waitready command to verify cluster ready#683

fix: Add waitready command to verify cluster ready#683
johnramsden wants to merge 7 commits intocanonical:mainfrom
johnramsden:john/CEPH-1590-wait-ready

johnramsden commented Feb 27, 2026 •

edited

Loading

Uh oh!

johnramsden commented Feb 27, 2026

Uh oh!

sabaini left a comment

Uh oh!

Uh oh!

sabaini Mar 2, 2026

Uh oh!

johnramsden Mar 2, 2026

Uh oh!

UtkarshBhatthere commented Mar 3, 2026

Uh oh!

UtkarshBhatthere left a comment

Uh oh!

sabaini left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

johnramsden commented Mar 3, 2026

Uh oh!

Uh oh!

sabaini left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

johnramsden commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How has this been tested?

Contributor checklist

Uh oh!

johnramsden commented Feb 27, 2026

Uh oh!

sabaini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sabaini Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

johnramsden Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

UtkarshBhatthere commented Mar 3, 2026

Uh oh!

UtkarshBhatthere left a comment

Choose a reason for hiding this comment

Uh oh!

sabaini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

johnramsden commented Mar 3, 2026

Uh oh!

Uh oh!

sabaini left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

johnramsden commented Feb 27, 2026 •

edited

Loading