New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mgr/cephadm: improve offline host handling, mostly around upgrade #48592
Conversation
We won't be able to complete the upgrade if there are offline hosts anyway so we might as well abort immediately. Signed-off-by: Adam King <adking@redhat.com>
…ling The idea is to be able to know elsewhere that the OrchestratorError we are looking at is specifically one raised due to a failure to connnect to a host. This can hopefully allow for some more precise error handling Signed-off-by: Adam King <adking@redhat.com>
Right now failures to connect to a host during the upgrade result in a "failed due to an unexpected exception" error. We can do a bit better than that. Fixes: https://tracker.ceph.com/issues/57891 Signed-off-by: Adam King <adking@redhat.com>
…ption Otherwise we'll return a message about the host not being found and to check 'ceph orch host ls' when the actual problem is the host being offline Signed-off-by: Adam King <adking@redhat.com>
Signed-off-by: Adam King <adking@redhat.com>
The health error is no longer valid if the upgrade has been removed. If the issue is still present we'll hit it again. Fixes: https://tracker.ceph.com/issues/57891 Signed-off-by: Adam King <adking@redhat.com>
Otherwise, it could remain in the list and cephadm could think there is an offline host in the cluster when said host has actually already been removed. Signed-off-by: Adam King <adking@redhat.com>
jenkins retest this please |
jenkins retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Quite a few failures, but I think I've pinned all of them down to either a known issue or a failure caused by a specific PR in the run. That being said, we should be able to merge the PRs that are not the two causing the failures or related to cephadm upgrade or ingress service, since those tests were affected by the failures. 11 failed jobs
so confident it's not caused by the PRs in the run. Will investigate further later
10 Dead Jobs
I think most of the PRs in the run are okay for merging despite the raw failure count. |
Fixes: https://tracker.ceph.com/issues/57891
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows