mgr/cephadm: improve offline host handling, mostly around upgrade #48592

adk3798 · 2022-10-21T21:00:01Z

Fixes: https://tracker.ceph.com/issues/57891

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

We won't be able to complete the upgrade if there are offline hosts anyway so we might as well abort immediately. Signed-off-by: Adam King <adking@redhat.com>

…ling The idea is to be able to know elsewhere that the OrchestratorError we are looking at is specifically one raised due to a failure to connnect to a host. This can hopefully allow for some more precise error handling Signed-off-by: Adam King <adking@redhat.com>

Right now failures to connect to a host during the upgrade result in a "failed due to an unexpected exception" error. We can do a bit better than that. Fixes: https://tracker.ceph.com/issues/57891 Signed-off-by: Adam King <adking@redhat.com>

…ption Otherwise we'll return a message about the host not being found and to check 'ceph orch host ls' when the actual problem is the host being offline Signed-off-by: Adam King <adking@redhat.com>

Signed-off-by: Adam King <adking@redhat.com>

The health error is no longer valid if the upgrade has been removed. If the issue is still present we'll hit it again. Fixes: https://tracker.ceph.com/issues/57891 Signed-off-by: Adam King <adking@redhat.com>

Otherwise, it could remain in the list and cephadm could think there is an offline host in the cluster when said host has actually already been removed. Signed-off-by: Adam King <adking@redhat.com>

adk3798 · 2022-12-07T16:58:45Z

jenkins retest this please

adk3798 · 2023-01-09T12:46:56Z

jenkins retest this please

rkachach

LGTM

adk3798 · 2023-01-13T13:56:18Z

https://pulpito.ceph.com/adking-2023-01-12_12:11:23-orch:cephadm-wip-adk-testing-2023-01-11-2322-distro-default-smithi/

Quite a few failures, but I think I've pinned all of them down to either a known issue or a failure caused by a specific PR in the run. That being said, we should be able to merge the PRs that are not the two causing the failures or related to cephadm upgrade or ingress service, since those tests were affected by the failures.

11 failed jobs

4 mds upgrade sequence jobs failing with timeout expired in wait_until_healthy. This was caused by an issue in the prometheus module that, at the time of testing, existed in the latest pacific devel container builds (which those tests use as an upgrade start point). Issue should go away with next pacific devel container build as the fix was merged in pacific already.
3 test_nfs task failures Test failure: test_non_existent_cluster (tasks.cephfs.test_nfs.TestNFS). This was caused by one of the PRs in the run adjusting the return code of an nfs module function (not this PR though).
2 failed staggered upgrade tests. From what I can tell, the staggered upgrade actually went fine. For whatever reason ceph versions json output didn't contain an OSD index with the versions of the OSDs that the test checks. I ran two instances of the test on main branch and the issue reproduced https://pulpito.ceph.com/adking-2023-01-13_05:56:45-orch:cephadm-main-distro-default-smithi/

{
    "mon": {
        "ceph version 18.0.0-1762-gcb17f286 (cb17f286272f7ae9dbdf8117ca7b077b0a5cf650) reef (dev)": 3
    },
    "rgw": {
        "ceph version 16.2.0 (0c2054e95bcd9b30fdd908a79ac1d8bbc3394442) pacific (stable)": 2
    },
    "overall": {
        "ceph version 16.2.0 (0c2054e95bcd9b30fdd908a79ac1d8bbc3394442) pacific (stable)": 2,
        "ceph version 18.0.0-1762-gcb17f286 (cb17f286272f7ae9dbdf8117ca7b077b0a5cf650) reef (dev)": 3
    }
}

so confident it's not caused by the PRs in the run. Will investigate further later

1 instance of https://tracker.ceph.com/issues/57771
1 instance of https://tracker.ceph.com/issues/49287

10 Dead Jobs

6 failures due to known issue with satellite server during RHEL tests 'The DMI UUID of this host (00000000-0000-0000-0000-0CC47AD93798) matches other registered hosts:
4 failures due to what seems to be mismatch issue between keepalived and haproxy conf, caused by another PR in the run (not this one)

I think most of the PRs in the run are okay for merging despite the raw failure count.

adk3798 added 7 commits October 21, 2022 13:13

mgr/cephadm: abort upgrade if there are offline hosts.

3e35324

We won't be able to complete the upgrade if there are offline hosts anyway so we might as well abort immediately. Signed-off-by: Adam King <adking@redhat.com>

mgr/cephadm: update check-host to handle new HostConnectionError exce…

29bff27

…ption Otherwise we'll return a message about the host not being found and to check 'ceph orch host ls' when the actual problem is the host being offline Signed-off-by: Adam King <adking@redhat.com>

mgr/cephadm: unit tests for upgrade offline host scenarios

c05e676

Signed-off-by: Adam King <adking@redhat.com>

mgr/cephadm: clear upgrade health error when upgrade is resumed

23f523d

The health error is no longer valid if the upgrade has been removed. If the issue is still present we'll hit it again. Fixes: https://tracker.ceph.com/issues/57891 Signed-off-by: Adam King <adking@redhat.com>

mgr/cephadm: remove host from offline_hosts list when removing host

beea26f

Otherwise, it could remain in the list and cephadm could think there is an offline host in the cluster when said host has actually already been removed. Signed-off-by: Adam King <adking@redhat.com>

adk3798 added pybind cephadm labels Oct 21, 2022

adk3798 requested a review from a team as a code owner October 21, 2022 21:00

adk3798 added needs-review wip-adk-testing labels Dec 7, 2022

rkachach approved these changes Jan 11, 2023

View reviewed changes

adk3798 merged commit 2b8e7a6 into ceph:main Jan 13, 2023

adk3798 mentioned this pull request Jan 25, 2023

quincy: mgr/cephadm: improve offline host handling, mostly around upgrade #49856

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgr/cephadm: improve offline host handling, mostly around upgrade #48592

mgr/cephadm: improve offline host handling, mostly around upgrade #48592

adk3798 commented Oct 21, 2022

adk3798 commented Dec 7, 2022

adk3798 commented Jan 9, 2023

rkachach left a comment

adk3798 commented Jan 13, 2023

mgr/cephadm: improve offline host handling, mostly around upgrade #48592

mgr/cephadm: improve offline host handling, mostly around upgrade #48592

Conversation

adk3798 commented Oct 21, 2022

Contribution Guidelines

Checklist

adk3798 commented Dec 7, 2022

adk3798 commented Jan 9, 2023

rkachach left a comment

Choose a reason for hiding this comment

adk3798 commented Jan 13, 2023