Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/cephadm: improve offline host handling, mostly around upgrade #48592

Merged
merged 7 commits into from Jan 13, 2023

Conversation

adk3798
Copy link
Contributor

@adk3798 adk3798 commented Oct 21, 2022

Fixes: https://tracker.ceph.com/issues/57891

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

We won't be able to complete the upgrade if there are offline
hosts anyway so we might as well abort immediately.

Signed-off-by: Adam King <adking@redhat.com>
…ling

The idea is to be able to know elsewhere that the OrchestratorError
we are looking at is specifically one raised due to a failure to
connnect to a host. This can hopefully allow for some more
precise error handling

Signed-off-by: Adam King <adking@redhat.com>
Right now failures to connect to a host during the upgrade result
in a "failed due to an unexpected exception" error. We can do a bit
better than that.

Fixes: https://tracker.ceph.com/issues/57891

Signed-off-by: Adam King <adking@redhat.com>
…ption

Otherwise we'll return a message about the host not being found
and to check 'ceph orch host ls' when the actual problem is the
host being offline

Signed-off-by: Adam King <adking@redhat.com>
Signed-off-by: Adam King <adking@redhat.com>
The health error is no longer valid if the upgrade has been
removed. If the issue is still present we'll hit it again.

Fixes: https://tracker.ceph.com/issues/57891

Signed-off-by: Adam King <adking@redhat.com>
Otherwise, it could remain in the list and cephadm could think
there is an offline host in the cluster when said host has
actually already been removed.

Signed-off-by: Adam King <adking@redhat.com>
@adk3798
Copy link
Contributor Author

adk3798 commented Dec 7, 2022

jenkins retest this please

@adk3798
Copy link
Contributor Author

adk3798 commented Jan 9, 2023

jenkins retest this please

Copy link
Contributor

@rkachach rkachach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adk3798
Copy link
Contributor Author

adk3798 commented Jan 13, 2023

https://pulpito.ceph.com/adking-2023-01-12_12:11:23-orch:cephadm-wip-adk-testing-2023-01-11-2322-distro-default-smithi/

Quite a few failures, but I think I've pinned all of them down to either a known issue or a failure caused by a specific PR in the run. That being said, we should be able to merge the PRs that are not the two causing the failures or related to cephadm upgrade or ingress service, since those tests were affected by the failures.

11 failed jobs

  • 4 mds upgrade sequence jobs failing with timeout expired in wait_until_healthy. This was caused by an issue in the prometheus module that, at the time of testing, existed in the latest pacific devel container builds (which those tests use as an upgrade start point). Issue should go away with next pacific devel container build as the fix was merged in pacific already.
  • 3 test_nfs task failures Test failure: test_non_existent_cluster (tasks.cephfs.test_nfs.TestNFS). This was caused by one of the PRs in the run adjusting the return code of an nfs module function (not this PR though).
  • 2 failed staggered upgrade tests. From what I can tell, the staggered upgrade actually went fine. For whatever reason ceph versions json output didn't contain an OSD index with the versions of the OSDs that the test checks. I ran two instances of the test on main branch and the issue reproduced https://pulpito.ceph.com/adking-2023-01-13_05:56:45-orch:cephadm-main-distro-default-smithi/
{
    "mon": {
        "ceph version 18.0.0-1762-gcb17f286 (cb17f286272f7ae9dbdf8117ca7b077b0a5cf650) reef (dev)": 3
    },
    "rgw": {
        "ceph version 16.2.0 (0c2054e95bcd9b30fdd908a79ac1d8bbc3394442) pacific (stable)": 2
    },
    "overall": {
        "ceph version 16.2.0 (0c2054e95bcd9b30fdd908a79ac1d8bbc3394442) pacific (stable)": 2,
        "ceph version 18.0.0-1762-gcb17f286 (cb17f286272f7ae9dbdf8117ca7b077b0a5cf650) reef (dev)": 3
    }
}

so confident it's not caused by the PRs in the run. Will investigate further later

10 Dead Jobs

  • 6 failures due to known issue with satellite server during RHEL tests 'The DMI UUID of this host (00000000-0000-0000-0000-0CC47AD93798) matches other registered hosts:
  • 4 failures due to what seems to be mismatch issue between keepalived and haproxy conf, caused by another PR in the run (not this one)

I think most of the PRs in the run are okay for merging despite the raw failure count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants