Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/cephadm: make scheduler able to accomodate offline/maintenance hosts #42690

Merged
merged 1 commit into from Aug 18, 2021

Conversation

adk3798
Copy link
Contributor

@adk3798 adk3798 commented Aug 5, 2021

Fixes: https://tracker.ceph.com/issues/51027

Signed-off-by: Adam King adking@redhat.com

For reference, the reason this is marked as fixing that tracker is that daemons being removed from offline hosts was causing the issue. As part of the pre-remove for mons we remove them from the monmap. Any host that was rebooted could be potentially marked offline at which point cephadm would try to remove the mon. It would complete the pre-remove step of removing the mon from the monmap but fail to remove the daemon itself because it would be unable to ssh into the host. When the host came back up the mon would be stuck in a stopped state because it had been removed from the monmap. By not allowing cephadm to remove any daemons from offline hosts, this issue should be avoided.

Also, right now the name I 'm using for hosts in maintenance or offline mode that we do not want to add or remove daemons on is "unreachable hosts" but if anyone has a better idea for that name I'll change to that.

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@adk3798 adk3798 requested a review from a team as a code owner August 5, 2021 20:20
Copy link
Contributor

@cfsnyder cfsnyder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also ran into this issue of losing our mon when a host rebooted. I was working on a fix myself, until I noticed that you already had a PR open here. Changes look good to me!

@cfsnyder
Copy link
Contributor

@ceph/orchestrators if possible, it would be nice to get this into the next Pacific release. Having monitors go down in a manner that requires manual intervention to bring back up during host reboots seems like relatively high priority. I'll handle the backport PR if we can get this reviewed and merged?

@sebastian-philipp sebastian-philipp added the wip-swagner-testing My Teuthology tests label Aug 17, 2021
@sebastian-philipp
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants