Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cephadm: allow redeploy of daemons in error state if container running #39385

Merged
merged 1 commit into from Feb 13, 2021

Conversation

adk3798
Copy link
Contributor

@adk3798 adk3798 commented Feb 9, 2021

Signed-off-by: Adam King adking@redhat.com

Right now any daemon that is in an error state, has its container running and has ports to check cannot be redeployed as the binary will still consider it an initial deployment and check the ports which will be in use by the container. In situations like
those described here https://tracker.ceph.com/issues/49013 where the error state is caused by a non-updated unit.run file a redeploy could fix the daemon. However, the redeploy would be blocked due to the ports being in use and the user would be required to manually go stop the services or fix the unit.run files. With this change, a situation like this could be fixed by simply redeploying the daemons using orchestrator commands instead.

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

src/cephadm/cephadm Outdated Show resolved Hide resolved
src/cephadm/cephadm Outdated Show resolved Hide resolved
(_, state, _) = check_unit(ctx, unit_name)
if state == 'running':
if state == 'running' or is_container_running(ctx, container_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_unit returns (enabled, state, installed). What about simplifying this to

    (_, _, installed) = check_unit(ctx, unit_name)
    if installed:

My thinking is, independent of the state, we're redeploying the daemon, if it is already installed. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds right. I'm going to test this and if it works I'll swap to this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sebastian-philipp This seems like it should work in most situations. However, it is possible to be in a state where this check passes and it considers this a redeploy but the port needed is actually taken. For example, stop the node-exporter service on a host, use the nc tool to listen to node-exporter's port nc -l localhost 9100 and then go and run a ceph orch redeploy node-exporter command. The node-exporter service will fail due to the port being taken but the user will never get the error message specifying this is the issue because the port wasn't checked. By checking for the container being up I think we can be sure node-exporter is holding the port it needs and redeploy can be done. Now, all of this seems like a pretty unlikely scenario so I'm fine still going with this change if you think this is acceptable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you think this is acceptable.

by now, you're the expert in this part of the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer checking the container. I think it's a more accurate view of whether we're actually redeploying or not.

@sebastian-philipp
Copy link
Contributor

Does it make sense to add Fixes: https://tracker.ceph.com/issues/49013 to the commit?

@adk3798
Copy link
Contributor Author

adk3798 commented Feb 10, 2021

Does it make sense to add Fixes: https://tracker.ceph.com/issues/49013 to the commit?

I want an additional change that makes sure all daemons were deployed by a mgr running the code that we're upgrading to at the end of an upgrade first. This PR helps deal with that issue or similar issues if they come up but I don't consider this a fix for the upgrade problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants