Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/cephadm: The command of 'ceph orch daemon restart mgr.xxx' may ca… #41002

Merged

Conversation

strenuous-life
Copy link

…se mgr daemon loop to restart

Scene:
The mgr daemon is active. After execing restart command, it may be save "scheduled_daemon_actions": {"mgr.cephqa08.cpp.zzbm.qianxin-inc.cn.mqwkha": "restart"}}" to config-key.
So the mgr daemon will restart before call rm_scheduled_daemon_action which case mgr daemon will load restart forever.

Signed-off-by: jianglong01 jianglong01@qianxin.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@@ -467,6 +467,9 @@ def save_host(self, host: str) -> None:
if host in self.last_etc_ceph_ceph_conf:
j['last_etc_ceph_ceph_conf'] = datetime_to_str(self.last_etc_ceph_ceph_conf[host])
if host in self.scheduled_daemon_actions:
for dict_key in list(self.scheduled_daemon_actions[host].keys()):
if dict_key.startswith('mgr.'):
self.scheduled_daemon_actions[host].pop(dict_key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually had a similar issue with redeploy, where it would infinitely loop with the active mgr going down when it redeploys itself then doing the same thing when it came back up since the scheduled action was never cleared. In that case, we dealt with it by implementing a fail over system. The idea was, if you have mgr.a and mgr.b and mgr.a is active and you tell it to redeploy mgr.a, mgr.a will just fail over to mgr.b who can safely redeploy mgr.a and clear the scheduled action with no issues. We don't support redeploying the mgr if there are no standby mgr daemons. You can see what I'm talking about with the fail over here https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/module.py#L1694-L1697 and the restriction on requiring the standby mgr here https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/module.py#L1746-L1749 and here https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/module.py#L1769-L1772.

The change you currently have here actually breaks that system. If you try to redeploy mgr.a now then when it fails over to mgr.b, mgr.b will clear the scheduled redeploy without the redeploy of mgr.a ever happening.

I think restart should be able to work with the exact same solution as redeploy where it fails over. I think that would be a preferred solution to what you currently have. As long as you don't have any strong objections to doing it that way, could you change this PR to use a fail over for restarts the same as is currently done with redeploys?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good way to solve the issue. But if I have only one mgr.a, the command of 'orch daemon restart mgr.a' will be unavailable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is on purpose for now. We don't want to have the mgr restarting/redeploying itself. It's something we may look into more in the future, but currently we don't consider it a safe operation to have the only mgr do that.

Copy link
Contributor

@adk3798 adk3798 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. Can squash the commits?

…se mgr daemon loop to restart

Scene:
The mgr daemon is active. After execing restart command, it may be save "scheduled_daemon_actions": {"mgr.xxx": "restart"}}" to config-key.
So the mgr daemon will restart before call rm_scheduled_daemon_action which case mgr daemon will load restart forever.

Fix mgr infinite restart issue refering to the same solution as 'ceph orch daemon redeploy'.

Signed-off-by: jianglong01 <jianglong01@qianxin.com>
@tchaikov
Copy link
Contributor

jenkins test make check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants