New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mgr/cephadm: The command of 'ceph orch daemon restart mgr.xxx' may ca… #41002
mgr/cephadm: The command of 'ceph orch daemon restart mgr.xxx' may ca… #41002
Conversation
src/pybind/mgr/cephadm/inventory.py
Outdated
| @@ -467,6 +467,9 @@ def save_host(self, host: str) -> None: | |||
| if host in self.last_etc_ceph_ceph_conf: | |||
| j['last_etc_ceph_ceph_conf'] = datetime_to_str(self.last_etc_ceph_ceph_conf[host]) | |||
| if host in self.scheduled_daemon_actions: | |||
| for dict_key in list(self.scheduled_daemon_actions[host].keys()): | |||
| if dict_key.startswith('mgr.'): | |||
| self.scheduled_daemon_actions[host].pop(dict_key) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually had a similar issue with redeploy, where it would infinitely loop with the active mgr going down when it redeploys itself then doing the same thing when it came back up since the scheduled action was never cleared. In that case, we dealt with it by implementing a fail over system. The idea was, if you have mgr.a and mgr.b and mgr.a is active and you tell it to redeploy mgr.a, mgr.a will just fail over to mgr.b who can safely redeploy mgr.a and clear the scheduled action with no issues. We don't support redeploying the mgr if there are no standby mgr daemons. You can see what I'm talking about with the fail over here https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/module.py#L1694-L1697 and the restriction on requiring the standby mgr here https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/module.py#L1746-L1749 and here https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/module.py#L1769-L1772.
The change you currently have here actually breaks that system. If you try to redeploy mgr.a now then when it fails over to mgr.b, mgr.b will clear the scheduled redeploy without the redeploy of mgr.a ever happening.
I think restart should be able to work with the exact same solution as redeploy where it fails over. I think that would be a preferred solution to what you currently have. As long as you don't have any strong objections to doing it that way, could you change this PR to use a fail over for restarts the same as is currently done with redeploys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a good way to solve the issue. But if I have only one mgr.a, the command of 'orch daemon restart mgr.a' will be unavailable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is on purpose for now. We don't want to have the mgr restarting/redeploying itself. It's something we may look into more in the future, but currently we don't consider it a safe operation to have the only mgr do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. Can squash the commits?
…se mgr daemon loop to restart
Scene:
The mgr daemon is active. After execing restart command, it may be save "scheduled_daemon_actions": {"mgr.xxx": "restart"}}" to config-key.
So the mgr daemon will restart before call rm_scheduled_daemon_action which case mgr daemon will load restart forever.
Fix mgr infinite restart issue refering to the same solution as 'ceph orch daemon redeploy'.
Signed-off-by: jianglong01 <jianglong01@qianxin.com>
51402ea
to
cc5b77e
Compare
|
jenkins test make check |
…se mgr daemon loop to restart
Scene:
The mgr daemon is active. After execing restart command, it may be save "scheduled_daemon_actions": {"mgr.cephqa08.cpp.zzbm.qianxin-inc.cn.mqwkha": "restart"}}" to config-key.
So the mgr daemon will restart before call rm_scheduled_daemon_action which case mgr daemon will load restart forever.
Signed-off-by: jianglong01 jianglong01@qianxin.com
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume tox