Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

squid: mgr/cephadm: catch CancelledError in asyncio timeout handler #56317

Merged
merged 1 commit into from Mar 27, 2024

Conversation

adk3798
Copy link
Contributor

@adk3798 adk3798 commented Mar 19, 2024

backport tracker: https://tracker.ceph.com/issues/64628


backport of #55620
parent tracker: https://tracker.ceph.com/issues/64473

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh

Specifically, concurrent.futures.CancelledError. At least on
python 3.9, this error can be raised when certain commands
being run asynchronously fail. Not catching this results in
the whole cephadm module crashing with something like

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/utils.py", line 94, in do_work
    return f(*arg)
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 267, in refresh
    r = self._refresh_facts(host)
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 370, in _refresh_facts
    val = self.mgr.wait_async(self._run_cephadm_json(
  File "/usr/share/ceph/mgr/cephadm/module.py", line 671, in wait_async
    return self.event_loop.get_result(coro, timeout)
  File "/usr/share/ceph/mgr/cephadm/ssh.py", line 64, in get_result
    return future.result(timeout)
  File "/lib64/python3.9/concurrent/futures/_base.py", line 444, in result
    raise CancelledError()
concurrent.futures._base.CancelledError

Fixes: https://tracker.ceph.com/issues/64473

Signed-off-by: Adam King <adking@redhat.com>
(cherry picked from commit 9c34973)
@adk3798 adk3798 requested a review from a team as a code owner March 19, 2024 20:30
@adk3798 adk3798 added this to the squid milestone Mar 19, 2024
@adk3798
Copy link
Contributor Author

adk3798 commented Mar 27, 2024

https://pulpito.ceph.com/adking-2024-03-20_17:29:20-orch:cephadm-wip-adk4-testing-2024-03-20-0737-squid-distro-default-smithi/

most in cluster log stuff. Beyond that, failures were

  • mds_upgrade_sequence, known to fail currently
  • test_cephadm task, hit https://tracker.ceph.com/issues/65155, known issue
  • staggered upgrade with agent enabled, known issue
  • random thrash test failed with Error response from daemon: Cannot kill container: ceph-d3b78eec-e712-11ee-95c9-87774f69a715-osd.4: No such container: ceph-d3b78eec-e712-11ee-95c9-87774f69a715-osd.4 while trying to docker kill that container. Doesn't seem related to any of the PRs in the run to me.

Nothing there to block merging PRs in the run.

@adk3798 adk3798 merged commit b1827f8 into ceph:squid Mar 27, 2024
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants