Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reef: mgr/cephadm: catch CancelledError in asyncio timeout handler #56103

Merged
merged 1 commit into from
Mar 15, 2024

Conversation

adk3798
Copy link
Contributor

@adk3798 adk3798 commented Mar 10, 2024

backport tracker: https://tracker.ceph.com/issues/64629


backport of #55620
parent tracker: https://tracker.ceph.com/issues/64473

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh

@adk3798 adk3798 requested a review from a team as a code owner March 10, 2024 20:47
@adk3798 adk3798 added this to the reef milestone Mar 10, 2024
@guits
Copy link
Contributor

guits commented Mar 11, 2024

jenkins test api

@guits
Copy link
Contributor

guits commented Mar 11, 2024

jenkins test make check

1 similar comment
@adk3798
Copy link
Contributor Author

adk3798 commented Mar 12, 2024

jenkins test make check

@adk3798
Copy link
Contributor Author

adk3798 commented Mar 12, 2024

jenkins test dashboard cephadm

@adk3798
Copy link
Contributor Author

adk3798 commented Mar 13, 2024

https://pulpito.ceph.com/adking-2024-03-11_12:07:34-orch:cephadm-wip-adk3-testing-2024-03-11-0143-reef-distro-default-smithi/

reruns: https://pulpito.ceph.com/adking-2024-03-11_17:13:18-orch:cephadm-wip-adk3-testing-2024-03-11-0143-reef-distro-default-smithi/

After reruns:

  • 4 failed instances of mds_upgrade_sequence, konwn issue
  • test_cephadm fails with Error: Container release squid != cephadm release reef, known issue
  • mgr-nfs-upgrade test fails with rcu_sched detected stalls on CPUs/tasks: ' in syslog, known issue

Nothing to block merging

Specifically, concurrent.futures.CancelledError. At least on
python 3.9, this error can be raised when certain commands
being run asynchronously fail. Not catching this results in
the whole cephadm module crashing with something like

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/utils.py", line 94, in do_work
    return f(*arg)
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 267, in refresh
    r = self._refresh_facts(host)
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 370, in _refresh_facts
    val = self.mgr.wait_async(self._run_cephadm_json(
  File "/usr/share/ceph/mgr/cephadm/module.py", line 671, in wait_async
    return self.event_loop.get_result(coro, timeout)
  File "/usr/share/ceph/mgr/cephadm/ssh.py", line 64, in get_result
    return future.result(timeout)
  File "/lib64/python3.9/concurrent/futures/_base.py", line 444, in result
    raise CancelledError()
concurrent.futures._base.CancelledError

Fixes: https://tracker.ceph.com/issues/64473

Signed-off-by: Adam King <adking@redhat.com>
(cherry picked from commit 9c34973)
@adk3798 adk3798 merged commit de043fd into ceph:reef Mar 15, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants