Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quincy: mgr/cephadm: catch CancelledError in asyncio timeout handler #56086

Merged
merged 1 commit into from
Mar 18, 2024

Conversation

adk3798
Copy link
Contributor

@adk3798 adk3798 commented Mar 10, 2024

backport tracker: https://tracker.ceph.com/issues/64630


backport of #55620
parent tracker: https://tracker.ceph.com/issues/64473

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh

@adk3798 adk3798 requested a review from a team as a code owner March 10, 2024 19:14
@adk3798 adk3798 added this to the quincy milestone Mar 10, 2024
@adk3798
Copy link
Contributor Author

adk3798 commented Mar 13, 2024

jenkins test make check

@adk3798
Copy link
Contributor Author

adk3798 commented Mar 13, 2024

jenkins test dashboard cephadm

Specifically, concurrent.futures.CancelledError. At least on
python 3.9, this error can be raised when certain commands
being run asynchronously fail. Not catching this results in
the whole cephadm module crashing with something like

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/utils.py", line 94, in do_work
    return f(*arg)
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 267, in refresh
    r = self._refresh_facts(host)
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 370, in _refresh_facts
    val = self.mgr.wait_async(self._run_cephadm_json(
  File "/usr/share/ceph/mgr/cephadm/module.py", line 671, in wait_async
    return self.event_loop.get_result(coro, timeout)
  File "/usr/share/ceph/mgr/cephadm/ssh.py", line 64, in get_result
    return future.result(timeout)
  File "/lib64/python3.9/concurrent/futures/_base.py", line 444, in result
    raise CancelledError()
concurrent.futures._base.CancelledError

Fixes: https://tracker.ceph.com/issues/64473

Signed-off-by: Adam King <adking@redhat.com>
(cherry picked from commit 9c34973)
@adk3798
Copy link
Contributor Author

adk3798 commented Mar 13, 2024

https://pulpito.ceph.com/adking-2024-03-11_04:43:35-orch:cephadm-wip-adk2-testing-2024-03-10-1540-quincy-distro-default-smithi/

reruns: https://pulpito.ceph.com/adking-2024-03-11_12:11:09-orch:cephadm-wip-adk2-testing-2024-03-10-1540-quincy-distro-default-smithi/

After reruns, 2 failures

  • test_non_existent_cluster failure is caused by a change to the ceph nfs cluster info command to make it return a non-zero return code when the cluster does not exist. The change itself was backported, but the change to make the test expect that return code was not.
  • rcu: INFO: rcu_sched detected stalls on CPUs/tasks: ' in syslog in mgr-nfs-upgrade test, known issue

Nothing to block merging

@adk3798
Copy link
Contributor Author

adk3798 commented Mar 15, 2024

jenkins retest this please

@adk3798 adk3798 merged commit 7a59308 into ceph:quincy Mar 18, 2024
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants