Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/cephadm: set HEALTH warnings during apply phase in serve #43376

Merged
merged 7 commits into from Oct 11, 2021

Conversation

Daniel-Pivonka
Copy link

rebase and minor changes of #42565

Fixes: https://tracker.ceph.com/issues/44414
Signed-off-by: Daniel Pivonka dpivonka@redhat.com

…pers in module.py

Fixes: https://tracker.ceph.com/issues/44414
Signed-off-by: Melissa Li <li.melissa.kun@gmail.com>
…lth_warning` helper

Fixes: https://tracker.ceph.com/issues/44414
Signed-off-by: Melissa Li <li.melissa.kun@gmail.com>
…ing` and `remove_health_warning` helpers

Fixes: https://tracker.ceph.com/issues/44414
Signed-off-by: Melissa Li <li.melissa.kun@gmail.com>
…_warning` and `remove_health_warning` helpers

Fixes: https://tracker.ceph.com/issues/44414
Signed-off-by: Melissa Li <li.melissa.kun@gmail.com>
…ng` helper

Fixes: https://tracker.ceph.com/issues/44414
Signed-off-by: Melissa Li <li.melissa.kun@gmail.com>
…invalid config options and failures to set options

Fixes: https://tracker.ceph.com/issues/44414
Signed-off-by: Melissa Li <li.melissa.kun@gmail.com>
…mon place failures in serve

Fixes: https://tracker.ceph.com/issues/44414
Signed-off-by: Melissa Li <li.melissa.kun@gmail.com>
@adk3798
Copy link
Contributor

adk3798 commented Oct 11, 2021

http://pulpito.front.sepia.ceph.com/adking-2021-10-10_15:19:57-rados:cephadm-wip-adk-testing-2021-10-09-1501-distro-basic-smithi/

Failures are known issues with cgroups and health warnings not being cleared quickly enough with agent.

@neha-ojha
Copy link
Member

@Daniel-Pivonka are failures like https://pulpito.ceph.com/nojha-2021-10-28_22:29:54-upgrade:octopus-x-master-distro-basic-smithi/6465645/ expected in octopus to master upgrade tests at the moment? Essentially the failure is because of

2021-10-28T23:36:02.855 INFO:teuthology.orchestra.run.smithi086.stdout:{"status":"HEALTH_WARN","checks":{"CEPHADM_APPLY_SPEC_FAIL":{"severity":"HEALTH_WARN","summary":{"message":"Failed to apply 4 service(s): alertmanager,grafana,node-exporter,prometheus","count":4},"muted":false}},"mutes":[]}
...
021-10-28T23:36:03.552 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_c56135d151713269e811ede3163c9743c2e269de/teuthology/run_tasks.py", line 91, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_c56135d151713269e811ede3163c9743c2e269de/teuthology/run_tasks.py", line 70, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_neha-ojha_ceph_0ec838099064db9e5bbe31f9573445689339c4aa/qa/tasks/ceph.py", line 1469, in healthy
    manager.wait_until_healthy(timeout=300)
  File "/home/teuthworker/src/github.com_neha-ojha_ceph_0ec838099064db9e5bbe31f9573445689339c4aa/qa/tasks/ceph_manager.py", line 3141, in wait_until_healthy
    'timeout expired in wait_until_healthy'
AssertionError: timeout expired in wait_until_healthy
2021-10-28T23:36:03.644 ERROR:teuthology.run_tasks: Sentry event: https://sentry.ceph.com/organizations/ceph/?query=1751b45979ba479a81bcd55c27d3f600
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_c56135d151713269e811ede3163c9743c2e269de/teuthology/run_tasks.py", line 91, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_c56135d151713269e811ede3163c9743c2e269de/teuthology/run_tasks.py", line 70, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_neha-ojha_ceph_0ec838099064db9e5bbe31f9573445689339c4aa/qa/tasks/ceph.py", line 1469, in healthy
    manager.wait_until_healthy(timeout=300)
  File "/home/teuthworker/src/github.com_neha-ojha_ceph_0ec838099064db9e5bbe31f9573445689339c4aa/qa/tasks/ceph_manager.py", line 3141, in wait_until_healthy
    'timeout expired in wait_until_healthy'
AssertionError: timeout expired in wait_until_healthy

@sebastian-philipp
Copy link
Contributor

Given we never upgraded the alertmanager in this run, I think the upgrade test was broken for a while now 😢

@Daniel-Pivonka
Copy link
Author

definitely a problem im also seeing this when manually testing an upgrade from octopus to master :( looking into a solution

@neha-ojha
Copy link
Member

definitely a problem im also seeing this when manually testing an upgrade from octopus to master :( looking into a solution

thanks @Daniel-Pivonka @sebastian-philipp
@yuriw FYI, regarding https://tracker.ceph.com/issues/53097

@sebastian-philipp
Copy link
Contributor

I dropped a86d0db from the Pacific backport #43728 , as it breaks the upgrade from Octopus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants