mgr/prometheus: fix orch check to prevent Prometheus from crashing #55149

rkachach · 2024-01-11T16:15:08Z

Fixes: https://tracker.ceph.com/issues/63992

The previous code was calling get_orch_status during prometheus module startup. As consequence and since the modules loading order by the mgr is random (sometimes it used alphabetical order, but not always), this can crash in case the orchestrator is not available yet. In summary, the previous code suffers from a race condition between loading prometheus and the orchestrator module so prometheus module could potentially crash during the startup.

Changes purpose is to remove the initialiaztion of the modify_instance_id variable from the stratup and move it to the get_metadata_and_osd_status which is called both periodically by the metrics collection thread or when some clients performs a get on /metrics endpoint.

Related PR: #52191
Related issues: rook/rook#13527

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

avanthakkar

LGTM! The changes worked for me while testing on both orchestrator rook & cephadm. The only issue remaining is the value for instance_id label for the rgw metrics where we see mismatch in values of that label exposed by Prometheus module & ceph-exporter , which I think would involve ceph-exporter changes too. IMO it should be dealt in a seperate PR and having a tracker issue for the same.
Wdyt? @rkachach
Based on this I'm approving PR as it solved the issue mentioned, thanks @rkachach

rkachach · 2024-02-06T17:06:42Z

jenkins test api

adk3798

I'm mostly okay with this as long as it doesn't bring back the dashboard issue that sparked changes here in the first place. There are a couple related failures in make check that are being caused by this PR though (see review comments)

src/pybind/mgr/prometheus/module.py

Fixes: https://tracker.ceph.com/issues/63992 Signed-off-by: Redouane Kachach <rkachach@redhat.com>

rkachach · 2024-02-08T07:22:13Z

LGTM! The changes worked for me while testing on both orchestrator rook & cephadm. The only issue remaining is the value for instance_id label for the rgw metrics where we see mismatch in values of that label exposed by Prometheus module & ceph-exporter , which I think would involve ceph-exporter changes too. IMO it should be dealt in a seperate PR and having a tracker issue for the same. Wdyt? @rkachach Based on this I'm approving PR as it solved the issue mentioned, thanks @rkachach

I agree on fixing the instance_id issue for RGW in a separate PR. Let's merge and then backport this PR to reef to fix the metrics there as well.

rkachach requested review from avanthakkar and a team January 11, 2024 16:15

github-actions bot added monitoring pybind labels Jan 11, 2024

rkachach requested a review from jmolmo January 11, 2024 16:15

rkachach force-pushed the fix_issue_63992 branch 2 times, most recently from 1bafbf7 to e52cd13 Compare January 12, 2024 11:35

avanthakkar added the wip-athakkar-testing label Jan 22, 2024

rkachach added the rook label Jan 25, 2024

avanthakkar approved these changes Jan 25, 2024

View reviewed changes

jmolmo approved these changes Jan 31, 2024

View reviewed changes

rkachach added this to the v18.2.2 milestone Feb 1, 2024

rkachach marked this pull request as ready for review February 6, 2024 17:05

rkachach changed the title ~~[WIP] mgr/prometheus: fixing orch check to avoid crashing during startup~~ mgr/prometheus: fixing orch check to avoid crashing during startup Feb 6, 2024

adk3798 reviewed Feb 6, 2024

View reviewed changes

src/pybind/mgr/prometheus/module.py Outdated Show resolved Hide resolved

src/pybind/mgr/prometheus/module.py Show resolved Hide resolved

rkachach force-pushed the fix_issue_63992 branch 2 times, most recently from fda49e9 to e917789 Compare February 7, 2024 13:32

ceph deleted a comment from avanthakkar Feb 7, 2024

mgr/prometheus: fix orch check to prevent Prometheus from crashing

6d550ff

Fixes: https://tracker.ceph.com/issues/63992 Signed-off-by: Redouane Kachach <rkachach@redhat.com>

rkachach force-pushed the fix_issue_63992 branch from e5add8b to 6d550ff Compare February 7, 2024 15:41

rkachach changed the title ~~mgr/prometheus: fixing orch check to avoid crashing during startup~~ mgr/prometheus: fix orch check to prevent Prometheus from crashing Feb 7, 2024

adk3798 merged commit 583a355 into ceph:main Feb 7, 2024
13 checks passed

avanthakkar mentioned this pull request Feb 8, 2024

reef: mgr/prometheus: fix orch check to prevent Prometheus crash #55491

Merged

14 tasks

barrettMCW mentioned this pull request Feb 16, 2024

Manager prometheus module failed to construct - ImportError: Module not found rook/rook#13527

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgr/prometheus: fix orch check to prevent Prometheus from crashing #55149

mgr/prometheus: fix orch check to prevent Prometheus from crashing #55149

rkachach commented Jan 11, 2024 •

edited

avanthakkar left a comment

rkachach commented Feb 6, 2024

adk3798 left a comment

rkachach commented Feb 8, 2024

mgr/prometheus: fix orch check to prevent Prometheus from crashing #55149

mgr/prometheus: fix orch check to prevent Prometheus from crashing #55149

Conversation

rkachach commented Jan 11, 2024 • edited

Contribution Guidelines

Checklist

avanthakkar left a comment

Choose a reason for hiding this comment

rkachach commented Feb 6, 2024

adk3798 left a comment

Choose a reason for hiding this comment

rkachach commented Feb 8, 2024

rkachach commented Jan 11, 2024 •

edited