Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/prometheus: fix orch check to prevent Prometheus from crashing #55149

Merged
merged 1 commit into from Feb 7, 2024

Conversation

rkachach
Copy link
Contributor

@rkachach rkachach commented Jan 11, 2024

Fixes: https://tracker.ceph.com/issues/63992

The previous code was calling get_orch_status during prometheus module startup. As consequence and since the modules loading order by the mgr is random (sometimes it used alphabetical order, but not always), this can crash in case the orchestrator is not available yet. In summary, the previous code suffers from a race condition between loading prometheus and the orchestrator module so prometheus module could potentially crash during the startup.

Changes purpose is to remove the initialiaztion of the modify_instance_id variable from the stratup and move it to the get_metadata_and_osd_status which is called both periodically by the metrics collection thread or when some clients performs a get on /metrics endpoint.

Related PR: #52191
Related issues: rook/rook#13527

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

Copy link
Contributor

@avanthakkar avanthakkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The changes worked for me while testing on both orchestrator rook & cephadm. The only issue remaining is the value for instance_id label for the rgw metrics where we see mismatch in values of that label exposed by Prometheus module & ceph-exporter , which I think would involve ceph-exporter changes too. IMO it should be dealt in a seperate PR and having a tracker issue for the same.
Wdyt? @rkachach
Based on this I'm approving PR as it solved the issue mentioned, thanks @rkachach

@rkachach rkachach added this to the v18.2.2 milestone Feb 1, 2024
@rkachach rkachach marked this pull request as ready for review February 6, 2024 17:05
@rkachach rkachach changed the title [WIP] mgr/prometheus: fixing orch check to avoid crashing during startup mgr/prometheus: fixing orch check to avoid crashing during startup Feb 6, 2024
@rkachach
Copy link
Contributor Author

rkachach commented Feb 6, 2024

jenkins test api

Copy link
Contributor

@adk3798 adk3798 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mostly okay with this as long as it doesn't bring back the dashboard issue that sparked changes here in the first place. There are a couple related failures in make check that are being caused by this PR though (see review comments)

src/pybind/mgr/prometheus/module.py Outdated Show resolved Hide resolved
src/pybind/mgr/prometheus/module.py Show resolved Hide resolved
@rkachach rkachach force-pushed the fix_issue_63992 branch 2 times, most recently from fda49e9 to e917789 Compare February 7, 2024 13:32
@ceph ceph deleted a comment from avanthakkar Feb 7, 2024
@rkachach rkachach changed the title mgr/prometheus: fixing orch check to avoid crashing during startup mgr/prometheus: fix orch check to prevent Prometheus from crashing Feb 7, 2024
@adk3798 adk3798 merged commit 583a355 into ceph:main Feb 7, 2024
13 checks passed
@rkachach
Copy link
Contributor Author

rkachach commented Feb 8, 2024

LGTM! The changes worked for me while testing on both orchestrator rook & cephadm. The only issue remaining is the value for instance_id label for the rgw metrics where we see mismatch in values of that label exposed by Prometheus module & ceph-exporter , which I think would involve ceph-exporter changes too. IMO it should be dealt in a seperate PR and having a tracker issue for the same. Wdyt? @rkachach Based on this I'm approving PR as it solved the issue mentioned, thanks @rkachach

I agree on fixing the instance_id issue for RGW in a separate PR. Let's merge and then backport this PR to reef to fix the metrics there as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants