New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mgr/prometheus: fix orch check to prevent Prometheus from crashing #55149
Conversation
1bafbf7
to
e52cd13
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! The changes worked for me while testing on both orchestrator rook & cephadm. The only issue remaining is the value for instance_id
label for the rgw metrics where we see mismatch in values of that label exposed by Prometheus module & ceph-exporter , which I think would involve ceph-exporter changes too. IMO it should be dealt in a seperate PR and having a tracker issue for the same.
Wdyt? @rkachach
Based on this I'm approving PR as it solved the issue mentioned, thanks @rkachach
jenkins test api |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mostly okay with this as long as it doesn't bring back the dashboard issue that sparked changes here in the first place. There are a couple related failures in make check that are being caused by this PR though (see review comments)
fda49e9
to
e917789
Compare
Fixes: https://tracker.ceph.com/issues/63992 Signed-off-by: Redouane Kachach <rkachach@redhat.com>
e5add8b
to
6d550ff
Compare
I agree on fixing the instance_id issue for RGW in a separate PR. Let's merge and then backport this PR to reef to fix the metrics there as well. |
Fixes: https://tracker.ceph.com/issues/63992
The previous code was calling
get_orch_status
during prometheus module startup. As consequence and since the modules loading order by the mgr is random (sometimes it used alphabetical order, but not always), this can crash in case the orchestrator is not available yet. In summary, the previous code suffers from a race condition between loading prometheus and the orchestrator module so prometheus module could potentially crash during the startup.Changes purpose is to remove the initialiaztion of the
modify_instance_id
variable from the stratup and move it to theget_metadata_and_osd_status
which is called both periodically by the metrics collection thread or when some clients performs a get on/metrics
endpoint.Related PR: #52191
Related issues: rook/rook#13527
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
x
between the brackets:[x]
. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e