New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mgr/prometheus: Skip bogus entries #20456
Conversation
828426b
to
19f4930
Compare
I have added a commit that fixes the This bug can be best seen with pg_active count where almost all the entries contain |
14b20a6
to
a9e0db8
Compare
I have added a commit that implements |
The pg_count and OSD flag commits seem fine. |
@jan--f afaik, it is an implementation detail from when ceph is still discovering the osd. The osd itself won't be skipped, it just won't show up twice as it does when it finally gets the address and updates the metric with the address. The old '-' address is still in the prometheus server db so it will show the osd twice when you are querying it -- this also means that you can't get realiable osd count or (osd, address) pair from prometheus just by doing something like count(ceph_osd_metadata). You need to manually work around this by removing the '-' addresses which is not ideal. |
@jan--f That won't help. The prometheus server will show the last available value for a metric if it was ever created. Your commit will only fix range requests since the time stamp of the no-longer-available metric will no longer be updated when prometheus scrapes the exported data. I guess these bogus entries will eventually time out after 15 days or whatever your data retention interval is but that is a rather long time window. I think it is better to just ignore these bogus entries than actually create them in the first place. |
Ok I'm not terribly passionate about when a particular Generally keep in mind though that you'll always want to use range queries with prometheus (and, say, Grafana always does). I.e. if you want to use the metadata metric for counting OSDs, you'd always use fairly recent time window. Otherwise you might count OSDs that have existed or are currently down/out. |
src/pybind/mgr/prometheus/module.py
Outdated
osd_devices = self.get('osd_map_crush')['devices'] | ||
for osd in osd_map['osds']: | ||
id_ = osd['osd'] | ||
p_addr = osd['public_addr'].split(':')[0] | ||
c_addr = osd['cluster_addr'].split(':')[0] | ||
if p_addr == "-" or c_addr == "-": | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a log message here that osd_$id was skipped due to missing address.
a9e0db8
to
e76ec2a
Compare
The osd data can contain bogus '-' entries, skip these when populating osd metadata and disk occupation. Signed-off-by: Boris Ranto <branto@redhat.com>
Currently, the pg_* counts are not computed properly. We split the current state by '+' sign but do not add the pg count to the already found pg count. Instead, we overwrite any existing pg count with the new count. This patch fixes it by adding all the pg counts together for all the states. It also introduces a new pg_total metric for pg_total that shows the total count of PGs. Signed-off-by: Boris Ranto <branto@redhat.com>
Signed-off-by: Boris Ranto <branto@redhat.com>
e76ec2a
to
aae7a21
Compare
Yeah, you are right that it is safer to use the range time queries for this kind of things, I'd still prefer it if we did not show the metrics for the OSDs while we are not able to populate them properly, though. I have updated the commit to include the message that the osd is being skipped. |
the failures in the first run were caused by sepia issues. |
The osd data can contain bogus '-' entries, skip these when populating
osd metadata and disk occupation.
Signed-off-by: Boris Ranto branto@redhat.com