Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/prometheus: Skip bogus entries #20456

Merged
merged 3 commits into from Feb 24, 2018
Merged

mgr/prometheus: Skip bogus entries #20456

merged 3 commits into from Feb 24, 2018

Conversation

b-ranto
Copy link
Contributor

@b-ranto b-ranto commented Feb 16, 2018

The osd data can contain bogus '-' entries, skip these when populating
osd metadata and disk occupation.

Signed-off-by: Boris Ranto branto@redhat.com

@b-ranto
Copy link
Contributor Author

b-ranto commented Feb 16, 2018

I have added a commit that fixes the pg_* counts. Currently, the code will only expose the last value in the summary. With this fix, it will add up these values to get the correct pg_* count.

This bug can be best seen with pg_active count where almost all the entries contain active in the summary but only the latest summary line is being shown as pg_active.

@b-ranto b-ranto requested a review from jan--f February 16, 2018 17:56
@batrick batrick added the mgr label Feb 16, 2018
@b-ranto b-ranto force-pushed the wip-mgr-prom branch 2 times, most recently from 14b20a6 to a9e0db8 Compare February 17, 2018 04:07
@b-ranto
Copy link
Contributor Author

b-ranto commented Feb 17, 2018

I have added a commit that implements ceph_flag_<flag> metrics to get the OSD flags info.

@jan--f
Copy link
Contributor

jan--f commented Feb 17, 2018

The pg_count and OSD flag commits seem fine.
I'm not sure its the best course to skip the metadata metrics in case the addresses are not there. Any idea why the OSD addresses are -? In any case I'd rather have these entries then skip the metadata altogether, especially if/when we add more metadata fields. Or am I missing somehting?

@b-ranto
Copy link
Contributor Author

b-ranto commented Feb 17, 2018

@jan--f afaik, it is an implementation detail from when ceph is still discovering the osd. The osd itself won't be skipped, it just won't show up twice as it does when it finally gets the address and updates the metric with the address. The old '-' address is still in the prometheus server db so it will show the osd twice when you are querying it -- this also means that you can't get realiable osd count or (osd, address) pair from prometheus just by doing something like count(ceph_osd_metadata). You need to manually work around this by removing the '-' addresses which is not ideal.

@jan--f
Copy link
Contributor

jan--f commented Feb 18, 2018

@b-ranto Ahh I see. This is actually a different bug. I have a fix for it here. Will do some more testing on that and make it into a PR next week. So feel free to drop 19f4930.

@b-ranto
Copy link
Contributor Author

b-ranto commented Feb 19, 2018

@jan--f That won't help. The prometheus server will show the last available value for a metric if it was ever created. Your commit will only fix range requests since the time stamp of the no-longer-available metric will no longer be updated when prometheus scrapes the exported data. I guess these bogus entries will eventually time out after 15 days or whatever your data retention interval is but that is a rather long time window.

I think it is better to just ignore these bogus entries than actually create them in the first place.

@jan--f
Copy link
Contributor

jan--f commented Feb 19, 2018

Ok I'm not terribly passionate about when a particular osd_metadata metrics shows up, I thought as early as possible would be good.

Generally keep in mind though that you'll always want to use range queries with prometheus (and, say, Grafana always does). I.e. if you want to use the metadata metric for counting OSDs, you'd always use fairly recent time window. Otherwise you might count OSDs that have existed or are currently down/out.

osd_devices = self.get('osd_map_crush')['devices']
for osd in osd_map['osds']:
id_ = osd['osd']
p_addr = osd['public_addr'].split(':')[0]
c_addr = osd['cluster_addr'].split(':')[0]
if p_addr == "-" or c_addr == "-":
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a log message here that osd_$id was skipped due to missing address.

The osd data can contain bogus '-' entries, skip these when populating
osd metadata and disk occupation.

Signed-off-by: Boris Ranto <branto@redhat.com>
Currently, the pg_* counts are not computed properly. We split the
current state by '+' sign but do not add the pg count to the already
found pg count. Instead, we overwrite any existing pg count with the new
count. This patch fixes it by adding all the pg counts together for all
the states.

It also introduces a new pg_total metric for pg_total that shows the
total count of PGs.

Signed-off-by: Boris Ranto <branto@redhat.com>
Signed-off-by: Boris Ranto <branto@redhat.com>
@b-ranto
Copy link
Contributor Author

b-ranto commented Feb 19, 2018

Yeah, you are right that it is safer to use the range time queries for this kind of things, I'd still prefer it if we did not show the metrics for the OSDs while we are not able to populate them properly, though. I have updated the commit to include the message that the osd is being skipped.

@tchaikov tchaikov merged commit 4a00e32 into ceph:master Feb 24, 2018
@b-ranto b-ranto deleted the wip-mgr-prom branch February 24, 2018 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants