mgr/prometheus: Skip bogus entries #20456

b-ranto · 2018-02-16T05:22:19Z

The osd data can contain bogus '-' entries, skip these when populating
osd metadata and disk occupation.

Signed-off-by: Boris Ranto branto@redhat.com

b-ranto · 2018-02-16T17:56:07Z

I have added a commit that fixes the pg_* counts. Currently, the code will only expose the last value in the summary. With this fix, it will add up these values to get the correct pg_* count.

This bug can be best seen with pg_active count where almost all the entries contain active in the summary but only the latest summary line is being shown as pg_active.

b-ranto · 2018-02-17T04:08:29Z

I have added a commit that implements ceph_flag_<flag> metrics to get the OSD flags info.

jan--f · 2018-02-17T08:20:44Z

The pg_count and OSD flag commits seem fine.
I'm not sure its the best course to skip the metadata metrics in case the addresses are not there. Any idea why the OSD addresses are -? In any case I'd rather have these entries then skip the metadata altogether, especially if/when we add more metadata fields. Or am I missing somehting?

b-ranto · 2018-02-17T15:25:42Z

@jan--f afaik, it is an implementation detail from when ceph is still discovering the osd. The osd itself won't be skipped, it just won't show up twice as it does when it finally gets the address and updates the metric with the address. The old '-' address is still in the prometheus server db so it will show the osd twice when you are querying it -- this also means that you can't get realiable osd count or (osd, address) pair from prometheus just by doing something like count(ceph_osd_metadata). You need to manually work around this by removing the '-' addresses which is not ideal.

jan--f · 2018-02-18T13:48:39Z

@b-ranto Ahh I see. This is actually a different bug. I have a fix for it here. Will do some more testing on that and make it into a PR next week. So feel free to drop 19f4930.

b-ranto · 2018-02-19T13:42:52Z

@jan--f That won't help. The prometheus server will show the last available value for a metric if it was ever created. Your commit will only fix range requests since the time stamp of the no-longer-available metric will no longer be updated when prometheus scrapes the exported data. I guess these bogus entries will eventually time out after 15 days or whatever your data retention interval is but that is a rather long time window.

I think it is better to just ignore these bogus entries than actually create them in the first place.

jan--f · 2018-02-19T14:24:51Z

Ok I'm not terribly passionate about when a particular osd_metadata metrics shows up, I thought as early as possible would be good.

Generally keep in mind though that you'll always want to use range queries with prometheus (and, say, Grafana always does). I.e. if you want to use the metadata metric for counting OSDs, you'd always use fairly recent time window. Otherwise you might count OSDs that have existed or are currently down/out.

jan--f · 2018-02-19T14:26:00Z

src/pybind/mgr/prometheus/module.py

        osd_devices = self.get('osd_map_crush')['devices']
        for osd in osd_map['osds']:
            id_ = osd['osd']
            p_addr = osd['public_addr'].split(':')[0]
            c_addr = osd['cluster_addr'].split(':')[0]
+            if p_addr == "-" or c_addr == "-":
+                continue


Let's add a log message here that osd_$id was skipped due to missing address.

The osd data can contain bogus '-' entries, skip these when populating osd metadata and disk occupation. Signed-off-by: Boris Ranto <branto@redhat.com>

Currently, the pg_* counts are not computed properly. We split the current state by '+' sign but do not add the pg count to the already found pg count. Instead, we overwrite any existing pg count with the new count. This patch fixes it by adding all the pg counts together for all the states. It also introduces a new pg_total metric for pg_total that shows the total count of PGs. Signed-off-by: Boris Ranto <branto@redhat.com>

Signed-off-by: Boris Ranto <branto@redhat.com>

b-ranto · 2018-02-19T16:38:19Z

Yeah, you are right that it is safer to use the range time queries for this kind of things, I'd still prefer it if we did not show the metrics for the OSDs while we are not able to populate them properly, though. I have updated the commit to include the message that the osd is being skipped.

tchaikov · 2018-02-24T04:17:11Z

the failures in the first run were caused by sepia issues.

b-ranto requested review from jcsp and zmc February 16, 2018 05:22

b-ranto force-pushed the wip-mgr-prom branch from 828426b to 19f4930 Compare February 16, 2018 10:03

b-ranto requested a review from jan--f February 16, 2018 17:56

batrick added the mgr label Feb 16, 2018

b-ranto force-pushed the wip-mgr-prom branch 2 times, most recently from 14b20a6 to a9e0db8 Compare February 17, 2018 04:07

jan--f suggested changes Feb 19, 2018

View reviewed changes

b-ranto force-pushed the wip-mgr-prom branch from a9e0db8 to e76ec2a Compare February 19, 2018 16:29

b-ranto added 3 commits February 19, 2018 17:30

mgr/prometheus: Skip bogus entries

965aaad

The osd data can contain bogus '-' entries, skip these when populating osd metadata and disk occupation. Signed-off-by: Boris Ranto <branto@redhat.com>

mgr/prometheus: Expose OSD Flags

aae7a21

Signed-off-by: Boris Ranto <branto@redhat.com>

b-ranto force-pushed the wip-mgr-prom branch from e76ec2a to aae7a21 Compare February 19, 2018 16:31

jan--f approved these changes Feb 20, 2018

View reviewed changes

jcsp added the needs-qa label Feb 20, 2018

jan--f mentioned this pull request Feb 21, 2018

pybind/mgr/prometheus: don't export metrics for dead daemon; new metrics #20506

Merged

tchaikov added the wip-kefu-testing label Feb 22, 2018

tchaikov merged commit 4a00e32 into ceph:master Feb 24, 2018

b-ranto deleted the wip-mgr-prom branch February 24, 2018 10:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgr/prometheus: Skip bogus entries #20456

mgr/prometheus: Skip bogus entries #20456

b-ranto commented Feb 16, 2018

b-ranto commented Feb 16, 2018

b-ranto commented Feb 17, 2018

jan--f commented Feb 17, 2018

b-ranto commented Feb 17, 2018

jan--f commented Feb 18, 2018

b-ranto commented Feb 19, 2018

jan--f commented Feb 19, 2018

jan--f Feb 19, 2018

b-ranto commented Feb 19, 2018

tchaikov commented Feb 24, 2018

mgr/prometheus: Skip bogus entries #20456

mgr/prometheus: Skip bogus entries #20456

Conversation

b-ranto commented Feb 16, 2018

b-ranto commented Feb 16, 2018

b-ranto commented Feb 17, 2018

jan--f commented Feb 17, 2018

b-ranto commented Feb 17, 2018

jan--f commented Feb 18, 2018

b-ranto commented Feb 19, 2018

jan--f commented Feb 19, 2018

jan--f Feb 19, 2018

Choose a reason for hiding this comment

b-ranto commented Feb 19, 2018

tchaikov commented Feb 24, 2018