New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exporter: use only counter dump/schema commands for extacting counters #50718
exporter: use only counter dump/schema commands for extacting counters #50718
Conversation
| else if (daemon_name.find("rbd-mirror") != std::string::npos) { | ||
| std::regex re( | ||
| "^rbd_mirror_image_([^/]+)/(?:(?:([^/]+)/" | ||
| ")?)(.*)\\.(replay(?:_bytes|_latency)?)$"); | ||
| std::smatch match; | ||
| if (std::regex_search(daemon_name, match, re) == true) { | ||
| new_metric_name = "ceph_rbd_mirror_image_" + match.str(4); | ||
| labels["pool"] = quote(match.str(1)); | ||
| labels["namespace"] = quote(match.str(2)); | ||
| labels["image"] = quote(match.str(3)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we also dropping this from the mgr/prometheus? Otherwise, they'll be duplicate entries, won't it?
Additionally, rather than introducing this transformation code in the ceph-exporter, I'd expect this to be provided by the daemons themselves. If not, we're trading a language (Python) which is super-friendly for data processing with bare C++ (tbh, this code is not complicate at all, but since we are introducing labeled perf-counters, why not go all the way?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we also dropping this from the mgr/prometheus? Otherwise, they'll be duplicate entries, won't it?
They're already removed.
Additionally, rather than introducing this transformation code in the ceph-exporter, I'd expect this to be provided by the daemons themselves. If not, we're trading a language (Python) which is super-friendly for data processing with bare C++ (tbh, this code is not complicate at all, but since we are introducing labeled perf-counters, why not go all the way?).
| // Add fixed name metrics from existing ones that have details in their names | ||
| // that should be in labels (not in name). For backward compatibility, | ||
| // a new fixed name metric is created (instead of replacing)and details are put | ||
| // in new labels. Intended for RGW sync perf. counters but extendable as required. | ||
| // See: https://tracker.ceph.com/issues/45311 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment! It's really helpful.
| std::regex re("^data_sync_from_(.*)\\."); | ||
| std::smatch match; | ||
| if (std::regex_search(metric_name, match, re) == true) { | ||
| new_metric_name = std::regex_replace(metric_name, re, "from_([^.]*)', 'from_zone"); | ||
| labels["source_zone"] = quote(match.str(1)); | ||
| return {labels, new_metric_name}; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here: why not providing the labeled perf-counter right from the rgw daemon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well basically this is just addressing this #49248 (comment). To have support for renaming the rgw multi-site metrics and adding labels. And well other daemons can make use of it as well for e.g. any rbd metrics which requires renaming metrics name and adding them as label. So for now it's just rgw in future there maybe other daemons which may need this functionality.
Fixes: https://tracker.ceph.com/issues/59191 Signed-off-by: Avan Thakkar <athakkar@redhat.com> Ceph exporter no more required the output of perf dump/schema, as the ``counter dump`` command returns both labeled and unlabeled perf counters which exporter can fetch and export. Removed the ``exporter_get_labeled_counters`` confiug option as exporter will now export all the counters, labeled or unlabeled. Also the fix includes the support for renaming the metrics name of rgw multi-site and adding labels to it, similar to what is there in prometheus module.
c81e3b9
to
51a8990
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR is closed but I'm commenting anyway since I had some comments pending -- if you are tagging people for review, give them at least a couple of days to review before merging.
| else if (daemon_name.find("rbd-mirror") != std::string::npos) { | ||
| std::regex re( | ||
| "^rbd_mirror_image_([^/]+)/(?:(?:([^/]+)/" | ||
| ")?)(.*)\\.(replay(?:_bytes|_latency)?)$"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is dead code -- rbd-mirror daemon doesn't report any non-labeled per-image counters anymore (starting in reef).
| if (daemon_name.find("radosgw") != std::string::npos) { | ||
| std::size_t pos = daemon_name.find_last_of('.'); | ||
| std::string tmp = daemon_name.substr(pos+1); | ||
| labels["instance_id"] = quote(tmp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know nothing about why both "rgw" and "radosgw" branches are needed but this seems wrong to me. Given radosgw.<instance_id>.asok string, wouldn't the above three lines always produce asok string for the value of the label?
| json_object labeled_perf_group_object = labeled_perf.value().as_object(); | ||
| auto counters = labeled_perf_group_object["counters"].as_object(); | ||
| auto counters_labels = labeled_perf_group_object["labels"].as_object(); | ||
| auto labeled_perf_group_counters = counter_dump[labeled_perf_group].as_object()["counters"].as_object(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these variable names that start with "labeled" are confusing because counter dump includes both labeled and unlabeled counters. I would remove this prefix throughout.
| json_object counter_schema = boost::json::parse(counter_schema_response).as_object(); | ||
|
|
||
| for (auto &labeled_perf : counter_schema) { | ||
| std::string labeled_perf_group = {labeled_perf.key().begin(), labeled_perf.key().end()}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and a few other lines here are over 80 columns.
|
|
||
| for(auto &label: counters_labels) { | ||
| std::string label_key = {label.key().begin(), label.key().end()}; | ||
| labels[label_key] = quote(label.value().as_string().c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is c_str needed given that quote takes std::string?
| new_metric_name = metric_name; | ||
|
|
||
| std::regex re("^data_sync_from_(.*)\\."); | ||
| std::smatch match; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation
|
This PR might have introduced a bug in 17.2.6 Just upgraded from 17.2.5 to 17.2.6 and prometheus is complaining about an ‘invalid metric type “counter”’ from one of the exporters. Grabbed the metrics file and ran it through promtool check metrics and it pointed to with the message |
|
@szinn This PR is only merged in main, it hasn't been backported anywhere. |
|
@idryomov Right - the issue appears in 17.2.6 now. exporter is generating a metric with a '+' in it which is invalid and causes prometheus to mark the exporter pod with TargetDown. This may not be the PR that caused the bug, but the issue is with 17.2.6 |
|
Understood. I don't think anything changed around |
|
Looking at my prometheus data, I see ceph_mds_mem_cap, ceph_mds_mem_cap_plus and ceph_mds_mem_cap_minus so perhaps it is just getting surfaced in .6 as you mentioned |
|
Are you seeing both |
|
Here's what I see on 17.2.6 It would appear that _minus now shows as _ and _plus now shows as + The _plus and _minus were perhaps from 17.2.5. I just did a search in the prometheus graph for the prefix. |
|
@szinn this specific PR is not backported to 17.2.6, but the new ceph-exporter is there (I'd assume that disabled by default). I see that the sanitizing code in the ceph/src/pybind/mgr/prometheus/module.py Lines 329 to 338 in d7ff0d1
But there's no such code in the @avanthakkar, @pereman2: could you plz confirm what's the status of |
|
@avanthakkar @pereman2 @epuertat Thanks for looking into this - even though it doesn't appear to impact functionality, it does mean that metrics aren't being pulled from one exporter which would stop any kind of alerting of issues off of that exporter. That would be a serious issue |
@epuertat What do you mean by the new ceph-exporter? ceph-exporter was there in 17.2.5 and not much changed under
Agreed... ceph/src/exporter/DaemonMetricCollector.cc Lines 148 to 149 in d7ff0d1
|
|
Probably the right thing to do is to produce the perf counter with the right name in each daemon. Obviously, we can fix the problem in Ceph exporter, (and continue making different assumptions about how to replace each invalid character in the perf counter name), but if we were fixed the problem in the mds daemon the first time the problem arose, now we will not find this problem again. In the case of mds, (what i find taking a look to the offensive counter) we have four counters with the same problem: Line 3514 in 094206c
In fact the use of the signs "+" or "-" in the counters has different meaning depending of the counter. @vshankar Do you think that to change the name of these counters is possible (any possible regression), or should we continue doing the "syntactic" replacement in the new ceph exporter? |
|
@jmolmo Putting aside the choice between carrying over massaging code from
What is new about ceph-exporter or how metrics are surfaced in 17.2.6? And, whatever it is, it doesn't appear to be disabled by default? |
I need to check. Will get back asap. |
|
@jmolmo Prometheus is only one way to expose perf-counters, but Ceph supports others: InfluxDB/Telegraf/Zabbix. Each one could have different requirements for metrics names, so I don't think that each Ceph daemon individually should deal with that. IMHO it corresponds to the adaptation layer (ceph-exporter) to sanitize each perf-counter name to make it suitable for each monitoring framework. Different question is doing complicate metric renamings, like the one here. I think that those transformations should be provided by the daemons themselves. |
|
@szinn we have discussed this topic in today's CLT meeting: CauseThe specific PRs that introduced this issue were:
Those PR allowed Cephadm/Rook to deploy the new (17.2.5) Impact
The impact would be:
Workaround
Next Actions
Retrospective
CC: @adk3798 @avanthakkar @idryomov @jmolmo @nizamial09 @yuriw |
|
Thanks @epuertat for the detailed clarifications. |
|
Thanks for the writeup as well - This did show up when I upgraded to 17.2.6 from 17.2.5 via rook-ceph. I didl try the ceph orch command and it indicated there was no orchestrator configured. |
Thanks for the heads up, @szinn. I completely ignored that Rook was already deploying this service. I'll update the notes above. Thanks again! The |
|
Yes, as of 17.2.6, rook is deploying Ceph-exporter and the service monitor instance to go with it. Unfortunately deleting the service monitor and Ceph-exporter won't work since the rook-operator will just put them back again. Will likely have to wait for either a fix from @travisn or wait until Ceph-exporter is corrected to sanitize all metric names before I can upgrade. |
The expected ceph version for exporter is changed temporarily because some regression was detected in Ceph version 17.2.6 which is summarised here ceph/ceph#50718 (comment). Thus, disabling ceph-exporter for now until all the regression are fixed. Signed-off-by: avanthakkar <avanjohn@gmail.com>
The expected ceph version for exporter is changed temporarily because some regression was detected in Ceph version 17.2.6 which is summarised here ceph/ceph#50718 (comment). Thus, disabling ceph-exporter for now until all the regression are fixed. Signed-off-by: avanthakkar <avanjohn@gmail.com>
Fixes: https://tracker.ceph.com/issues/59191
Signed-off-by: Avan Thakkar athakkar@redhat.com
Ceph exporter no more required the output of perf dump/schema, as the
counter dumpcommand returns both labeled and unlabeled perf counters which exporter can fetch and export. Removed theexporter_get_labeled_countersconfig option as exporter will now export all the counters, labeled or unlabeled.Also the fix includes the support for renaming the metrics name of rgw multi-site and adding labels to it, similar to what is there in prometheus module.
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows