Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/prometheus: add pool crush rule and crush rule metadata #42216

Closed
wants to merge 2 commits into from

Conversation

clwluvw
Copy link
Contributor

@clwluvw clwluvw commented Jul 7, 2021

  • Add ceph_crush_rule_metadata metric
  • Add crush_rule label for ceph_pool_metadata metric

Fixes: https://tracker.ceph.com/issues/51563
Signed-off-by: Seena Fallah seenafallah@gmail.com

@k0ste
Copy link
Contributor

k0ste commented Jul 7, 2021

@clwluvw, I think crush_rule missed class field 😏

@clwluvw
Copy link
Contributor Author

clwluvw commented Jul 7, 2021

@clwluvw, I think crush_rule missed class field 😏

@k0ste Sorry I don't get what you mean. Do you mean about crush_rule device class or...?

@k0ste
Copy link
Contributor

k0ste commented Jul 7, 2021

@clwluvw, yes, I mean crush device class, cause:

  • 2 pools (one for hdd and one for nvme)
  • one crush rule
  • different media

without device_class label we can't determine difference between this two pools

@clwluvw
Copy link
Contributor Author

clwluvw commented Jul 7, 2021

@k0ste How are you assigning one crush_rule to two different pools with different device classes? 🤔

@k0ste
Copy link
Contributor

k0ste commented Jul 7, 2021

@clwluvw oops, overhead in abstraction. Anyway, it's will be nice to know what device_class is served for pool for alerting. Case: pool > 60% - alert to adding new hardware. What media type we should add? Operator need to check what is device_class in pool crush rule 🧐

@clwluvw clwluvw force-pushed the prometheus-crush-rule branch 2 times, most recently from 547604d to 69d9aeb Compare July 7, 2021 16:01
@clwluvw
Copy link
Contributor Author

clwluvw commented Jul 7, 2021

@k0ste I've added it as a take_step

@clwluvw
Copy link
Contributor Author

clwluvw commented Jul 10, 2021

@epuertat Can you please help to review this?

@epuertat
Copy link
Member

epuertat commented Jul 13, 2021

@clwluvw probably @tchaikov is your person here!

That said, are we sure we really want to report the CRUSH rules every 10-15 seconds? Is this a metric that can be somehow displayed in Grafana? And do you need time-series of that? I think that's more of metadata thing and you want to know the current value not the last 30 days progression, don't you?

@p-se @aaSharma14 fyi.

@clwluvw
Copy link
Contributor Author

clwluvw commented Jul 13, 2021

@clwluvw probably @tchaikov is your person here!

That said, are we sure we really want to report the CRUSH rules every 10-15 seconds? Is this a metric that can be somehow displayed in Grafana? And do you need time-series of that? I think that's more of metadata thing and you want to know the current value not the last 30 days progression, don't you?

@p-se @aaSharma14 fyi.

Yes correct it's just a metadata metric like ceph_osd_metadata or other metadata metrics that can help to have graphs with more metadata.

@epuertat
Copy link
Member

@clwluvw could you please 'draft' what you are trying to get here? For pseudo-static information or data whose temporal progression is not relevant, I'd rather move it to the Ceph-Dashboard (landing page or pools page).

@k0ste
Copy link
Contributor

k0ste commented Jul 14, 2021

@clwluvw could you please 'draft' what you are trying to get here? For pseudo-static information or data whose temporal progression is not relevant, I'd rather move it to the Ceph-Dashboard (landing page or pools page).

Actually we need this in Prometheus. Not in dashboard.
If prometheus module can call for crush rules not every 10-15 seconds, but let's say once an hour - this will be preferable ofcourse

@clwluvw
Copy link
Contributor Author

clwluvw commented Jul 14, 2021

@clwluvw could you please 'draft' what you are trying to get here? For pseudo-static information or data whose temporal progression is not relevant, I'd rather move it to the Ceph-Dashboard (landing page or pools page).

@epuertat I'm trying to have crush rule info in prometheus metrics as a info metrics to join them to the pool metrics for example and have some infos on them
This metric is like ceph_osd_metadata or ceph_pool_metadata that you can see some example of its use case in alerts that can help for being more human readable.
It's a common metric in prometheus for metadata that it's already used in ceph prometheus too.

@clwluvw
Copy link
Contributor Author

clwluvw commented Jul 23, 2021

ping @epuertat :)

Copy link
Contributor

@p-se p-se left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting a Traceback.

Traceback (most recent call last):
  File "/home/user/src/ceph/src/pybind/mgr/prometheus/module.py", line 232, in collect
    data = self.mod.collect()
  File "/home/user/src/ceph/src/pybind/mgr/mgr_util.py", line 753, in wrapper
    result = f(*args, **kwargs)
  File "/home/user/src/ceph/src/pybind/mgr/prometheus/module.py", line 1238, in collect
    self.get_metadata_and_osd_status()
  File "/home/user/src/ceph/src/pybind/mgr/mgr_util.py", line 753, in wrapper
    result = f(*args, **kwargs)
  File "/home/user/src/ceph/src/pybind/mgr/prometheus/module.py", line 888, in get_metadata_and_osd_status
    rule['ruleset'],
KeyError: 'ruleset'

It appears that ruleset equals to rule_id in newer Ceph versions (and I don't get it returned on the CLI either on the master branch):

ceph/src/crush/CrushWrapper.cc

Lines 3000 to 3015 in ade690b

encode(crush->rules[i]->len, bl);
/*
* legacy crush_rule_mask was
*
* struct crush_rule_mask {
* __u8 ruleset;
* __u8 type;
* __u8 min_size;
* __u8 max_size;
* };
*
* encode ruleset=ruleid, and min/max of 1/100
*/
encode((__u8)i, bl); // ruleset == ruleid
encode(crush->rules[i]->type, bl);

ceph/src/crush/CrushWrapper.cc

Lines 3145 to 3149 in ade690b

__u8 ruleset; // ignore + discard
decode(ruleset, blp);
if (ruleset != i) {
throw ::ceph::buffer::malformed_input("crush ruleset_id != rule_id; encoding is too old");
}

You might want to simply take out ruleset or set it rule_id, but as I'm not a C++ developer, someone else might want to add a suggestion here.

Other than that, the PR looks okay to me. It does look like the data for the metadata metric has always been retrieved alongside other data, the PR just puts it into a metric. So performance-wise this shouldn't make a difference.

Signed-off-by: Seena Fallah <seenafallah@gmail.com>
Signed-off-by: Seena Fallah <seenafallah@gmail.com>
@clwluvw
Copy link
Contributor Author

clwluvw commented Jul 23, 2021

@p-se Thanks for your review. Yes you are right I've tested it with pacific v16.2.4 and this change wasn't backported to the pacific branch. As it doesn't even print in CLI in recent versions I've removed it from the exporter too.

@epuertat
Copy link
Member

@clwluvw my main concern about this change is that it exposes no metrics at all (the metric value is constantly set to 1), so the metric is used just to convey almost static metadata:

ceph_crush_rule_metadata{rule_id="0",name="replicated_rule",type="1",take_step="default"} 1.0
ceph_crush_rule_metadata{rule_id="1",name="erasure-code",type="3",take_step="default"} 1.0

Yes, I know there are other metadata (label) only metrics (e.g.: ceph_pool_metadata). IMHO that's a mistake, the metadata should always be attached to a metric (instead of performing join-like queries), but at least that metadata is used from the Grafana dashboards. Here I don't see any use of this new metric? Can you please better elaborate what's this for?

@clwluvw
Copy link
Contributor Author

clwluvw commented Jul 24, 2021

@epuertat To use pool metrics based on their crush rule this can be helped. for example, you want to know the pool usage or pool IOPS, or pool throughput based on its crush rule (this can be mean as a device class in most scenarios) or you want to create a dashboard for your SSD tier and wants to just see the pools that use the SSD disks (crush rule).
If this doesn't exist we can have these expressions.

@epuertat
Copy link
Member

@epuertat To use pool metrics based on their crush rule this can be helped. for example, you want to know the pool usage or pool IOPS, or pool throughput based on its crush rule (this can be mean as a device class in most scenarios) or you want to create a dashboard for your SSD tier and wants to just see the pools that use the SSD disks (crush rule).
If this doesn't exist we can have these expressions.

@clwluvw and can't we simply label the pools with the labels you want to use for the queries? PromQL and most NoSQL/timeseries DBs are extremely efficient when the data is modeled based on the queries. In this very case: pool{rule_name="replicated_rule"}.

What I wouldn't do in any case is converting the Prometheus exporter in a new read-only API gateway for exposing all Ceph internal metadata. That's beyond the purpose of monitoring, and also Ceph provides multiple ways to do that (CLI, Dashboard UI, RESTful API, etc).

@k0ste
Copy link
Contributor

k0ste commented Jul 26, 2021

@clwluvw and can't we simply label the pools with the labels you want to use for the queries? PromQL and most NoSQL/timeseries DBs are extremely efficient when the data is modeled based on the queries. In this very case: pool{rule_name="replicated_rule"}

We need device_class, not the rule name 🙂

@epuertat
Copy link
Member

@clwluvw and can't we simply label the pools with the labels you want to use for the queries? PromQL and most NoSQL/timeseries DBs are extremely efficient when the data is modeled based on the queries. In this very case: pool{rule_name="replicated_rule"}

We need device_class, not the rule name

I don't see any reference to the device class on this PR.

@clwluvw
Copy link
Contributor Author

clwluvw commented Jul 27, 2021

@epuertat Yes we can set the crush rule metadata into pool metadata metric directly but I think make this metadata as a dedicated metric can help a lot because maybe others needed to use it for example with device class (step_take and label_replace it as they need) or other labels and it's not really harmful I think because it has few params and also in each scrap it doesn't request for extra data than it has because osd_map_crush is requesting one time for osd_devices and we just export another data from osd_map_crush.
Also in my opinion, if metadata metrics be a dedicated metric it would look better :D

@epuertat
Copy link
Member

@clwluvw "a dedicated metric can help a lot because maybe others needed to use it" doesn't sound like a real use case to me. Prometheus is not a general purpose SQL database or API where you just expose things just in case someone needs them... For that you have the Dashboard API or the CLI. Prometheus is a time-series database specifically designed for monitoring/alerting purposes.

@clwluvw
Copy link
Contributor Author

clwluvw commented Jul 29, 2021

@epuertat Yes in time crush rules can change. Also for example you can see in this doc that metadata metrics are not so far metrics (https://prometheus.io/docs/practices/naming/ - foobar_build_info).
Also, do you think adding crush rule metadata as a label to the pool metric can sound good?
I think it won't be a big deal to store these types of metrics in Prometheus because it won't get an extra load in the Prometheus database I think.
Also, it's not just for use in Grafana dashboards and can use for alerting purposes.

@epuertat
Copy link
Member

@clwluvw I've been talking about this change with @p-se offline. I'll try to wrap up my mind for the last time:

  • First of all, it's not so true that adding new metrics is harmless, in fact it's one the basic scaling factors for Prometheus. I can tell you about Ceph users facing OOM events in the Prometheus container (with 8 GB RAM allocated) on large clusters.
  • This metric may be harmless, but as long as we are allowing any metadata be considered a monitoring metric... On what grounds can we block the next one if we allowed this one? BTW, in the monitoring world there's this rule of not adding metrics that you wouldn't check if you were woken up at 2 am to deal with a system outage. You haven't yet exposed the relevance of this.
  • This metric has no direct use for Ceph right now. There's no new Grafana dashboard using it or an alerting rule (you mentioned one, but there's no trace of it here). So this is almost the same as adding dead code, with one exception: dead code doesn't have any performance impact, and this does (despite small).

@clwluvw
Copy link
Contributor Author

clwluvw commented Jul 29, 2021

  • This metric may be harmless, but as long as we are allowing any metadata be considered a monitoring metric... On what grounds can we block the next one if we allowed this one? BTW, in the monitoring world there's this rule of not adding metrics that you wouldn't check if you were woken up at 2 am to deal with a system outage. You haven't yet exposed the relevance of this.

@epuertat It's a metric that is being used for statistics panels. Why shouldn't we have this? metrics aren't just for alerting and can be used for statistics panels like many metrics ceph and other standard prometheus exporters does! I don't get your point here, sorry :(

  • This metric has no direct use for Ceph right now. There's no new Grafana dashboard using it or an alerting rule (you mentioned one, but there's no trace of it here). So this is almost the same as adding dead code, with one exception: dead code doesn't have any performance impact, and this does (despite small).

I told you the purpose past to you that can (At least I want to) use in monitoring to see pool metrics based on crush rule (that would be device class in many cases because crush rule matches to the device class step_take) and this really help to have better monitoring dashboards. If not can you tell me how can I have a list of my pools that are based on my ssd device class right now? I have a dashboard for my SSD tier to see the statistics (iops, throughput, ...) and the pool metric has those metrics too and I want to list my pools in that device class too but now I don't have any metric to filter my pools based on device class!

@epuertat
Copy link
Member

@clwluvw I don't think the PR as it is, it's adding value to the whole Ceph community (maybe just for your private use case): there are no visible outcomes from this, no docs, no examples, no tests, nothing.

That said I'm not blocking this: let someone else jump in and give their approval to this.

Copy link
Contributor

@pcuzner pcuzner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clwluvw I'm struggling to see how this helps with alerts. We'll fire an alert based on POOL_NEARFULL or POOL_FULL - they're the things that get you out of bed right? The workflow beyond that is to look at the pool and determine action.

So how can we use this metric to improve the escalation?

The code is running osd crush dump at every refresh - we ideally we need to keep latency down. I haven't profile'd your code, but just comparing osd crush dump between a 3 osd cluster and against 34 osd's I see a .4s to .8s execution time

Before we could merge this I think we need to understand the impact the call would have on larger clusters.

Make sense?

@clwluvw
Copy link
Contributor Author

clwluvw commented Oct 14, 2021

@clwluvw I'm struggling to see how this helps with alerts. We'll fire an alert based on POOL_NEARFULL or POOL_FULL - they're the things that get you out of bed right? The workflow beyond that is to look at the pool and determine action.

So how can we use this metric to improve the escalation?

The code is running osd crush dump at every refresh - we ideally we need to keep latency down. I haven't profile'd your code, but just comparing osd crush dump between a 3 osd cluster and against 34 osd's I see a .4s to .8s execution time

Before we could merge this I think we need to understand the impact the call would have on larger clusters.

Make sense?

@pcuzner It's not only for alerting purposes and If you read my previous comments it would be used for monitoring purposes as Prometheus can do it like many other metrics that are used just for monitoring purposes!

Did you do your perf test without this change, too? Because I don't think I do any more queries in this change!

@stale
Copy link

stale bot commented Jan 9, 2022

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Jan 9, 2022
@clwluvw
Copy link
Contributor Author

clwluvw commented Feb 9, 2022

unstale please

@stale stale bot removed the stale label Feb 9, 2022
@djgalloway djgalloway changed the base branch from master to main July 3, 2022 00:00
@github-actions
Copy link

github-actions bot commented Sep 1, 2022

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Sep 1, 2022
@github-actions
Copy link

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions github-actions bot closed this Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants