mgr/prometheus: add pool crush rule and crush rule metadata #42216

clwluvw · 2021-07-07T12:07:11Z

Add ceph_crush_rule_metadata metric
Add crush_rule label for ceph_pool_metadata metric

Fixes: https://tracker.ceph.com/issues/51563
Signed-off-by: Seena Fallah seenafallah@gmail.com

k0ste · 2021-07-07T13:16:37Z

@clwluvw, I think crush_rule missed class field 😏

clwluvw · 2021-07-07T13:20:41Z

@clwluvw, I think crush_rule missed class field 😏

@k0ste Sorry I don't get what you mean. Do you mean about crush_rule device class or...?

k0ste · 2021-07-07T14:33:40Z

@clwluvw, yes, I mean crush device class, cause:

2 pools (one for hdd and one for nvme)
one crush rule
different media

without device_class label we can't determine difference between this two pools

clwluvw · 2021-07-07T14:38:59Z

@k0ste How are you assigning one crush_rule to two different pools with different device classes? 🤔

k0ste · 2021-07-07T15:04:18Z

@clwluvw oops, overhead in abstraction. Anyway, it's will be nice to know what device_class is served for pool for alerting. Case: pool > 60% - alert to adding new hardware. What media type we should add? Operator need to check what is device_class in pool crush rule 🧐

clwluvw · 2021-07-07T16:41:57Z

@k0ste I've added it as a take_step

clwluvw · 2021-07-10T05:04:45Z

@epuertat Can you please help to review this?

epuertat · 2021-07-13T17:23:28Z

@clwluvw probably @tchaikov is your person here!

That said, are we sure we really want to report the CRUSH rules every 10-15 seconds? Is this a metric that can be somehow displayed in Grafana? And do you need time-series of that? I think that's more of metadata thing and you want to know the current value not the last 30 days progression, don't you?

@p-se @aaSharma14 fyi.

clwluvw · 2021-07-13T17:28:56Z

@clwluvw probably @tchaikov is your person here!

That said, are we sure we really want to report the CRUSH rules every 10-15 seconds? Is this a metric that can be somehow displayed in Grafana? And do you need time-series of that? I think that's more of metadata thing and you want to know the current value not the last 30 days progression, don't you?

@p-se @aaSharma14 fyi.

Yes correct it's just a metadata metric like ceph_osd_metadata or other metadata metrics that can help to have graphs with more metadata.

epuertat · 2021-07-14T10:59:11Z

@clwluvw could you please 'draft' what you are trying to get here? For pseudo-static information or data whose temporal progression is not relevant, I'd rather move it to the Ceph-Dashboard (landing page or pools page).

k0ste · 2021-07-14T11:06:54Z

@clwluvw could you please 'draft' what you are trying to get here? For pseudo-static information or data whose temporal progression is not relevant, I'd rather move it to the Ceph-Dashboard (landing page or pools page).

Actually we need this in Prometheus. Not in dashboard.
If prometheus module can call for crush rules not every 10-15 seconds, but let's say once an hour - this will be preferable ofcourse

clwluvw · 2021-07-14T11:54:57Z

@clwluvw could you please 'draft' what you are trying to get here? For pseudo-static information or data whose temporal progression is not relevant, I'd rather move it to the Ceph-Dashboard (landing page or pools page).

@epuertat I'm trying to have crush rule info in prometheus metrics as a info metrics to join them to the pool metrics for example and have some infos on them
This metric is like ceph_osd_metadata or ceph_pool_metadata that you can see some example of its use case in alerts that can help for being more human readable.
It's a common metric in prometheus for metadata that it's already used in ceph prometheus too.

clwluvw · 2021-07-23T09:08:12Z

ping @epuertat :)

p-se

I'm getting a Traceback.

Traceback (most recent call last):
  File "/home/user/src/ceph/src/pybind/mgr/prometheus/module.py", line 232, in collect
    data = self.mod.collect()
  File "/home/user/src/ceph/src/pybind/mgr/mgr_util.py", line 753, in wrapper
    result = f(*args, **kwargs)
  File "/home/user/src/ceph/src/pybind/mgr/prometheus/module.py", line 1238, in collect
    self.get_metadata_and_osd_status()
  File "/home/user/src/ceph/src/pybind/mgr/mgr_util.py", line 753, in wrapper
    result = f(*args, **kwargs)
  File "/home/user/src/ceph/src/pybind/mgr/prometheus/module.py", line 888, in get_metadata_and_osd_status
    rule['ruleset'],
KeyError: 'ruleset'

It appears that ruleset equals to rule_id in newer Ceph versions (and I don't get it returned on the CLI either on the master branch):

ceph/src/crush/CrushWrapper.cc

Lines 3000 to 3015 in ade690b

    
               encode(crush->rules[i]->len, bl); 
        
               /* 
        
                * legacy crush_rule_mask was 
        
                * 
        
                * struct crush_rule_mask { 
        
                * 	__u8 ruleset; 
        
                * 	__u8 type; 
        
                * 	__u8 min_size; 
        
                * 	__u8 max_size; 
        
                * }; 
        
                * 
        
                * encode ruleset=ruleid, and min/max of 1/100 
        
                */ 
        
               encode((__u8)i, bl);   // ruleset == ruleid 
        
               encode(crush->rules[i]->type, bl);

ceph/src/crush/CrushWrapper.cc

Lines 3145 to 3149 in ade690b

    
                __u8 ruleset; // ignore + discard 
        
                decode(ruleset, blp); 
        
                if (ruleset != i) { 
        
           throw ::ceph::buffer::malformed_input("crush ruleset_id != rule_id; encoding is too old"); 
        
                }

You might want to simply take out ruleset or set it rule_id, but as I'm not a C++ developer, someone else might want to add a suggestion here.

Other than that, the PR looks okay to me. It does look like the data for the metadata metric has always been retrieved alongside other data, the PR just puts it into a metric. So performance-wise this shouldn't make a difference.

Signed-off-by: Seena Fallah <seenafallah@gmail.com>

clwluvw · 2021-07-23T12:17:25Z

@p-se Thanks for your review. Yes you are right I've tested it with pacific v16.2.4 and this change wasn't backported to the pacific branch. As it doesn't even print in CLI in recent versions I've removed it from the exporter too.

epuertat · 2021-07-23T16:53:08Z

@clwluvw my main concern about this change is that it exposes no metrics at all (the metric value is constantly set to 1), so the metric is used just to convey almost static metadata:

ceph_crush_rule_metadata{rule_id="0",name="replicated_rule",type="1",take_step="default"} 1.0
ceph_crush_rule_metadata{rule_id="1",name="erasure-code",type="3",take_step="default"} 1.0

Yes, I know there are other metadata (label) only metrics (e.g.: ceph_pool_metadata). IMHO that's a mistake, the metadata should always be attached to a metric (instead of performing join-like queries), but at least that metadata is used from the Grafana dashboards. Here I don't see any use of this new metric? Can you please better elaborate what's this for?

clwluvw · 2021-07-24T08:15:04Z

@epuertat To use pool metrics based on their crush rule this can be helped. for example, you want to know the pool usage or pool IOPS, or pool throughput based on its crush rule (this can be mean as a device class in most scenarios) or you want to create a dashboard for your SSD tier and wants to just see the pools that use the SSD disks (crush rule).
If this doesn't exist we can have these expressions.

epuertat · 2021-07-26T15:55:45Z

@epuertat To use pool metrics based on their crush rule this can be helped. for example, you want to know the pool usage or pool IOPS, or pool throughput based on its crush rule (this can be mean as a device class in most scenarios) or you want to create a dashboard for your SSD tier and wants to just see the pools that use the SSD disks (crush rule).
If this doesn't exist we can have these expressions.

@clwluvw and can't we simply label the pools with the labels you want to use for the queries? PromQL and most NoSQL/timeseries DBs are extremely efficient when the data is modeled based on the queries. In this very case: pool{rule_name="replicated_rule"}.

What I wouldn't do in any case is converting the Prometheus exporter in a new read-only API gateway for exposing all Ceph internal metadata. That's beyond the purpose of monitoring, and also Ceph provides multiple ways to do that (CLI, Dashboard UI, RESTful API, etc).

k0ste · 2021-07-26T17:03:32Z

@clwluvw and can't we simply label the pools with the labels you want to use for the queries? PromQL and most NoSQL/timeseries DBs are extremely efficient when the data is modeled based on the queries. In this very case: pool{rule_name="replicated_rule"}

We need device_class, not the rule name 🙂

epuertat · 2021-07-27T17:20:28Z

@clwluvw and can't we simply label the pools with the labels you want to use for the queries? PromQL and most NoSQL/timeseries DBs are extremely efficient when the data is modeled based on the queries. In this very case: pool{rule_name="replicated_rule"}

We need device_class, not the rule name

I don't see any reference to the device class on this PR.

clwluvw · 2021-07-27T21:21:10Z

@epuertat Yes we can set the crush rule metadata into pool metadata metric directly but I think make this metadata as a dedicated metric can help a lot because maybe others needed to use it for example with device class (step_take and label_replace it as they need) or other labels and it's not really harmful I think because it has few params and also in each scrap it doesn't request for extra data than it has because osd_map_crush is requesting one time for osd_devices and we just export another data from osd_map_crush.
Also in my opinion, if metadata metrics be a dedicated metric it would look better :D

epuertat · 2021-07-28T15:27:55Z

@clwluvw "a dedicated metric can help a lot because maybe others needed to use it" doesn't sound like a real use case to me. Prometheus is not a general purpose SQL database or API where you just expose things just in case someone needs them... For that you have the Dashboard API or the CLI. Prometheus is a time-series database specifically designed for monitoring/alerting purposes.

clwluvw · 2021-07-29T17:42:45Z

@epuertat Yes in time crush rules can change. Also for example you can see in this doc that metadata metrics are not so far metrics (https://prometheus.io/docs/practices/naming/ - foobar_build_info).
Also, do you think adding crush rule metadata as a label to the pool metric can sound good?
I think it won't be a big deal to store these types of metrics in Prometheus because it won't get an extra load in the Prometheus database I think.
Also, it's not just for use in Grafana dashboards and can use for alerting purposes.

epuertat · 2021-07-29T18:05:43Z

@clwluvw I've been talking about this change with @p-se offline. I'll try to wrap up my mind for the last time:

First of all, it's not so true that adding new metrics is harmless, in fact it's one the basic scaling factors for Prometheus. I can tell you about Ceph users facing OOM events in the Prometheus container (with 8 GB RAM allocated) on large clusters.
This metric may be harmless, but as long as we are allowing any metadata be considered a monitoring metric... On what grounds can we block the next one if we allowed this one? BTW, in the monitoring world there's this rule of not adding metrics that you wouldn't check if you were woken up at 2 am to deal with a system outage. You haven't yet exposed the relevance of this.
This metric has no direct use for Ceph right now. There's no new Grafana dashboard using it or an alerting rule (you mentioned one, but there's no trace of it here). So this is almost the same as adding dead code, with one exception: dead code doesn't have any performance impact, and this does (despite small).

clwluvw · 2021-07-29T23:01:21Z

This metric may be harmless, but as long as we are allowing any metadata be considered a monitoring metric... On what grounds can we block the next one if we allowed this one? BTW, in the monitoring world there's this rule of not adding metrics that you wouldn't check if you were woken up at 2 am to deal with a system outage. You haven't yet exposed the relevance of this.

@epuertat It's a metric that is being used for statistics panels. Why shouldn't we have this? metrics aren't just for alerting and can be used for statistics panels like many metrics ceph and other standard prometheus exporters does! I don't get your point here, sorry :(

This metric has no direct use for Ceph right now. There's no new Grafana dashboard using it or an alerting rule (you mentioned one, but there's no trace of it here). So this is almost the same as adding dead code, with one exception: dead code doesn't have any performance impact, and this does (despite small).

I told you the purpose past to you that can (At least I want to) use in monitoring to see pool metrics based on crush rule (that would be device class in many cases because crush rule matches to the device class step_take) and this really help to have better monitoring dashboards. If not can you tell me how can I have a list of my pools that are based on my ssd device class right now? I have a dashboard for my SSD tier to see the statistics (iops, throughput, ...) and the pool metric has those metrics too and I want to list my pools in that device class too but now I don't have any metric to filter my pools based on device class!

epuertat · 2021-07-30T15:27:09Z

@clwluvw I don't think the PR as it is, it's adding value to the whole Ceph community (maybe just for your private use case): there are no visible outcomes from this, no docs, no examples, no tests, nothing.

That said I'm not blocking this: let someone else jump in and give their approval to this.

pcuzner

@clwluvw I'm struggling to see how this helps with alerts. We'll fire an alert based on POOL_NEARFULL or POOL_FULL - they're the things that get you out of bed right? The workflow beyond that is to look at the pool and determine action.

So how can we use this metric to improve the escalation?

The code is running osd crush dump at every refresh - we ideally we need to keep latency down. I haven't profile'd your code, but just comparing osd crush dump between a 3 osd cluster and against 34 osd's I see a .4s to .8s execution time

Before we could merge this I think we need to understand the impact the call would have on larger clusters.

Make sense?

clwluvw · 2021-10-14T12:04:40Z

@clwluvw I'm struggling to see how this helps with alerts. We'll fire an alert based on POOL_NEARFULL or POOL_FULL - they're the things that get you out of bed right? The workflow beyond that is to look at the pool and determine action.

So how can we use this metric to improve the escalation?

The code is running osd crush dump at every refresh - we ideally we need to keep latency down. I haven't profile'd your code, but just comparing osd crush dump between a 3 osd cluster and against 34 osd's I see a .4s to .8s execution time

Before we could merge this I think we need to understand the impact the call would have on larger clusters.

Make sense?

@pcuzner It's not only for alerting purposes and If you read my previous comments it would be used for monitoring purposes as Prometheus can do it like many other metrics that are used just for monitoring purposes!

Did you do your perf test without this change, too? Because I don't think I do any more queries in this change!

stale · 2022-01-09T00:27:18Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

clwluvw · 2022-02-09T09:37:22Z

unstale please

github-actions · 2022-09-01T00:05:11Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2022-11-28T11:02:18Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

github-actions bot added the pybind label Jul 7, 2021

clwluvw force-pushed the prometheus-crush-rule branch from e0d9c5d to 44eddd1 Compare July 7, 2021 12:49

clwluvw force-pushed the prometheus-crush-rule branch 2 times, most recently from 547604d to 69d9aeb Compare July 7, 2021 16:01

p-se added the monitoring label Jul 23, 2021

p-se requested changes Jul 23, 2021

View reviewed changes

clwluvw added 2 commits July 23, 2021 16:42

mgr/prometheus: add crush_rule to pool metadata

b4ddc73

Signed-off-by: Seena Fallah <seenafallah@gmail.com>

mgr/prometheus: add crush rule metadata metric

95be87a

Signed-off-by: Seena Fallah <seenafallah@gmail.com>

clwluvw force-pushed the prometheus-crush-rule branch from 69d9aeb to 95be87a Compare July 23, 2021 12:14

p-se approved these changes Jul 23, 2021

View reviewed changes

p-se requested review from pcuzner, epuertat and aaSharma14 July 23, 2021 14:13

p-se added the needs-review label Jul 23, 2021

pcuzner reviewed Oct 13, 2021

View reviewed changes

stale bot added the stale label Jan 9, 2022

stale bot removed the stale label Feb 9, 2022

djgalloway changed the base branch from master to main July 3, 2022 00:00

github-actions bot added the stale label Sep 1, 2022

github-actions bot closed this Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgr/prometheus: add pool crush rule and crush rule metadata #42216

mgr/prometheus: add pool crush rule and crush rule metadata #42216

clwluvw commented Jul 7, 2021 •

edited

k0ste commented Jul 7, 2021

clwluvw commented Jul 7, 2021

k0ste commented Jul 7, 2021 •

edited

clwluvw commented Jul 7, 2021

k0ste commented Jul 7, 2021

clwluvw commented Jul 7, 2021

clwluvw commented Jul 10, 2021

epuertat commented Jul 13, 2021 •

edited

clwluvw commented Jul 13, 2021 •

edited

epuertat commented Jul 14, 2021

k0ste commented Jul 14, 2021

clwluvw commented Jul 14, 2021

clwluvw commented Jul 23, 2021

p-se left a comment

clwluvw commented Jul 23, 2021

epuertat commented Jul 23, 2021

clwluvw commented Jul 24, 2021

epuertat commented Jul 26, 2021

k0ste commented Jul 26, 2021

epuertat commented Jul 27, 2021

clwluvw commented Jul 27, 2021

epuertat commented Jul 28, 2021

clwluvw commented Jul 29, 2021

epuertat commented Jul 29, 2021

clwluvw commented Jul 29, 2021

epuertat commented Jul 30, 2021

pcuzner left a comment

clwluvw commented Oct 14, 2021

stale bot commented Jan 9, 2022

clwluvw commented Feb 9, 2022

github-actions bot commented Sep 1, 2022

github-actions bot commented Nov 28, 2022

	encode(crush->rules[i]->len, bl);

	/*
	* legacy crush_rule_mask was
	*
	* struct crush_rule_mask {
	* __u8 ruleset;
	* __u8 type;
	* __u8 min_size;
	* __u8 max_size;
	* };
	*
	* encode ruleset=ruleid, and min/max of 1/100
	*/
	encode((__u8)i, bl); // ruleset == ruleid
	encode(crush->rules[i]->type, bl);

	__u8 ruleset; // ignore + discard
	decode(ruleset, blp);
	if (ruleset != i) {
	throw ::ceph::buffer::malformed_input("crush ruleset_id != rule_id; encoding is too old");
	}

mgr/prometheus: add pool crush rule and crush rule metadata #42216

mgr/prometheus: add pool crush rule and crush rule metadata #42216

Conversation

clwluvw commented Jul 7, 2021 • edited

k0ste commented Jul 7, 2021

clwluvw commented Jul 7, 2021

k0ste commented Jul 7, 2021 • edited

clwluvw commented Jul 7, 2021

k0ste commented Jul 7, 2021

clwluvw commented Jul 7, 2021

clwluvw commented Jul 10, 2021

epuertat commented Jul 13, 2021 • edited

clwluvw commented Jul 13, 2021 • edited

epuertat commented Jul 14, 2021

k0ste commented Jul 14, 2021

clwluvw commented Jul 14, 2021

clwluvw commented Jul 23, 2021

p-se left a comment

Choose a reason for hiding this comment

clwluvw commented Jul 23, 2021

epuertat commented Jul 23, 2021

clwluvw commented Jul 24, 2021

epuertat commented Jul 26, 2021

k0ste commented Jul 26, 2021

epuertat commented Jul 27, 2021

clwluvw commented Jul 27, 2021

epuertat commented Jul 28, 2021

clwluvw commented Jul 29, 2021

epuertat commented Jul 29, 2021

clwluvw commented Jul 29, 2021

epuertat commented Jul 30, 2021

pcuzner left a comment

Choose a reason for hiding this comment

clwluvw commented Oct 14, 2021

stale bot commented Jan 9, 2022

clwluvw commented Feb 9, 2022

github-actions bot commented Sep 1, 2022

github-actions bot commented Nov 28, 2022

clwluvw commented Jul 7, 2021 •

edited

k0ste commented Jul 7, 2021 •

edited

epuertat commented Jul 13, 2021 •

edited

clwluvw commented Jul 13, 2021 •

edited