-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mgr: Fix ceph_daemon label in ceph_rgw_* metrics #43707
Conversation
See https://tracker.ceph.com/issues/51120#note-7 for why this is still a draft. |
507b8e0
to
2ab7463
Compare
d3b7c95
to
6a2087c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this PR adress all existing queries that will have to change (monitoring/grafana/dashboards/
)?
@pereman2 That's a good point, I can definitely add that. |
6a2087c
to
ebae43d
Compare
ebae43d
to
9a51a5f
Compare
9a51a5f
to
b9b3098
Compare
I've updated the dashboards and added tests to cover the modified panels, so this PR should be ready for another review. |
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
The `DaemonStateCollection` used to always contain the daemon name in its `DaemonKey`, but since ceph#40220 (or more specifically afc3375), the RadosGW registers with its instance ID instead (`rados.get_instance_id()`). As a result, the `ceph_rgw_*` metrics returned by `ceph-mgr` through the `prometheus` module have their `ceph_daemon` label include that ID instead of the daemon name, e.g. ``` ceph_rgw_req{ceph_daemon="rgw.127202"} ``` instead of ``` ceph_rgw_req{ceph_daemon="rgw.my-hostname.rgw0"} ``` This commit adds the daemon name from `state->metadata["id"]` if available, as `service.name` in the JSON document returned by `dump_server()`. Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
Now that the RadosGW returns its instance ID instead of its daemon name, replace the `ceph_daemon` label with an `instance_id` label on the `rgw` metrics. Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
In order to get the `ceph_daemon` label for `rgw` metrics corresponding to the value before ceph#40220, we need to add the `instance_id` label to the `ceph_rgw_metadata` metric. This way, the old `ceph_daemon` label can be added to any `ceph_rgw_*` metric using the following PromQL query, for instance: ``` ceph_rgw_req * on (instance_id) group_left(ceph_daemon) ceph_rgw_metadata ``` Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
With the `ceph_daemon` label now replaced by `instance_id` on all `ceph_rgw_*` metrics, we need to update Grafana dashboards get that label back from `ceph_rgw_metadata` using this type of construct: ``` ceph_rgw_req * on (instance_id) group_left(ceph_daemon) ceph_rgw_metadata ``` Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
Some of the expressions modified in c402903 were not covered by any tests, especially those in the `radosgw-detail.json` dashboard. This commit fills in those gaps. Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
b9b3098
to
2daaa05
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pereman2, no worries, and thanks for the review! It's very strange, I just tried again on a $ RGW=1 ../src/vstart.sh --debug --new -x --localhost --bluestore
$ ./bin/ceph mgr module enable prometheus
$ curl -s localhost:9283/metrics | grep ^ceph_rgw_
ceph_rgw_metadata{ceph_daemon="rgw.8000",hostname="5e7ace7cdc93",ceph_version="ceph version Development (no_version) quincy (dev)",instance_id="4425"} 1.0
ceph_rgw_cache_hit{instance_id="4425"} 0.0
ceph_rgw_cache_miss{instance_id="4425"} 10.0
ceph_rgw_failed_req{instance_id="4425"} 0.0
ceph_rgw_gc_retire_object{instance_id="4425"} 0.0
ceph_rgw_get{instance_id="4425"} 0.0
ceph_rgw_get_b{instance_id="4425"} 0.0
ceph_rgw_get_initial_lat_sum{instance_id="4425"} 0.0
ceph_rgw_get_initial_lat_count{instance_id="4425"} 0.0
ceph_rgw_keystone_token_cache_hit{instance_id="4425"} 0.0
ceph_rgw_keystone_token_cache_miss{instance_id="4425"} 0.0
ceph_rgw_lc_abort_mpu{instance_id="4425"} 0.0
ceph_rgw_lc_expire_current{instance_id="4425"} 0.0
ceph_rgw_lc_expire_dm{instance_id="4425"} 0.0
ceph_rgw_lc_expire_noncurrent{instance_id="4425"} 0.0
ceph_rgw_lc_transition_current{instance_id="4425"} 0.0
ceph_rgw_lc_transition_noncurrent{instance_id="4425"} 0.0
ceph_rgw_pubsub_event_lost{instance_id="4425"} 0.0
ceph_rgw_pubsub_event_triggered{instance_id="4425"} 0.0
ceph_rgw_pubsub_events{instance_id="4425"} 0.0
ceph_rgw_pubsub_missing_conf{instance_id="4425"} 0.0
ceph_rgw_pubsub_push_failed{instance_id="4425"} 0.0
ceph_rgw_pubsub_push_ok{instance_id="4425"} 0.0
ceph_rgw_pubsub_push_pending{instance_id="4425"} 0.0
ceph_rgw_pubsub_store_fail{instance_id="4425"} 0.0
ceph_rgw_pubsub_store_ok{instance_id="4425"} 0.0
ceph_rgw_put{instance_id="4425"} 0.0
ceph_rgw_put_b{instance_id="4425"} 0.0
ceph_rgw_put_initial_lat_sum{instance_id="4425"} 0.0
ceph_rgw_put_initial_lat_count{instance_id="4425"} 0.0
ceph_rgw_qactive{instance_id="4425"} 0.0
ceph_rgw_qlen{instance_id="4425"} 0.0
ceph_rgw_req{instance_id="4425"} 0.0 Are you sure you were running a version of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recompiled and worked :P, LGTM. Thanks @BenoitKnecht
Has anyone else had a chance to review this? |
@aaSharma14 @epuertat ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.!
jenkins test make check arm64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @BenoitKnecht ! This is overlapping with #44059, but given this is fixing a bug, I'd rather prioritize this over that one. @MrFreezeex are you ok with having this one merged first? I hope this doesn't mean much rebase for you...
@@ -494,7 +494,7 @@ local addStyle(alias, colorMode, colors, dateFormat, decimals, mappingType, patt | |||
'Latencies are shown stacked, without a yaxis to provide a visual indication of GET latency imbalance across RGW hosts', | |||
's', | |||
'short', | |||
'label_replace(rate(ceph_rgw_get_initial_lat_sum[30s]),"rgw_host","$1","ceph_daemon","rgw.(.*)") / \nlabel_replace(rate(ceph_rgw_get_initial_lat_count[30s]),"rgw_host","$1","ceph_daemon","rgw.(.*)")', | |||
'label_replace(\n rate(ceph_rgw_get_initial_lat_sum[30s]) /\n rate(ceph_rgw_get_initial_lat_count[30s]) *\n on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata,\n"rgw_host", "$1", "ceph_daemon", "rgw.(.*)")', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's better use jsonnet verbatims (|||
) instead of explicit line breaks:
'label_replace(\n rate(ceph_rgw_get_initial_lat_sum[30s]) /\n rate(ceph_rgw_get_initial_lat_count[30s]) *\n on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata,\n"rgw_host", "$1", "ceph_daemon", "rgw.(.*)")', | |
||| | |
label_replace( | |
rate(ceph_rgw_get_initial_lat_sum[30s]) / | |
rate(ceph_rgw_get_initial_lat_count[30s]) * | |
on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata, | |
"rgw_host", "$1", "ceph_daemon", "rgw.(.*)") | |
|||, |
@@ -520,7 +520,7 @@ local addStyle(alias, colorMode, colors, dateFormat, decimals, mappingType, patt | |||
'Total bytes transferred in/out through get/put operations, by radosgw instance', | |||
'bytes', | |||
'short', | |||
'sum by(rgw_host) (\n (label_replace(rate(ceph_rgw_get_b[30s]), "rgw_host","$1","ceph_daemon","rgw.(.*)")) + \n (label_replace(rate(ceph_rgw_put_b[30s]), "rgw_host","$1","ceph_daemon","rgw.(.*)"))\n)', | |||
'label_replace(sum by (instance_id) (\n rate(ceph_rgw_get_b[30s]) + \n rate(ceph_rgw_put_b[30s])\n) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata, "rgw_host", "$1", "ceph_daemon", "rgw.(.*)")', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, with verbatim magic (since we are refactoring this, let's take the chance to clean this up a bit).
@@ -529,7 +529,7 @@ local addStyle(alias, colorMode, colors, dateFormat, decimals, mappingType, patt | |||
'Latencies are shown stacked, without a yaxis to provide a visual indication of PUT latency imbalance across RGW hosts', | |||
's', | |||
'short', | |||
'label_replace(rate(ceph_rgw_put_initial_lat_sum[30s]),"rgw_host","$1","ceph_daemon","rgw.(.*)") / \nlabel_replace(rate(ceph_rgw_put_initial_lat_count[30s]),"rgw_host","$1","ceph_daemon","rgw.(.*)")', | |||
'label_replace(\n rate(ceph_rgw_put_initial_lat_sum[30s]) /\n rate(ceph_rgw_put_initial_lat_count[30s]) *\n on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata,\n"rgw_host", "$1", "ceph_daemon", "rgw.(.*)")', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here ;-)
And variable `rgw_servers` is `rgw.foo` | ||
Then Grafana panel `$rgw_servers GET/PUT Latencies` with legend `GET {{ceph_daemon}}` shows: | ||
| metrics | values | | ||
| {ceph_daemon="rgw.foo", instance_id="58892247"} | 2.5000000000000004 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh my IEEE 754 🙈
Scenario: "Test $rgw_servers GET/PUT Latencies - GET" | ||
Given the following series: | ||
| metrics | values | | ||
| ceph_rgw_get_initial_lat_sum{instance="127.0.0.1", instance_id="58892247", job="ceph"} | 10 50 100 | | ||
| ceph_rgw_get_initial_lat_count{instance="127.0.0.1", instance_id="58892247", job="ceph"} | 20 60 80 | | ||
| ceph_rgw_metadata{ceph_daemon="rgw.foo", hostname="localhost", instance="127.0.0.1", instance_id="58892247", job="ceph"} | 1 1 1 | | ||
When interval is `30s` | ||
And variable `rgw_servers` is `rgw.foo` | ||
Then Grafana panel `$rgw_servers GET/PUT Latencies` with legend `GET {{ceph_daemon}}` shows: | ||
| metrics | values | | ||
| {ceph_daemon="rgw.foo", instance_id="58892247"} | 2.5000000000000004 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work!
Yes sure you can merge this before. It's always a pain to rebase but I did it already a couple of times so I can do it once more, no worries. |
Thank you @MrFreezeex ! Will merge this then. |
Thanks @epuertat for the review and subsequent merge. I didn't know the And thanks @MrFreezeex for accepting the burden of having to rebase your own PR. :) Could this be flagged for backport to Pacific? It would be great to have those metric labels be consistent again. |
@BenoitKnecht the pacific backport tracker is already there: https://tracker.ceph.com/issues/54118 (you just need to run the |
Ah that's great, thanks! I though some specific label was needed on this PR for a bot to do the backport, but the script worked just fine. |
The
DaemonStateCollection
used to always contain the daemon name in itsDaemonKey
, but since #40220 (or more specificallyafc3375), the RadosGW registers with its
instance ID instead (
rados.get_instance_id()
).As a result, the
ceph_rgw_*
metrics returned byceph-mgr
through theprometheus
module have theirceph_daemon
label include that ID instead ofthe daemon name, e.g.
instead of
This PR replace the
ceph_daemon
label with aninstance_id
label on allceph_rgw_*
metrics. The oldceph_daemon
label can then be added back to anyceph_rgw_*
metric using the following PromQL query, for instance:Fixes: https://tracker.ceph.com/issues/51120
Signed-off-by: Benoît Knecht bknecht@protonmail.ch
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox