Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr: Fix ceph_daemon label in ceph_rgw_* metrics #43707

Merged
merged 5 commits into from
Feb 2, 2022

Conversation

BenoitKnecht
Copy link
Contributor

@BenoitKnecht BenoitKnecht commented Oct 28, 2021

The DaemonStateCollection used to always contain the daemon name in its
DaemonKey, but since #40220 (or more specifically
afc3375), the RadosGW registers with its
instance ID instead (rados.get_instance_id()).

As a result, the ceph_rgw_* metrics returned by ceph-mgr through the
prometheus module have their ceph_daemon label include that ID instead of
the daemon name, e.g.

ceph_rgw_req{ceph_daemon="rgw.127202"}

instead of

ceph_rgw_req{ceph_daemon="rgw.my-hostname.rgw0"}

This PR replace the ceph_daemon label with an instance_id label on all
ceph_rgw_* metrics. The old ceph_daemon label can then be added back to any
ceph_rgw_* metric using the following PromQL query, for instance:

ceph_rgw_req * on (instance_id) group_left(ceph_daemon) ceph_rgw_metadata

Fixes: https://tracker.ceph.com/issues/51120
Signed-off-by: Benoît Knecht bknecht@protonmail.ch

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@BenoitKnecht BenoitKnecht changed the title mgr/ActivePyModules: Use metadata id in dump_server() Draft: mgr/ActivePyModules: Use metadata id in dump_server() Oct 28, 2021
@BenoitKnecht BenoitKnecht marked this pull request as draft October 28, 2021 17:30
@BenoitKnecht
Copy link
Contributor Author

See https://tracker.ceph.com/issues/51120#note-7 for why this is still a draft.

@p-se p-se self-requested a review November 11, 2021 10:32
@BenoitKnecht BenoitKnecht force-pushed the ceph-mgr-service-id branch 2 times, most recently from d3b7c95 to 6a2087c Compare November 12, 2021 14:22
@BenoitKnecht BenoitKnecht changed the title Draft: mgr/ActivePyModules: Use metadata id in dump_server() mgr: Fix ceph_daemon label in ceph_rgw_* metrics Nov 12, 2021
@BenoitKnecht BenoitKnecht marked this pull request as ready for review November 12, 2021 14:31
Copy link
Contributor

@pereman2 pereman2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this PR adress all existing queries that will have to change (monitoring/grafana/dashboards/)?

@BenoitKnecht
Copy link
Contributor Author

@pereman2 That's a good point, I can definitely add that.

@BenoitKnecht
Copy link
Contributor Author

I've updated the dashboards and added tests to cover the modified panels, so this PR should be ready for another review.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

The `DaemonStateCollection` used to always contain the daemon name in its
`DaemonKey`, but since ceph#40220 (or more specifically
afc3375), the RadosGW registers with its
instance ID instead (`rados.get_instance_id()`).

As a result, the `ceph_rgw_*` metrics returned by `ceph-mgr` through the
`prometheus` module have their `ceph_daemon` label include that ID instead of
the daemon name, e.g.

```
ceph_rgw_req{ceph_daemon="rgw.127202"}
```

instead of

```
ceph_rgw_req{ceph_daemon="rgw.my-hostname.rgw0"}
```

This commit adds the daemon name from `state->metadata["id"]` if available, as
`service.name` in the JSON document returned by `dump_server()`.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
Now that the RadosGW returns its instance ID instead of its daemon name,
replace the `ceph_daemon` label with an `instance_id` label on the `rgw`
metrics.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
In order to get the `ceph_daemon` label for `rgw` metrics corresponding to the
value before ceph#40220, we need to add the `instance_id` label to the
`ceph_rgw_metadata` metric.

This way, the old `ceph_daemon` label can be added to any `ceph_rgw_*` metric
using the following PromQL query, for instance:

```
ceph_rgw_req * on (instance_id) group_left(ceph_daemon) ceph_rgw_metadata
```

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
With the `ceph_daemon` label now replaced by `instance_id` on all `ceph_rgw_*`
metrics, we need to update Grafana dashboards get that label back from
`ceph_rgw_metadata` using this type of construct:

```
ceph_rgw_req * on (instance_id) group_left(ceph_daemon) ceph_rgw_metadata
```

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
Some of the expressions modified in c402903 were not covered by any tests,
especially those in the `radosgw-detail.json` dashboard.

This commit fills in those gaps.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
@BenoitKnecht
Copy link
Contributor Author

make check (arm64) seems to be failing for some unrelated reason.

Copy link
Contributor

@pereman2 pereman2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey, sorry for the delay. I've been testing this and I was looking at ceph_rgw_metadata and ceph_rgw_req, but ceph_daemon label wasn't filled in any of them.

image

@BenoitKnecht
Copy link
Contributor Author

Hi @pereman2, no worries, and thanks for the review!

It's very strange, I just tried again on a vstart cluster, and I get the expected labels:

$ RGW=1 ../src/vstart.sh --debug --new -x --localhost --bluestore
$ ./bin/ceph mgr module enable prometheus
$ curl -s localhost:9283/metrics | grep ^ceph_rgw_
ceph_rgw_metadata{ceph_daemon="rgw.8000",hostname="5e7ace7cdc93",ceph_version="ceph version Development (no_version) quincy (dev)",instance_id="4425"} 1.0
ceph_rgw_cache_hit{instance_id="4425"} 0.0
ceph_rgw_cache_miss{instance_id="4425"} 10.0
ceph_rgw_failed_req{instance_id="4425"} 0.0
ceph_rgw_gc_retire_object{instance_id="4425"} 0.0
ceph_rgw_get{instance_id="4425"} 0.0
ceph_rgw_get_b{instance_id="4425"} 0.0
ceph_rgw_get_initial_lat_sum{instance_id="4425"} 0.0
ceph_rgw_get_initial_lat_count{instance_id="4425"} 0.0
ceph_rgw_keystone_token_cache_hit{instance_id="4425"} 0.0
ceph_rgw_keystone_token_cache_miss{instance_id="4425"} 0.0
ceph_rgw_lc_abort_mpu{instance_id="4425"} 0.0
ceph_rgw_lc_expire_current{instance_id="4425"} 0.0
ceph_rgw_lc_expire_dm{instance_id="4425"} 0.0
ceph_rgw_lc_expire_noncurrent{instance_id="4425"} 0.0
ceph_rgw_lc_transition_current{instance_id="4425"} 0.0
ceph_rgw_lc_transition_noncurrent{instance_id="4425"} 0.0
ceph_rgw_pubsub_event_lost{instance_id="4425"} 0.0
ceph_rgw_pubsub_event_triggered{instance_id="4425"} 0.0
ceph_rgw_pubsub_events{instance_id="4425"} 0.0
ceph_rgw_pubsub_missing_conf{instance_id="4425"} 0.0
ceph_rgw_pubsub_push_failed{instance_id="4425"} 0.0
ceph_rgw_pubsub_push_ok{instance_id="4425"} 0.0
ceph_rgw_pubsub_push_pending{instance_id="4425"} 0.0
ceph_rgw_pubsub_store_fail{instance_id="4425"} 0.0
ceph_rgw_pubsub_store_ok{instance_id="4425"} 0.0
ceph_rgw_put{instance_id="4425"} 0.0
ceph_rgw_put_b{instance_id="4425"} 0.0
ceph_rgw_put_initial_lat_sum{instance_id="4425"} 0.0
ceph_rgw_put_initial_lat_count{instance_id="4425"} 0.0
ceph_rgw_qactive{instance_id="4425"} 0.0
ceph_rgw_qlen{instance_id="4425"} 0.0
ceph_rgw_req{instance_id="4425"} 0.0

Are you sure you were running a version of ceph-mgr compiled from this branch?

Copy link
Contributor

@pereman2 pereman2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recompiled and worked :P, LGTM. Thanks @BenoitKnecht

@BenoitKnecht
Copy link
Contributor Author

Has anyone else had a chance to review this?

@pereman2
Copy link
Contributor

pereman2 commented Feb 1, 2022

@aaSharma14 @epuertat ping

Copy link
Contributor

@aaSharma14 aaSharma14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.!

@aaSharma14
Copy link
Contributor

jenkins test make check arm64

@pereman2 pereman2 added the needs-quincy-backport backport required for quincy label Feb 2, 2022
Copy link
Member

@epuertat epuertat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @BenoitKnecht ! This is overlapping with #44059, but given this is fixing a bug, I'd rather prioritize this over that one. @MrFreezeex are you ok with having this one merged first? I hope this doesn't mean much rebase for you...

@@ -494,7 +494,7 @@ local addStyle(alias, colorMode, colors, dateFormat, decimals, mappingType, patt
'Latencies are shown stacked, without a yaxis to provide a visual indication of GET latency imbalance across RGW hosts',
's',
'short',
'label_replace(rate(ceph_rgw_get_initial_lat_sum[30s]),"rgw_host","$1","ceph_daemon","rgw.(.*)") / \nlabel_replace(rate(ceph_rgw_get_initial_lat_count[30s]),"rgw_host","$1","ceph_daemon","rgw.(.*)")',
'label_replace(\n rate(ceph_rgw_get_initial_lat_sum[30s]) /\n rate(ceph_rgw_get_initial_lat_count[30s]) *\n on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata,\n"rgw_host", "$1", "ceph_daemon", "rgw.(.*)")',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's better use jsonnet verbatims (|||) instead of explicit line breaks:

Suggested change
'label_replace(\n rate(ceph_rgw_get_initial_lat_sum[30s]) /\n rate(ceph_rgw_get_initial_lat_count[30s]) *\n on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata,\n"rgw_host", "$1", "ceph_daemon", "rgw.(.*)")',
|||
label_replace(
rate(ceph_rgw_get_initial_lat_sum[30s]) /
rate(ceph_rgw_get_initial_lat_count[30s]) *
on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata,
"rgw_host", "$1", "ceph_daemon", "rgw.(.*)")
|||,

@@ -520,7 +520,7 @@ local addStyle(alias, colorMode, colors, dateFormat, decimals, mappingType, patt
'Total bytes transferred in/out through get/put operations, by radosgw instance',
'bytes',
'short',
'sum by(rgw_host) (\n (label_replace(rate(ceph_rgw_get_b[30s]), "rgw_host","$1","ceph_daemon","rgw.(.*)")) + \n (label_replace(rate(ceph_rgw_put_b[30s]), "rgw_host","$1","ceph_daemon","rgw.(.*)"))\n)',
'label_replace(sum by (instance_id) (\n rate(ceph_rgw_get_b[30s]) + \n rate(ceph_rgw_put_b[30s])\n) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata, "rgw_host", "$1", "ceph_daemon", "rgw.(.*)")',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, with verbatim magic (since we are refactoring this, let's take the chance to clean this up a bit).

@@ -529,7 +529,7 @@ local addStyle(alias, colorMode, colors, dateFormat, decimals, mappingType, patt
'Latencies are shown stacked, without a yaxis to provide a visual indication of PUT latency imbalance across RGW hosts',
's',
'short',
'label_replace(rate(ceph_rgw_put_initial_lat_sum[30s]),"rgw_host","$1","ceph_daemon","rgw.(.*)") / \nlabel_replace(rate(ceph_rgw_put_initial_lat_count[30s]),"rgw_host","$1","ceph_daemon","rgw.(.*)")',
'label_replace(\n rate(ceph_rgw_put_initial_lat_sum[30s]) /\n rate(ceph_rgw_put_initial_lat_count[30s]) *\n on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata,\n"rgw_host", "$1", "ceph_daemon", "rgw.(.*)")',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here ;-)

And variable `rgw_servers` is `rgw.foo`
Then Grafana panel `$rgw_servers GET/PUT Latencies` with legend `GET {{ceph_daemon}}` shows:
| metrics | values |
| {ceph_daemon="rgw.foo", instance_id="58892247"} | 2.5000000000000004 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh my IEEE 754 🙈

Comment on lines +3 to +13
Scenario: "Test $rgw_servers GET/PUT Latencies - GET"
Given the following series:
| metrics | values |
| ceph_rgw_get_initial_lat_sum{instance="127.0.0.1", instance_id="58892247", job="ceph"} | 10 50 100 |
| ceph_rgw_get_initial_lat_count{instance="127.0.0.1", instance_id="58892247", job="ceph"} | 20 60 80 |
| ceph_rgw_metadata{ceph_daemon="rgw.foo", hostname="localhost", instance="127.0.0.1", instance_id="58892247", job="ceph"} | 1 1 1 |
When interval is `30s`
And variable `rgw_servers` is `rgw.foo`
Then Grafana panel `$rgw_servers GET/PUT Latencies` with legend `GET {{ceph_daemon}}` shows:
| metrics | values |
| {ceph_daemon="rgw.foo", instance_id="58892247"} | 2.5000000000000004 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

@MrFreezeex
Copy link
Member

@MrFreezeex are you ok with having this one merged first? I hope this doesn't mean much rebase for you...

Yes sure you can merge this before. It's always a pain to rebase but I did it already a couple of times so I can do it once more, no worries.

@epuertat
Copy link
Member

epuertat commented Feb 2, 2022

Thank you @MrFreezeex ! Will merge this then.

@epuertat epuertat merged commit c47ace9 into ceph:master Feb 2, 2022
@BenoitKnecht
Copy link
Contributor Author

Thanks @epuertat for the review and subsequent merge. I didn't know the ||| syntax in jsonnet, I'll keep it in mind for the future.

And thanks @MrFreezeex for accepting the burden of having to rebase your own PR. :)

Could this be flagged for backport to Pacific? It would be great to have those metric labels be consistent again.

@epuertat
Copy link
Member

epuertat commented Feb 3, 2022

@BenoitKnecht the pacific backport tracker is already there: https://tracker.ceph.com/issues/54118 (you just need to run the scr/script/ceph-backport.sh 54118). I'll take care of backporting this to quincy in batch PR.

@BenoitKnecht
Copy link
Contributor Author

Ah that's great, thanks! I though some specific label was needed on this PR for a bot to do the backport, but the script worked just fine.

@epuertat epuertat removed the needs-quincy-backport backport required for quincy label Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants