Scale Set Metrics ADR #2568

nikola-jokic · 2023-05-08T13:51:08Z

Propose a solution for exposing metrics by the gha-runner-scale-set and gha-runner-scale-set-controller

TingluoHuang · 2023-05-08T16:59:46Z

docs/adrs/2023-05-08-exposing-metrics.md

+
+### Metrics exposed by the controller
+
+To get a better understanding about health and workings of the cluster


i assume each metrics will be tagged with the RunnerScleSet id/name/url etc?

I used labels that you assigned in POC which are great! We can always add or delete ones. That is the reason I said I'm not sure if we actually need to include them in the ADR. Changing labels or metrics would require an update on the ADR. But I wanted to display the basic structure and then trim it down if we need or extend it if you think we should

TingluoHuang · 2023-05-08T17:01:15Z

docs/adrs/2023-05-08-exposing-metrics.md

+  ephemeral runner pod after multiple retries, it will set the state of the
+  `EphemeralRunner` to failed. Since the controller can not recover from this
+  state, it can be useful to set Prometheus alerts to catch this issue quickly.
+


do we need metrics for EpheneralRunnerSet and AutoScalingRunnerSet level?

They are just controllers that are up. Not sure what added value will they bring but we can
EphemeralRunnerSet is exposing pending, failed and running ephemeral runners. I'd start from here and then extend them if there is a need for more metrics ☺️

TingluoHuang · 2023-05-08T17:02:28Z

docs/adrs/2023-05-08-exposing-metrics.md

+service, it can expose actions service related data through metrics. In
+particular:
+
+- `available_jobs` - Number of jobs with `runs-on` matching the runner scale set name. Jobs are not yet assigned but are acquired by the runner scale set.


available_jobs is for not acquired job.

Sorry, mistake in copying...

TingluoHuang · 2023-05-08T17:04:38Z

docs/adrs/2023-05-08-exposing-metrics.md

+
+Controller metrics belong to the `github_runner_scale_set_controller` subsystem,
+so the names are going to have `github_runner_scale_set_controller` prefix.
+


should we also include a section to doc all labels we will add to each metric?

Not sure, I kind of wanted to have them displayed here in case we need them. Maybe it should not even be in the ADR in case we want to extend metrics. I just included it for comments ☺️

It would be useful as we create documentation to have the set of labels we apply for each metric.

In addition, I'd love it if we could have the following:

An ability to configure which labels to be added to metrics via command-line flags, and

An ability to add useful but can-be-source-of-prometheus-cardinality-issue labels (like runner id, job id, and any kind of "ids" that differ across instances of jobs = high cardinality).

Regardless of if we document every supported label in ADR, I do prefer it if the ADR clearly says that it is supposed to support dangerous-and-useful labels, and at the same time, it is going to have the ability to toggle which labels to be added. Those two would enable users to label metrics with various IDs for easier debugging and monitoring for relatively small-scale ARC deployments and to turn those ID labels off for large-scale deployments.

This old thread might provide more context about that #2176 (comment)

docs/adrs/2023-05-08-exposing-metrics.md

Link- · 2023-05-09T11:24:57Z

docs/adrs/2023-05-08-exposing-metrics.md

+### Metric names
+
+Listener metrics belong to the `github_runner_scale_set` subsystem, so the names
+are going to have the `github_runner_scale_set_` prefix.


Will we cross any label name limits with this prefix? Also why not gh_runner_scale_set and gh_runner_scale_set_controller to stay consistent with the chart names?

Also: https://prometheus.io/docs/practices/naming/#metric-names

docs/adrs/2023-05-08-exposing-metrics.md

Link- · 2023-05-11T09:54:57Z

docs/adrs/2023-05-08-exposing-metrics.md

+  of all listeners. This is not a big problem but is something to point out.
+- Managing requests/limits can be tricky.
+
+### Use a Prometheus Pushgateway


Suggested change

### Use a Prometheus Pushgateway

### Option 3: Use a Prometheus Pushgateway

docs/adrs/2023-05-08-exposing-metrics.md

Link- · 2023-05-11T09:58:43Z

docs/adrs/2023-05-08-exposing-metrics.md

+  be applied across all `AutoscalingRunnerSets`, it is difficult to inherit this
+  configuration by applying helm charts.
+
+### Create an aggregator service


I am personally highly in favour of this approach, especially since this is the same way it was implemented in the legacy modes

Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>

Link- · 2023-05-18T13:26:13Z

We've done enough iterations on this, let's ship it 🚀 thank you for drafting a great doc.

Scale Set Metrics ADR

167c8e3

nikola-jokic requested a review from Link- May 8, 2023 13:51

nikola-jokic requested review from mumoshu, toast-gear and a team as code owners May 8, 2023 13:51

TingluoHuang reviewed May 8, 2023

View reviewed changes

chrispat previously approved these changes May 8, 2023

View reviewed changes

Link- added the gha-runner-scale-set Related to the gha-runner-scale-set mode label May 9, 2023

Link- reviewed May 9, 2023

View reviewed changes

docs/adrs/2023-05-08-exposing-metrics.md Outdated Show resolved Hide resolved

Link- reviewed May 9, 2023

View reviewed changes

Add 3 approaches of implementing metrics

0e39f91

nikola-jokic dismissed chrispat’s stale review via 0e39f91 May 10, 2023 12:55

Link- reviewed May 11, 2023

View reviewed changes

docs/adrs/2023-05-08-exposing-metrics.md Outdated Show resolved Hide resolved

Link- reviewed May 11, 2023

View reviewed changes

docs/adrs/2023-05-08-exposing-metrics.md Outdated Show resolved Hide resolved

Link- reviewed May 11, 2023

View reviewed changes

docs/adrs/2023-05-08-exposing-metrics.md Outdated Show resolved Hide resolved

Link- reviewed May 11, 2023

View reviewed changes

docs/adrs/2023-05-08-exposing-metrics.md Outdated Show resolved Hide resolved

Link- reviewed May 11, 2023

View reviewed changes

docs/adrs/2023-05-08-exposing-metrics.md Outdated Show resolved Hide resolved

Link- reviewed May 11, 2023

View reviewed changes

docs/adrs/2023-05-08-exposing-metrics.md Outdated Show resolved Hide resolved

Link- reviewed May 11, 2023

View reviewed changes

docs/adrs/2023-05-08-exposing-metrics.md Outdated Show resolved Hide resolved

Link- reviewed May 11, 2023

View reviewed changes

docs/adrs/2023-05-08-exposing-metrics.md Outdated Show resolved Hide resolved

Link- reviewed May 11, 2023

View reviewed changes

nikola-jokic and others added 2 commits May 15, 2023 10:33

Apply suggestions from code review

2cffc29

Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>

Include decision on metrics server

5ffaaa6

nikola-jokic requested a review from Link- May 18, 2023 12:46

Link- approved these changes May 18, 2023

View reviewed changes

nikola-jokic merged commit 91c8991 into master May 18, 2023
7 checks passed

nikola-jokic deleted the nikola-jokic/metrics-adr branch May 18, 2023 13:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale Set Metrics ADR #2568

Scale Set Metrics ADR #2568

nikola-jokic commented May 8, 2023

TingluoHuang May 8, 2023

nikola-jokic May 8, 2023

TingluoHuang May 8, 2023

nikola-jokic May 8, 2023

TingluoHuang May 8, 2023

nikola-jokic May 8, 2023

TingluoHuang May 8, 2023

nikola-jokic May 8, 2023

chrispat May 8, 2023

mumoshu May 11, 2023 •

edited

Link- May 9, 2023

Link- May 9, 2023

Link- May 11, 2023

Link- May 11, 2023

Link- commented May 18, 2023


		### Metrics exposed by the controller

		To get a better understanding about health and workings of the cluster


		Controller metrics belong to the `github_runner_scale_set_controller` subsystem,
		so the names are going to have `github_runner_scale_set_controller` prefix.

	### Use a Prometheus Pushgateway
	### Option 3: Use a Prometheus Pushgateway

Scale Set Metrics ADR #2568

Scale Set Metrics ADR #2568

Conversation

nikola-jokic commented May 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mumoshu May 11, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Link- commented May 18, 2023

mumoshu May 11, 2023 •

edited