Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale Set Metrics ADR #2568

Merged
merged 4 commits into from
May 18, 2023
Merged

Scale Set Metrics ADR #2568

merged 4 commits into from
May 18, 2023

Conversation

nikola-jokic
Copy link
Member

Propose a solution for exposing metrics by the gha-runner-scale-set and gha-runner-scale-set-controller

@nikola-jokic nikola-jokic requested a review from Link- May 8, 2023 13:51
@nikola-jokic nikola-jokic requested review from mumoshu, toast-gear and a team as code owners May 8, 2023 13:51

### Metrics exposed by the controller

To get a better understanding about health and workings of the cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume each metrics will be tagged with the RunnerScleSet id/name/url etc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used labels that you assigned in POC which are great! We can always add or delete ones. That is the reason I said I'm not sure if we actually need to include them in the ADR. Changing labels or metrics would require an update on the ADR. But I wanted to display the basic structure and then trim it down if we need or extend it if you think we should

ephemeral runner pod after multiple retries, it will set the state of the
`EphemeralRunner` to failed. Since the controller can not recover from this
state, it can be useful to set Prometheus alerts to catch this issue quickly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need metrics for EpheneralRunnerSet and AutoScalingRunnerSet level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are just controllers that are up. Not sure what added value will they bring but we can
EphemeralRunnerSet is exposing pending, failed and running ephemeral runners. I'd start from here and then extend them if there is a need for more metrics ☺️

service, it can expose actions service related data through metrics. In
particular:

- `available_jobs` - Number of jobs with `runs-on` matching the runner scale set name. Jobs are not yet assigned but are acquired by the runner scale set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

available_jobs is for not acquired job.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, mistake in copying...


Controller metrics belong to the `github_runner_scale_set_controller` subsystem,
so the names are going to have `github_runner_scale_set_controller` prefix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also include a section to doc all labels we will add to each metric?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, I kind of wanted to have them displayed here in case we need them. Maybe it should not even be in the ADR in case we want to extend metrics. I just included it for comments ☺️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful as we create documentation to have the set of labels we apply for each metric.

Copy link
Collaborator

@mumoshu mumoshu May 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, I'd love it if we could have the following:

  • An ability to configure which labels to be added to metrics via command-line flags, and
  • An ability to add useful but can-be-source-of-prometheus-cardinality-issue labels (like runner id, job id, and any kind of "ids" that differ across instances of jobs = high cardinality).

Regardless of if we document every supported label in ADR, I do prefer it if the ADR clearly says that it is supposed to support dangerous-and-useful labels, and at the same time, it is going to have the ability to toggle which labels to be added. Those two would enable users to label metrics with various IDs for easier debugging and monitoring for relatively small-scale ARC deployments and to turn those ID labels off for large-scale deployments.

This old thread might provide more context about that #2176 (comment)

chrispat
chrispat previously approved these changes May 8, 2023
@Link- Link- added the gha-runner-scale-set Related to the gha-runner-scale-set mode label May 9, 2023
### Metric names

Listener metrics belong to the `github_runner_scale_set` subsystem, so the names
are going to have the `github_runner_scale_set_` prefix.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we cross any label name limits with this prefix? Also why not gh_runner_scale_set and gh_runner_scale_set_controller to stay consistent with the chart names?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of all listeners. This is not a big problem but is something to point out.
- Managing requests/limits can be tricky.

### Use a Prometheus Pushgateway
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Use a Prometheus Pushgateway
### Option 3: Use a Prometheus Pushgateway

be applied across all `AutoscalingRunnerSets`, it is difficult to inherit this
configuration by applying helm charts.

### Create an aggregator service
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am personally highly in favour of this approach, especially since this is the same way it was implemented in the legacy modes

nikola-jokic and others added 2 commits May 15, 2023 10:33
Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>
@nikola-jokic nikola-jokic requested a review from Link- May 18, 2023 12:46
@Link-
Copy link
Collaborator

Link- commented May 18, 2023

We've done enough iterations on this, let's ship it 🚀 thank you for drafting a great doc.

@nikola-jokic nikola-jokic merged commit 91c8991 into master May 18, 2023
7 checks passed
@nikola-jokic nikola-jokic deleted the nikola-jokic/metrics-adr branch May 18, 2023 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gha-runner-scale-set Related to the gha-runner-scale-set mode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants