Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

ws-manager: Replace backup/restore success with total metric #11158

Merged
merged 1 commit into from
Jul 6, 2022

Conversation

ArthurSens
Copy link
Contributor

Description

Hey Workspace-team!

We noticed that you want to level up your game towards success-criteria-related metrics, and there is momentum in the company to start adopting SLOs to get the right metrics.

We from the platform-team are being a little more proactive regarding implementing SLOs and while trying to get the first example up and running, we noticed that your metrics for workspace backups and workspace restores are a bit out of best practices for SLOs with Prometheus 馃槵

When implementing success ratio SLOs, we usually look for metrics that represent "total amount of requests" and "amount of failed requests". So the calculation looks like this: backup error ratio = amount of backup with errors / total amount of backups

Would it be fine if we change the current metrics to be more SLO-friendly?

PS: I also simplified the nested ifs by using WithLabelValues instead of GetMetricWithLabelValues as well

Related Issue(s)

Fixes #

How to test

Release Notes

NONE

@ArthurSens ArthurSens requested a review from a team July 5, 2022 18:56
@github-actions github-actions bot added the team: workspace Issue belongs to the Workspace team label Jul 5, 2022
Copy link
Contributor

@sagor999 sagor999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!!

/hold
adding hold, can you please double check that we are not using that metric in any of grafana dashboards? As in that case they will become broken. I don't think we do, but if you could double check that would be appreciated. Feel free to remove the hold.
鉂わ笍

@ArthurSens
Copy link
Contributor Author

ArthurSens commented Jul 5, 2022

I've looked at Workspace success criteria, and couldn't find anything 馃

Signed-off-by: ArthurSens <arthursens2005@gmail.com>
Copy link
Contributor

@jenting jenting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you 鉂わ笍

@jenting
Copy link
Contributor

jenting commented Jul 6, 2022

We only use the workspace_backups_failure_total in the ws-daemon alert rules.

/unhold

@roboquat roboquat merged commit 87a1e34 into main Jul 6, 2022
@roboquat roboquat deleted the as/ws-backup-total-metrics branch July 6, 2022 07:57
@roboquat roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Jul 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: workspace Workspace team change is running in production deployed Change is completely running in production release-note-none size/M team: workspace Issue belongs to the Workspace team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants