-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MCM does not reset the failed_machines gauge once the machine is deleted #476
Labels
area/metering
Metering related
effort/1w
Effort for issue is around 1 week
kind/bug
Bug
lifecycle/rotten
Nobody worked on this for 12 months (final aging stage)
needs/planning
Needs (more) planning with other MCM maintainers
priority/4
Priority (lower number equals higher priority)
Comments
@sebbonnet Thank you for your contribution. |
cc @ggaurav10 |
@rfranzke commented on Apr 29
|
gardener-robot
added
the
lifecycle/stale
Nobody worked on this for 6 months (will further age)
label
Oct 16, 2020
gardener-robot
added
lifecycle/rotten
Nobody worked on this for 12 months (final aging stage)
and removed
lifecycle/stale
Nobody worked on this for 6 months (will further age)
labels
Dec 16, 2020
prashanth26
added
effort/1w
Effort for issue is around 1 week
priority/5
Priority (lower number equals higher priority)
labels
Mar 30, 2021
himanshu-kun
added
priority/4
Priority (lower number equals higher priority)
needs/planning
Needs (more) planning with other MCM maintainers
and removed
priority/5
Priority (lower number equals higher priority)
lifecycle/rotten
Nobody worked on this for 12 months (final aging stage)
labels
Feb 14, 2023
gardener-robot
added
the
lifecycle/stale
Nobody worked on this for 6 months (will further age)
label
Oct 25, 2023
gardener-robot
added
lifecycle/rotten
Nobody worked on this for 12 months (final aging stage)
and removed
lifecycle/stale
Nobody worked on this for 6 months (will further age)
labels
Jul 3, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/metering
Metering related
effort/1w
Effort for issue is around 1 week
kind/bug
Bug
lifecycle/rotten
Nobody worked on this for 12 months (final aging stage)
needs/planning
Needs (more) planning with other MCM maintainers
priority/4
Priority (lower number equals higher priority)
What happened:
When a machine is no longer reachable, MCM correctly reports the machine as unhealthy and creates the corresponding
failed_machines
gauge metric. However when the machine is actually deleted that metric is not cleared and it keeps on being reported.This makes it problematic for alerting rules based on the
failed_machines
metric as it never gets reset to 0.I may be easier to change this metric to a counter, so you don't have to worry about resetting its value, as keeping track of the failed machines may not be trivial or wanted.
What you expected to happen:
failed_machines
gauge metric should be reset once the machines is deleted (or recovers)How to reproduce it (as minimally and precisely as possible):
count(mcm_machine_deployment_failed_machines)
failed_machines
Anything else we need to know:
Environment:
eu.gcr.io/gardener-project/gardener/machine-controller-manager:v0.26.3
Interpretation/Solutions
num_failed_machines
which tells the number of machines which are inFailed
Phase currentlymcm_machine_deployment_operation_failed
gauge metric -> for the machines last operation failedThe text was updated successfully, but these errors were encountered: