Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCM does not reset the failed_machines gauge once the machine is deleted #476

Open
sebbonnet opened this issue Jun 18, 2020 · 3 comments
Open
Labels
area/metering Metering related effort/1w Effort for issue is around 1 week kind/bug Bug lifecycle/rotten Nobody worked on this for 12 months (final aging stage) needs/planning Needs (more) planning with other MCM maintainers priority/4 Priority (lower number equals higher priority)

Comments

@sebbonnet
Copy link
Contributor

sebbonnet commented Jun 18, 2020

What happened:
When a machine is no longer reachable, MCM correctly reports the machine as unhealthy and creates the corresponding failed_machines gauge metric. However when the machine is actually deleted that metric is not cleared and it keeps on being reported.
This makes it problematic for alerting rules based on the failed_machines metric as it never gets reset to 0.
I may be easier to change this metric to a counter, so you don't have to worry about resetting its value, as keeping track of the failed machines may not be trivial or wanted.

I0616 23:12:14.072380       1 event.go:255] Event(v1.ObjectReference{Kind:"MachineSet", Namespace:"machine-controller-manager", Name:"mcm-immutable-node-az-b-6cb7bb5d9", UID:"31b3fc43-a668-11ea-b0fa-0a1dc5f423b0", APIVersion:"machine.sapcloud.io/v1alpha1", ResourceVersion:"371959992", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted machine: mcm-immutable-node-az-b-6cb7bb5d9-7zgk9
E0616 23:12:12.968331       1 machine.go:931] Machine mcm-immutable-node-az-b-6cb7bb5d9-7zgk9 is not healthy since 10m0s minutes. Changing status to failed. Node Conditions: [...]  {Type:Ready Status:Unknown LastHeartbeatTime:2020-06-16 22:59:55 +0000 UTC LastTransitionTime:2020-06-16 23:01:18 +0000 UTC Reason:NodeStatusUnknown Message:Kubelet stopped posting node status.}
W0616 23:01:38.218468       1 machine.go:729] Machine mcm-immutable-node-az-b-6cb7bb5d9-7zgk9 is unhealthy - changing MachineState to Unknown

What you expected to happen:
failed_machines gauge metric should be reset once the machines is deleted (or recovers)

How to reproduce it (as minimally and precisely as possible):

  • Try creating a machine that doesn't bootstrap and watch the number of failed machines keep going up: count(mcm_machine_deployment_failed_machines)
  • Create a machine, then delete the underlying VM and observe the value of failed_machines

Anything else we need to know:

Environment:

  • kubernetes 1.12
  • image: eu.gcr.io/gardener-project/gardener/machine-controller-manager:v0.26.3

Interpretation/Solutions

  • We need a new gauge metric by the name num_failed_machines which tells the number of machines which are in Failed Phase currently
  • mcm_machine_deployment_operation_failed gauge metric -> for the machines last operation failed
  • gauge here means that the metric won't repeat, slice needs to be cleared once the machine is deleted
@sebbonnet sebbonnet added the kind/bug Bug label Jun 18, 2020
@gardener-robot
Copy link

@sebbonnet Thank you for your contribution.

@hardikdr
Copy link
Member

cc @ggaurav10

@hardikdr
Copy link
Member

@rfranzke commented on Apr 29

What happened:
MCM does not update the .status.failedMachine of the MachineDeployment after the .status.lastOperation of the Machine changes (e.g., from Failed -> Processing (e.g., after the credentials have been fixed)):

  status:
    availableReplicas: 2
    conditions:
    - lastTransitionTime: "2020-04-29T06:51:38Z"
      lastUpdateTime: "2020-04-29T06:51:38Z"
      message: Deployment does not have minimum availability.
      reason: MinimumReplicasUnavailable
      status: "False"
      type: Available
    failedMachines:
    - lastOperation:
        description: 'Failed to list VMs while deleting the machine "shoot--foo--bar-cpu-worker-z1-5cdcb46f64-pxzp5"
          AuthFailure: AWS was not able to validate the provided access credentials
          status code: 401, request id: 6e99231c-654e-4b05-8801-310e3532b4e9'
        lastUpdateTime: "2020-04-29T06:53:33Z"
        state: Failed
        type: Delete
      name: shoot--foo--bar-cpu-worker-z1-5cdcb46f64-pxzp5
      ownerRef: shoot--foo--bar-cpu-worker-z1-5cdcb46f64
    observedGeneration: 2
  spec:
    class:
      kind: AWSMachineClass
      name: shoot--foo--bar-cpu-worker-z1-ff76e
    nodeTemplate:
      metadata:
        creationTimestamp: null
        labels:
          node.kubernetes.io/role: node
          worker.garden.sapcloud.io/group: cpu-worker
          worker.gardener.cloud/pool: cpu-worker
      spec: {}
    providerID: aws:///eu-west-1/i-05f4737c3ef646f89
  status:
    currentStatus:
      lastUpdateTime: "2020-04-29T07:41:44Z"
      phase: Pending
      timeoutActive: true
    lastOperation:
      description: Creating machine on cloud provider
      lastUpdateTime: "2020-04-29T07:41:44Z"
      state: Processing
      type: Create
    node: ip-10-250-9-55.eu-west-1.compute.internal
(compare the timestamps)

What you expected to happen:
The .status.failedMachines is properly updated when .status.lastOperation of Machine objects are changed.

@prashanth26 prashanth26 added the area/metering Metering related label Aug 16, 2020
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 16, 2020
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Dec 16, 2020
@prashanth26 prashanth26 added effort/1w Effort for issue is around 1 week priority/5 Priority (lower number equals higher priority) labels Mar 30, 2021
@himanshu-kun himanshu-kun added priority/4 Priority (lower number equals higher priority) needs/planning Needs (more) planning with other MCM maintainers and removed priority/5 Priority (lower number equals higher priority) lifecycle/rotten Nobody worked on this for 12 months (final aging stage) labels Feb 14, 2023
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 25, 2023
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metering Metering related effort/1w Effort for issue is around 1 week kind/bug Bug lifecycle/rotten Nobody worked on this for 12 months (final aging stage) needs/planning Needs (more) planning with other MCM maintainers priority/4 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

5 participants