MCM does not reset the failed_machines gauge once the machine is deleted #476

sebbonnet · 2020-06-18T18:34:25Z

What happened:
When a machine is no longer reachable, MCM correctly reports the machine as unhealthy and creates the corresponding failed_machines gauge metric. However when the machine is actually deleted that metric is not cleared and it keeps on being reported.
This makes it problematic for alerting rules based on the failed_machines metric as it never gets reset to 0.
I may be easier to change this metric to a counter, so you don't have to worry about resetting its value, as keeping track of the failed machines may not be trivial or wanted.

I0616 23:12:14.072380       1 event.go:255] Event(v1.ObjectReference{Kind:"MachineSet", Namespace:"machine-controller-manager", Name:"mcm-immutable-node-az-b-6cb7bb5d9", UID:"31b3fc43-a668-11ea-b0fa-0a1dc5f423b0", APIVersion:"machine.sapcloud.io/v1alpha1", ResourceVersion:"371959992", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted machine: mcm-immutable-node-az-b-6cb7bb5d9-7zgk9
E0616 23:12:12.968331       1 machine.go:931] Machine mcm-immutable-node-az-b-6cb7bb5d9-7zgk9 is not healthy since 10m0s minutes. Changing status to failed. Node Conditions: [...]  {Type:Ready Status:Unknown LastHeartbeatTime:2020-06-16 22:59:55 +0000 UTC LastTransitionTime:2020-06-16 23:01:18 +0000 UTC Reason:NodeStatusUnknown Message:Kubelet stopped posting node status.}
W0616 23:01:38.218468       1 machine.go:729] Machine mcm-immutable-node-az-b-6cb7bb5d9-7zgk9 is unhealthy - changing MachineState to Unknown

What you expected to happen:
failed_machines gauge metric should be reset once the machines is deleted (or recovers)

How to reproduce it (as minimally and precisely as possible):

Try creating a machine that doesn't bootstrap and watch the number of failed machines keep going up: count(mcm_machine_deployment_failed_machines)
Create a machine, then delete the underlying VM and observe the value of failed_machines

Anything else we need to know:

Environment:

kubernetes 1.12
image: eu.gcr.io/gardener-project/gardener/machine-controller-manager:v0.26.3

Interpretation/Solutions

We need a new gauge metric by the name num_failed_machines which tells the number of machines which are in Failed Phase currently
mcm_machine_deployment_operation_failed gauge metric -> for the machines last operation failed
gauge here means that the metric won't repeat, slice needs to be cleared once the machine is deleted

The text was updated successfully, but these errors were encountered:

gardener-robot · 2020-06-18T18:34:29Z

@sebbonnet Thank you for your contribution.

hardikdr · 2020-08-14T03:45:37Z

cc @ggaurav10

hardikdr · 2020-08-14T03:51:35Z

@rfranzke commented on Apr 29

What happened:
MCM does not update the .status.failedMachine of the MachineDeployment after the .status.lastOperation of the Machine changes (e.g., from Failed -> Processing (e.g., after the credentials have been fixed)):

  status:
    availableReplicas: 2
    conditions:
    - lastTransitionTime: "2020-04-29T06:51:38Z"
      lastUpdateTime: "2020-04-29T06:51:38Z"
      message: Deployment does not have minimum availability.
      reason: MinimumReplicasUnavailable
      status: "False"
      type: Available
    failedMachines:
    - lastOperation:
        description: 'Failed to list VMs while deleting the machine "shoot--foo--bar-cpu-worker-z1-5cdcb46f64-pxzp5"
          AuthFailure: AWS was not able to validate the provided access credentials
          status code: 401, request id: 6e99231c-654e-4b05-8801-310e3532b4e9'
        lastUpdateTime: "2020-04-29T06:53:33Z"
        state: Failed
        type: Delete
      name: shoot--foo--bar-cpu-worker-z1-5cdcb46f64-pxzp5
      ownerRef: shoot--foo--bar-cpu-worker-z1-5cdcb46f64
    observedGeneration: 2
  spec:
    class:
      kind: AWSMachineClass
      name: shoot--foo--bar-cpu-worker-z1-ff76e
    nodeTemplate:
      metadata:
        creationTimestamp: null
        labels:
          node.kubernetes.io/role: node
          worker.garden.sapcloud.io/group: cpu-worker
          worker.gardener.cloud/pool: cpu-worker
      spec: {}
    providerID: aws:///eu-west-1/i-05f4737c3ef646f89
  status:
    currentStatus:
      lastUpdateTime: "2020-04-29T07:41:44Z"
      phase: Pending
      timeoutActive: true
    lastOperation:
      description: Creating machine on cloud provider
      lastUpdateTime: "2020-04-29T07:41:44Z"
      state: Processing
      type: Create
    node: ip-10-250-9-55.eu-west-1.compute.internal
(compare the timestamps)

What you expected to happen:
The .status.failedMachines is properly updated when .status.lastOperation of Machine objects are changed.

sebbonnet added the kind/bug Bug label Jun 18, 2020

hardikdr mentioned this issue Aug 14, 2020

MCM does not reset .status.failedMachines of MachineDeployment #456

Open

prashanth26 added the area/metering Metering related label Aug 16, 2020

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 16, 2020

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Dec 16, 2020

prashanth26 added effort/1w Effort for issue is around 1 week priority/5 Priority (lower number equals higher priority) labels Mar 30, 2021

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 25, 2023

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCM does not reset the failed_machines gauge once the machine is deleted #476

MCM does not reset the failed_machines gauge once the machine is deleted #476

sebbonnet commented Jun 18, 2020 •

edited by himanshu-kun

Loading

gardener-robot commented Jun 18, 2020

hardikdr commented Aug 14, 2020

hardikdr commented Aug 14, 2020

MCM does not reset the failed_machines gauge once the machine is deleted #476

MCM does not reset the failed_machines gauge once the machine is deleted #476

Comments

sebbonnet commented Jun 18, 2020 • edited by himanshu-kun Loading

gardener-robot commented Jun 18, 2020

hardikdr commented Aug 14, 2020

hardikdr commented Aug 14, 2020

sebbonnet commented Jun 18, 2020 •

edited by himanshu-kun

Loading