Provide remedy controller for replacing/repairing nodes with lost IP address in vCenter #180

MartinWeindel · 2021-10-19T12:40:23Z

How to categorize this issue?

/area robustness
/kind enhancement
/priority 3
/platform vsphere

What would you like to be added:
Deal somehow with nodes loosing their IP address in vCenter and as a consequence in the Kubernetes node object status.
Either restart such nodes, find some way to repair them, or at least move the calico-typha-deploy-... pod to another node.
This seems to be a task for a remedy controller.
The solution to resolve the root cause would probably be a fix in vSphere/vCenter, but it is unclear how long we have to wait for that.

Why is this needed:
Sporadically worker node loose their IP address in vCenter. In such a situation the cloud-controller-manager cannot provide the IP address for the Kubernetes node object anymore.

Good case:

$ kubectl get node shoot--garden--vsphere-rld1-cpu-worker-z1-6b8c9-tlkbp -oyaml
apiVersion: v1
kind: Node
metadata:
  name: shoot--garden--vsphere-rld1-cpu-worker-z1-6b8c9-tlkbp
spec:
  podCIDR: 10.231.133.0/24
  podCIDRs:
  - 10.231.133.0/24
  providerID: vsphere://4207d8f1-1a6c-2609-e284-a25789a9583c
status:
  addresses:
  - address: shoot--garden--vsphere-rld1-cpu-worker-z1-6b8c9-tlkbp
    type: Hostname
  - address: 10.230.130.54
    type: InternalIP
  - address: 10.230.130.54
    type: ExternalIP
  allocatable:
  …

Bad case:

$ kubectl get node shoot--garden--vsphere-rld1-cpu-worker-z1-6b8c9-pt694 -oyaml
apiVersion: v1
kind: Node
metadata:
  name: shoot--garden--vsphere-rld1-cpu-worker-z1-6b8c9-pt694
  …
spec:
  podCIDR: 10.231.139.0/24
  podCIDRs:
  - 10.231.139.0/24
  providerID: vsphere://4207c3da-d8ba-d016-6c26-e3636471ecc1
status:
  addresses:
  - address: shoot--garden--vsphere-rld1-cpu-worker-z1-6b8c9-pt694
    type: Hostname
  allocatable:
  …

In this case pods in the kube-system namespace which are running on the host network of the node are loosing their IP address. The node itself has still the IP address assigned by DHCP.

This issue can break any cluster if a node with the calico-typha-deploy-... pod is loosing its IP address in the node object and as a consequence no calico-node pod starts successfully anymore.

The text was updated successfully, but these errors were encountered:

MartinWeindel · 2021-12-08T15:56:15Z

resolved by updating open-vmtools package in the OS.

MartinWeindel added the kind/enhancement Enhancement, improvement, extension label Oct 19, 2021

gardener-robot added area/robustness Robustness, reliability, resilience related platform/vsphere VMware vSphere platform/infrastructure priority/3 Priority (lower number equals higher priority) labels Oct 19, 2021

MartinWeindel closed this as completed Dec 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide remedy controller for replacing/repairing nodes with lost IP address in vCenter #180

Provide remedy controller for replacing/repairing nodes with lost IP address in vCenter #180

MartinWeindel commented Oct 19, 2021

MartinWeindel commented Dec 8, 2021

Provide remedy controller for replacing/repairing nodes with lost IP address in vCenter #180

Provide remedy controller for replacing/repairing nodes with lost IP address in vCenter #180

Comments

MartinWeindel commented Oct 19, 2021

MartinWeindel commented Dec 8, 2021