Skip to content
This repository has been archived by the owner on Feb 12, 2024. It is now read-only.

Provide remedy controller for replacing/repairing nodes with lost IP address in vCenter #180

Closed
MartinWeindel opened this issue Oct 19, 2021 · 1 comment
Labels
area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension platform/vsphere VMware vSphere platform/infrastructure priority/3 Priority (lower number equals higher priority)

Comments

@MartinWeindel
Copy link
Contributor

How to categorize this issue?

/area robustness
/kind enhancement
/priority 3
/platform vsphere

What would you like to be added:
Deal somehow with nodes loosing their IP address in vCenter and as a consequence in the Kubernetes node object status.
Either restart such nodes, find some way to repair them, or at least move the calico-typha-deploy-... pod to another node.
This seems to be a task for a remedy controller.
The solution to resolve the root cause would probably be a fix in vSphere/vCenter, but it is unclear how long we have to wait for that.

Why is this needed:
Sporadically worker node loose their IP address in vCenter. In such a situation the cloud-controller-manager cannot provide the IP address for the Kubernetes node object anymore.

Good case:

$ kubectl get node shoot--garden--vsphere-rld1-cpu-worker-z1-6b8c9-tlkbp -oyaml
apiVersion: v1
kind: Node
metadata:
  name: shoot--garden--vsphere-rld1-cpu-worker-z1-6b8c9-tlkbp
spec:
  podCIDR: 10.231.133.0/24
  podCIDRs:
  - 10.231.133.0/24
  providerID: vsphere://4207d8f1-1a6c-2609-e284-a25789a9583c
status:
  addresses:
  - address: shoot--garden--vsphere-rld1-cpu-worker-z1-6b8c9-tlkbp
    type: Hostname
  - address: 10.230.130.54
    type: InternalIP
  - address: 10.230.130.54
    type: ExternalIP
  allocatable:
  …

Bad case:

$ kubectl get node shoot--garden--vsphere-rld1-cpu-worker-z1-6b8c9-pt694 -oyaml
apiVersion: v1
kind: Node
metadata:
  name: shoot--garden--vsphere-rld1-cpu-worker-z1-6b8c9-pt694
  …
spec:
  podCIDR: 10.231.139.0/24
  podCIDRs:
  - 10.231.139.0/24
  providerID: vsphere://4207c3da-d8ba-d016-6c26-e3636471ecc1
status:
  addresses:
  - address: shoot--garden--vsphere-rld1-cpu-worker-z1-6b8c9-pt694
    type: Hostname
  allocatable:
  …

In this case pods in the kube-system namespace which are running on the host network of the node are loosing their IP address. The node itself has still the IP address assigned by DHCP.

This issue can break any cluster if a node with the calico-typha-deploy-... pod is loosing its IP address in the node object and as a consequence no calico-node pod starts successfully anymore.

@MartinWeindel MartinWeindel added the kind/enhancement Enhancement, improvement, extension label Oct 19, 2021
@gardener-robot gardener-robot added area/robustness Robustness, reliability, resilience related platform/vsphere VMware vSphere platform/infrastructure priority/3 Priority (lower number equals higher priority) labels Oct 19, 2021
@MartinWeindel
Copy link
Contributor Author

resolved by updating open-vmtools package in the OS.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension platform/vsphere VMware vSphere platform/infrastructure priority/3 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

2 participants