Bug: kubeproxy was trying to acces K8s apiserver through LB URL after removing the LB #1226

JKBGIT1 · 2024-02-14T14:31:03Z

Current Behaviour

The InputManifest contained some control-plane nodes and two static nodes. One static node was used as a worker node in the K8s cluster, and another one was used as a LoadBalancer. The worker node couldn't start any pod because the systemd-resolved service wasn't running there.
We removed the LoadBalancer from the InputManifest. Then we started the systemd-resolved service on the worker node. After the start, the first pods on the worker node were created (kube-proxy and node-local), however, the Cilium agent on that worker node kept on restarting, because it couldn't connect to the Kubernetes apiserver. These logs were in the kube-proxy pod of that worker node.

E0214 12:29:32.000604       1 node.go:152] Failed to retrieve node info: Get "https://lb-syd.worldofpotter.eu:6443/api/v1/nodes/syd01-1": x509: certificate is valid for control01-hetzner-hel1-iqpc8bt-1, kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, loadbalancer.worldofpotter.eu, not lb-syd.worldofpotter.eu
E0214 12:30:11.396240       1 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://lb-syd.worldofpotter.eu:6443/api/v1/nodes?fieldSelector=metadata.name%3Dsyd01-1&limit=500&resourceVersion=0": x509: certificate is valid for control03-hetzner-nbg1-z88etn3-1, kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, loadbalancer.worldofpotter.eu, not lb-syd.worldofpotter.eu

To me, it looks like the kube-proxy still tried to connect to the Kubernetes apiserver through the LoadBalancer URL, even though the LoadBalancer was removed.

Expected Behaviour

A healed worker node should recognize that the LoadBalancer isn't in use anymore, and most importantly it should be able to connect to the Kubernetes apiserver.

Steps To Reproduce

Create an InputManifest with two static nodes. One as a K8s worker and another one as a LoadBalancer. The number and type of the K8s master nodes shouldn't matter, but FYI, we had 3 dynamic master nodes.
Make sure the systemd-resolved service isn't running on the static worker node.
Build the cluster.
Remove the static LoadBalancer from the InputManifest
Start the systemd-resolved service on the static worker node.
Check logs of the kube-proxy pod running on that worker node.

The text was updated successfully, but these errors were encountered:

bernardhalas · 2024-02-16T14:18:53Z

Let's see if we can reproduce the problem, but without steps 2. and 5. (stop and start of systemd-resolved)?
Can we confirm that the problem is happening only when it comes to last static API LB? What if we remove the last VM API LB?

@JKBGIT1 could you pls check this and we can talk about this again during the next grooming.

JKBGIT1 · 2024-02-20T10:11:05Z

The problem isn't completely mirrored without steps 2. and 5. (stop and start of systemd-resolved). The difference is that after you remove the LB all pods will be in the Running state, but the kube-proxy pod (on both master and worker nodes) is logging the following errors and warnings

W0220 09:15:50.444218   	1 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.Node: Get "https://w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io:6443/api/v1/nodes?fieldSelector=metadata.name%3Dcompute-gcp-1-eu-qit2860-1&resourceVersion=3379": dial tcp: lookup w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io on 169.254.169.254:53: no such host
E0220 09:15:50.444332   	1 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io:6443/api/v1/nodes?fieldSelector=metadata.name%3Dcompute-gcp-1-eu-qit2860-1&resourceVersion=3379": dial tcp: lookup w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io on 169.254.169.254:53: no such host
W0220 09:16:20.069375   	1 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.Service: Get "https://w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io:6443/api/v1/services?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=3434": dial tcp: lookup w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io on 169.254.169.254:53: no such host
E0220 09:16:20.069451   	1 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io:6443/api/v1/services?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=3434": dial tcp: lookup w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io on 169.254.169.254:53: no such host
W0220 09:16:24.858667   	1 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.EndpointSlice: Get "https://w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io:6443/apis/discovery.k8s.io/v1/endpointslices?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=3369": dial tcp: lookup w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io on 169.254.169.254:53: no such host
E0220 09:16:24.858739   	1 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Get "https://w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io:6443/apis/discovery.k8s.io/v1/endpointslices?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2F

As you can see kube-proxy still uses the LoadBalancer's URL even though the LB was removed.

Can we confirm that the problem is happening only when it comes to last static API LB? What if we remove the last VM API LB?

This problem described above in this comment happens in both.

cloudziu · 2024-03-14T10:26:05Z

My findings on this issue.
After a cluster is created and the control plane apiendpoint changes (adding loadbalancer configuration / removing loadbalancer) kube-eleven is not patching kube-proxy configmap with the new apiEndpoint:

kubectl -n kube-system get cm kube-proxy -o yaml

apiVersion: v1
data:
  config.conf: |-

    . . . [TRIMMED]

  kubeconfig.conf: |-
    apiVersion: v1
    kind: Config
    clusters:
    - cluster:
        certificate-authority: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        server: https://API_ENDPOINT_ADDRESS:6443 # <----- Value over here
      name: default
    contexts:
    - context:
        cluster: default
        namespace: default
        user: default
      name: default
    current-context: default
    users:
    - name: default
      user:
        tokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
kind: ConfigMap
metadata:
  labels:
    app: kube-proxy
  name: kube-proxy
  namespace: kube-system
  resourceVersion: "235"
  uid: 38ac488d-8dcb-4fb6-a78b-62d479232394

To fix this manually, I had to update the apiEndpoint in the config map, and rollout restart kube-proxy.

JKBGIT1 added the bug Something isn't working label Feb 14, 2024

Despire added the groomed Task that everybody agrees to pass the gatekeeper label Feb 23, 2024

cloudziu self-assigned this Feb 28, 2024

cloudziu removed their assignment Mar 14, 2024

JKBGIT1 mentioned this issue May 3, 2024

correct api endpoint change #1366

Merged

Despire closed this as completed in #1366 May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: kubeproxy was trying to acces K8s apiserver through LB URL after removing the LB #1226

Bug: kubeproxy was trying to acces K8s apiserver through LB URL after removing the LB #1226

JKBGIT1 commented Feb 14, 2024

bernardhalas commented Feb 16, 2024 •

edited

Loading

JKBGIT1 commented Feb 20, 2024

cloudziu commented Mar 14, 2024

Bug: kubeproxy was trying to acces K8s apiserver through LB URL after removing the LB #1226

Bug: kubeproxy was trying to acces K8s apiserver through LB URL after removing the LB #1226

Comments

JKBGIT1 commented Feb 14, 2024

Current Behaviour

Expected Behaviour

Steps To Reproduce

bernardhalas commented Feb 16, 2024 • edited Loading

JKBGIT1 commented Feb 20, 2024

cloudziu commented Mar 14, 2024

bernardhalas commented Feb 16, 2024 •

edited

Loading