Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: kubeproxy was trying to acces K8s apiserver through LB URL after removing the LB #1226

Closed
JKBGIT1 opened this issue Feb 14, 2024 · 3 comments · Fixed by #1366
Closed
Labels
bug Something isn't working groomed Task that everybody agrees to pass the gatekeeper

Comments

@JKBGIT1
Copy link
Contributor

JKBGIT1 commented Feb 14, 2024

Current Behaviour

The InputManifest contained some control-plane nodes and two static nodes. One static node was used as a worker node in the K8s cluster, and another one was used as a LoadBalancer. The worker node couldn't start any pod because the systemd-resolved service wasn't running there.
We removed the LoadBalancer from the InputManifest. Then we started the systemd-resolved service on the worker node. After the start, the first pods on the worker node were created (kube-proxy and node-local), however, the Cilium agent on that worker node kept on restarting, because it couldn't connect to the Kubernetes apiserver. These logs were in the kube-proxy pod of that worker node.

E0214 12:29:32.000604       1 node.go:152] Failed to retrieve node info: Get "https://lb-syd.worldofpotter.eu:6443/api/v1/nodes/syd01-1": x509: certificate is valid for control01-hetzner-hel1-iqpc8bt-1, kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, loadbalancer.worldofpotter.eu, not lb-syd.worldofpotter.eu
E0214 12:30:11.396240       1 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://lb-syd.worldofpotter.eu:6443/api/v1/nodes?fieldSelector=metadata.name%3Dsyd01-1&limit=500&resourceVersion=0": x509: certificate is valid for control03-hetzner-nbg1-z88etn3-1, kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, loadbalancer.worldofpotter.eu, not lb-syd.worldofpotter.eu

To me, it looks like the kube-proxy still tried to connect to the Kubernetes apiserver through the LoadBalancer URL, even though the LoadBalancer was removed.

Expected Behaviour

A healed worker node should recognize that the LoadBalancer isn't in use anymore, and most importantly it should be able to connect to the Kubernetes apiserver.

Steps To Reproduce

  1. Create an InputManifest with two static nodes. One as a K8s worker and another one as a LoadBalancer. The number and type of the K8s master nodes shouldn't matter, but FYI, we had 3 dynamic master nodes.
  2. Make sure the systemd-resolved service isn't running on the static worker node.
  3. Build the cluster.
  4. Remove the static LoadBalancer from the InputManifest
  5. Start the systemd-resolved service on the static worker node.
  6. Check logs of the kube-proxy pod running on that worker node.
@JKBGIT1 JKBGIT1 added the bug Something isn't working label Feb 14, 2024
@bernardhalas
Copy link
Member

bernardhalas commented Feb 16, 2024

  1. Let's see if we can reproduce the problem, but without steps 2. and 5. (stop and start of systemd-resolved)?
  2. Can we confirm that the problem is happening only when it comes to last static API LB? What if we remove the last VM API LB?

@JKBGIT1 could you pls check this and we can talk about this again during the next grooming.

@JKBGIT1
Copy link
Contributor Author

JKBGIT1 commented Feb 20, 2024

The problem isn't completely mirrored without steps 2. and 5. (stop and start of systemd-resolved). The difference is that after you remove the LB all pods will be in the Running state, but the kube-proxy pod (on both master and worker nodes) is logging the following errors and warnings

W0220 09:15:50.444218   	1 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.Node: Get "https://w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io:6443/api/v1/nodes?fieldSelector=metadata.name%3Dcompute-gcp-1-eu-qit2860-1&resourceVersion=3379": dial tcp: lookup w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io on 169.254.169.254:53: no such host
E0220 09:15:50.444332   	1 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io:6443/api/v1/nodes?fieldSelector=metadata.name%3Dcompute-gcp-1-eu-qit2860-1&resourceVersion=3379": dial tcp: lookup w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io on 169.254.169.254:53: no such host
W0220 09:16:20.069375   	1 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.Service: Get "https://w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io:6443/api/v1/services?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=3434": dial tcp: lookup w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io on 169.254.169.254:53: no such host
E0220 09:16:20.069451   	1 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io:6443/api/v1/services?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=3434": dial tcp: lookup w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io on 169.254.169.254:53: no such host
W0220 09:16:24.858667   	1 reflector.go:424] vendor/k8s.io/client-go/informers/factory.go:150: failed to list *v1.EndpointSlice: Get "https://w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io:6443/apis/discovery.k8s.io/v1/endpointslices?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&resourceVersion=3369": dial tcp: lookup w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io on 169.254.169.254:53: no such host
E0220 09:16:24.858739   	1 reflector.go:140] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Get "https://w1rnnbr0k7g8rj93p.platform2.gcp.e2e.claudie.io:6443/apis/discovery.k8s.io/v1/endpointslices?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2F

As you can see kube-proxy still uses the LoadBalancer's URL even though the LB was removed.

Can we confirm that the problem is happening only when it comes to last static API LB? What if we remove the last VM API LB?

This problem described above in this comment happens in both.

@Despire Despire added the groomed Task that everybody agrees to pass the gatekeeper label Feb 23, 2024
@cloudziu cloudziu self-assigned this Feb 28, 2024
@cloudziu
Copy link
Contributor

My findings on this issue.
After a cluster is created and the control plane apiendpoint changes (adding loadbalancer configuration / removing loadbalancer) kube-eleven is not patching kube-proxy configmap with the new apiEndpoint:

kubectl -n kube-system get cm kube-proxy -o yaml

apiVersion: v1
data:
  config.conf: |-

    . . . [TRIMMED]

  kubeconfig.conf: |-
    apiVersion: v1
    kind: Config
    clusters:
    - cluster:
        certificate-authority: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        server: https://API_ENDPOINT_ADDRESS:6443 # <----- Value over here
      name: default
    contexts:
    - context:
        cluster: default
        namespace: default
        user: default
      name: default
    current-context: default
    users:
    - name: default
      user:
        tokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
kind: ConfigMap
metadata:
  labels:
    app: kube-proxy
  name: kube-proxy
  namespace: kube-system
  resourceVersion: "235"
  uid: 38ac488d-8dcb-4fb6-a78b-62d479232394

To fix this manually, I had to update the apiEndpoint in the config map, and rollout restart kube-proxy.

@cloudziu cloudziu removed their assignment Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working groomed Task that everybody agrees to pass the gatekeeper
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants