Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cilium v1.14.2 with Kubernetes v1.28 is unstable #27982

Closed
2 tasks done
dghubble opened this issue Sep 7, 2023 · 33 comments
Closed
2 tasks done

Cilium v1.14.2 with Kubernetes v1.28 is unstable #27982

dghubble opened this issue Sep 7, 2023 · 33 comments
Labels
info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. sig/agent Cilium agent related. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

Comments

@dghubble
Copy link
Contributor

dghubble commented Sep 7, 2023

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Starting in Cilium v1.14.0 on Kubernetes v1.28.1, Cilium agents can lose connection to kube-apisever when using kube-proxy and the kubernetes service ClusterIP. This looks closely related to #27900

Cilium supports hybrid modes in which Cilium can coexist with kube-proxy while performing some or all of its responsibilities (e.g. there are reasons one might not wish to remove kube-proxy). Cilium v1.14 removed the kube-proxy-replacement partial mode and changed it to either true or false. But something else appears to have changed:

Consider a cluster with a kube-proxy daemonset. kube-proxy uses ipvs to load balance the default kubernetes service ClusterIP to a kube-apiserver endpoint.

kubectl get service
NAME              TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
kubernetes        ClusterIP   10.3.0.1       <none>        443/TCP                      26h

kubectl get endpoints
NAME              ENDPOINTS                                         AGE
kubernetes      10.0.8.71:6443                                    26h

Cilium agent's respect the default KUBERNETES_SERVICE_HOST by default (10.3.0.1), which usually works fine.

level=info msg="Establishing connection to apiserver" host="https://10.3.0.1:443" subsys=k8s-client
level=info msg="Establishing connection to apiserver" host="https://10.3.0.1:443" subsys=k8s-client
level=error msg="Unable to contact k8s api-server" error="Get \"https://10.3.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.3.0.1:443: connect: operation not permitted" ipAddr="https://10.3.0.1:443" subsys=k8s-client

But I've noticed there is a (yet unknown) sequence of events whereby connectivity to the kubernetes service Cluster IP breaks on certain nodes. This can happen after days of otherwise running normally. I think it's related to node restarts because I see it more on spot instances. The result is that the Cilium agent on those nodes crashloops, unable to reach the apiserver.

Workaround

The workaround is updating Cilium agent to have an explicit kube-apiserver IP address or DNS record in a KUBERNETES_SERVICE_HOST environment variable, but this should not be neccessary and is undesired. Workloads (including Cilium agent) on clusters with kube-proxy should be able to use in-cluster service discovery

I suspect the wrinkle here is that Cilium itself can interact with Kubernetes Service mappings. That or something about Kubernetes v1.28 itself.

Scope

I've observed in this with KubeProxyReplacement false (enabling the individual features) and KubeProxyReplacement enabled.

kube-proxy-replacement:  "false"
bpf-lb-sock: "true"
bpf-lb-external-clusterip: "true"
enable-node-port: "true"
enable-health-check-nodeport: "false"
enable-external-ips: "true"
enable-host-port: "true"

And with KubeProxyReplacement true

kube-proxy-replacement:  "true"

Neither mode is related to the fix.

Cilium Version

Cilium v1.14.0, v1.14.1

Kernel Version

Linux ip-10-0-11-132 6.4.7-200.fc38.aarch64 #1 SMP PREEMPT_DYNAMIC Thu Jul 27 20:22:11 UTC 2023 aarch64 GNU/Linux

Kubernetes Version

Kubernetes v1.28.1

Sysdump

No response

Relevant log output

No response

Anything else?

The fallback here should be kube-proxy's IPVS, which does program the right LVS rules

ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.3.0.1:443 rr
  -> 10.0.8.71:6443               Masq    1      0          0
  ...

Code of Conduct

  • I agree to follow this project's Code of Conduct
@dghubble dghubble added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Sep 7, 2023
@dghubble dghubble changed the title Cilium v1.14 with Kubernetes v1.28 is unstable Cilium v1.14.1 with Kubernetes v1.28 is unstable Sep 7, 2023
@dghubble
Copy link
Contributor Author

dghubble commented Sep 7, 2023

When this occurs on a node, the kubernetes Service cluster IP can't be used from the host, which aligns with what Pods see.

curl https://10.3.0.1
hangs

Once Cilium agents start (using the workaround above), the same curl command works from hosts. Seems like Cilium having run on the node in the past interferes with IPVS functionality.

@youngnick
Copy link
Contributor

Thanks for this issue @dghubble, and especially for the great investigation.

Cilium 1.14 actually only supports up to Kubernetes 1.27 - the client upgrade has only been merged into main at this point. It is odd that there's much more compatibility issues than usual (see #27900 and #27965 for other examples), so we'll see if we can do something to resolve all of these. Feels like at the very least we should call out that there are known issues with Cilium 1.14.x and Kubernetes 1.28, but I'll talk to some other folks and see what the consensus is.

@youngnick youngnick added the need-more-info More information is required to further debug or fix the issue. label Sep 8, 2023
@Silvest89
Copy link

This actually happens after a node reboot. I am using a k3s cluster with kured to have automatic reboots after updates. It won't be able to access the kube-dns anymore so it cannot reach anything.

@youngnick
Copy link
Contributor

It looks like there may be two issues in play here: An upstream issue (kubernetes/kubernetes#120247), and possibly #27848. The upstream issue fix is in and will be included in Kubernetes 1.28.2, due out soon, and the other investigation is ongoing at #27848.

@aanm
Copy link
Member

aanm commented Sep 13, 2023

@dghubble @Silvest89 would you be able to test it again with Kubernetes 1.28.2? Thank you

@zhurkin
Copy link

zhurkin commented Sep 14, 2023

@dghubble @Silvest89 would you be able to test it again with Kubernetes 1.28.2? Thank you
I don't know if the information will be useful .

cicd-kub-control-01:/home/icce# cilium status
/¯¯
/¯¯_/¯¯\ Cilium: 1 errors, 3 warnings
_
/¯¯_/ Operator: OK
/¯¯_
/¯¯\ Envoy DaemonSet: disabled (using embedded mode)
_/¯¯_/ Hubble Relay: disabled
__/ ClusterMesh: disabled

DaemonSet cilium Desired: 3, Unavailable: 3/3
Deployment cilium-operator Desired: 1, Ready: 1/1, Available: 1/1
Containers: cilium Pending: 3
cilium-operator Running: 1
Cluster Pods: 0/2 managed by Cilium
Helm chart version: 1.15.0-pre.0
Image versions cilium quay.io/cilium/cilium:v1.15.0-pre.0: 3
cilium-operator quay.io/cilium/operator-generic:v1.15.0-pre.0: 1
Errors: cilium cilium 3 pods of DaemonSet cilium are not ready
Warnings: cilium cilium-9ch5r pod is pending
cilium cilium-f8zgf pod is pending
cilium cilium-j6kql pod is pending

cicd-kub-control-01:/home/icce# kubectl get nodes

NAME STATUS ROLES AGE VERSION
cicd-kub-control-01 Ready control-plane 3d1h v1.28.2
cicd-kub-control-02 Ready control-plane 2d23h v1.28.2
cicd-kub-control-03 Ready control-plane 2d22h v1.28.2
cicd-kub-control-01:/home/icce# kubectl get po -A | grep -e cilium -e core
kube-system cilium-9ch5r 0/1 Init:CreateContainerError 0 98s
kube-system cilium-f8zgf 0/1 Init:CreateContainerError 0 98s
kube-system cilium-j6kql 0/1 Init:CreateContainerError 0 98s
kube-system cilium-operator-756dfd6d4d-nfxk5 1/1 Running 0 98s
kube-system coredns-5dd5756b68-brfj8 0/1 Pending 0 2d16h
kube-system coredns-5dd5756b68-ql7mw 0/1 Pending 0 2d16h

cicd-kub-control-01:/home/icce# kubectl -n kube-system logs cilium-9ch5r
Defaulted container "cilium-agent" out of: cilium-agent, config (init), mount-cgroup (init), apply-sysctl-overwrites (init), mount-bpf-fs (init), clean-cilium-state (init), install-cni-binaries (init)
Error from server (BadRequest): container "cilium-agent" in pod "cilium-9ch5r" is waiting to start: PodInitializing`

@dghubble
Copy link
Contributor Author

Preliminarily, on a Kubernetes v1.28.2 cluster, I've not been able to reproduce the issue. Restarting nodes, Cilium can reach the apiserver just fine, which I suspected was the trigger before. I observed the original issue in real production clusters though, after several days of use, so I'll have more confidence in a few days.

@github-actions github-actions bot added info-completed The GH issue has received a reply from the author and removed need-more-info More information is required to further debug or fix the issue. labels Sep 14, 2023
@julianwiedmann
Copy link
Member

Preliminarily, on a Kubernetes v1.28.2 cluster, I've not been able to reproduce the issue. Restarting nodes, Cilium can reach the apiserver just fine, which I suspected was the trigger before. I observed the original issue in real production clusters though, after several days of use, so I'll have more confidence in a few days.

Thank you for the feedback! Let's leave it in need-more-info then until you have full confidence.

@julianwiedmann julianwiedmann added need-more-info More information is required to further debug or fix the issue. and removed info-completed The GH issue has received a reply from the author labels Sep 15, 2023
@aojea
Copy link
Contributor

aojea commented Sep 18, 2023

This seems a duplicate of #27900, should we close it @aanm @julianwiedmann ?

@dghubble
Copy link
Contributor Author

I've seen this occur once on a new cluster with Kubernetes v1.28.2 and Cilium v1.14.2. Most clusters have been fine since those upgrades.

Is there anything specific I should be collecting? To confirm it's the same issue. Unfortunately, I usually have to apply mitigations asap and can't afford to leave clusters in this broken state for long.

@github-actions github-actions bot added info-completed The GH issue has received a reply from the author and removed need-more-info More information is required to further debug or fix the issue. labels Sep 18, 2023
@Silvest89
Copy link

@aanm @dghubble
Upgraded my cluster from 1.27.6 to 1.28.2. Everything seems to be working smooth =]

@dghubble dghubble changed the title Cilium v1.14.1 with Kubernetes v1.28 is unstable Cilium v1.14.2 with Kubernetes v1.28 is unstable Sep 23, 2023
@dghubble
Copy link
Contributor Author

dghubble commented Sep 23, 2023

This issue can still happen. I've had to explicitly set a KUBERNETES_SERVICE_HOST to an external DNS record so that Cilium can reliably find the apiserver. This should not be required, in-cluster kube-proxy should be sufficient.

level=info msg="Establishing connection to apiserver" host="https://10.3.0.1:443" subsys=k8s-client
level=error msg="Unable to contact k8s api-server" error="Get \"https://10.3.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.3.0.1:443: connect: operation not permitted" ipAddr="https://10.3.0.1:443" subsys=k8s-client
level=error msg="Start hook failed" error="Get \"https://10.3.0.1:443/api/v1/namespaces/kube-system\": dial tcp 10.3.0.1:443: connect: operation not permitted" function="client.(*compositeClientset).onStart" subsys=hive

This can take days of real-world usage to become evident. Fresh clusters looked fine, but they're not fine.

@lmb lmb added sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. sig/agent Cilium agent related. and removed needs/triage This issue requires triaging to establish severity and next steps. labels Sep 26, 2023
@aanm
Copy link
Member

aanm commented Sep 26, 2023

@dghubble that could be related with an issue with kube-proxy and not Cilium itself as you have pointed it out that connecting to connecting directly to an external DNS record it works but not with a cluster IP, for which kube-proxy does the service translation.

@dghubble
Copy link
Contributor Author

@squeed 👋🏻 long time! Yeah, my suspicion is that it's related to this overlapping responsibility kube-proxy and Cilium have for managing the apiserver's own Kubernetes Service traffic. Having socket-lb optionally exclude the apiserver itself could be helpful. Odd part is this was never an issue before, I'm not sure if something changed here with the shifting kube-proxy modes. Next Kubernetes patch release, I should have an opportunity to test this again and capture logs (or if I find time sooner).

@tedli The workaround to this issue is giving Cilium explicit IP addresses for the apiserver (undesired). If you're seeing issues in that case, you're probably describing a separate issue.

tommyp1ckles added a commit that referenced this issue Oct 17, 2023
K8s v1.28.0 causes the following regression: #27982.

Most noticably, this has been causing k8s conformance test failures.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
tommyp1ckles added a commit that referenced this issue Oct 17, 2023
K8s v1.28.0 causes the following regression: #27982.

Most noticeably, this has been causing k8s conformance test failures.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
tommyp1ckles added a commit that referenced this issue Oct 17, 2023
K8s v1.28.0 causes the following regression: #27982.

Most noticeably, this has been causing k8s conformance test failures.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
tommyp1ckles added a commit that referenced this issue Oct 17, 2023
K8s v1.28.0 causes the following regression: #27982.

Most noticeably, this has been causing k8s conformance test failures.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
tommyp1ckles added a commit that referenced this issue Oct 18, 2023
K8s v1.28.0 causes the following regression: #27982.

Most noticeably, this has been causing k8s conformance test failures.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
tommyp1ckles added a commit that referenced this issue Oct 18, 2023
K8s v1.28.0 causes the following regression: #27982.

Most noticeably, this has been causing k8s conformance test failures.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
tommyp1ckles added a commit that referenced this issue Oct 18, 2023
K8s v1.28.0 causes the following regression: #27982.

Most noticeably, this has been causing k8s conformance test failures.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
tommyp1ckles added a commit that referenced this issue Oct 18, 2023
K8s v1.28.0 causes the following regression: #27982.

Most noticeably, this has been causing k8s conformance test failures.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
michi-covalent pushed a commit that referenced this issue Oct 18, 2023
K8s v1.28.0 causes the following regression: #27982.

Most noticeably, this has been causing k8s conformance test failures.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
@dghubble
Copy link
Contributor Author

I saw it recur on one node today! A Cilium agent pod was unable to reach kube-apiserver. My usual workaround is to modify the Cilium DaemonSet to explicitly set a KUBERNETES_SERVICE_HOST, but this time I tried rebooting the host, which does mitigate the issue. This seems to support the theory that old BPF rules are lying around preventing the new Cilium Pod from reaching kube-apiserver via kube-proxy, as it normally can.

Bugtool: https://storage.googleapis.com/dghubble/bugtool.tar.gz (too big for GitHub)

@squeed
Copy link
Contributor

squeed commented Oct 20, 2023

So, looking at the bugtool, I see the following set of backends for the apiserver service:

10.3.0.11:443        0.0.0.0:0 (16) (0) [ClusterIP]   (ignore this bit)
                     10.2.1.160:443 (16) (2)          
                     10.2.2.35:443 (16) (1)

As of the time of the bugtool, are those correct? My theory is that cilium is missing changes to the default/kubernetes service, and once that happens it can't recover. (This is why you should not use the apiserver ClusterIP with socket-lb).

Separately, #25169 may also be relevant to this.

@dghubble
Copy link
Contributor Author

dghubble commented Oct 20, 2023

Yeah, this looks ood. Those two backend IPs are from the Pod CIDR range (10.2.0.0/16). But the apiserver runs on controller/master node(s) with host network (10.0.4.0/22 in this case), as a static pod. Kubernetes and kube-proxy see this as I'd expect, just one apiserver/master in this cluster:

kubectl get endpoints kubernetes
NAME         ENDPOINTS        AGE
kubernetes   10.0.4.26:6443   29d

I'm not sure what the Pod IPs Cilium was seeing correspond to anymore. Cilium now shows the right backend.

ID   Frontend             Service Type   Backend                          
...
2    10.3.0.1:443         ClusterIP      1 => 10.0.4.26:6443 (active)
kubectl get pods -n kube-system -o wide
NAME                                                    READY   STATUS    RESTARTS      AGE   IP           NODE                            NOMINATED NODE     
kube-apiserver-magnesium.region.dghubble.io            1/1     Running   3 (33h ago)   29d   10.0.4.26    magnesium.region.dghubble.io   <none>           <none>

Interesting that the apiserver restarted 33h ago, but maybe a coincidence. And only Cilium one node got into this bad state. The prior apiserver logs show it clearing the kubernetes endpoints.

W1019 06:10:07.698343       1 lease.go:263] Resetting endpoints for master service "kubernetes" to []

@dghubble
Copy link
Contributor Author

dghubble commented Oct 20, 2023

Btw, I've preferred Cilium using in-cluster discovery (i.e. 10.3.0.1 via kube-proxy) in a Kubernetes distro because it's platform agnostic. Giving Cilium an IP hardcodes a value (and to support multi-master I'd need to create VIPs in a platform agnostic way on behalf of users) and giving Cilium a DNS record pointing to the apiserver is hard to do in a platform agnostic way (e.g. public vs private clusters use different FQDNs depending on AWS private Endpoints Azure PrivateLink, etc). Though that may be the way we go. The Cilium docs just say give us the real apiserver address somehow.

In theory, Cilium could just read the kubernetes endpoints directly and do it's own load balancing (since client-go doesn't load balance multiple IPs) similar to what kube-proxy is providing. Distros wouldn't have to decide how to give Cilium a "real" apiserver address.

@squeed
Copy link
Contributor

squeed commented Oct 23, 2023

@dghubble makes perfect sense; the ultimate solution may be to add a flag disabling socket-lb for host-netns processes. (or perhaps just the Cilium agent). Then Cilium would receive load balancing from kube-proxy, and pods would use socket-lb. WDYT?

@dghubble
Copy link
Contributor Author

I suspect that would fix this situation. Did Cilium 1.13 with partial kube-proxy previously work this way? It's odd this became a problem so recently.

dghubble added a commit to poseidon/typhoon that referenced this issue Oct 29, 2023
* With Cilium v1.14, Cilium's kube-proxy partial mode changed to
either be enabled or disabled (not partial). This somtimes leaves
Cilium (and the host) unable to reach the kube-apiserver via the
in-cluster Kubernetes Service IP, until the host is rebooted
* As a workaround, configure Cilium to rely on external DNS resolvers
to find the IP address of the apiserver. This is less portable
and less "clean" than using in-cluster discovery, but also what
Cilium wants users to do. Revert this when the upstream issue
cilium/cilium#27982 is resolved
@Silvest89
Copy link

@dghubble I've been running my cluster for more than a month now, haven't run into any issues. Cluster bootstrapped using k3s on Hetzner Cloud

@squeed
Copy link
Contributor

squeed commented Nov 14, 2023

I believe 1.14 brings changes to the socket-lb, but that’s not my area of expertise. @aditighag, any pointers?

@n-able-consulting
Copy link

n-able-consulting commented Dec 5, 2023

I was fixated on this issue for a day before I found this threat.
I am using k8s 1.28.4 and cilium 1.14.4, with my own programmed provisioning. Suddenly it breaks when enabling OIDC...
I run on bare metal. I have multiple interfaces to different VLANs, so I can use the 'internal' interface in my configuration. Technically, it's no big deal.

I have problems with what it means when I do a high-available setup because I then have to address the 'outside' load balancer for 'in cluster traffic'. I do not like it. Also, I can not imagine this is the way it is meant to be.

We built in PKI and overlay to secure all communication, then we break it open to talk over the 'outside' infrastructure network. Instead of using the build in overlay Kubernetes service. I can not believe this?

@n-able-consulting
Copy link

Issue is older. Also 1.13.9 breaks when configuring OIDC on api-server...

@n-able-consulting
Copy link

I can concure it is a 1.28 issue. Provisioning k8s 1.27 Cilium does not break, after configuring OIDC. With breaking I refer to the familiar issue: Unable to contact k8s api-server / Forbidden 10.2.0.1;6443.
It works when setting k8sServiceHost in 1.28, which, as I stated above, in my opinion, is digging a whole out of the k8s security plane... And as such no solution.
If not clear yet, I do not like this Cilium feature.

Copy link

github-actions bot commented Feb 4, 2024

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Feb 4, 2024
Copy link

This issue has not seen any activity since it was marked stale.
Closing.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 19, 2024
@dghubble
Copy link
Contributor Author

I ultimately had to adapt our Kubernetes distro to tell Cilium the DNS name resolving to any of the apiservers. The approach and resolver varies based on the cloud provider.

It's a shame, Cilium used to support the in-cluster kubernetes ClusterIP, but now effectively relies on an external resolver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. sig/agent Cilium agent related. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects
None yet
Development

No branches or pull requests