New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coredns fails connecting to kube-api via kubernetes service #27900
Comments
Thanks for this detailed issue @azzid. Unfortunately Cilium 1.14 only supports Kubernetes 1.27 - it looks like there might be something in the 1.28 upgrade that's not working properly. Could you try either doing Cilium 1.14 on Kubernetes 1.27, or Cilium |
Is there a miss in the spelling there? 1.14 on 1.28 is what I'm already failing at, right? ;-) |
🤦 Yes, thanks, Cilium 1.14 on Kubernetes 1.27 is what I meant. Thanks! |
Forced a downgrade (basically
Thereafter
|
Okay, so it's probably something about coredns and Cilium on 1.14 and 1.28, thanks! |
Oh, to be clear, were you using Kubernetes 1.28 before version 1.14, or is this problem only with Cilium 1.14 and Kubernetes 1.28. Do you know what happens if it's 1.13 and Kube 1.28? (Trying to think about if this is only the Kubernetes version or if we changed something in 1.14 that causes it). |
That's unfortunately a bit unclear. I was using something older before doing the kubernetes upgrade - so after the nodes where 1.28 I upgraded Cilium to 1.14 - but I don't know: |
Tried upgrading to 1.15 on 1.27:
Not only was I unsuccessful, it also seem to have successfully broken even the older versions:
Probably due to me doing something wrong causing the kube-proxy replacement to not be enabled:
Manually setting
Afterwhich another upgrade was performed:
status looks good
but proxy is disabled again
re-enabled it
but 1.15 still seem to work poorly
|
I've installed minikube with k8s 1.28.1 and Cilium 1.14.1 and the connectivity tests have passed:
|
Since yesterday my version soup looks like:
Everything looks a bit broken
not only coredns, but cilium as well is having trouble with the kube-api service
Since it works in How does one go about flushing whatever configuration might be left behind since the earlier attempts? I'd prefer to get a grasp on what's going on rather than restarting with I've found cilium cleanup in the docs, but I stumble on using it.
I tried running it in a cilium-agent pod yesterday, but while the cilium command there was aware of it it refused to do anything as the cilium service was running. |
I have exactly the same problem. Pods are failing to connect to apiserver - |
Starting to investigate the problem :) So I have cluster with 4x nodes. First 3 nodes are control plane. I've created test daemonset and trying to launch: |
I managed to get I did a
which got the configmap right from the start:
also got all the pods in working shape
|
@azzid can you clarify which steps did you use for the working scenario and steps did you use for the non-working scenario? Thank you |
I'll try to reproduce again. What should trigger the problem is upgrading to 1.28. tl;dr - upgrade k8s from 1.27 to 1.28 and coredns won't be able to start after cluster reboot reproductionI have come to realize that I'm using the now deprecated google repos - after switching to
But even with the upgraded I upgraded all cluster nodes to 1.28
everything seemed fine
so I did a reboot (
versions after upgrade
note that only kubernetes has been upgraded - cilium has remained untouched.
Sorry for being a bit verbose - just trying to be transparent so any user errors on my part will be apparent. |
Cc @ti-mo wrt cgroup links |
@brb, will re-run experiment in approx 4 hours! Thanks! |
Trying to confirm my reproducer is related to #27848 by installing Cilium 1.13.6 and then restarting all the Kind nodes..and it still looks like that fails for me... So that's a bit of a head-scratcher as commentary in the other issue seems to correlate with the group attachment findings from @brb here Reproducer procedure:
|
|
From my setup:
and
|
@azzid There seems to be a regression introduced on k8s v1.28 that involves app containers getting started before init containers - kubernetes/kubernetes#120247. If you are running with cgroup v2 and cilium socket-lb config enabled, the As an alternative/workaround, you can mount the cgroup2 fs on the host, see this note in the guide -
/cc @aojea Which k8s versions will the regression fix be available in? |
My reproducer is probably an entirely different issue than initially reported, same symptoms but high likelihood a different problem..maybe even a Kind specific problem. I'm going to retest now using my local k3s home lab and avoid the complication of cluster node as containers sharing a common kernel. |
1.28.2 that is supposed to be released next week |
2023-09-13 to be even more specific. |
I'm not entirely sure I understand what workaround you're proposing. My nodes are running cgroup v2:
And they all have a
I can't find any There is however a
Manually mounting seem like no no-go:
Pointing the configmap to the existing mount point and restarting the pods does not seem to improve the situation:
The change to the configmap seem to have been picked up by the pods:
|
Tried upgrading to
Upgrading alone does not seem to fix the dns:
Neither does deleting the old pods:
But a cluster reboot after the upgrade seem to get everything in shape:
For completion, the bpf-stuff post upgrade:
|
Thanks @azzid that you tested it. Had the same problem being discussed here on a bare metal cluster of two nodes and also after rebooting the control plane everything stopped. I updated Kubernetes to 1.28.2 then rebooted and all back to normal. |
So what is the solution here? I'm receiving the same issue on 1.27.6 and my podcidr is 192.168.0.0/16. coredns gives me the same output as well. Is the solution to just update kubeadm? |
I'm still experiencing the same... RHEL 9 |
We observed such behaviour on kubernetes 1.28.8 aswell. Upgrading to kubernetes 1.29.x fixed it. |
@kvaster I ended up moving to Calico after couple of days of frustration ... 😢 |
@kvaster Do you mean a wrong cgroup root for the bpf_sock attachment? |
might be unrelated, but had exact error
environment: 👋 @\brb |
IS there a resolution eventually for the above issue? |
Is there an existing issue for this?
What happened?
As initially reported here I'm unable to get dns working due to coredns failing to connect to kubernetes api - I think it might be a regression since upgrading to $latest.
Whole post follows as copy-paste:
Cluster information:
Kubernetes version: v1.28.1
Cloud being used: bare-metal
Installation method: kubeadm
Host OS: Ubuntu 22.04.3 LTS
CNI and version: cilium 1.14.1
CRI and version: containerd 1.6.22
Today, after upgrading to 1.28.1 I realized that my test cluster is unable to get
coredns
ready:Upon inspecting the logs there seem to be some connectivity issue between coredns and kube-api:
cilium connectivity test
seem to run into the same issue:Accessing the kube-api from outside the cluster works fine - as is demonstrated by
kubectl
working. 😉Cilium status seem ok.
There are no network policies I can find to blame.
There are endpoints which I believe should be implicitly targeted by the service:
I don’t believe I have any funny business in the coredns config:
There is a service running in the container - but it does not seem to hold any data, probably due to not being able to connect to the api:
I can access the api from the pod on the external ip, but not the service ip:
What am I missing?
Cilium Version
cilium-cli: v0.15.7 compiled with go1.21.0 on linux/amd64
cilium image (default): v1.14.1
cilium image (stable): v1.14.1
cilium image (running): 1.14.1
Kernel Version
6.4.11-200.fc38.x86_64
Kubernetes Version
Client Version: v1.28.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.1
Sysdump
cilium-sysdump-20230903-172137.zip
Relevant log output
Anything else?
https://discuss.kubernetes.io/t/coredns-fails-connecting-to-kube-api-via-kubernetes-service/
Code of Conduct
The text was updated successfully, but these errors were encountered: