-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connectivity issues in Azure #12113
Comments
I think is could be very much related to #11428, but the original report was concerning only |
I'm using Hubble UI as a test right now, and I tried to relocate the deployment to {"name":"frontend","hostname":"hubble-ui-7d4fb6fb6c-n4cm7","pid":18,"req_id":"8247f0ed-f585-4c78-9642-2059227c2a03","user":"admin@localhost","level":50,"err":{"message":"Can't fetch namespaces via k8s api: Error: connect ETIMEDOUT 10.0.0.1:443","locations":[{"line":4,"column":7}],"path":["viewer","clusters"],"extensions":{"code":"INTERNAL_SERVER_ERROR"}},"msg":"","time":"2020-06-16T18:10:04.386Z","v":0} So somehow there is also an API server connectivity issue also. |
I was not able to
So it looks like only pods on the same nodes are reachable. Here is what
|
Here is the sysdump: More general info:
|
Deployed another cluster to reproduce the issue @errordeveloper was observing. I ran the following command to see each node's routing table: (Note the Azure CNI DS has not been touched).
As you can see there are 3 nodes in this cluster. Only one of them has the You can see in C's routing table, the Anyway, my goal was to test all comm between all the nodes. And my finding is that pod-to-pod comms between A and B, work fine. Any comm between C and (A|B), ends up in ICMP timing out (this is also where the unreachable host error comes from as well). This is very likely the root cause of the issue. What we still don't know is why node C has |
Still having some connectivity issues. Had to restart unmanaged pods twice to get
|
I've seen similar connectivity issues with both CNI chaining as well as pure Cilium Azure IPAM. I've traced the root cause back to the default Note: the versions of Currently gathering all the findings in these issues and trying to repro them, so we can work on a more waterproof solution. Will link into overarching issue. |
Another thing to add here: we might have to add |
Also ran into weird connectivity behavior when trying to out the new AKS container image, Once I moved the |
Another potential issue I've discovered that leads to unreachable Pods is addressed in #14105. By touching The leftover |
@dctrwatson Thanks for the report! Will create a separate issue about this. The failure is likely unrelated to the kernel version it's running on, but rather the leftover state described above. Will troubleshoot this in isolation after some other fixes have gone in, although restarting tunnelfront is required anyway to get visibility into its traffic. |
There is something wrong with DNS in Azure, not very clear what it is yet - more details to follow.
One way it manifest itself is that pods deployed in
kube-system
, such as Hubble UI, fail to resolve$KUBERNETES_SERVICE_HOST
. It turns out that in AKS the value ofKUBERNETES_SERVICE_HOST
gets set to something likeilya-test--ilya-test-1-da2a1f-9923c925.hcp.westeurope.azmk8s.io
for pods inkube-system
, and more traditional service IP in all other namespaces.Quite crucially, it appears that quite a few of connectivity test pods are not reaching ready state at all:
The text was updated successfully, but these errors were encountered: