Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connectivity issues in Azure #12113

Closed
errordeveloper opened this issue Jun 16, 2020 · 12 comments · Fixed by #14452
Closed

Connectivity issues in Azure #12113

errordeveloper opened this issue Jun 16, 2020 · 12 comments · Fixed by #14452
Assignees
Labels
area/azure Impacts Azure based IPAM.

Comments

@errordeveloper
Copy link
Contributor

There is something wrong with DNS in Azure, not very clear what it is yet - more details to follow.

One way it manifest itself is that pods deployed in kube-system, such as Hubble UI, fail to resolve $KUBERNETES_SERVICE_HOST. It turns out that in AKS the value of KUBERNETES_SERVICE_HOST gets set to something like ilya-test--ilya-test-1-da2a1f-9923c925.hcp.westeurope.azmk8s.io for pods in kube-system, and more traditional service IP in all other namespaces.

Quite crucially, it appears that quite a few of connectivity test pods are not reaching ready state at all:

echo-a-558b9b6dc4-pmsh8                                  1/1     Running            0          5h42m
echo-b-59d5ff8b98-r4hx8                                  1/1     Running            0          5h42m
echo-b-host-f4bd98474-rbpgz                              1/1     Running            0          5h42m
host-to-b-multi-node-clusterip-7bb8b4f964-qgsf6          1/1     Running            52         5h42m
host-to-b-multi-node-headless-5c5676647b-56xbt           1/1     Running            50         5h42m
pod-to-a-646cccc5df-t8blr                                1/1     Running            101        5h42m
pod-to-a-allowed-cnp-56f4cfd999-fnppn                    0/1     CrashLoopBackOff   99         5h42m
pod-to-a-external-1111-7c5c99c6d9-gnmfk                  1/1     Running            0          5h42m
pod-to-a-l3-denied-cnp-5dc8d69b7f-q4nvb                  1/1     Running            0          5h42m
pod-to-b-intra-node-b9454c7c6-sc9lq                      0/1     CrashLoopBackOff   99         5h42m
pod-to-b-intra-node-nodeport-6cc56666dc-tmqt9            0/1     CrashLoopBackOff   100        5h42m
pod-to-b-multi-node-clusterip-754d5ff9d-9gzwg            0/1     CrashLoopBackOff   99         5h42m
pod-to-b-multi-node-headless-7876749b84-sz4zz            1/1     Running            46         5h42m
pod-to-b-multi-node-nodeport-6d8fc65c99-ld8hv            0/1     CrashLoopBackOff   99         5h42m
pod-to-external-fqdn-allow-google-cnp-6478db9cd9-d74xk   0/1     CrashLoopBackOff   99         5h42m
@errordeveloper errordeveloper added the area/azure Impacts Azure based IPAM. label Jun 16, 2020
@errordeveloper
Copy link
Contributor Author

I think is could be very much related to #11428, but the original report was concerning only pod-to-external-fqdn-allow-google-cnp, so I think there is more going on.

@errordeveloper
Copy link
Contributor Author

I'm using Hubble UI as a test right now, and I tried to relocate the deployment to default namespace, to eliminate DNS issue. Now I'm seeing this in the logs:

{"name":"frontend","hostname":"hubble-ui-7d4fb6fb6c-n4cm7","pid":18,"req_id":"8247f0ed-f585-4c78-9642-2059227c2a03","user":"admin@localhost","level":50,"err":{"message":"Can't fetch namespaces via k8s api: Error: connect ETIMEDOUT 10.0.0.1:443","locations":[{"line":4,"column":7}],"path":["viewer","clusters"],"extensions":{"code":"INTERNAL_SERVER_ERROR"}},"msg":"","time":"2020-06-16T18:10:04.386Z","v":0}

So somehow there is also an API server connectivity issue also.

@errordeveloper
Copy link
Contributor Author

errordeveloper commented Jun 16, 2020

I was not able to kubectl run a container image and install all the tools needed, as package repo connectivity was broken. I created an image of my own for this.

$ kubectl run -ti --image=errordeveloper/alpine-net-debug test-1 -n kube-system -- sh -l
If you don't see a command prompt, try pressing enter.
test-1:/# ping -c2 1.1.1.1
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
From 172.17.0.1 icmp_seq=1 Destination Host Unreachable
From 172.17.0.1 icmp_seq=2 Destination Host Unreachable

--- 1.1.1.1 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1030ms
pipe 2
test-1:/# ping -c2 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
From 172.17.0.1 icmp_seq=1 Destination Host Unreachable
From 172.17.0.1 icmp_seq=2 Destination Host Unreachable

--- 8.8.8.8 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1002ms

test-1:/# cat /etc/resolv.conf 
nameserver 10.0.0.10
search kube-system.svc.cluster.local svc.cluster.local cluster.local xdbzucfb4y2ubdbavguxct3msh.ax.internal.cloudapp.net
options ndots:5
test-1:/# dig google.com

test-1:/# dig +short google.com
test-1:/# dig +short google.com
;; connection timed out; no servers could be reached
test-1:/# dig +short azure.microsoft.com
;; connection timed out; no servers could be reached
test-1:/# echo $KUBERNETES_SERVICE_HOST
ilya-test--ilya-test-1-da2a1f-9923c925.hcp.westeurope.azmk8s.io
test-1:/# dig +short $KUBERNETES_SERVICE_HOST
test-1:/# dig +short $KUBERNETES_SERVICE_HOST
test-1:/# dig +short kubernetes.default
test-1:/# curl https://10.0.0.1 # Kubernetes service IP
curl: (28) Failed to connect to 10.0.0.1 port 443: Operation timed out
test-1:/# ping -c2 10.240.0.64 # CoreDNS pod IP on another node
PING 10.240.0.64 (10.240.0.64) 56(84) bytes of data.
From 172.17.0.1 icmp_seq=1 Destination Host Unreachable
From 172.17.0.1 icmp_seq=2 Destination Host Unreachable

--- 10.240.0.64 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1013ms
pipe 2
test-1:/# ping -c2 10.240.0.76  # CoreDNS pod IP on the same node as test pod
PING 10.240.0.76 (10.240.0.76) 56(84) bytes of data.
64 bytes from 10.240.0.76: icmp_seq=1 ttl=63 time=0.244 ms
64 bytes from 10.240.0.76: icmp_seq=2 ttl=63 time=0.071 ms

--- 10.240.0.76 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.071/0.157/0.244/0.086 ms
test-1:/# ping -c2 10.240.0.4 # private node IP of a remote node
PING 10.240.0.4 (10.240.0.4) 56(84) bytes of data.
From 172.17.0.1 icmp_seq=1 Destination Host Unreachable
From 172.17.0.1 icmp_seq=2 Destination Host Unreachable

--- 10.240.0.4 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1001ms

test-1:/# ping -c2 10.240.0.35 # private node IP of a remote node
PING 10.240.0.35 (10.240.0.35) 56(84) bytes of data.
From 172.17.0.1 icmp_seq=1 Destination Host Unreachable
From 172.17.0.1 icmp_seq=2 Destination Host Unreachable

--- 10.240.0.35 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1012ms
pipe 2
test-1:/# ping -c2 10.240.0.66 # private node IP of the local node
PING 10.240.0.66 (10.240.0.66) 56(84) bytes of data.
64 bytes from 10.240.0.66: icmp_seq=1 ttl=64 time=0.120 ms
64 bytes from 10.240.0.66: icmp_seq=2 ttl=64 time=0.080 ms

--- 10.240.0.66 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1012ms
rtt min/avg/max/mdev = 0.080/0.100/0.120/0.020 ms
test-1:/# 

So it looks like only pods on the same nodes are reachable.

Here is what cilium-health report:

$ kubectl exec -ti -n kube-system cilium-q99b5 -- cilium-health status --probe
Probe time:   2020-06-16T18:35:54Z
Nodes:
  aks-nodepool1-24840118-vmss000002 (localhost):
    Host connectivity to 10.240.0.66:
      ICMP to stack:   OK, RTT=1.115504ms
      HTTP to agent:   OK, RTT=183.4µs
    Endpoint connectivity to 10.240.0.95:
      ICMP to stack:   OK, RTT=313.001µs
      HTTP to agent:   OK, RTT=285.101µs
  aks-nodepool1-24840118-vmss000000:
    Host connectivity to 10.240.0.4:
      ICMP to stack:   OK, RTT=1.160904ms
      HTTP to agent:   OK, RTT=970.703µs
    Endpoint connectivity to 10.240.0.15:
      ICMP to stack:   OK, RTT=1.543006ms
      HTTP to agent:   OK, RTT=865.104µs
  aks-nodepool1-24840118-vmss000001:
    Host connectivity to 10.240.0.35:
      ICMP to stack:   OK, RTT=1.410105ms
      HTTP to agent:   OK, RTT=705.902µs
    Endpoint connectivity to 10.240.0.38:
      ICMP to stack:   OK, RTT=1.494305ms
      HTTP to agent:   OK, RTT=1.069204ms
$ kubectl exec -ti -n kube-system cilium-8q27l -- cilium-health status --probe
Probe time:   2020-06-16T18:36:25Z
Nodes:
  aks-nodepool1-24840118-vmss000000 (localhost):
    Host connectivity to 10.240.0.4:
      ICMP to stack:   OK, RTT=639.104µs
      HTTP to agent:   OK, RTT=496.804µs
    Endpoint connectivity to 10.240.0.15:
      ICMP to stack:   OK, RTT=592.404µs
      HTTP to agent:   OK, RTT=609.505µs
  aks-nodepool1-24840118-vmss000001:
    Host connectivity to 10.240.0.35:
      ICMP to stack:   OK, RTT=1.794313ms
      HTTP to agent:   OK, RTT=853.406µs
    Endpoint connectivity to 10.240.0.38:
      ICMP to stack:   OK, RTT=1.823113ms
      HTTP to agent:   OK, RTT=1.077109ms
  aks-nodepool1-24840118-vmss000002:
    Host connectivity to 10.240.0.66:
      ICMP to stack:   OK, RTT=1.721013ms
      HTTP to agent:   OK, RTT=646.805µs
    Endpoint connectivity to 10.240.0.95:
      ICMP to stack:   OK, RTT=1.953914ms
      HTTP to agent:   OK, RTT=904.207µs
$ kubectl exec -ti -n kube-system cilium-ttb5g -- cilium-health status --probe
Probe time:   2020-06-16T18:36:37Z
Nodes:
  aks-nodepool1-24840118-vmss000001 (localhost):
    Host connectivity to 10.240.0.35:
      ICMP to stack:   OK, RTT=254.501µs
      HTTP to agent:   OK, RTT=245.801µs
    Endpoint connectivity to 10.240.0.38:
      ICMP to stack:   OK, RTT=283.601µs
      HTTP to agent:   OK, RTT=270.401µs
  aks-nodepool1-24840118-vmss000000:
    Host connectivity to 10.240.0.4:
      ICMP to stack:   OK, RTT=1.142304ms
      HTTP to agent:   OK, RTT=1.010304ms
    Endpoint connectivity to 10.240.0.15:
      ICMP to stack:   OK, RTT=1.173504ms
      HTTP to agent:   OK, RTT=1.225004ms
  aks-nodepool1-24840118-vmss000002:
    Host connectivity to 10.240.0.66:
      ICMP to stack:   OK, RTT=1.097704ms
      HTTP to agent:   OK, RTT=610.002µs
    Endpoint connectivity to 10.240.0.95:
      ICMP to stack:   OK, RTT=1.152604ms
      HTTP to agent:   OK, RTT=977.904µs
$

@errordeveloper
Copy link
Contributor Author

errordeveloper commented Jun 16, 2020

Here is the sysdump: cilium-sysdump-20200616-194701.zip

More general info:

@christarazi
Copy link
Member

christarazi commented Jun 17, 2020

Deployed another cluster to reproduce the issue @errordeveloper was observing. I ran the following command to see each node's routing table:

(Note the Azure CNI DS has not been touched).

❯ ./contrib/k8s/k8s-cilium-exec.sh bash -c "ip r show table main && hostname && echo"
default via 10.240.0.1 dev eth0
10.240.0.0/16 dev eth0 proto kernel scope link src 10.240.0.35
10.240.0.37 dev lxcdf3c1a24b218 scope link
10.240.0.45 dev lxcf8fa2a919766 scope link
10.240.0.55 dev lxc87bf1643715f scope link
10.240.0.56 dev lxc_health scope link
10.240.0.62 dev lxce6ef6a5e72b3 scope link
10.240.0.65 dev lxc6a0641eb0d0d scope link
168.63.129.16 via 10.240.0.1 dev eth0
169.254.169.254 via 10.240.0.1 dev eth0
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
aks-nodepool1-26249792-vmss000001

default via 10.240.0.1 dev eth0
10.240.0.0/16 dev eth0 proto kernel scope link src 10.240.0.4
10.240.0.6 dev lxc_health scope link
10.240.0.8 dev lxc97df74ed6c39 scope link
10.240.0.10 dev lxc077cf2fcd176 scope link
10.240.0.11 dev lxca8b7289fe1f6 scope link
10.240.0.22 dev lxc1e0b3888a2ed scope link
10.240.0.33 dev lxc458594a67dd7 scope link
168.63.129.16 via 10.240.0.1 dev eth0
169.254.169.254 via 10.240.0.1 dev eth0
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
aks-nodepool1-26249792-vmss000000

Unable to use a TTY - input is not a terminal or the right kind of file
default via 10.240.0.1 dev azure0
10.240.0.0/16 dev azure0 proto kernel scope link src 10.240.0.66
10.240.0.71 dev lxc165937946b54 scope link
10.240.0.73 dev lxc9ac55f96650f scope link
10.240.0.75 dev lxc3482a9784d06 scope link
10.240.0.81 dev lxc_health scope link
10.240.0.85 dev lxcc2aca6fe1928 scope link
10.240.0.88 dev lxc650b49ced0b8 scope link
10.240.0.92 dev lxcb5057f5a759e scope link
10.240.0.96 dev lxcc31671d0163a scope link
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
aks-nodepool1-26249792-vmss000002

As you can see there are 3 nodes in this cluster. Only one of them has the azure0 device--it is the ... 000002 node, let's call it C. The other nodes are A and B.

You can see in C's routing table, the azure0 device is being used as the default device as well as handling the routing of 10.240.0.0/16. What's interesting about that is, nodes A and B have eth0 handling that CIDR, along with being the device on the default route as well. So something is definitely borked with node C.

Anyway, my goal was to test all comm between all the nodes. And my finding is that pod-to-pod comms between A and B, work fine. Any comm between C and (A|B), ends up in ICMP timing out (this is also where the unreachable host error comes from as well). This is very likely the root cause of the issue. What we still don't know is why node C has azure0 and the others don't. Hopefully that's helpful.

@errordeveloper
Copy link
Contributor Author

The docs weren't marked as beta, I added that in #12108. I've spoken to @tgraf, and since it's beta, it doesn't have to be a release blocker.

@jrajahalme
Copy link
Member

Still having some connectivity issues. Had to restart unmanaged pods twice to get kubectl exec to cilium pods working. That seemed to fix the pod-to-external-fqdn-allow-google-cnp fail (either no connectivity to DNS, or DNS fail). Remaining issues with pod-to-a-allowed-cnp (verified no policy drops, SYN/ACK never gets back to the source pod) and pod-to-b-multi-node-headless (no policy enforcement, SYN/ACK never gets back to the source pod).

$ kubectl get pods --all-namespaces -o wide
NAMESPACE     NAME                                                     READY   STATUS             RESTARTS   AGE    IP            NODE                                NOMINATED NODE   READINESS GATES
cilium-test   echo-a-76c5d9bd76-mdhlw                                  1/1     Running            0          28m    10.240.0.62   aks-nodepool1-22410938-vmss000001   <none>           <none>
cilium-test   echo-b-795c4b4f76-vd9hv                                  1/1     Running            0          39m    10.240.0.8    aks-nodepool1-22410938-vmss000000   <none>           <none>
cilium-test   echo-b-host-6b7fc94b7c-8hwd5                             1/1     Running            0          65m    10.240.0.4    aks-nodepool1-22410938-vmss000000   <none>           <none>
cilium-test   host-to-b-multi-node-clusterip-85476cd779-k5r4l          1/1     Running            12         65m    10.240.0.35   aks-nodepool1-22410938-vmss000001   <none>           <none>
cilium-test   host-to-b-multi-node-headless-dc6c44cb5-wzsgb            1/1     Running            12         65m    10.240.0.35   aks-nodepool1-22410938-vmss000001   <none>           <none>
cilium-test   pod-to-a-79546bc469-gxdpd                                1/1     Running            0          39m    10.240.0.26   aks-nodepool1-22410938-vmss000000   <none>           <none>
cilium-test   pod-to-a-allowed-cnp-58b7f7fb8f-dlbdh                    0/1     CrashLoopBackOff   13         38m    10.240.0.17   aks-nodepool1-22410938-vmss000000   <none>           <none>
cilium-test   pod-to-a-denied-cnp-6967cb6f7f-4g4fq                     1/1     Running            0          37m    10.240.0.23   aks-nodepool1-22410938-vmss000000   <none>           <none>
cilium-test   pod-to-b-intra-node-nodeport-9b487cf89-mh5rx             1/1     Running            0          37m    10.240.0.11   aks-nodepool1-22410938-vmss000000   <none>           <none>
cilium-test   pod-to-b-multi-node-clusterip-7db5dfdcf7-4q2zc           1/1     Running            0          36m    10.240.0.55   aks-nodepool1-22410938-vmss000001   <none>           <none>
cilium-test   pod-to-b-multi-node-headless-7d44b85d69-z5ldz            0/1     CrashLoopBackOff   13         36m    10.240.0.47   aks-nodepool1-22410938-vmss000001   <none>           <none>
cilium-test   pod-to-b-multi-node-nodeport-7ffc76db7c-vqjcx            1/1     Running            1          36m    10.240.0.52   aks-nodepool1-22410938-vmss000001   <none>           <none>
cilium-test   pod-to-external-1111-d56f47579-hzkq7                     1/1     Running            0          36m    10.240.0.7    aks-nodepool1-22410938-vmss000000   <none>           <none>
cilium-test   pod-to-external-fqdn-allow-google-cnp-78986f4bcf-jc9xc   1/1     Running            0          35m    10.240.0.27   aks-nodepool1-22410938-vmss000000   <none>           <none>
kube-system   azure-cni-networkmonitor-jxmgn                           1/1     Running            0          164m   10.240.0.35   aks-nodepool1-22410938-vmss000001   <none>           <none>
kube-system   azure-cni-networkmonitor-tms59                           1/1     Running            0          165m   10.240.0.4    aks-nodepool1-22410938-vmss000000   <none>           <none>
kube-system   azure-ip-masq-agent-2kzj8                                1/1     Running            0          165m   10.240.0.4    aks-nodepool1-22410938-vmss000000   <none>           <none>
kube-system   azure-ip-masq-agent-9jtfq                                1/1     Running            0          164m   10.240.0.35   aks-nodepool1-22410938-vmss000001   <none>           <none>
kube-system   cilium-node-init-6sds9                                   1/1     Running            0          72m    10.240.0.35   aks-nodepool1-22410938-vmss000001   <none>           <none>
kube-system   cilium-node-init-wstkb                                   1/1     Running            0          72m    10.240.0.4    aks-nodepool1-22410938-vmss000000   <none>           <none>
kube-system   cilium-operator-6655dcd688-c5qbm                         1/1     Running            0          72m    10.240.0.4    aks-nodepool1-22410938-vmss000000   <none>           <none>
kube-system   cilium-operator-6655dcd688-k5ws2                         1/1     Running            0          72m    10.240.0.35   aks-nodepool1-22410938-vmss000001   <none>           <none>
kube-system   cilium-wjp44                                             1/1     Running            0          72m    10.240.0.35   aks-nodepool1-22410938-vmss000001   <none>           <none>
kube-system   cilium-wspxs                                             1/1     Running            0          72m    10.240.0.4    aks-nodepool1-22410938-vmss000000   <none>           <none>
kube-system   coredns-869cb84759-l4nf2                                 1/1     Running            0          35m    10.240.0.34   aks-nodepool1-22410938-vmss000000   <none>           <none>
kube-system   coredns-869cb84759-nt5s4                                 1/1     Running            0          35m    10.240.0.51   aks-nodepool1-22410938-vmss000001   <none>           <none>
kube-system   coredns-autoscaler-5b867494f-kkjfm                       1/1     Running            0          35m    10.240.0.63   aks-nodepool1-22410938-vmss000001   <none>           <none>
kube-system   dashboard-metrics-scraper-5ddb5bf5c8-f6hbl               1/1     Running            0          35m    10.240.0.43   aks-nodepool1-22410938-vmss000001   <none>           <none>
kube-system   kube-proxy-p2txn                                         1/1     Running            0          164m   10.240.0.35   aks-nodepool1-22410938-vmss000001   <none>           <none>
kube-system   kube-proxy-rcbtt                                         1/1     Running            0          165m   10.240.0.4    aks-nodepool1-22410938-vmss000000   <none>           <none>
kube-system   kubernetes-dashboard-5596bdb9f-g2bj9                     1/1     Running            0          35m    10.240.0.44   aks-nodepool1-22410938-vmss000001   <none>           <none>
kube-system   metrics-server-5f4c878d8-8rxt2                           1/1     Running            0          35m    10.240.0.42   aks-nodepool1-22410938-vmss000001   <none>           <none>
kube-system   tunnelfront-787b4b7fc-gz7kv                              1/1     Running            0          35m    10.240.0.38   aks-nodepool1-22410938-vmss000001   <none>           <none>

@ti-mo
Copy link
Contributor

ti-mo commented Nov 18, 2020

I've seen similar connectivity issues with both CNI chaining as well as pure Cilium Azure IPAM. I've traced the root cause back to the default azure-vnet CNI plugin installing ebtables rules in the host netns. When Azure IPAM is enabled, azure-vnet is taken out of the active CNI chain completely, so CNI DEL events are no longer handled by azure-vnet, resulting in these ebtables rules not being cleaned up when the pods are removed. In my case, I had some dangling routes for the affected addresses pointing to azure0 as well.

Note: the versions of ebtables, ebtables-legacy and/or ebtables-nft (as well as their -save commands) we ship with Cilium are incompatible with the current AKS kernel (4.15). You might need to SSH into the host and run ebtables-save there, or the nat and broute won't show up. Alternatively, ebtables-legacy -L -t nat (and -t broute) could work, but make sure it's the ioctl version, not the one that uses netlink.

Currently gathering all the findings in these issues and trying to repro them, so we can work on a more waterproof solution. Will link into overarching issue.

@ti-mo
Copy link
Contributor

ti-mo commented Nov 19, 2020

Another thing to add here: we might have to add tunnelfront to the list of Pods to recreate post-install, since all apiserver -> kubelet communication seems to pass through this service. I've had trouble running kubectl exec against Pods that were scheduled on nodes other than the one where tunnelfront was scheduled on. The apiserver could only initiate new connections to the node where tunnelfront was running.

@dctrwatson
Copy link
Contributor

Also ran into weird connectivity behavior when trying to out the new AKS container image, AKSUbuntu-1804-2020.10.28, which is the new default in 1.18. The kernel it uses is 5.4.0-1031-azure so most (all?) eBPF features are enabled by default.

Once I moved the tunnelfront pod back to a node running AKSUbuntu-1604-2020.09.23 / 4.15.0-1096-azure, kubectl exec/logs worked normally again too.

@ti-mo
Copy link
Contributor

ti-mo commented Nov 20, 2020

Another potential issue I've discovered that leads to unreachable Pods is addressed in #14105. By touching /var/run/azure-vnet.json, we trigger a behavioural change in azure-vnet that leads to static ARP entries not being removed on pod delete.

The leftover ebtables rules that don't get cleaned up is currently isolated to running with Azure IPAM and will be addressed separately.

@ti-mo
Copy link
Contributor

ti-mo commented Nov 20, 2020

Once I moved the tunnelfront pod back to a node running AKSUbuntu-1604-2020.09.23 / 4.15.0-1096-azure, kubectl exec/logs worked normally again too.

@dctrwatson Thanks for the report! Will create a separate issue about this. The failure is likely unrelated to the kernel version it's running on, but rather the leftover state described above. Will troubleshoot this in isolation after some other fixes have gone in, although restarting tunnelfront is required anyway to get visibility into its traffic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/azure Impacts Azure based IPAM.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants