Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico deletes routes when using CNI v1.7.0 #1159

Closed
kgtw opened this issue Aug 22, 2020 · 15 comments
Closed

Calico deletes routes when using CNI v1.7.0 #1159

kgtw opened this issue Aug 22, 2020 · 15 comments
Labels
calico Calico integration issue priority/P0 Highest priority. Someone needs to actively work on this.

Comments

@kgtw
Copy link
Contributor

kgtw commented Aug 22, 2020

Recently applied 1.7.0 manifests in one of our clusters and we are seeing new pods are having interfaces created with no ip address associated with them on the host.

$ kubectl get pod -o wide | grep debug
shell-node-debug     1/1     Running   0     6s     172.30.164.149   ip-172-30-165-219.eu-central-1.compute.internal   <none> 
 <none>
$ tail /var/log/aws-routed-eni/plugin.log
{"level":"info","ts":"2020-08-22T11:41:26.166Z","caller":"routed-eni-cni-plugin/cni.go:117","msg":"Received CNI add request: ContainerID(7864f0bbebd4eb32871d4a42414e7e216f4614203d59c3ec1d7fbbe532975064) Netns(/proc/17287/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=shell-node-debug;K8S_POD_INFRA_CONTAINER_ID=7864f0bbebd4eb32871d4a42414e7e216f4614203d59c3ec1d7fbbe532975064) Path(/opt/cni/bin/) argsStdinData({\"cniVersion\":\"0.3.1\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"Debug\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"debug","ts":"2020-08-22T11:41:26.166Z","caller":"routed-eni-cni-plugin/cni.go:117","msg":"MTU value set is 9001:"}
{"level":"info","ts":"2020-08-22T11:41:26.168Z","caller":"routed-eni-cni-plugin/cni.go:117","msg":"Received add network response for container 7864f0bbebd4eb32871d4a42414e7e216f4614203d59c3ec1d7fbbe532975064 interface eth0: Success:true IPv4Addr:\"172.30.164.149\" UseExternalSNAT:true VPCcidrs:\"172.30.160.0/20\" "}
{"level":"debug","ts":"2020-08-22T11:41:26.168Z","caller":"routed-eni-cni-plugin/cni.go:194","msg":"SetupNS: hostVethName=eni7c64954e155, contVethName=eth0, netnsPath=/proc/17287/ns/net, table=0, mtu=9001"}
{"level":"debug","ts":"2020-08-22T11:41:26.194Z","caller":"driver/driver.go:184","msg":"setupVeth network: disabled IPv6 RA and ICMP redirects on eni7c64954e155"}
{"level":"debug","ts":"2020-08-22T11:41:26.194Z","caller":"driver/driver.go:178","msg":"Setup host route outgoing hostVeth, LinkIndex 12"}
{"level":"debug","ts":"2020-08-22T11:41:26.194Z","caller":"driver/driver.go:178","msg":"Successfully set host route to be 172.30.164.149/0"}
{"level":"info","ts":"2020-08-22T11:41:26.194Z","caller":"driver/driver.go:178","msg":"Added toContainer rule for 172.30.164.149/32"}

Logs from aws cni indicate it has created the new interface eni7c64954e155 with ip address 172.30.164.149/32, however when inspecting the routing table and interface we can see that it has not been setup correctly.

$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         172.30.164.1    0.0.0.0         UG    100    0        0 ens5
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
172.30.164.0    0.0.0.0         255.255.252.0   U     0      0        0 ens5
172.30.164.1    0.0.0.0         255.255.255.255 UH    100    0        0 ens5
172.30.166.133  0.0.0.0         255.255.255.255 UH    0      0        0 enic9a8a93f307
172.30.166.147  0.0.0.0         255.255.255.255 UH    0      0        0 enic91275e76d3
172.30.167.107  0.0.0.0         255.255.255.255 UH    0      0        0 eniae804caef6a
172.30.167.144  0.0.0.0         255.255.255.255 UH    0      0        0 eni9ae85d9b9d4
$ ifconfig eni7c64954e155
eni7c64954e155: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        ether 6a:e9:93:af:0b:2e  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 19  bytes 5187 (5.1 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

AWS CNI Config:

            - name: AWS_VPC_K8S_CNI_LOGLEVEL
              value: DEBUG
            - name: AWS_VPC_K8S_CNI_LOG_FILE
              value: stdout
            - name: AWS_VPC_K8S_CNI_EXTERNALSNAT
              value: "true"
            - name: AWS_VPC_CNI_NODE_PORT_SUPPORT
              value: "true"
            - name: AWS_VPC_ENI_MTU
              value: "9001"
            - name: AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER
              value: "true"
            - name: AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG
              value: "false"
            - name: MY_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: WARM_IP_TARGET
              value: "10"

Kubernetes: 1.17.9
AWS CNI: 1.7.0

@SaranBalaji90
Copy link
Contributor

SaranBalaji90 commented Aug 22, 2020

Thank you for reporting the issue @kgtw. Is it possible to run CNI log collection tool and send us the files?

I'm wondering if there was a delete request from kubelet for this pod or whether two pods (running and terminating?) got same IP and deleting one of them caused the IP to be cleaned up. Logs from ipamd, plugin will help us to dive deep here.

@kgtw
Copy link
Contributor Author

kgtw commented Aug 22, 2020

Hi @SaranBalaji90, I've re-applied the manifests and manually deployed a test pod shell-node-debug that hadn't previously existed on to a newly created node, this is the relevant logs for that action.

Unfortunately we cannot provide the entire debug information from that script due to company policies, but here are the relevant logs for the timestamp period when the debug pod was created.

plugin.log

{"level":"info","ts":"2020-08-22T16:47:47.359Z","caller":"routed-eni-cni-plugin/cni.go:117","msg":"Received CNI add request: ContainerID(a0366eae458ff21b8a0efe8190f673744fe83e65e524d5fa8d6c93b2881239f0) Netns(/proc/23799/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=shell-node-debug;K8S_POD_INFRA_CONTAINER_ID=a0366eae458ff21b8a0efe8190f673744fe83e65e524d5fa8d6c93b2881239f0) Path(/opt/cni/bin/) argsStdinData({\"cniVersion\":\"0.3.1\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"Debug\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"debug","ts":"2020-08-22T16:47:47.359Z","caller":"routed-eni-cni-plugin/cni.go:117","msg":"MTU value set is 9001:"}
{"level":"info","ts":"2020-08-22T16:47:47.361Z","caller":"routed-eni-cni-plugin/cni.go:117","msg":"Received add network response for container a0366eae458ff21b8a0efe8190f673744fe83e65e524d5fa8d6c93b2881239f0 interface eth0: Success:true IPv4Addr:\"172.30.167.51\" UseExternalSNAT:true VPCcidrs:\"172.30.160.0/20\" "}
{"level":"debug","ts":"2020-08-22T16:47:47.361Z","caller":"routed-eni-cni-plugin/cni.go:194","msg":"SetupNS: hostVethName=eni51add10f095, contVethName=eth0, netnsPath=/proc/23799/ns/net, table=0, mtu=9001"}
{"level":"debug","ts":"2020-08-22T16:47:47.417Z","caller":"driver/driver.go:184","msg":"setupVeth network: disabled IPv6 RA and ICMP redirects on eni51add10f095"}
{"level":"debug","ts":"2020-08-22T16:47:47.417Z","caller":"driver/driver.go:178","msg":"Setup host route outgoing hostVeth, LinkIndex 12"}
{"level":"debug","ts":"2020-08-22T16:47:47.417Z","caller":"driver/driver.go:178","msg":"Successfully set host route to be 172.30.167.51/0"}
{"level":"info","ts":"2020-08-22T16:47:47.417Z","caller":"driver/driver.go:178","msg":"Added toContainer rule for 172.30.167.51/32"}

ipamd.log

{"level":"info","ts":"2020-08-22T16:47:47.360Z","caller":"rpc/rpc.pb.go:486","msg":"Received AddNetwork for NS /proc/23799/ns/net, Sandbox a0366eae458ff21b8a0efe8190f673744fe83e65e524d5fa8d6c93b2881239f0, ifname eth0"}
{"level":"debug","ts":"2020-08-22T16:47:47.360Z","caller":"rpc/rpc.pb.go:486","msg":"AddNetworkRequest: ClientVersion:\"v1.7.0\" K8S_POD_NAME:\"shell-node-debug\" K8S_POD_NAMESPACE:\"kube-system\" K8S_POD_INFRA_CONTAINER_ID:\"a0366eae458ff21b8a0efe8190f673744fe83e65e524d5fa8d6c93b2881239f0\" ContainerID:\"a0366eae458ff21b8a0efe8190f673744fe83e65e524d5fa8d6c93b2881239f0\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/proc/23799/ns/net\" "}
{"level":"debug","ts":"2020-08-22T16:47:47.360Z","caller":"ipamd/rpc_handler.go:142","msg":"AssignIPv4Address: IP address pool stats: total: 15, assigned 5"}
{"level":"info","ts":"2020-08-22T16:47:47.360Z","caller":"datastore/data_store.go:499","msg":"AssignPodIPv4Address: Assign IP 172.30.167.51 to sandbox aws-cni/a0366eae458ff21b8a0efe8190f673744fe83e65e524d5fa8d6c93b2881239f0/eth0"}
{"level":"debug","ts":"2020-08-22T16:47:47.360Z","caller":"rpc/rpc.pb.go:486","msg":"VPC CIDR 172.30.160.0/20"}
{"level":"info","ts":"2020-08-22T16:47:47.361Z","caller":"rpc/rpc.pb.go:486","msg":"Send AddNetworkReply: IPv4Addr 172.30.167.51, DeviceNumber: 0, err: <nil>"}
{"level":"debug","ts":"2020-08-22T16:47:48.929Z","caller":"ipamd/ipamd.go:889","msg":"Current warm IP stats: target: 10, total: 15, assigned: 6, available: 9, short: 1, over 0"}
{"level":"debug","ts":"2020-08-22T16:47:48.929Z","caller":"ipamd/ipamd.go:510","msg":"Starting to increase IP pool size"}
{"level":"debug","ts":"2020-08-22T16:47:48.929Z","caller":"ipamd/ipamd.go:628","msg":"Current warm IP stats: target: 10, total: 15, assigned: 6, available: 9, short: 1, over 0"}
{"level":"debug","ts":"2020-08-22T16:47:48.929Z","caller":"ipamd/ipamd.go:729","msg":"Current warm IP stats: target: 10, total: 15, assigned: 6, available: 9, short: 1, over 0"}
{"level":"debug","ts":"2020-08-22T16:47:48.929Z","caller":"ipamd/ipamd.go:735","msg":"Found ENI eni-0fc3792c7cd8ef3cc that has less than the maximum number of IP addresses allocated: cur=15, max=29"}
{"level":"info","ts":"2020-08-22T16:47:48.929Z","caller":"ipamd/ipamd.go:739","msg":"Trying to allocate 14 IP addresses on ENI eni-0fc3792c7cd8ef3cc"}
{"level":"info","ts":"2020-08-22T16:47:49.404Z","caller":"ipamd/ipamd.go:739","msg":"Allocated 14 private IP addresses"}
{"level":"info","ts":"2020-08-22T16:47:49.507Z","caller":"ipamd/ipamd.go:793","msg":"Added ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.166.124 to datastore"}
{"level":"info","ts":"2020-08-22T16:47:49.507Z","caller":"ipamd/ipamd.go:793","msg":"Added ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.165.248 to datastore"}
{"level":"info","ts":"2020-08-22T16:47:49.507Z","caller":"ipamd/ipamd.go:793","msg":"Added ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.166.214 to datastore"}
{"level":"info","ts":"2020-08-22T16:47:49.507Z","caller":"ipamd/ipamd.go:793","msg":"Added ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.167.246 to datastore"}
{"level":"info","ts":"2020-08-22T16:47:49.507Z","caller":"ipamd/ipamd.go:793","msg":"Added ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.166.18 to datastore"}
{"level":"info","ts":"2020-08-22T16:47:49.507Z","caller":"ipamd/ipamd.go:793","msg":"Added ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.165.210 to datastore"}
{"level":"info","ts":"2020-08-22T16:47:49.507Z","caller":"ipamd/ipamd.go:793","msg":"Added ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.166.13 to datastore"}
{"level":"info","ts":"2020-08-22T16:47:49.507Z","caller":"ipamd/ipamd.go:793","msg":"Added ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.166.104 to datastore"}
{"level":"info","ts":"2020-08-22T16:47:49.507Z","caller":"ipamd/ipamd.go:793","msg":"Added ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.167.169 to datastore"}
{"level":"info","ts":"2020-08-22T16:47:49.507Z","caller":"ipamd/ipamd.go:793","msg":"Added ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.166.70 to datastore"}
{"level":"info","ts":"2020-08-22T16:47:49.507Z","caller":"ipamd/ipamd.go:793","msg":"Added ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.166.6 to datastore"}
{"level":"info","ts":"2020-08-22T16:47:49.507Z","caller":"ipamd/ipamd.go:793","msg":"Added ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.167.229 to datastore"}
{"level":"info","ts":"2020-08-22T16:47:49.507Z","caller":"ipamd/ipamd.go:793","msg":"Added ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.166.2 to datastore"}
{"level":"info","ts":"2020-08-22T16:47:49.507Z","caller":"ipamd/ipamd.go:793","msg":"Added ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.165.64 to datastore"}
{"level":"debug","ts":"2020-08-22T16:47:49.507Z","caller":"ipamd/ipamd.go:755","msg":"IP Address Pool stats: total: 29, assigned: 6"}
{"level":"debug","ts":"2020-08-22T16:47:49.507Z","caller":"ipamd/ipamd.go:645","msg":"Successfully increased IP pool, total: 29, used: 6"}
{"level":"debug","ts":"2020-08-22T16:47:49.507Z","caller":"ipamd/ipamd.go:668","msg":"IP pool stats: total = 29, used = 6, c.maxIPsPerENI = 29"}
{"level":"debug","ts":"2020-08-22T16:47:49.507Z","caller":"ipamd/ipamd.go:919","msg":"Current warm IP stats: target: 10, total: 29, assigned: 6, available: 23, short: 0, over 13"}
{"level":"debug","ts":"2020-08-22T16:47:49.508Z","caller":"datastore/data_store.go:675","msg":"ENI eni-0fc3792c7cd8ef3cc cannot be deleted because it is primary"}
{"level":"debug","ts":"2020-08-22T16:47:54.508Z","caller":"ipamd/ipamd.go:889","msg":"Current warm IP stats: target: 10, total: 29, assigned: 6, available: 23, short: 0, over 13"}
{"level":"debug","ts":"2020-08-22T16:47:54.508Z","caller":"ipamd/ipamd.go:907","msg":"Current warm IP stats: target: 10, total: 29, assigned: 6, available: 23, short: 0, over 13"}
{"level":"debug","ts":"2020-08-22T16:47:54.508Z","caller":"ipamd/ipamd.go:512","msg":"Starting to decrease IP pool"}
{"level":"debug","ts":"2020-08-22T16:47:54.508Z","caller":"ipamd/ipamd.go:567","msg":"Current warm IP stats: target: 10, total: 29, assigned: 6, available: 23, short: 0, over 13"}
{"level":"debug","ts":"2020-08-22T16:47:54.508Z","caller":"ipamd/ipamd.go:616","msg":"Current warm IP stats: target: 10, total: 29, assigned: 6, available: 23, short: 0, over 13"}
{"level":"info","ts":"2020-08-22T16:47:54.508Z","caller":"ipamd/ipamd.go:585","msg":"Deleted ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.167.108 from datastore"}
{"level":"info","ts":"2020-08-22T16:47:54.508Z","caller":"ipamd/ipamd.go:585","msg":"Deleted ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.166.124 from datastore"}
{"level":"info","ts":"2020-08-22T16:47:54.508Z","caller":"ipamd/ipamd.go:585","msg":"Deleted ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.167.246 from datastore"}
{"level":"info","ts":"2020-08-22T16:47:54.508Z","caller":"ipamd/ipamd.go:585","msg":"Deleted ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.167.169 from datastore"}
{"level":"info","ts":"2020-08-22T16:47:54.508Z","caller":"ipamd/ipamd.go:585","msg":"Deleted ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.164.252 from datastore"}
{"level":"info","ts":"2020-08-22T16:47:54.508Z","caller":"ipamd/ipamd.go:585","msg":"Deleted ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.164.184 from datastore"}
{"level":"info","ts":"2020-08-22T16:47:54.508Z","caller":"ipamd/ipamd.go:585","msg":"Deleted ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.167.81 from datastore"}
{"level":"info","ts":"2020-08-22T16:47:54.508Z","caller":"ipamd/ipamd.go:585","msg":"Deleted ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.166.103 from datastore"}
{"level":"info","ts":"2020-08-22T16:47:54.508Z","caller":"ipamd/ipamd.go:585","msg":"Deleted ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.167.32 from datastore"}
{"level":"info","ts":"2020-08-22T16:47:54.508Z","caller":"ipamd/ipamd.go:585","msg":"Deleted ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.166.2 from datastore"}
{"level":"info","ts":"2020-08-22T16:47:54.508Z","caller":"ipamd/ipamd.go:585","msg":"Deleted ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.165.64 from datastore"}
{"level":"info","ts":"2020-08-22T16:47:54.508Z","caller":"ipamd/ipamd.go:585","msg":"Deleted ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.165.69 from datastore"}
{"level":"info","ts":"2020-08-22T16:47:54.508Z","caller":"ipamd/ipamd.go:585","msg":"Deleted ENI(eni-0fc3792c7cd8ef3cc)'s IP 172.30.166.70 from datastore"}
{"level":"info","ts":"2020-08-22T16:47:54.508Z","caller":"ipamd/ipamd.go:596","msg":"Trying to unassign the following IPs [172.30.167.108 172.30.166.124 172.30.167.246 172.30.167.169 172.30.164.252 172.30.164.184 172.30.167.81 172.30.166.103 172.30.167.32 172.30.166.2 172.30.165.64 172.30.165.69 172.30.166.70] from ENI eni-0fc3792c7cd8ef3cc"}
{"level":"debug","ts":"2020-08-22T16:47:54.870Z","caller":"ipamd/ipamd.go:534","msg":"Successfully decreased IP pool by removing IPs [172.30.167.108 172.30.166.124 172.30.167.246 172.30.167.169 172.30.164.252 172.30.164.184 172.30.167.81 172.30.166.103 172.30.167.32 172.30.166.2 172.30.165.64 172.30.165.69 172.30.166.70] from ENI eni-0fc3792c7cd8ef3cc"}
{"level":"debug","ts":"2020-08-22T16:47:54.870Z","caller":"ipamd/ipamd.go:512","msg":"Successfully decreased IP pool"}
{"level":"debug","ts":"2020-08-22T16:47:54.870Z","caller":"ipamd/ipamd.go:540","msg":"IP pool stats: total = 16, used = 6, c.maxIPsPerENI = 29"}
{"level":"debug","ts":"2020-08-22T16:47:54.870Z","caller":"ipamd/ipamd.go:919","msg":"Current warm IP stats: target: 10, total: 16, assigned: 6, available: 10, short: 0, over 0"}
{"level":"debug","ts":"2020-08-22T16:47:54.870Z","caller":"datastore/data_store.go:675","msg":"ENI eni-0fc3792c7cd8ef3cc cannot be deleted because it is primary"}

kubelet

$ journalctl -fu kubelet --since "2020-08-22 16:42:34" --until "2020-08-22 16:43:34"
-- Logs begin at Sat 2020-08-22 16:41:38 UTC. --
Aug 22 16:42:34 ip-172-30-166-186 kubelet[2379]: I0822 16:42:34.007286    2379 kube_docker_client.go:345] Stop pulling image "REDACTED/jaegertracing/jaeger-agent:1.9.0": "Status: Downloaded newer image for REDACTED/jaegertracing/jaeger-agent:1.9.0"
Aug 22 16:42:34 ip-172-30-166-186 kubelet[2379]: {"level":"info","ts":"2020-08-22T16:42:34.229Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.0 ..."}
Aug 22 16:42:34 ip-172-30-166-186 kubelet[2379]: I0822 16:42:34.876949    2379 kubelet.go:1952] SyncLoop (PLEG): "node-problem-detector-l8dsn_node-problem-detector(25d93733-77ca-4292-a14f-3a1c8a700b85)", event: &pleg.PodLifecycleEvent{ID:"25d93733-77ca-4292-a14f-3a1c8a700b85", Type:"ContainerStarted", Data:"ed4ec97abc79740a8a4f228f2a91df917f1802576803049c643d1b5442b7d423"}
Aug 22 16:42:34 ip-172-30-166-186 kubelet[2379]: I0822 16:42:34.883229    2379 kubelet.go:1952] SyncLoop (PLEG): "kured-2455s_kured(e8901744-5eba-4c50-a823-0a2a161c8cbb)", event: &pleg.PodLifecycleEvent{ID:"e8901744-5eba-4c50-a823-0a2a161c8cbb", Type:"ContainerStarted", Data:"e7453255bfae231d6bfb290f603436206f51e2736bcbef13f74b4bc47687686c"}
Aug 22 16:42:34 ip-172-30-166-186 kubelet[2379]: I0822 16:42:34.886649    2379 kubelet.go:1952] SyncLoop (PLEG): "jaeger-agent-jl5r4_tracing(6296834f-a475-453d-ac0f-4394bce4d54f)", event: &pleg.PodLifecycleEvent{ID:"6296834f-a475-453d-ac0f-4394bce4d54f", Type:"ContainerStarted", Data:"863be82ef98ae7c97ded728b14f6925bbd6c3c6c4312c4431f35656d95ad9aae"}
Aug 22 16:42:34 ip-172-30-166-186 kubelet[2379]: I0822 16:42:34.891773    2379 kubelet.go:1952] SyncLoop (PLEG): "cluster-overprovisioning-56ccddc5bc-m8wn4_kube-system(914518fb-d282-480d-b021-e61a002b4e9c)", event: &pleg.PodLifecycleEvent{ID:"914518fb-d282-480d-b021-e61a002b4e9c", Type:"ContainerStarted", Data:"eb12dd0745269c029ce538503cf7f57c3281f75fb9fa650a0ca847d00069acfe"}
Aug 22 16:42:35 ip-172-30-166-186 kubelet[2379]: W0822 16:42:35.004286    2379 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/6296834f-a475-453d-ac0f-4394bce4d54f/volumes/kubernetes.io~secret/jaeger-agent-token-fbr6b and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
Aug 22 16:42:35 ip-172-30-166-186 kubelet[2379]: I0822 16:42:35.741063    2379 kube_docker_client.go:345] Stop pulling image "REDACTED/fluentbit-s3:1.4.2": "Status: Downloaded newer image for REDACTED/fluentbit-s3:1.4.2"
Aug 22 16:42:36 ip-172-30-166-186 kubelet[2379]: W0822 16:42:36.006111    2379 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/6296834f-a475-453d-ac0f-4394bce4d54f/volumes/kubernetes.io~secret/jaeger-agent-token-fbr6b and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
Aug 22 16:42:36 ip-172-30-166-186 kubelet[2379]: I0822 16:42:36.650462    2379 kube_docker_client.go:345] Stop pulling image "REDACTED/pause:3.2": "Status: Downloaded newer image for REDACTED/pause:3.2"
Aug 22 16:42:37 ip-172-30-166-186 kubelet[2379]: I0822 16:42:37.922919    2379 kubelet.go:1952] SyncLoop (PLEG): "cluster-overprovisioning-56ccddc5bc-m8wn4_kube-system(914518fb-d282-480d-b021-e61a002b4e9c)", event: &pleg.PodLifecycleEvent{ID:"914518fb-d282-480d-b021-e61a002b4e9c", Type:"ContainerStarted", Data:"3aa25d0d3119c7b7073a5ed23eea0d5bbad8d5c1dc615510d4b0ad0508a56d14"}
Aug 22 16:42:37 ip-172-30-166-186 kubelet[2379]: I0822 16:42:37.930020    2379 kubelet.go:1952] SyncLoop (PLEG): "fluentbit-fdpwt_kube-logs(3ad66c50-3ce5-419c-8685-5b8949991344)", event: &pleg.PodLifecycleEvent{ID:"3ad66c50-3ce5-419c-8685-5b8949991344", Type:"ContainerStarted", Data:"e10fb3610a6049bce22481f787e083eae6e3e36acb86f3f4e4a86e870d463831"}
Aug 22 16:42:38 ip-172-30-166-186 kubelet[2379]: {"level":"info","ts":"2020-08-22T16:42:38.109Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.0 ..."}
Aug 22 16:42:38 ip-172-30-166-186 kubelet[2379]: W0822 16:42:38.110013    2379 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/3ad66c50-3ce5-419c-8685-5b8949991344/volumes/kubernetes.io~secret/fluentbit-token-d25cp and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
Aug 22 16:42:38 ip-172-30-166-186 kubelet[2379]: W0822 16:42:38.110305    2379 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/3ad66c50-3ce5-419c-8685-5b8949991344/volumes/kubernetes.io~configmap/fluentbit-config and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
Aug 22 16:42:38 ip-172-30-166-186 kubelet[2379]: I0822 16:42:38.689641    2379 setters.go:73] Using node IP: "172.30.166.186"
Aug 22 16:42:39 ip-172-30-166-186 kubelet[2379]: W0822 16:42:39.113297    2379 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/3ad66c50-3ce5-419c-8685-5b8949991344/volumes/kubernetes.io~secret/fluentbit-token-d25cp and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
Aug 22 16:42:39 ip-172-30-166-186 kubelet[2379]: W0822 16:42:39.114164    2379 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/3ad66c50-3ce5-419c-8685-5b8949991344/volumes/kubernetes.io~configmap/fluentbit-config and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
Aug 22 16:42:43 ip-172-30-166-186 kubelet[2379]: {"level":"info","ts":"2020-08-22T16:42:43.119Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.0 ..."}
Aug 22 16:42:48 ip-172-30-166-186 kubelet[2379]: {"level":"info","ts":"2020-08-22T16:42:48.129Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.0 ..."}
Aug 22 16:42:48 ip-172-30-166-186 kubelet[2379]: I0822 16:42:48.694827    2379 setters.go:73] Using node IP: "172.30.166.186"
Aug 22 16:42:53 ip-172-30-166-186 kubelet[2379]: {"level":"info","ts":"2020-08-22T16:42:53.138Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.0 ..."}
Aug 22 16:42:58 ip-172-30-166-186 kubelet[2379]: {"level":"info","ts":"2020-08-22T16:42:58.181Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.0 ..."}
Aug 22 16:42:58 ip-172-30-166-186 kubelet[2379]: I0822 16:42:58.698566    2379 setters.go:73] Using node IP: "172.30.166.186"
Aug 22 16:43:00 ip-172-30-166-186 kubelet[2379]: I0822 16:43:00.202296    2379 kubelet.go:1952] SyncLoop (PLEG): "spire-agent-dhtdv_spire(28662c40-f3f9-460b-96ff-d401c95fa729)", event: &pleg.PodLifecycleEvent{ID:"28662c40-f3f9-460b-96ff-d401c95fa729", Type:"ContainerDied", Data:"c6ae074a1a293c03a8bcf4583fc7ab96547793035ab755fcf544e873bc32b71d"}
Aug 22 16:43:00 ip-172-30-166-186 kubelet[2379]: I0822 16:43:00.262341    2379 kube_docker_client.go:345] Stop pulling image "REDACTED/spiffe-io/wait-for-it@sha256:d9bdc931e4404237d2fb0ba84db5ece88b236c40eeca570d786ee54fd243f4ae": "Status: Image is up to date for REDACTED/spiffe-io/wait-for-it@sha256:d9bdc931e4404237d2fb0ba84db5ece88b236c40eeca570d786ee54fd243f4ae"
Aug 22 16:43:00 ip-172-30-166-186 kubelet[2379]: W0822 16:43:00.359464    2379 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/28662c40-f3f9-460b-96ff-d401c95fa729/volumes/kubernetes.io~configmap/spire-bundle and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
Aug 22 16:43:00 ip-172-30-166-186 kubelet[2379]: W0822 16:43:00.359479    2379 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/28662c40-f3f9-460b-96ff-d401c95fa729/volumes/kubernetes.io~configmap/spire-agent-config and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
Aug 22 16:43:00 ip-172-30-166-186 kubelet[2379]: W0822 16:43:00.359588    2379 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/28662c40-f3f9-460b-96ff-d401c95fa729/volumes/kubernetes.io~secret/spire-agent-token-tgj5p and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
Aug 22 16:43:00 ip-172-30-166-186 kubelet[2379]: W0822 16:43:00.359654    2379 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/28662c40-f3f9-460b-96ff-d401c95fa729/volumes/kubernetes.io~projected/spire-agent and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
Aug 22 16:43:01 ip-172-30-166-186 kubelet[2379]: I0822 16:43:01.206631    2379 kubelet.go:1952] SyncLoop (PLEG): "spire-agent-dhtdv_spire(28662c40-f3f9-460b-96ff-d401c95fa729)", event: &pleg.PodLifecycleEvent{ID:"28662c40-f3f9-460b-96ff-d401c95fa729", Type:"ContainerStarted", Data:"8fc986cb87ee3eb07ddfb9a958c6e78b51c752c091ccba31f100cb38e24c04a5"}
Aug 22 16:43:01 ip-172-30-166-186 kubelet[2379]: W0822 16:43:01.361978    2379 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/28662c40-f3f9-460b-96ff-d401c95fa729/volumes/kubernetes.io~projected/spire-agent and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
Aug 22 16:43:01 ip-172-30-166-186 kubelet[2379]: W0822 16:43:01.362010    2379 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/28662c40-f3f9-460b-96ff-d401c95fa729/volumes/kubernetes.io~configmap/spire-bundle and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
Aug 22 16:43:01 ip-172-30-166-186 kubelet[2379]: W0822 16:43:01.362025    2379 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/28662c40-f3f9-460b-96ff-d401c95fa729/volumes/kubernetes.io~configmap/spire-agent-config and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
Aug 22 16:43:01 ip-172-30-166-186 kubelet[2379]: W0822 16:43:01.362070    2379 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/28662c40-f3f9-460b-96ff-d401c95fa729/volumes/kubernetes.io~secret/spire-agent-token-tgj5p and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
Aug 22 16:43:03 ip-172-30-166-186 kubelet[2379]: {"level":"info","ts":"2020-08-22T16:43:03.190Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.0 ..."}
Aug 22 16:43:05 ip-172-30-166-186 kubelet[2379]: I0822 16:43:05.263316    2379 kubelet.go:1952] SyncLoop (PLEG): "kured-2455s_kured(e8901744-5eba-4c50-a823-0a2a161c8cbb)", event: &pleg.PodLifecycleEvent{ID:"e8901744-5eba-4c50-a823-0a2a161c8cbb", Type:"ContainerDied", Data:"e7453255bfae231d6bfb290f603436206f51e2736bcbef13f74b4bc47687686c"}
Aug 22 16:43:06 ip-172-30-166-186 kubelet[2379]: I0822 16:43:06.279378    2379 kubelet.go:1952] SyncLoop (PLEG): "kured-2455s_kured(e8901744-5eba-4c50-a823-0a2a161c8cbb)", event: &pleg.PodLifecycleEvent{ID:"e8901744-5eba-4c50-a823-0a2a161c8cbb", Type:"ContainerStarted", Data:"bff65f4e8d7125e9c57a4f5baefa525c4f3878826f485f813d675d1259217c4d"}
Aug 22 16:43:08 ip-172-30-166-186 kubelet[2379]: {"level":"info","ts":"2020-08-22T16:43:08.200Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.0 ..."}
Aug 22 16:43:08 ip-172-30-166-186 kubelet[2379]: I0822 16:43:08.710533    2379 setters.go:73] Using node IP: "172.30.166.186"
Aug 22 16:43:13 ip-172-30-166-186 kubelet[2379]: {"level":"info","ts":"2020-08-22T16:43:13.210Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.0 ..."}
Aug 22 16:43:14 ip-172-30-166-186 kubelet[2379]: I0822 16:43:14.648917    2379 prober.go:116] Liveness probe for "fluentbit-fdpwt_kube-logs(3ad66c50-3ce5-419c-8685-5b8949991344):fluentbit" failed (failure): Get http://172.30.165.255:2020/: dial tcp 172.30.165.255:2020: connect: no route to host
Aug 22 16:43:16 ip-172-30-166-186 kubelet[2379]: W0822 16:43:16.297252    2379 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/ed16d36e-c7bd-4b8a-b698-0a94c68b7477/volumes/kubernetes.io~secret/node-exporter-token-k788f and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
Aug 22 16:43:18 ip-172-30-166-186 kubelet[2379]: {"level":"info","ts":"2020-08-22T16:43:18.220Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.0 ..."}
Aug 22 16:43:18 ip-172-30-166-186 kubelet[2379]: I0822 16:43:18.715397    2379 setters.go:73] Using node IP: "172.30.166.186"
Aug 22 16:43:22 ip-172-30-166-186 kubelet[2379]: I0822 16:43:22.872870    2379 prober.go:116] Liveness probe for "fluentbit-fdpwt_kube-logs(3ad66c50-3ce5-419c-8685-5b8949991344):fluentbit" failed (failure): Get http://172.30.165.255:2020/: dial tcp 172.30.165.255:2020: connect: no route to host
Aug 22 16:43:23 ip-172-30-166-186 kubelet[2379]: {"level":"info","ts":"2020-08-22T16:43:23.230Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.0 ..."}
Aug 22 16:43:28 ip-172-30-166-186 kubelet[2379]: {"level":"info","ts":"2020-08-22T16:43:28.240Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.0 ..."}
Aug 22 16:43:28 ip-172-30-166-186 kubelet[2379]: I0822 16:43:28.719027    2379 setters.go:73] Using node IP: "172.30.166.186"
Aug 22 16:43:32 ip-172-30-166-186 kubelet[2379]: I0822 16:43:32.561275    2379 kubelet.go:1952] SyncLoop (PLEG): "spire-agent-dhtdv_spire(28662c40-f3f9-460b-96ff-d401c95fa729)", event: &pleg.PodLifecycleEvent{ID:"28662c40-f3f9-460b-96ff-d401c95fa729", Type:"ContainerDied", Data:"8fc986cb87ee3eb07ddfb9a958c6e78b51c752c091ccba31f100cb38e24c04a5"}
Aug 22 16:43:32 ip-172-30-166-186 kubelet[2379]: E0822 16:43:32.571248    2379 remote_runtime.go:261] RemoveContainer "c6ae074a1a293c03a8bcf4583fc7ab96547793035ab755fcf544e873bc32b71d" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "c6ae074a1a293c03a8bcf4583fc7ab96547793035ab755fcf544e873bc32b71d": Error response from daemon: removal of container c6ae074a1a293c03a8bcf4583fc7ab96547793035ab755fcf544e873bc32b71d is already in progress
Aug 22 16:43:32 ip-172-30-166-186 kubelet[2379]: E0822 16:43:32.571360    2379 kuberuntime_container.go:671] failed to remove pod init container "init": rpc error: code = Unknown desc = failed to remove container "c6ae074a1a293c03a8bcf4583fc7ab96547793035ab755fcf544e873bc32b71d": Error response from daemon: removal of container c6ae074a1a293c03a8bcf4583fc7ab96547793035ab755fcf544e873bc32b71d is already in progress; Skipping pod "spire-agent-dhtdv_spire(28662c40-f3f9-460b-96ff-d401c95fa729)"
Aug 22 16:43:32 ip-172-30-166-186 kubelet[2379]: E0822 16:43:32.571571    2379 pod_workers.go:191] Error syncing pod 28662c40-f3f9-460b-96ff-d401c95fa729 ("spire-agent-dhtdv_spire(28662c40-f3f9-460b-96ff-d401c95fa729)"), skipping: failed to "StartContainer" for "init" with CrashLoopBackOff: "back-off 10s restarting failed container=init pod=spire-agent-dhtdv_spire(28662c40-f3f9-460b-96ff-d401c95fa729)"
Aug 22 16:43:32 ip-172-30-166-186 kubelet[2379]: W0822 16:43:32.733670    2379 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/28662c40-f3f9-460b-96ff-d401c95fa729/volumes/kubernetes.io~projected/spire-agent and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
Aug 22 16:43:32 ip-172-30-166-186 kubelet[2379]: W0822 16:43:32.733782    2379 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/28662c40-f3f9-460b-96ff-d401c95fa729/volumes/kubernetes.io~configmap/spire-agent-config and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
Aug 22 16:43:32 ip-172-30-166-186 kubelet[2379]: W0822 16:43:32.733892    2379 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/28662c40-f3f9-460b-96ff-d401c95fa729/volumes/kubernetes.io~configmap/spire-bundle and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
Aug 22 16:43:32 ip-172-30-166-186 kubelet[2379]: W0822 16:43:32.733953    2379 volume_linux.go:45] Setting volume ownership for /var/lib/kubelet/pods/28662c40-f3f9-460b-96ff-d401c95fa729/volumes/kubernetes.io~secret/spire-agent-token-tgj5p and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
Aug 22 16:43:33 ip-172-30-166-186 kubelet[2379]: {"level":"info","ts":"2020-08-22T16:43:33.249Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"CNI Plugin version: v1.7.0 ..."}
Aug 22 16:43:33 ip-172-30-166-186 kubelet[2379]: I0822 16:43:33.880905    2379 prober.go:116] Liveness probe for "fluentbit-fdpwt_kube-logs(3ad66c50-3ce5-419c-8685-5b8949991344):fluentbit" failed (failure): Get http://172.30.165.255:2020/: dial tcp 172.30.165.255:2020: connect: no route to host
Aug 22 16:43:33 ip-172-30-166-186 kubelet[2379]: I0822 16:43:33.880963    2379 kubelet.go:1985] SyncLoop (container unhealthy): "fluentbit-fdpwt_kube-logs(3ad66c50-3ce5-419c-8685-5b8949991344)"
Aug 22 16:43:33 ip-172-30-166-186 kubelet[2379]: I0822 16:43:33.881401    2379 kuberuntime_manager.go:630] Container "fluentbit" ({"docker" "e10fb3610a6049bce22481f787e083eae6e3e36acb86f3f4e4a86e870d463831"}) of pod fluentbit-fdpwt_kube-logs(3ad66c50-3ce5-419c-8685-5b8949991344): Container fluentbit failed liveness probe, will be restarted
Aug 22 16:43:33 ip-172-30-166-186 kubelet[2379]: I0822 16:43:33.881443    2379 kuberuntime_container.go:600] Killing container "docker://e10fb3610a6049bce22481f787e083eae6e3e36acb86f3f4e4a86e870d463831" with 20 second grace period

Docker Version

Client: Docker Engine - Community
 Version:           19.03.12
 API version:       1.39
 Go version:        go1.13.10
 Git commit:        48a66213fe
 Built:             Mon Jun 22 15:45:36 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.6
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       481bc77
  Built:            Sat May  4 01:59:36 2019
  OS/Arch:          linux/amd64
  Experimental:     false
$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         172.30.164.1    0.0.0.0         UG    100    0        0 ens5
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
172.30.164.0    0.0.0.0         255.255.252.0   U     0      0        0 ens5
172.30.164.1    0.0.0.0         255.255.255.255 UH    100    0        0 ens5
$ ifconfig eni51add10f095
eni51add10f095: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        ether 2a:5c:2d:80:46:8a  txqueuelen 0  (Ethernet)
        RX packets 2  bytes 148 (148.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 20  bytes 5460 (5.4 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

You can see from the kubelet log that the daemonsets which had pods scheduled on this new node are also in a failing state. Let me know if you need anything further.

@mogren mogren removed the bug label Aug 22, 2020
@mogren
Copy link
Contributor

mogren commented Aug 22, 2020

Hi @kgtw, we set up some new Ubuntu clusters and tested with v1.7.0 right now. Using the same config as you pasted above, pods came up fine so we could not reproduce the issue. The ipamd logs above also do not show any errors, and eni51add10f095 for example does get the IP 172.30.167.51 from the CNI and the plugin.log shows that it gets set up under /proc/23799/ns/net on ens5. This happened around "2020-08-22T16:47:54.870Z", but the kubelet logs end 4 minutes before that, so it's hard to tell if the pod started up correctly. If the pod failed to start after getting the IP, kubelet will call the CNI with a Delete command, and the IP and routes will be cleaned up again.

If you run ip rule and ip route, does that show the pod IP?

It would help us a lot to debug this if you could provide us with the complete logs from the /opt/cni/bin/aws-cni-support.sh script. (Email to mogren@amazon.com, or through a support ticket to AWS)

If you apply the v1.6.3 config to the node, does it recover?

@kgtw
Copy link
Contributor Author

kgtw commented Aug 22, 2020

Hi @mogren, thanks for the speedy reply :)

Re-apply the 1.6.3 manifests does appear to recover the cluster, although I did terminate most of the nodes to speed it up.

I have attached the debug information in a support ticket, ID 7303365231.

Logs were taken from a newly created node, from which once it registered as ready within k8s I manually deployed a debug pod named shell-node-debug at 2020-08-22T21:42:21.635Z.

This pod has the ip address 172.30.163.152 / eni8e97e305193 associated to it. I did manage to to see the rule but no route for it.

$ ip route
default via 172.30.160.1 dev ens5 proto dhcp src 172.30.163.15 metric 100
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
172.30.160.0/22 dev ens5 proto kernel scope link src 172.30.163.15
172.30.160.1 dev ens5 proto dhcp scope link src 172.30.163.15 metric 100

$ ip rule
0:	from all lookup local
512:	from all to 172.30.163.154 lookup main
512:	from all to 172.30.163.121 lookup main
512:	from all to 172.30.160.79 lookup main
512:	from all to 172.30.161.176 lookup main
512:	from all to 172.30.160.103 lookup main
512:	from all to 172.30.160.120 lookup main
512:	from all to 172.30.161.59 lookup main
512:	from all to 172.30.163.152 lookup main
1024:	from all fwmark 0x80/0x80 lookup main
32766:	from all lookup main
32767:	from all lookup default

Hope this helps.

@SaranBalaji90
Copy link
Contributor

@kgtw I'm guessing you upgraded by applying the yaml file right?

@mogren mogren added the calico Calico integration issue label Aug 23, 2020
@mogren
Copy link
Contributor

mogren commented Aug 23, 2020

@kgtw Thanks for providing the logs. It seems like the routes get added correctly, but later they get removed by Calico. We are still investigating why this happens.

@kgtw
Copy link
Contributor Author

kgtw commented Aug 23, 2020

@SaranBalaji90 - Yes, although we have copied them so we can add additional labels/tolerations etc. I have double checked what we have and all looks correct.

@mogren Yes, I am seeing this as well now. Adding the debug pod shell-node-debug to a newly created node.

172.30.170.74 / eni467ad56930c

ipamd.log

{"level":"info","ts":"2020-08-23T09:16:25.912Z","caller":"rpc/rpc.pb.go:486","msg":"Received AddNetwork for NS /proc/12312/ns/net, Sandbox 2c35bc93533a5a984f5e71d8ec408cdca96e1f13beff1f7e293b3842ba8c74db, ifname eth0"}
{"level":"debug","ts":"2020-08-23T09:16:25.912Z","caller":"rpc/rpc.pb.go:486","msg":"AddNetworkRequest: ClientVersion:\"v1.7.0\" K8S_POD_NAME:\"shell-node-debug\" K8S_POD_NAMESPACE:\"kube-system\" K8S_POD_INFRA_CONTAINER_ID:\"2c35bc93533a5a984f5e71d8ec408cdca96e1f13beff1f7e293b3842ba8c74db\" ContainerID:\"2c35bc93533a5a984f5e71d8ec408cdca96e1f13beff1f7e293b3842ba8c74db\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/proc/12312/ns/net\" "}
{"level":"debug","ts":"2020-08-23T09:16:25.912Z","caller":"ipamd/rpc_handler.go:142","msg":"AssignIPv4Address: IP address pool stats: total: 14, assigned 6"}
{"level":"info","ts":"2020-08-23T09:16:25.912Z","caller":"datastore/data_store.go:499","msg":"AssignPodIPv4Address: Assign IP 172.30.170.74 to sandbox aws-cni/2c35bc93533a5a984f5e71d8ec408cdca96e1f13beff1f7e293b3842ba8c74db/eth0"}
{"level":"debug","ts":"2020-08-23T09:16:25.912Z","caller":"rpc/rpc.pb.go:486","msg":"VPC CIDR 172.30.160.0/20"}
{"level":"info","ts":"2020-08-23T09:16:25.912Z","caller":"rpc/rpc.pb.go:486","msg":"Send AddNetworkReply: IPv4Addr 172.30.170.74, DeviceNumber: 0, err: <nil>"}

plugin.log

{"level":"info","ts":"2020-08-23T09:16:25.912Z","caller":"routed-eni-cni-plugin/cni.go:117","msg":"Received add network response for container 2c35bc93533a5a984f5e71d8ec408cdca96e1f13beff1f7e293b3842ba8c74db interface eth0: Success:true IPv4Addr:\"172.30.170.74\" UseExternalSNAT:true VPCcidrs:\"172.30.160.0/20\" "}
{"level":"debug","ts":"2020-08-23T09:16:25.912Z","caller":"routed-eni-cni-plugin/cni.go:194","msg":"SetupNS: hostVethName=eni467ad56930c, contVethName=eth0, netnsPath=/proc/12312/ns/net, table=0, mtu=9001"}
{"level":"debug","ts":"2020-08-23T09:16:25.988Z","caller":"driver/driver.go:184","msg":"setupVeth network: disabled IPv6 RA and ICMP redirects on eni467ad56930c"}
{"level":"debug","ts":"2020-08-23T09:16:25.989Z","caller":"driver/driver.go:178","msg":"Setup host route outgoing hostVeth, LinkIndex 13"}
{"level":"debug","ts":"2020-08-23T09:16:25.989Z","caller":"driver/driver.go:178","msg":"Successfully set host route to be 172.30.170.74/0"}
{"level":"info","ts":"2020-08-23T09:16:25.990Z","caller":"driver/driver.go:178","msg":"Added toContainer rule for 172.30.170.74/32"}

calico-node

2020-08-23 09:16:25.988 [INFO][58] int_dataplane.go 778: Linux interface addrs changed. addrs=set.mapSet{} ifaceName="eni467ad56930c"
2020-08-23 09:16:25.988 [INFO][58] int_dataplane.go 743: Linux interface state changed. ifIndex=13 ifaceName="eni467ad56930c" state="up"
2020-08-23 09:16:25.989 [INFO][58] int_dataplane.go 1067: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"eni467ad56930c", Addrs:set.mapSet{}}
2020-08-23 09:16:25.989 [INFO][58] hostip_mgr.go 84: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"eni467ad56930c", Addrs:set.mapSet{}}
2020-08-23 09:16:25.989 [INFO][58] int_dataplane.go 1162: Applying dataplane updates
2020-08-23 09:16:25.989 [INFO][58] route_table.go 556: Syncing routes: found unexpected route; ignoring due to grace period. dest=172.30.171.35/32 ifaceName="eni325bbee54ae" ipVersion=0x4
2020-08-23 09:16:25.989 [INFO][58] route_table.go 364: Interface in cleanup grace period, will retry after. ifaceName="eni325bbee54ae" ipVersion=0x4
2020-08-23 09:16:25.989 [INFO][58] route_table.go 556: Syncing routes: found unexpected route; ignoring due to grace period. dest=172.30.169.5/32 ifaceName="enif7638c71565" ipVersion=0x4
2020-08-23 09:16:25.989 [INFO][58] route_table.go 364: Interface in cleanup grace period, will retry after. ifaceName="enif7638c71565" ipVersion=0x4
2020-08-23 09:16:25.990 [INFO][58] int_dataplane.go 1176: Finished applying updates to dataplane. msecToApply=0.7474839999999999
2020-08-23 09:16:25.990 [INFO][58] int_dataplane.go 1049: Received interface update msg=&intdataplane.ifaceUpdate{Name:"eni467ad56930c", State:"up", Index:13}
2020-08-23 09:16:25.990 [INFO][58] int_dataplane.go 1162: Applying dataplane updates
2020-08-23 09:16:25.990 [INFO][58] endpoint_mgr.go 342: Workload interface came up, marking for reconfiguration. ifaceName="eni467ad56930c"
2020-08-23 09:16:25.990 [INFO][58] endpoint_mgr.go 1043: Applying /proc/sys configuration to interface. ifaceName="eni467ad56930c"
2020-08-23 09:16:25.990 [INFO][58] route_table.go 556: Syncing routes: found unexpected route; ignoring due to grace period. dest=172.30.170.74/32 ifaceName="eni467ad56930c" ipVersion=0x4
2020-08-23 09:16:25.990 [INFO][58] route_table.go 364: Interface in cleanup grace period, will retry after. ifaceName="eni467ad56930c" ipVersion=0x4
2020-08-23 09:16:25.991 [INFO][58] route_table.go 556: Syncing routes: found unexpected route; ignoring due to grace period. dest=172.30.171.35/32 ifaceName="eni325bbee54ae" ipVersion=0x4
2020-08-23 09:16:25.991 [INFO][58] route_table.go 364: Interface in cleanup grace period, will retry after. ifaceName="eni325bbee54ae" ipVersion=0x4
2020-08-23 09:16:25.991 [INFO][58] route_table.go 556: Syncing routes: found unexpected route; ignoring due to grace period. dest=172.30.169.5/32 ifaceName="enif7638c71565" ipVersion=0x4
2020-08-23 09:16:25.991 [INFO][58] route_table.go 364: Interface in cleanup grace period, will retry after. ifaceName="enif7638c71565" ipVersion=0x4
2020-08-23 09:16:25.991 [INFO][58] int_dataplane.go 1176: Finished applying updates to dataplane. msecToApply=1.098275

.
.
.

2020-08-23 09:16:36.695 [INFO][58] table.go 810: Invalidating dataplane cache ipVersion=0x4 reason="post update" table="filter"
2020-08-23 09:16:36.695 [INFO][58] route_table.go 561: Syncing routes: removing old route. dest=172.30.170.74/32 ifaceName="eni467ad56930c" ipVersion=0x4 routeProblems=[]string{"unexpected route"}
2020-08-23 09:16:36.696 [INFO][58] conntrack.go 78: Removing conntrack flows ip=172.30.170.74
2020-08-23 09:16:36.697 [INFO][58] table.go 497: Loading current iptables state and checking it is correct. ipVersion=0x4 table="filter"
2020-08-23 09:16:36.706 [INFO][58] int_dataplane.go 1176: Finished applying updates to dataplane. msecToApply=10.334442
2020-08-23 09:16:36.695 [INFO][58] route_table.go 561: Syncing routes: removing old route. dest=172.30.170.74/32 ifaceName="eni467ad56930c" ipVersion=0x4 routeProblems=[]string{"unexpected route"}

We can see that calico-node is removing the route as it considers it unexpected for some reason, along with all other pods on the node.

I did notice that we are running a newer version of calico v3.15.0 whereas your 1.7.0 release is using v3.13.4. I tested with both versions of calico running in the cluster, both show the same log events.

One thing to note while we do have calico running in the cluster, we don't have any network policies applied for this cluster.

To summarise:

  • cni 1.6.3 works with calico 3.13.4 & 3.15.0
  • cni 1.7.0 doesn't work with calico 3.13.4 or 3.15.0

Reverting the cni to 1.6.3 and leaving calico deployment unchanged then deleting pods/nodes fixes the issue.


Update: I just noticed that calico wasn't printing meaningful debug information, with debug on I noticed these two lines from felix which stood out for a new test pod. It might be nothing though...

2020-08-23 10:31:11.112 [DEBUG][64] felix/update_filter.go 103: Route update route={Ifindex: 37 Dst: 172.30.171.53/32 Src: <nil> Gw: <nil> Flags: [] Table: 254}
2020-08-23 10:31:11.112 [DEBUG][64] felix/update_filter.go 105: Ignoring non-local route. route={Ifindex: 37 Dst: 172.30.171.53/32 Src: <nil> Gw: <nil> Flags: [] Table: 254}

Afterwards we start seeing the log entries about unexpected routes.

2020-08-23 10:31:11.215 [INFO][64] felix/route_table.go 856: Syncing routes: found unexpected route; ignoring due to grace period. dest=172.30.171.53/32 ifaceName="eniac706274002" ifaceRegex="^eni.*" ipVersion=0x4

@mogren
Copy link
Contributor

mogren commented Aug 23, 2020

@kgtw Thanks for reporting and debugging this issue. Can we update the title of this ticket to something like "Calico deletes routes when using CNI v1.7.0" to make it easier to find for other people that run into this? We will try to get to the root cause of why Calico can't see the changes with v1.7.0 ASAP.

@kgtw kgtw changed the title interface created with no ip-address / routes Calico deletes routes when using CNI v1.7.0 Aug 23, 2020
@mogren mogren added priority/P0 Highest priority. Someone needs to actively work on this. and removed needs investigation labels Aug 25, 2020
@jayanthvn
Copy link
Contributor

@kgtw - Thanks again for reporting the issue. #1166 is merged to master so closing this issue now.

@kgtw
Copy link
Contributor Author

kgtw commented Aug 25, 2020

Thanks for the speedy fix @jayanthvn - I'll let you know if we still have issues with this once a new release has been made.

@jayanthvn
Copy link
Contributor

Yes please. Thanks :)

@mogren
Copy link
Contributor

mogren commented Aug 25, 2020

PR has all the details, but in short, Calico assumes that the ENI name is generated by using this exact formula:

h := sha1.New()
h.Write([]byte(fmt.Sprintf("%s.%s", namespace, podname)))

@mogren
Copy link
Contributor

mogren commented Aug 26, 2020

@kgtw v1.7.1 is out, please let us know if this fixes the issue for you.

@kgtw
Copy link
Contributor Author

kgtw commented Aug 28, 2020

@mogren - Yes, this did fix the issue. Thanks!

@jayanthvn
Copy link
Contributor

@kgtw - Thanks for verifying this :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
calico Calico integration issue priority/P0 Highest priority. Someone needs to actively work on this.
Projects
None yet
Development

No branches or pull requests

4 participants