New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flannel not creating cni or veth intefaces on one node #1039

Open
sgmacdougall opened this Issue Sep 18, 2018 · 10 comments

Comments

Projects
None yet
3 participants
@sgmacdougall

sgmacdougall commented Sep 18, 2018

I have a kubernetes environment created manually in vsphere with three worker nodes. I've installed flannel using the yaml from here:

https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

Except, I changed the backend from vxlan to host-gw. Vxkan didn't work in my environment, probably because I don't have distributed switching configured in the ESXi hosts.

Expected Behavior

Each node should have a cni0 interface with an IP address derived from the pod CIDR as well as several veth interfaces. Routing tables should update to reflect the routes to the pod CIDRs on the other nodes. Here's output from node #2:

_ifconfig
cni0 Link encap:Ethernet HWaddr ae:85:74:b8:f1:4e
inet addr:10.244.1.1 Bcast:0.0.0.0 Mask:255.255.255.0
inet6 addr: fe80::ac85:74ff:feb8:f14e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:308224 errors:0 dropped:0 overruns:0 frame:0
TX packets:296430 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:24686474 (24.6 MB) TX bytes:1385790869 (1.3 GB)

ens160 Link encap:Ethernet HWaddr 00:50:56:aa:5f:2d
inet addr:10.180.11.195 Bcast:10.180.11.255 Mask:255.255.255.0
inet6 addr: fe80::250:56ff:feaa:5f2d/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1250386 errors:0 dropped:175 overruns:0 frame:0
TX packets:418707 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:1545089936 (1.5 GB) TX bytes:40886026 (40.8 MB)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:194 errors:0 dropped:0 overruns:0 frame:0
TX packets:194 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:15726 (15.7 KB) TX bytes:15726 (15.7 KB)

veth04e19e3b Link encap:Ethernet HWaddr 02:28:ed:e3:64:37
inet6 addr: fe80::28:edff:fee3:6437/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:71015 errors:0 dropped:0 overruns:0 frame:0
TX packets:75305 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:6143154 (6.1 MB) TX bytes:6847736 (6.8 MB)

veth40da2910 Link encap:Ethernet HWaddr da:d2:a8:33:90:0b
inet6 addr: fe80::d8d2:a8ff:fe33:900b/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:237209 errors:0 dropped:0 overruns:0 frame:0
TX packets:221156 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:22858456 (22.8 MB) TX bytes:1378945435 (1.3 GB)_

Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
0.0.0.0 10.180.11.1 0.0.0.0 UG 0 0 0 ens160
10.180.11.0 0.0.0.0 255.255.255.0 U 0 0 0 ens160
10.244.0.0 10.180.11.194 255.255.255.0 UG 0 0 0 ens160
10.244.1.0 0.0.0.0 255.255.255.0 U 0 0 0 cni0
10.244.2.0 10.180.11.196 255.255.255.0 UG 0 0 0 ens160

And here's node 3:

_ifconfig
cni0 Link encap:Ethernet HWaddr c6:af:b6:bd:19:df
inet addr:10.244.2.1 Bcast:0.0.0.0 Mask:255.255.255.0
inet6 addr: fe80::c4af:b6ff:febd:19df/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:276735 errors:0 dropped:0 overruns:0 frame:0
TX packets:314344 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:43575778 (43.5 MB) TX bytes:63961863 (63.9 MB)

ens160 Link encap:Ethernet HWaddr 00:50:56:aa🆎b4
inet addr:10.180.11.196 Bcast:10.180.11.255 Mask:255.255.255.0
inet6 addr: fe80::250:56ff:feaa:abb4/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:569322 errors:0 dropped:158 overruns:0 frame:0
TX packets:284413 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:494184506 (494.1 MB) TX bytes:346410379 (346.4 MB)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:194 errors:0 dropped:0 overruns:0 frame:0
TX packets:194 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:15726 (15.7 KB) TX bytes:15726 (15.7 KB)

veth64075cea Link encap:Ethernet HWaddr 3a:64:8e:da:96:eb
inet6 addr: fe80::3864:8eff:feda:96eb/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:78998 errors:0 dropped:0 overruns:0 frame:0
TX packets:89833 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:15257080 (15.2 MB) TX bytes:10282634 (10.2 MB)

veth9564535e Link encap:Ethernet HWaddr e6:44:18:02💿5f
inet6 addr: fe80::e444:18ff:fe02:cd5f/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:197737 errors:0 dropped:0 overruns:0 frame:0
TX packets:224543 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:32192988 (32.1 MB) TX bytes:53681601 (53.6 MB)_

Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
0.0.0.0 10.180.11.1 0.0.0.0 UG 0 0 0 ens160
10.180.11.0 0.0.0.0 255.255.255.0 U 0 0 0 ens160
10.244.0.0 10.180.11.194 255.255.255.0 UG 0 0 0 ens160
10.244.1.0 10.180.11.195 255.255.255.0 UG 0 0 0 ens160
10.244.2.0 0.0.0.0 255.255.255.0 U 0 0 0 cni0

Current Behavior

Node one does not cni0 or veth interfaces, however it's routing table has the routes to the other nodes:

_ens160 Link encap:Ethernet HWaddr 00:50:56:aa:f4:35
inet addr:10.180.11.194 Bcast:10.180.11.255 Mask:255.255.255.0
inet6 addr: fe80::250:56ff:feaa:f435/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:477072 errors:0 dropped:168 overruns:0 frame:0
TX packets:212542 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:452413052 (452.4 MB) TX bytes:286003380 (286.0 MB)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:194 errors:0 dropped:0 overruns:0 frame:0
TX packets:194 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:15726 (15.7 KB) TX bytes:15726 (15.7 KB)_

Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
0.0.0.0 10.180.11.1 0.0.0.0 UG 0 0 0 ens160
10.180.11.0 0.0.0.0 255.255.255.0 U 0 0 0 ens160
10.244.1.0 10.180.11.195 255.255.255.0 UG 0 0 0 ens160
10.244.2.0 10.180.11.196 255.255.255.0 UG 0 0 0 ens160

Possible Solution

Unknown

Steps to Reproduce (for bugs)

  1. kubelet.service on node 1:
    _Description=Kubernetes Kubelet
    Documentation=https://github.com/kubernetes/kubernetes
    After=cri-containerd.service
    Requires=containerd.service

[Service]
ExecStart=/usr/local/bin/kubelet
--node-ip=10.180.11.194
--allow-privileged=true
--anonymous-auth=false
--authorization-mode=Webhook
--client-ca-file=/var/lib/kubernetes/ca.pem
--cloud-provider=
--cluster-dns=10.32.0.10
--cluster-domain=cluster.local
--container-runtime=remote
--container-runtime-endpoint=unix:///var/run/containerd/containerd.sock
--network-plugin=cni
--pod-cidr=10.244.0.0/24
--image-pull-progress-deadline=2m
--kubeconfig=/var/lib/kubelet/kubeconfig
--register-node=true
--runtime-request-timeout=15m
--tls-cert-file=/var/lib/kubelet/10.180.11.194.pem
--tls-private-key-file=/var/lib/kubelet/10.180.11.194-key.pem
--v=2
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target_

  1. kubelet.service on node 2

_[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/kubernetes/kubernetes
After=cri-containerd.service
Requires=containerd.service

[Service]
ExecStart=/usr/local/bin/kubelet
--node-ip=10.180.11.195
--allow-privileged=true
--anonymous-auth=false
--authorization-mode=Webhook
--client-ca-file=/var/lib/kubernetes/ca.pem
--cloud-provider=
--cluster-dns=10.32.0.10
--cluster-domain=cluster.local
--container-runtime=remote
--container-runtime-endpoint=unix:///var/run/containerd/containerd.sock
--image-pull-progress-deadline=2m
--kubeconfig=/var/lib/kubelet/kubeconfig
--network-plugin=cni
--pod-cidr=10.244.1.0/24
--register-node=true
--runtime-request-timeout=15m
--tls-cert-file=/var/lib/kubelet/10.180.11.195.pem
--tls-private-key-file=/var/lib/kubelet/10.180.11.195-key.pem
--v=2
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target_

  1. kubelet.service on node 3

_[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/kubernetes/kubernetes
After=cri-containerd.service
Requires=containerd.service

[Service]
ExecStart=/usr/local/bin/kubelet
--node-ip=10.180.11.196
--allow-privileged=true
--anonymous-auth=false
--authorization-mode=Webhook
--client-ca-file=/var/lib/kubernetes/ca.pem
--cloud-provider=
--cluster-dns=10.32.0.10
--cluster-domain=cluster.local
--container-runtime=remote
--container-runtime-endpoint=unix:///var/run/containerd/containerd.sock
--image-pull-progress-deadline=2m
--kubeconfig=/var/lib/kubelet/kubeconfig
--network-plugin=cni
--pod-cidr=10.244.2.0/24
--register-node=true
--runtime-request-timeout=15m
--tls-cert-file=/var/lib/kubelet/10.180.11.196.pem
--tls-private-key-file=/var/lib/kubelet/10.180.11.196-key.pem
--v=2
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target_

  1. kube-contrller-manaager.service on master

_[Unit]
Description=Kubernetes Controller Manager
Documentation=https://github.com/kubernetes/kubernetes

[Service]
ExecStart=/usr/local/bin/kube-controller-manager
--address=0.0.0.0
--cluster-cidr=10.244.0.0/16
--cluster-name=kubernetes
--cluster-signing-cert-file=/var/lib/kubernetes/ca.pem
--cluster-signing-key-file=/var/lib/kubernetes/ca-key.pem
--leader-elect=true
--master=http://127.0.0.1:8080
--root-ca-file=/var/lib/kubernetes/ca.pem
--service-account-private-key-file=/var/lib/kubernetes/ca-key.pem
--service-cluster-ip-range=10.32.0.0/24
--v=2
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target_

  1. The PodCIDR values weren't figured out from the --pod-cidr setting in the kubelet.service file, so I manually added them using"

kubectl patch node <NODE_NAME> -p '{"spec":{"podCIDR":""}}'

Here's the output from the kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}' command showing that the PodCIDRs are correct now:

kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'
10.244.0.0/24 10.244.1.0/24 10.244.2.0/24

  1. Updated kube-flannel.yml to use host-gw:

net-conf.json: |
{
"Network": "10.244.0.0/16",
"Backend": {
"Type": "host-gw"
}
}

  1. Here's the /etc/cni/net.d/10-flannel.conflist which flannel created. It't identical on each node:

_{
"name": "cbr0",
"plugins": [
{
"type": "flannel",
"delegate": {
"hairpinMode": true,
"isDefaultGateway": true
}
},
{
"type": "portmap",
"capabilities": {
"portMappings": true
}
}
]
}
_

  1. And here's the subnet.env on the three nodes:

/run/flannel # cat subnet.env
FLANNEL_NETWORK=10.244.0.0/16
FLANNEL_SUBNET=10.244.0.1/24
FLANNEL_MTU=1500
FLANNEL_IPMASQ=true

/run/flannel # cat subnet.env
FLANNEL_NETWORK=10.244.0.0/16
FLANNEL_SUBNET=10.244.1.1/24
FLANNEL_MTU=1500
FLANNEL_IPMASQ=true

/run/flannel # cat subnet.env
FLANNEL_NETWORK=10.244.0.0/16
FLANNEL_SUBNET=10.244.2.1/24
FLANNEL_MTU=1500
FLANNEL_IPMASQ=true

Context

Trying to enable networking between the three workers.

Your Environment

  • Flannel version: flannel:v0.10.0-amd64
  • Backend used (e.g. vxlan or udp): host-gw
  • Etcd version: etcdctl version: 3.3.9
  • Kubernetes version (if used): v1.11.3
  • Operating System and version:Ubuntu 16.04.5 LTS
  • Link to your project (optional):
@sgmacdougall

This comment has been minimized.

sgmacdougall commented Sep 18, 2018

I'd like to check etcd to see what the podCIDRs look like in there, but I don't know how to do that or if its even possible.

@MrEcco

This comment has been minimized.

MrEcco commented Oct 18, 2018

I do not know why this is happening, but I was able to reproduce this bug. Experimentally, I discovered that there is no configuration /etc/cni/net.d/* on the bad nodes.
May be solution:
Copy /etc/cni/net.d/* from master and manualy paste to bad nodes. Configs applyed immediately and you can test intercluster network.

@lanoxx

This comment has been minimized.

lanoxx commented Nov 5, 2018

I have one master and one worker node. For me, the cni0 interface is missing on the master node, while it is being created on the worker node. Flannel is running on both nodes and reports no errors but I cannot get any network traffic across the nodes using the overlay IPs because of the missing cni0 interface on the master node.

@MrEcco

This comment has been minimized.

MrEcco commented Nov 5, 2018

Every time what flannel not work on node I use this (on node):

mkdir -p /etc/cni/net.d
cd /etc/cni/net.d
# This is zipped cni config which must be deployed by flannel-pod, but 
# wasnt deployed by unknow reason
cat << EOF | openssl base64 -d | xz -d > 10-flannel.conflist
/Td6WFoAAATm1rRGAgAhARwAAAAQz1jM4AEKAJFdAD2CgBccLouJnyT/6A8zPtZS
xLRFcjIbx3pn6UV/UpoPAEjPLRmPz8u5fwxtKGvSxeMWHNVeyJ2Vpb491DXaBjHk
hP/DcMJyv+4mJL330vZDjgFq9OUqbVG0Nx6n6BAMRfhEYAqrhEcyjIQJVsTAgWVi
ODNmTWnAm3vdSjAtesWbiM+PR2FP/IK0cGdsy1VvzDQAAAAAXN3PLZF7zbAAAa0B
iwIAAAkaa2KxxGf7AgAAAAAEWVo=
EOF

After this i see flannel-pods in Creating status

watch -n1 kubectl get pods --all-namespaces
@lanoxx

This comment has been minimized.

lanoxx commented Nov 5, 2018

@MrEcco This configuration is present under /etc/cni/net.d/10-flannel.conflist on my master node but still there is no cni0 interface.

@lanoxx

This comment has been minimized.

lanoxx commented Nov 5, 2018

I just noticed that /var/lib/cni/ does not exist on my master node. Shouldn't that be created by flannel?

@MrEcco

This comment has been minimized.

MrEcco commented Nov 5, 2018

Should.
I have this in work cluster:

root@kube-master:/var/lib/cni# find .
./flannel
./flannel/<64_hex_symbols>
./flannel/<other_64_hex_symbols>
./networks
./networks/cbr0
./networks/cbr0/10.244.0.4
./networks/cbr0/last_reserved_ip.0
./networks/cbr0/10.244.0.5

Are you sure you turned off selinux? May be you use custom iptables policyes? Or this is problem with connection between datacenters? After #1039 (comment) are you see flannel pods in kube-system namespace? Nodes is resolvable by their hostnames?

@lanoxx

This comment has been minimized.

lanoxx commented Nov 6, 2018

I am running this on Ubuntu 18.04 which does not have selinux installed or enabled by default. I also did not add any iptables policies by my self.

I can see that for each node a flannel pod is running:

ubuntu@ip-172-33-1-142:~$ kubectl get pods -n kube-system -o wide
NAME                                  READY   STATUS     RESTARTS   IP             NODE
kube-flannel-ds-amd64-knnmh           1/1     Running    0          172.33.1.142   ip-172-33-1-142   
kube-flannel-ds-amd64-vqp2v           1/1     Running    0          172.33.1.188   ip-172-33-1-188   
kube-flannel-ds-msgdj                 1/1     Running    0          172.33.1.188   ip-172-33-1-188   
kube-flannel-ds-xhjwk                 1/1     Running    0          172.33.1.142   ip-172-33-1-142   

The master (.142) and worker (.188) nodes can ping each other by IP and also by hostname.

On the master node there is no cni folder under /var/lib:

# on master node:
ubuntu@ip-172-33-1-142:~$ cd /var/lib/cni
-bash: cd: /var/lib/cni: No such file or directory

On the worker node the folder exists and has the flannel and network subfolders as in your find . output.

@lanoxx

This comment has been minimized.

lanoxx commented Nov 7, 2018

I made some progress on this today. I had only one pod running on the master and it was configured with hostNetwork: true. As soon as I set this to hostNetwork: false and redeployed the pod, flannel started to create the cni0 interface.

Now I have a cni0 interface on my master node, but I am unable to communicate across nodes using the overlay network.

My master has 10.244.0.0/24 while my worker node has 10.244.1.0/24. I can ping pods from my master node using the masters' overlay subnet (e.g. 10.244.0.x) and I can ping pods from my worker node using the worker node's overlay subnet (e.g. 10.244.1.x). But I cannot get any traffic (e.g pings or even HTTP) across the overlay network. So I cannot reach a pod's http server on the worker node from my master node using the overlay IP of the pod.

@lanoxx

This comment has been minimized.

lanoxx commented Nov 7, 2018

Solved that final issue too, the port 8472 was not open in my AWS security group which is needed for VXLAN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment