Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod cannot ping each other in multi-host scenario - failed to add vxlanRoute (XXX -> X.Y.0.0): invalid argument #844

Closed
senwangrockets opened this issue Oct 18, 2017 · 21 comments

Comments

@senwangrockets
Copy link

senwangrockets commented Oct 18, 2017

Pod from different host cannot ping each others.
Flannel logs as below:

I1018 17:58:53.498781       1 main.go:470] Determining IP address of default interface
I1018 17:58:53.499196       1 main.go:483] Using interface with name eth0 and address 172.28.249.156
I1018 17:58:53.499243       1 main.go:500] Defaulting external address to interface address (172.28.249.156)
I1018 17:58:53.517275       1 kube.go:130] Waiting 10m0s for node controller to sync
I1018 17:58:53.517332       1 kube.go:283] Starting kube subnet manager
I1018 17:58:54.517591       1 kube.go:137] Node controller sync successful
I1018 17:58:54.517652       1 main.go:235] Created subnet manager: Kubernetes Subnet Manager - scarif-admin-2
I1018 17:58:54.517661       1 main.go:238] Installing signal handlers
I1018 17:58:54.517821       1 main.go:348] Found network config - Backend type: vxlan
I1018 17:58:54.517912       1 vxlan.go:119] VXLAN config: VNI=1 Port=0 GBP=false DirectRouting=false
I1018 17:58:54.573370       1 main.go:295] Wrote subnet file to /run/flannel/subnet.env
I1018 17:58:54.573408       1 main.go:299] Running backend.
I1018 17:58:54.573427       1 main.go:317] Waiting for all goroutines to exit
I1018 17:58:54.573496       1 vxlan_network.go:56] watching for new subnet leases
**E1018 17:58:54.573780       1 vxlan_network.go:158] failed to add vxlanRoute (172.16.0.0/24 -> 172.16.0.0): invalid argument**
I1018 17:58:54.577620       1 ipmasq.go:75] Some iptables rules are missing; deleting and recreating rules
I1018 17:58:54.577673       1 ipmasq.go:97] Deleting iptables rule: -s 172.16.0.0/16 -d 172.16.0.0/16 -j RETURN
I1018 17:58:54.579324       1 ipmasq.go:97] Deleting iptables rule: -s 172.16.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
I1018 17:58:54.580870       1 ipmasq.go:97] Deleting iptables rule: ! -s 172.16.0.0/16 -d 172.16.1.0/24 -j RETURN
I1018 17:58:54.582349       1 ipmasq.go:97] Deleting iptables rule: ! -s 172.16.0.0/16 -d 172.16.0.0/16 -j MASQUERADE
I1018 17:58:54.583900       1 ipmasq.go:85] Adding iptables rule: -s 172.16.0.0/16 -d 172.16.0.0/16 -j RETURN
I1018 17:58:54.587553       1 ipmasq.go:85] Adding iptables rule: -s 172.16.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
I1018 17:58:54.591290       1 ipmasq.go:85] Adding iptables rule: ! -s 172.16.0.0/16 -d 172.16.1.0/24 -j RETURN
I1018 17:58:54.595032       1 ipmasq.go:85] Adding iptables rule: ! -s 172.16.0.0/16 -d 172.16.0.0/16 -j MASQUERADE

Your Environment

  • Flannel version: 0.9
  • Backend used (e.g. vxlan or udp): vxlan
  • Etcd version:
  • Kubernetes version (if used): 1.8
  • Operating System and version: Centos 7.3 Docker 17.06
@senwangrockets
Copy link
Author

What I think is interesting is "
E1018 17:58:54.573780 1 vxlan_network.go:158] failed to add vxlanRoute (172.16.0.0/24 -> 172.16.0.0): invalid argument
"

@tomdee
Copy link
Contributor

tomdee commented Oct 20, 2017

Yes, that line is the smoking gun. What other nodes do you have? Can you output the flannel annotation you have on your nodes (something like kubectl get nodes -o yaml |grep flannel.alpha).

Somehow, I think one of your nodes has a PublicIP of 172.16.0.0 which it shouldn't do. The 172.16/16 range should be reserved for the vxlan network.

@camflan
Copy link

camflan commented Oct 24, 2017

I have a similar issue, same versions of flannel, k8s. Using vxlan, flannel is up and running, no errors in the logs (not even the error above).

kubeadm 1.8.1
k8s 1.8.0
flannel 0.9
ubuntu 16.04
docker 17.03ce

I've tried combinations of k8s as far back as 1.6 and flannel as far back as 0.8, all with the same results.

I'm able to connect pod <-> pod and host <-> pod as long as the pods are on that host. All hosts can communicate with each other without issues. I've spent almost a month fiddling with iptables, routes, etc and cannot figure this out. I'm seeing traffic via tcpdump on the cni0 bridge, but my pods aren't getting it. IIRC, last night I was using iptstate and was seeing udp traffic on the bridge when I expected tcp. Maybe this is the issue? It's also possible I was seeing something else...

Should I open another ticket, or piggy back on this one?

@jhorwit2
Copy link

I'm running into the same issue it seems.

I1026 22:38:06.797811     208 vxlan_network.go:56] watching for new subnet leases
I1026 22:38:06.800429     208 ipmasq.go:75] Some iptables rules are missing; deleting and recreating rules
I1026 22:38:06.800450     208 ipmasq.go:97] Deleting iptables rule: -s 172.17.0.0/16 -d 172.17.0.0/16 -j RETURN
I1026 22:38:06.801507     208 ipmasq.go:97] Deleting iptables rule: -s 172.17.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
I1026 22:38:06.802527     208 ipmasq.go:97] Deleting iptables rule: ! -s 172.17.0.0/16 -d 172.17.9.0/24 -j RETURN
I1026 22:38:06.803535     208 ipmasq.go:97] Deleting iptables rule: ! -s 172.17.0.0/16 -d 172.17.0.0/16 -j MASQUERADE
I1026 22:38:06.804543     208 ipmasq.go:85] Adding iptables rule: -s 172.17.0.0/16 -d 172.17.0.0/16 -j RETURN
I1026 22:38:06.806706     208 ipmasq.go:85] Adding iptables rule: -s 172.17.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
I1026 22:38:06.808932     208 ipmasq.go:85] Adding iptables rule: ! -s 172.17.0.0/16 -d 172.17.9.0/24 -j RETURN
I1026 22:38:06.811148     208 ipmasq.go:85] Adding iptables rule: ! -s 172.17.0.0/16 -d 172.17.0.0/16 -j MASQUERADE
E1026 22:38:11.064786     208 vxlan_network.go:158] failed to add vxlanRoute (172.17.0.0/24 -> 172.17.0.0): invalid argument
E1027 02:51:24.265565     208 vxlan_network.go:158] failed to add vxlanRoute (172.17.0.0/24 -> 172.17.0.0): invalid argument

@tomdee none of my nodes have that as the public ip annotation (they're all correct).

@jhorwit2
Copy link

jhorwit2 commented Oct 27, 2017

I don't see a route for 172.17.0.0/24 on any of my hosts.

$ ip route
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
172.17.1.0/24 via 172.17.1.0 dev flannel.1 onlink
172.17.2.0/24 via 172.17.2.0 dev flannel.1 onlink
172.17.3.0/24 via 172.17.3.0 dev flannel.1 onlink
172.17.4.0/24 via 172.17.4.0 dev flannel.1 onlink
172.17.5.0/24 via 172.17.5.0 dev flannel.1 onlink
172.17.6.0/24 via 172.17.6.0 dev flannel.1 onlink
172.17.7.0/24 via 172.17.7.0 dev flannel.1 onlink
172.17.8.0/24 via 172.17.8.0 dev flannel.1 onlink
172.17.9.2 dev cali299270d87b6 scope link
172.17.9.3 dev calib63aee49779 scope link
172.17.9.4 dev cali12d4a061371 scope link
$ arp -a
...
? (172.17.0.0) at <incomplete> on flannel.1
...

Flannel logs

I1027 12:53:29.439503     166 vxlan_network.go:138] adding subnet: 172.17.0.0/24 PublicIP: 10.65.27.18 VtepMAC: 46:ee:d0:82:55:a4
I1027 12:53:29.439524     166 device.go:179] calling AddARP: 172.17.0.0, 46:ee:d0:82:55:a4
I1027 12:53:29.439591     166 device.go:156] calling AddFDB: <hostip>, 46:ee:d0:82:55:a4
E1027 12:53:29.439668     166 vxlan_network.go:158] failed to add vxlanRoute (172.17.0.0/24 -> 172.17.0.0): invalid argument
I1027 12:53:29.439706     166 device.go:190] calling DelARP: 172.17.0.0, 46:ee:d0:82:55:a4
I1027 12:53:29.439751     166 device.go:168] calling DelFDB: <hostip>, 46:ee:d0:82:55:a4

@DominicDV
Copy link

DominicDV commented Oct 27, 2017

I had this error too when transitioning from 1.7.5 to 1.8.2.
A reboot solved this error for me.
(for completenes: prior to this I deleted the fstab swap entry because kubelet requires that the system doesnt swap. Not sure If this is related)

@tomdee tomdee changed the title Pod cannot ping each other in multi-host scenario Pod cannot ping each other in multi-host scenario - failed to add vxlanRoute (XXX -> X.Y.0.0): invalid argument Nov 4, 2017
@tomdee
Copy link
Contributor

tomdee commented Nov 4, 2017

@camflan please open a different issue. I suspect you just need "iptables -P FORWARD ACCEPT"

@tomdee
Copy link
Contributor

tomdee commented Nov 4, 2017

@jhorwit2 @senwangrockets I think the problem could be that you have the same IP range configured for your Docker bridge as you do for flannel. If you're using kubeadm, did you specify --pod-network-cidr 10.244.0.0/16

@jhorwit2
Copy link

jhorwit2 commented Nov 4, 2017

@tomdee that was my issue. Sorry I forgot to post after I realized that.

@tomdee tomdee closed this as completed Nov 8, 2017
@kumarganesh2814
Copy link

@tomdee
Hi Tom,

I initialised my cluster with same kubeadm command
kubeadm init --pod-network-cidr 10.244.0.0/16
But Still in Flannel pods I see errors

E1210 07:10:45.198903 1 vxlan_network.go:158] failed to add vxlanRoute (10.244.2.0/24 -> 10.244.2.0): invalid argument

I have 4 host cluster 2 of them works fine but other 2 fails to schedule container

Always in state of "ContainerCreating"

Errors which I see is

Dec 10 01:39:14 kongapi-poc-db1 kubelet: E1210 01:39:14.554032   58034 cni.go:250] Error while adding to cni network: "cni0" already has an IP address different from 10.244.3.1/24
Dec 10 01:39:14 kongapi-poc-db1 kernel: cni0: port 1(veth7b12c96f) entered disabled state
Dec 10 01:39:14 kongapi-poc-db1 kernel: device veth7b12c96f left promiscuous mode
Dec 10 01:39:14 kongapi-poc-db1 kernel: cni0: port 1(veth7b12c96f) entered disabled state
Dec 10 01:39:14 kongapi-poc-db1 NetworkManager[702]: <info>  [1512898754.6477] device (veth7b12c96f): released from master device cni0
Dec 10 01:39:14 kongapi-poc-db1 kubelet: E1210 01:39:14.655974   58034 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "tomcat-d6b5b9647-prq9w_tomcat" network: "cni0" already has an IP address different from 10.244.3.1/24

@eroji
Copy link

eroji commented Feb 9, 2018

Having the same problem. 4 nodes, 2 masters and 2 workers. the .167 and .168 are the workers and .167 is the one that's having issues adding the route.

Output of: kubectl get nodes -o yaml |grep flannel.alpha

      flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"d2:28:18:cd:1d:82"}'
      flannel.alpha.coreos.com/backend-type: vxlan
      flannel.alpha.coreos.com/kube-subnet-manager: "true"
      flannel.alpha.coreos.com/public-ip: 10.1.130.165
      flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"b6:67:12:1c:d9:c4"}'
      flannel.alpha.coreos.com/backend-type: vxlan
      flannel.alpha.coreos.com/kube-subnet-manager: "true"
      flannel.alpha.coreos.com/public-ip: 10.1.130.166
      flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"aa:e0:31:6e:d1:ef"}'
      flannel.alpha.coreos.com/backend-type: vxlan
      flannel.alpha.coreos.com/kube-subnet-manager: "true"
      flannel.alpha.coreos.com/public-ip: 10.1.130.167
      flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"16:13:d5:7c:c5:e2"}'
      flannel.alpha.coreos.com/backend-type: vxlan
      flannel.alpha.coreos.com/kube-subnet-manager: "true"
      flannel.alpha.coreos.com/public-ip: 10.1.130.168

@BSWANG
Copy link
Contributor

BSWANG commented Mar 1, 2018

Are the invalid gateway addresses treated as multicast address by linux?
The subnet allocation in flannel will skip the multicast addresses https://github.com/coreos/flannel/blob/master/subnet/config.go#L86-L93. But using the podCidr allocated by "controller manager" not skip the first subnet.

@tomdee

@nabheet
Copy link

nabheet commented Dec 3, 2018

@tomdee
Hi Tom,

I initialised my cluster with same kubeadm command
kubeadm init --pod-network-cidr 10.244.0.0/16
But Still in Flannel pods I see errors

E1210 07:10:45.198903 1 vxlan_network.go:158] failed to add vxlanRoute (10.244.2.0/24 -> 10.244.2.0): invalid argument

I have 4 host cluster 2 of them works fine but other 2 fails to schedule container

Always in state of "ContainerCreating"

Errors which I see is

Dec 10 01:39:14 kongapi-poc-db1 kubelet: E1210 01:39:14.554032   58034 cni.go:250] Error while adding to cni network: "cni0" already has an IP address different from 10.244.3.1/24
Dec 10 01:39:14 kongapi-poc-db1 kernel: cni0: port 1(veth7b12c96f) entered disabled state
Dec 10 01:39:14 kongapi-poc-db1 kernel: device veth7b12c96f left promiscuous mode
Dec 10 01:39:14 kongapi-poc-db1 kernel: cni0: port 1(veth7b12c96f) entered disabled state
Dec 10 01:39:14 kongapi-poc-db1 NetworkManager[702]: <info>  [1512898754.6477] device (veth7b12c96f): released from master device cni0
Dec 10 01:39:14 kongapi-poc-db1 kubelet: E1210 01:39:14.655974   58034 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "tomcat-d6b5b9647-prq9w_tomcat" network: "cni0" already has an IP address different from 10.244.3.1/24

I am not sure if this will help, but you might want to delete all the network/bridge devices before initializing k8s again. I had similar issues but I destroyed and created new VMs which resolved my similar issue. However, the issues might not be the same.

After reading flannel documentation, it was not obvious to me that flannel works one cidr only. But after the change things are much better, although with other issues.

@leogoing
Copy link

@senwangrockets @kumarganesh2814 ,I have the same problem. Have you solved it ?

@Voxis
Copy link

Voxis commented Oct 31, 2020

I got the same problem here is how I resolved. I have a 1 master 2 worker nodes setup, all of them are VMs. they have fixed ip and hostnames in my local are network. master and 1 worker node is ok. 1 worker node has this problem.

when I see something like this: vxlan_network.go:158] failed to add vxlanRoute (10.244.2.0/24 -> 10.244.2.0): invalid argument, I would log onto that machine and check the ip address of cni0, it could be a different address. you could delete the interface and let the cluster re-generate. but my side of the problem is that I realized the flannel.1 interface was not created.

so I delete the node, manually delete the associated pods from master, and did kubectl reset on the problematic worker node. and rejoined. but the flannel.1 never appear. In the end, I deleted the node from master and and did a reset. Restart the vm, and join master just like normal, flannel.1 appeared. And I did a deployment on master. On the worker node, cni0 and veth appeared.

TLDR: not sure whether it would work but: delete worker node from master, worker node kubectl reset, clean up , Restart vm, join master node as normal.

@Queetinliu
Copy link

I also faced this problem,this is because the network interface which flanneld use can't access each other,i use another network interface then sovled

@rthamrin
Copy link

mine so weird on this flannel.alpha.coreos.com/public-ip: 10.0.3.15. this is my master, now my master cannot ping others flannel. what is actually happened here and how to edit the flannel.alpha on my master?

kubectl get nodes -o yaml |grep flannel.alpha

      flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"16:cb:5c:78:57:cb"}'

      flannel.alpha.coreos.com/backend-type: vxlan

      flannel.alpha.coreos.com/kube-subnet-manager: "true"

      flannel.alpha.coreos.com/public-ip: 192.168.14.3

      flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"7e:1e:e8:f6:8f:77"}'

      flannel.alpha.coreos.com/backend-type: vxlan

      flannel.alpha.coreos.com/kube-subnet-manager: "true"

      flannel.alpha.coreos.com/public-ip: 192.168.14.4

      flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"06:cd:6a:ba:6b:54"}'

      flannel.alpha.coreos.com/backend-type: vxlan

      flannel.alpha.coreos.com/kube-subnet-manager: "true"

      flannel.alpha.coreos.com/public-ip: 10.0.3.15

      flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"96:71:0e:48:52:4d"}'

      flannel.alpha.coreos.com/backend-type: vxlan

      flannel.alpha.coreos.com/kube-subnet-manager: "true"

      flannel.alpha.coreos.com/public-ip: 192.168.14.2

@dale1202
Copy link

check the flannel.1 is conflicted with the docker0's ip, if conflicted, change the subnet's ip range

@rthamrin
Copy link

check the flannel.1 is conflicted with the docker0's ip, if conflicted, change the subnet's ip range

sorry, to whom your answer go with?

@dale1202
Copy link

@rthamrin i followed this question: "failed to add vxlanRoute (10.244.2.0/24 -> 10.244.2.0): invalid argument"

@stale
Copy link

stale bot commented Jan 25, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jan 25, 2023
@stale stale bot closed this as completed Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests