Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

join existing flannel network #435

Closed
mmkonrad opened this issue May 11, 2016 · 8 comments
Closed

join existing flannel network #435

mmkonrad opened this issue May 11, 2016 · 8 comments
Labels

Comments

@mmkonrad
Copy link

mmkonrad commented May 11, 2016

Hello together,

I am working on 2 ubuntu-14.04 aws-ec2 instances.
I have setup an etcd-cluster as described here (even though it's is old): https://xelatex.github.io/2015/10/10/Flannel-for-Docker-Overlay-Network/

The cluster works, the raft succeeded and when I enter a key on one of the two nodes I can also retrieve it on the other one. Therefore I set the flannel-config:
etcdctl set /coreos.com/network/config '{"Network": "10.43.116.192/27", "Backend": { "Type": "vxlan"}} '

As aforementioned I can set it on one node and get it on the other one.

Here comes the issue:
I am able to start flannel as described in that post on my first node. But when I start it on the other one I receive the following error message:

I0511 14:23:14.371975 01809 main.go:120] Installing signal handlers
I0511 14:23:14.372081 01809 manager.go:133] Determining IP address of default interface
I0511 14:23:14.372248 01809 manager.go:163] Using 10.43.116.155 as external interface
I0511 14:23:14.372278 01809 manager.go:164] Using 10.43.116.155 as external endpoint
I0511 14:23:14.374031 01809 local_manager.go:179] Picking subnet in range 10.43.116.208 ... 10.43.116.208
E0511 14:23:14.374070 01809 network.go:106] failed to register network: failed to acquire lease: out of subnets

And when I add the --etcd-endpoints=http://10.43.116.137:4001 flag I receive the following ouput:

I0511 14:25:00.112768 01823 main.go:120] Installing signal handlers
I0511 14:25:00.112896 01823 manager.go:133] Determining IP address of default interface
I0511 14:25:00.113079 01823 manager.go:163] Using 10.43.116.155 as external interface
I0511 14:25:00.113113 01823 manager.go:164] Using 10.43.116.155 as external endpoint
E0511 14:25:00.113877 01823 network.go:106] failed to retrieve network config: client: etcd cluster is unavailable or misconfigured

And with port 2379 I get the same error as without the flag.

Where am I wrong? Is there a connection to the fact that I only have linux kernel 3.13 instead of 3.16, which then messes with vxlan? If so, why can I successfully set everything up as described in the guide....on a single node

Annotations:

  • I am aware of your aws-vpc backend. But I am just a minion myself and therefore don't have the required policies for my aws-role (e.g. ec2:CreateRoute)
@tomdee
Copy link
Contributor

tomdee commented May 11, 2016

If you followed those instructions, then your running the latest master code right?

Your SubnetLen will be /28 so I'm assuming your first host grabbed the 10.43.116.192/28 subnet and from the logs, it looks like the 2nd host tried to get 10.43.116.208/28 but then failed to register it. Could you add a dump of (the relevant part of) your etcd so we can see if this is the case?

I'm a little concerned that there could be multiple distinct etcd servers running here. It's odd that you're not able to connect to etcd from the second host.

@mmkonrad
Copy link
Author

mmkonrad commented May 12, 2016

@tomdee Since I used https://github.com/coreos/flannel.git I assumed it would be version 0.5.5.

In fact there are two etcd-processes. I followed the instructions of the guide and started on both nodes an etcd process. They found each other and raft succeeded. etcdctl member list then produces the output:

586dbf8b73b61d39: name=node1 peerURLs=http://10.43.116.137:2380 clientURLs=http://10.43.116.137:2379 isLeader=true
752ee4714249b716: name=node2 peerURLs=http://10.43.116.155:2380 clientURLs=http://10.43.116.155:2379 isLeader=false

What confuses me is the SubnetLen. I configured as network 10.43.116.192/27 and assumed that then the 27 bits would be used for the descritpion of the networks. But I guess that's me lacking knowledge about networking, subnets and address-spaces. The reason I chose this as a network was the fact, that all my nodes are within a vpc in a subnet, with the configuration: 10.43.116.128/26.

Nevertheless the node1 that starts flannel first has the following ifconfig output:

flannel.1 Link encap:Ethernet  HWaddr 5a:3e:21:8c:91:28
          inet addr:10.43.116.208  Bcast:0.0.0.0  Mask:255.255.255.224
          inet6 addr: fe80::583e:21ff:fe8c:9128/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:8951  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:8 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

with flannel output:

I0512 08:08:31.286710 01728 main.go:120] Installing signal handlers
I0512 08:08:31.286820 01728 manager.go:133] Determining IP address of default interface
I0512 08:08:31.286992 01728 manager.go:163] Using 10.43.116.137 as external interface
I0512 08:08:31.287019 01728 manager.go:164] Using 10.43.116.137 as external endpoint
I0512 08:08:31.294691 01728 local_manager.go:179] Picking subnet in range 10.43.116.208 ... 10.43.116.208
I0512 08:08:31.298252 01728 manager.go:246] Lease acquired: 10.43.116.208/28
I0512 08:08:31.298424 01728 network.go:58] Watching for L3 misses
I0512 08:08:31.298444 01728 network.go:66] Watching for new subnet leases

UPDATE: I see flannel-config 10.43.116.192/27 matches with Mask:255.255.255.224..that's why node1 has picked the 10.43.116.208

But still I don't get it why I cannot join with node2..

@tomdee what dump exactly do you want? a real dump of all keys/values within etcd or the logs?

@mmkonrad
Copy link
Author

any idea?

@mmkonrad
Copy link
Author

mmkonrad commented May 18, 2016

UPDATE:

In the meantime I am reading the docker cookbook which contains in the networking chapter (chapter 3) a sub-chapter (3.13) contibuted by Eugene Yakubovich who is (afaik) mainly responsible for flannel. He proposed the usage of the --iface=http://IP_ADDRESS and --ip-masq flags. I applied these flags and changed the setup as proposed to only the master node running etcd.

SETUP:

  • 2 aws ec2 ubuntu-14.04 instances
  • both are in a vpc and a subnet (flannel config is chosen so it does not conflict with vpc/subnet)
  • both are behind a proxy (is set in the /etc/default/docker file)
  • each having docker 1.11
  • each having latest flannel version
  • master has etcd-2.3.3

on master:

  • stop docker
  • start etcd with following settings:
etcd -name node1 \
-listen-peer-urls http://0.0.0.0:2380   \
-listen-client-urls http://0.0.0.0:2379,http://127.0.0.1:4001   \
-initial-advertise-peer-urls http://10.43.116.137:2380 \
-initial-cluster node1=http://10.43.116.137:2380  \
-initial-cluster-state new  \
-initial-cluster-token etcd-cluster   \
-advertise-client-urls http://10.43.116.137:2379
  • set network config as: etcdctl set /coreos.com/network/config '{"Network": "10.43.116.192/26", "Backend": { "Type": "vxlan"}}'
  • run flannel with: sudo ./bin/flanneld --iface=http://10.43.116.137 --ip-masq that produces following output:
I0518 07:43:21.062926 01884 main.go:120] Installing signal handlers
I0518 07:43:21.063175 01884 manager.go:163] Using 10.43.116.137 as external interface
I0518 07:43:21.063237 01884 manager.go:164] Using 10.43.116.137 as external endpoint
I0518 07:43:21.073439 01884 local_manager.go:179] Picking subnet in range 10.43.116.224 ... 10.43.116.224
I0518 07:43:21.074995 01884 ipmasq.go:47] Adding iptables rule: -s 10.43.116.192/26 -d 10.43.116.192/26 -j ACCEPT
I0518 07:43:21.078817 01884 ipmasq.go:47] Adding iptables rule: -s 10.43.116.192/26 ! -d 224.0.0.0/4 -j MASQUERADE
I0518 07:43:21.083156 01884 ipmasq.go:47] Adding iptables rule: ! -s 10.43.116.192/26 -d 10.43.116.192/26 -j MASQUERADE
I0518 07:43:21.085813 01884 manager.go:246] Lease acquired: 10.43.116.224/27
I0518 07:43:21.085936 01884 network.go:58] Watching for L3 misses
I0518 07:43:21.086011 01884 network.go:66] Watching for new subnet leases
  • run following commands for docker:
service docker stop
source /run/flannel/subnet.env
sudo ifconfig docker0 ${FLANNEL_SUBNET}
sudo docker daemon --bip=${FLANNEL_SUBNET} --mtu=${FLANNEL_MTU} &

which produces following output:

INFO[0000] New containerd process, pid: 1930
WARN[0000] containerd: low RLIMIT_NOFILE changing to max  current=1024 max=4096
INFO[0001] [graphdriver] using prior storage driver "aufs"
INFO[0001] Graph migration to content-addressability took 0.00 seconds
INFO[0001] Firewalld running: false
WARN[0001] Your kernel does not support swap memory limit.
WARN[0001] mountpoint for pids not found
INFO[0001] Loading containers: start.
...
INFO[0001] Loading containers: done.
INFO[0001] Daemon has completed initialization
INFO[0001] Docker daemon                                 commit=5604cbe graphdriver=aufs version=1.11.1
INFO[0001] API listen on /var/run/docker.sock

...So far so good. Let's go to the second node..

on worker node:

  • also stop docker
  • NOT running etcd (since master already has it running..)
  • instead run flannel as recommended (pointing to master):
    sudo ./bin/flanneld --etcd-endpoints=http://10.43.116.137:2379 --iface=10.43.116.155 --ip-masq

which results again in the known output:

I0518 07:49:38.420397 01805 main.go:120] Installing signal handlers
I0518 07:49:38.420642 01805 manager.go:163] Using 10.43.116.155 as external interface
I0518 07:49:38.420675 01805 manager.go:164] Using 10.43.116.155 as external endpoint
I0518 07:49:38.430341 01805 local_manager.go:179] Picking subnet in range 10.43.116.224 ...10.43.116.224
E0518 07:49:38.430367 01805 network.go:106] failed to register network: failed to acquire lease: out of subnets

Any idea what's wrong here?

Could the proxy settings be part of the problem? I don't think so. If that was the case docker wouldn't be able to pull images from the hub....

@mmkonrad
Copy link
Author

mmkonrad commented May 30, 2016

Also tried it with a 10.0.0.0/24 Subnet...with the exact same results

@tomdee
Copy link
Contributor

tomdee commented Jun 8, 2016

You're setting the network to a small range (/26) - people typically use a /16 for the network.

Since you're not setting SubnetLen it's going to default to /27 - per

SubnetLen (integer): The size of the subnet allocated to each host. Defaults to 24 (i.e. /24) unless the Network was configured to be smaller than a /24 in which case it is one less than the network.

I suspect there's an off-by-one bug in the subnet allocation logic when the Network is only twice the size of the SubnetLen. It only allows a single host to be allocated.

Although this looks like a bug, I'm not sure how quickly it will be fixed given that this is a slightly odd scenario. Could you try using a larger network and/or a small SubnetLen?

@tomdee tomdee added the kind/bug label Jun 8, 2016
@mmkonrad
Copy link
Author

mmkonrad commented Jun 8, 2016

NOOOOOOOO......it worked...6 weeks of headache are more or less over.
I would be interested in the source of the bug, but the problem is, that my active phase on this project will end really soon.
Nevertheless thank you very much for this hint.

In the meantime I/we managed to create a cluster with docker overlay network and dockerswarm...it helped a lot of understanding more of the processes within such a cluster and the setup...but of course our solution is not so sophisticated as a k8s solution.

For me this issue could be closed. Do you want to leave it open until the bug is solved?

@tomdee
Copy link
Contributor

tomdee commented Jul 14, 2016

This code is relevant

if cfg.SubnetMin == ip.IP4(0) {
        // skip over the first subnet otherwise it causes problems. e.g.
        // if Network is 10.100.0.0/16, having an interface with 10.0.0.0
        // makes ping think it's a broadcast address (not sure why)
        cfg.SubnetMin = cfg.Network.IP + subnetSize

@tomdee tomdee closed this as completed Sep 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants