Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker run failed on overlay network since "failed to set link up: transport endpoint is not connected" #1247

Closed
BSWANG opened this issue Jun 12, 2016 · 10 comments · Fixed by #1279

Comments

@BSWANG
Copy link

BSWANG commented Jun 12, 2016

docker version:

1.11.2 on ubuntu 14.04

docker network ls:

NETWORK ID          NAME                 DRIVER
6a1fa16db582        bridge               bridge
0aa8655e6fbc        host                 host
35873009d789        multi-host-network   overlay
f9705d542f03        none                 null

uname -a:

Linux c55c4e23c9d714361aa1be6b7a97e7c63-node1 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

When I start container on overlay network, docker daemon return error:

root@c55c4e23c9d714361aa1be6b7a97e7c63-node1:~# docker run -it --net multi-host-network busybox
docker: Error response from daemon: subnet sandbox join failed for "172.19.0.0/16": vxlan interface creation failed for subnet "172.19.0.0/16": failed to set link up: transport endpoint is not connected.

And I still can't start the container after restart docker daemon service.
I produce this error many times at 1.11.x, but I don't know how to reproduce this, and how can I workaround this problem without restart host.

@BSWANG
Copy link
Author

BSWANG commented Jun 12, 2016

@vikstrous I saw you have meet this issue in docker/docker#22486, have you meet it again in 1.11.2? and how you workaround it.

@vikstrous
Copy link

Yes. I have seen it on 1.11.2 but probably only once or twice. It might be some kind of race condition. I don't have a workaround other than to restart the machine. @mavenugo @mrjana might be able to provide more info.

@BSWANG
Copy link
Author

BSWANG commented Jun 13, 2016

@vikstrous use ip link del vx-xxxxx-xxxx and restart docker-daemon can workaround, but why transport endpoint is not connected error occur frequently at 1.11.x docker daemon?

@BSWANG
Copy link
Author

BSWANG commented Jun 13, 2016

10: ov-000100-35873: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
    link/ether 2e:95:32:2f:f5:5c brd ff:ff:ff:ff:ff:ff
11: vx-000100-35873: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master ov-000100-35873 state DOWN mode DEFAULT group default
    link/ether 2e:95:32:2f:f5:5c brd ff:ff:ff:ff:ff:ff

It's the ip link result when I meet this error, the vxlan interface has same mac address with the bridge, is it cause the vxlan interface can't be up?

@BSWANG
Copy link
Author

BSWANG commented Jun 14, 2016

#1100

@mavenugo
Copy link
Contributor

@BSWANG honestly, I don't know exactly what is causing this issue either. I saw it once in @vikstrous testbed and yes restarting the daemon solved the problem. Also thanks for pointing out to the same mac-address used for the bridge and vxlan interface.

@mrjana @sanimej wdyt ?

@BSWANG
Copy link
Author

BSWANG commented Jun 20, 2016

@mavenugo "same mac-address used for the bridge and vxlan interface" cause by linux bridge choose mac address from minimum connected interfaces‘s mac address, so should not be the root of this issue.

@BSWANG
Copy link
Author

BSWANG commented Jun 20, 2016

finally, I add retry codes to osl/AddInterface:

// Up the interface.
        maxSetupRetry := 3
        retryCount := 0
        for ; retryCount < maxSetupRetry; retryCount++ {
            if err := netlink.LinkSetUp(iface); err == nil {
                break
            } else {
                log.Debugf("error in setup link up, error %+v", err)
            }
            time.Sleep(1 * time.Second)
        }
        if retryCount >= maxSetupRetry {
            return fmt.Errorf("failed to set link up: %v", err)
        }

and I not see this issue again, the failed vxlan interface up successful on the second retry.

@mrjana
Copy link
Contributor

mrjana commented Jun 20, 2016

"same mac-address used for the bridge and vxlan interface"

Yeah same mac-address in bridge and bridge port is not an issue. It's always like that. Bridge inherits the lowest mac-address of one of the bridge-ports because bridge ports themselves don't need their mac. The issue happens because in older kernels when a vxlan interface is created, the socket creation is queued up in a worker thread which actually creates the socket. But this needs to happen before we bring up the link on the vxlan interface. If for some chance, the worker thread hasn't completed the creation of the socket before we did link up then when we do link up the kernel checks if the socket was created and if not it will return ENOTCONN. This was a bug in the kernel which got fixed in later kernels. That is why retrying with a timer fixes the issue.

@BSWANG
Copy link
Author

BSWANG commented Jun 21, 2016

@mrjana Thanks, I got the root of this issue. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants