New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No networking in some docker containers when spawned at high rate #1638

Closed
rawouter opened this Issue Oct 31, 2016 · 4 comments

Comments

Projects
None yet
3 participants
@rawouter

Issue Report

Bug

When spawning docker container at high rate in network mode bridge, some containers will not have networking connectivity. Coming from moby/moby#27808 asked to report the bug in CoreOS

CoreOS Version

$ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=1192.2.0
VERSION_ID=1192.2.0
BUILD_ID=2016-10-21-0026
PRETTY_NAME="CoreOS 1192.2.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

Environment

VM running on VMWare

Steps to reproduce the issue:

  1. Created a network with /16 address space, in bridge mode
  2. Spawn container at high rate and try to access network (ping)

I can reproduce the issue with the following script, some container image work better but will fail at some point. Also, I can not reproduce the issue in host network mode.

for num in range {1..300}
do
        docker run --network taskers --rm  ubuntu:14.04 sh -c "ping -c 1 173.36.21.105; arp -n; ifconfig" | tee output_$num.txt &
done

Describe the results you received:

About 1% of the container will fail without networking.
Here some command outputs taken from the script above:

PING 173.36.21.105 (173.36.21.105) 56(84) bytes of data.
From 173.77.2.18 icmp_seq=1 Destination Host Unreachable

--- 173.36.21.105 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

Address                  HWtype  HWaddress           Flags Mask            Iface
173.77.0.1                       (incomplete)                              eth0

eth0      Link encap:Ethernet  HWaddr 02:42:ad:4d:02:12
          inet addr:173.77.2.18  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::42:adff:fe4d:212/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:5 errors:0 dropped:0 overruns:0 frame:0
          TX packets:14 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:438 (438.0 B)  TX bytes:1072 (1.0 KB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:1 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:112 (112.0 B)  TX bytes:112 (112.0 B)

X bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

Describe the results you expected:

No network failure, ping should go through, arp should resolve.

Additional information you deem important (e.g. issue happens only occasionally):

The issue only occur for 1% of the pods when the system is under load, spawning and deleting lots of containers.
It doesn't occur in host network mode.

Output of docker version:

$ docker version
Client:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   7a86f89
 Built:
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   7a86f89
 Built:
 OS/Arch:      linux/amd64

Output of docker info:

$ docker info
Containers: 284
 Running: 113
 Paused: 0
 Stopped: 171
Images: 30
Server Version: 1.12.1
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null host bridge overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: selinux
Kernel Version: 4.7.3-coreos-r1
Operating System: CoreOS 1192.2.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 31.43 GiB
Name: pink-node-01
ID: D6TN:CC4O:MEWN:5IY4:ZOIO:OGKZ:AWPG:ZN5B:7EM6:J3JM:U6G2:6E3M
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8
@dm0-

This comment has been minimized.

Show comment
Hide comment
@dm0-

dm0- Oct 31, 2016

Member

I've run this a few times to ping the gateway with the default bridge network driver, and there have been no ping failures.

for ((i=0; i<200; i++)) ; do (docker run --rm busybox ping -c 1 172.17.0.1 &> fail.$i && rm -f fail.$i) & done

How did you set up your custom network? Can you confirm whether the failed containers are attached to the bridge? (For example, ip link should show e.g. master docker0 on the veth interface.)

Member

dm0- commented Oct 31, 2016

I've run this a few times to ping the gateway with the default bridge network driver, and there have been no ping failures.

for ((i=0; i<200; i++)) ; do (docker run --rm busybox ping -c 1 172.17.0.1 &> fail.$i && rm -f fail.$i) & done

How did you set up your custom network? Can you confirm whether the failed containers are attached to the bridge? (For example, ip link should show e.g. master docker0 on the veth interface.)

@rawouter

This comment has been minimized.

Show comment
Hide comment
@rawouter

rawouter Oct 31, 2016

It happens that the node I used for repro just reloaded and I couldn't reproduce the issue easily. I had to re-run several times the scripts (maybe 5000 container runs) until I found the first failures.

Nevertheless, here is ip link from a container that failed:

PING 173.36.21.105 (173.36.21.105) 56(84) bytes of data.
From 173.77.4.23 icmp_seq=1 Destination Host Unreachable

--- 173.36.21.105 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms


1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
8274: eth0@if8275: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:ad:4d:04:17 brd ff:ff:ff:ff:ff:ff

Here is one that worked:

PING 173.36.21.105 (173.36.21.105) 56(84) bytes of data.
64 bytes from 173.36.21.105: icmp_seq=1 ttl=61 time=3.71 ms

--- 173.36.21.105 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 3.716/3.716/3.716/0.000 ms


8192: eth0@if8193: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:ad:4d:03:ed brd ff:ff:ff:ff:ff:ff
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

rawouter commented Oct 31, 2016

It happens that the node I used for repro just reloaded and I couldn't reproduce the issue easily. I had to re-run several times the scripts (maybe 5000 container runs) until I found the first failures.

Nevertheless, here is ip link from a container that failed:

PING 173.36.21.105 (173.36.21.105) 56(84) bytes of data.
From 173.77.4.23 icmp_seq=1 Destination Host Unreachable

--- 173.36.21.105 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms


1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
8274: eth0@if8275: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:ad:4d:04:17 brd ff:ff:ff:ff:ff:ff

Here is one that worked:

PING 173.36.21.105 (173.36.21.105) 56(84) bytes of data.
64 bytes from 173.36.21.105: icmp_seq=1 ttl=61 time=3.71 ms

--- 173.36.21.105 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 3.716/3.716/3.716/0.000 ms


8192: eth0@if8193: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:ad:4d:03:ed brd ff:ff:ff:ff:ff:ff
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
@dm0-

This comment has been minimized.

Show comment
Hide comment
@dm0-

dm0- Nov 2, 2016

Member

Okay, I eventually managed to reproduce this with a fresh CoreOS system. However, when using an image built with a fix I've already proposed to systemd, I could not reproduce the issue (after nearly 20,000 containers spawned). We are waiting on upstream to decide on the configuration option to use in systemd/systemd#4228, but we can backport it to fix this issue when a decision is made.

Member

dm0- commented Nov 2, 2016

Okay, I eventually managed to reproduce this with a fresh CoreOS system. However, when using an image built with a fix I've already proposed to systemd, I could not reproduce the issue (after nearly 20,000 containers spawned). We are waiting on upstream to decide on the configuration option to use in systemd/systemd#4228, but we can backport it to fix this issue when a decision is made.

@crawford crawford added this to the CoreOS Alpha 1263.0.0 milestone Dec 2, 2016

@dm0-

This comment has been minimized.

Show comment
Hide comment
@dm0-

dm0- Dec 3, 2016

Member

The proposed option was merged in upstream systemd, and we are going to backport it to our current systemd versions at coreos/systemd#73.

Member

dm0- commented Dec 3, 2016

The proposed option was merged in upstream systemd, and we are going to backport it to our current systemd versions at coreos/systemd#73.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment