Packets lost when scaling #1350

Closed
Nurza opened this Issue Aug 9, 2016 · 4 comments

Projects

None yet

4 participants

@Nurza
Nurza commented Aug 9, 2016 edited

Hello I have some issues when I try to scale a service in docker swarm (One node is enough to reproduce the bug).

Containers: 49
 Running: 20
 Paused: 0
 Stopped: 29
Images: 3
Server Version: 1.12.0
Storage Driver: devicemapper
 Pool Name: docker-253:1-652915-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 549.3 MB
 Data Space Total: 107.4 GB
 Data Space Available: 19.07 GB
 Metadata Space Used: 3.756 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.144 GB
 Thin Pool Minimum Free Space: 10.74 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.110 (2015-10-30)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: overlay null host bridge
Swarm: active
 NodeID: 0ihvr3th16retkaj2uof4iozp
 Is Manager: true
 ClusterID: cqq5euvmbx5sw8b2fb7c0d9rl
 Managers: 1
 Nodes: 1
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot interval: 10000
  Heartbeat tick: 1
  Election tick: 3
 Dispatcher:
  Heartbeat period: 5 seconds
 CA configuration:
  Expiry duration: 3 months
 Node Address: 178.62.211.166
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-31-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 488.5 MiB
Name: swarm-1
ID: 7E2C:JQFD:KSPM:UV3U:TSUG:GTIX:TDWX:ANHW:BOHO:Z63G:3V5P:Q5NA
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: nurza
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8

To reproduce the bug:

Create a service, here it's just a nc server who returns the container hostname on the port 8080.

docker service create -p 8080:8080 --name=nc nurza/nc bash -c 'hostname>hostname && while true ; do nc -l 8080 < hostname ; done'

On the same system, launch a curl loop.

while true ; do curl 127.0.0.1:8080 ; sleep 0.1 ; done

And scale the service with a high number like 30

docker service scale nc=30

And in the loop, it will appears some lines like this during 30sec approximately:

548ce9a70d35
bfdc0cc701e1
8bed39426672
7880a6e5167d
curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused
c5d5aead5781
127888f96883
e062990c1f2c
9bef031116a5
d81d97c7a679
3143312230b4
043c0ee0b059
93828ebfcb02
719495543f82
548ce9a70d35
bfdc0cc701e1
8bed39426672
7880a6e5167d
curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused
d2657ab2594c
7a02ef70551c
36627dd0e197

And same when you scale down:

docker service scale nc=2

I have these line in the loop:

93828ebfcb02
719495543f82
548ce9a70d35
bfdc0cc701e1
curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused
curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused
61d8762584f5
7a02ef70551c
curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused
curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused
c5d5aead5781
curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused
9bef031116a5
curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused
c5d5aead5781
9bef031116a5
c5d5aead5781
9bef031116a5
c5d5aead5781
9bef031116a5
c5d5aead5781
9bef031116a5

In conclusion: scaling a service will result in a network packets drop.

Am I the only one with this problem? Thank you.

EDIT: sometime when I scale down, the service's network freeze during 1min.

@dperny
Member
dperny commented Aug 9, 2016

Related to #1328

@dongluochen
Contributor

@Nurza I repeated your test and confirm the issue. As @dperny says, it's likely related to endpoint adding/removing from loadbalancer without knowledge of task change.

EDIT: sometime when I scale down, the service's network freeze during 1min.

I think the curl TCP connection might get into a waiting state and take 1 min to break away. During this period if you start another curl on the side it'd be successful. So it's not network freeze, but rather a connection state problem.

@dongluochen
Contributor

The TCP connection is in SYN_SENT state. In my system it'd retry 6 times, taking around 45 seconds to give up this connection.

ubuntu@ip-172-19-241-147:~$ netstat -an | grep -i 8080
tcp        0      1 127.0.0.1:45334         127.0.0.1:8080          SYN_SENT
tcp6       0      0 :::8080                 :::*                    LISTEN
ubuntu@ip-172-19-241-147:~$ date
Tue Aug  9 18:27:39 UTC 2016

ubuntu@ip-172-19-241-147:~$ date
Tue Aug  9 18:28:23 UTC 2016
ubuntu@ip-172-19-241-147:~$ netstat -an | grep -i 8080
tcp        0      1 127.0.0.1:45334         127.0.0.1:8080          SYN_SENT
tcp6       0      0 :::8080                 :::*                    LISTEN

ubuntu@ip-172-19-241-147:~$ more /proc/sys/net/ipv4/tcp_syn_retries
6
@mrjana
Contributor
mrjana commented Aug 9, 2016

@Nurza @dperny @dongluochen This issue is not in swarmkit. It has been fixed in docker/libnetwork#1370 which will be available in the next patch release of docker engine. Since it is not an issue in swarmkit and since there are already a bunch of issues open for this in docker/docker I am going to close this issue. Please feel free to continue the discussion here if necessary.

@mrjana mrjana closed this Aug 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment