New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packets lost when scaling #1350

Closed
Nurza opened this Issue Aug 9, 2016 · 4 comments

Comments

Projects
None yet
4 participants
@Nurza

Nurza commented Aug 9, 2016

Hello I have some issues when I try to scale a service in docker swarm (One node is enough to reproduce the bug).

Containers: 49
 Running: 20
 Paused: 0
 Stopped: 29
Images: 3
Server Version: 1.12.0
Storage Driver: devicemapper
 Pool Name: docker-253:1-652915-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 549.3 MB
 Data Space Total: 107.4 GB
 Data Space Available: 19.07 GB
 Metadata Space Used: 3.756 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.144 GB
 Thin Pool Minimum Free Space: 10.74 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.110 (2015-10-30)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: overlay null host bridge
Swarm: active
 NodeID: 0ihvr3th16retkaj2uof4iozp
 Is Manager: true
 ClusterID: cqq5euvmbx5sw8b2fb7c0d9rl
 Managers: 1
 Nodes: 1
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot interval: 10000
  Heartbeat tick: 1
  Election tick: 3
 Dispatcher:
  Heartbeat period: 5 seconds
 CA configuration:
  Expiry duration: 3 months
 Node Address: 178.62.211.166
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-31-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 488.5 MiB
Name: swarm-1
ID: 7E2C:JQFD:KSPM:UV3U:TSUG:GTIX:TDWX:ANHW:BOHO:Z63G:3V5P:Q5NA
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: nurza
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8

To reproduce the bug:

Create a service, here it's just a nc server who returns the container hostname on the port 8080.

docker service create -p 8080:8080 --name=nc nurza/nc bash -c 'hostname>hostname && while true ; do nc -l 8080 < hostname ; done'

On the same system, launch a curl loop.

while true ; do curl 127.0.0.1:8080 ; sleep 0.1 ; done

And scale the service with a high number like 30

docker service scale nc=30

And in the loop, it will appears some lines like this during 30sec approximately:

548ce9a70d35
bfdc0cc701e1
8bed39426672
7880a6e5167d
curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused
c5d5aead5781
127888f96883
e062990c1f2c
9bef031116a5
d81d97c7a679
3143312230b4
043c0ee0b059
93828ebfcb02
719495543f82
548ce9a70d35
bfdc0cc701e1
8bed39426672
7880a6e5167d
curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused
d2657ab2594c
7a02ef70551c
36627dd0e197

And same when you scale down:

docker service scale nc=2

I have these line in the loop:

93828ebfcb02
719495543f82
548ce9a70d35
bfdc0cc701e1
curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused
curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused
61d8762584f5
7a02ef70551c
curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused
curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused
c5d5aead5781
curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused
9bef031116a5
curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused
c5d5aead5781
9bef031116a5
c5d5aead5781
9bef031116a5
c5d5aead5781
9bef031116a5
c5d5aead5781
9bef031116a5

In conclusion: scaling a service will result in a network packets drop.

Am I the only one with this problem? Thank you.

EDIT: sometime when I scale down, the service's network freeze during 1min.

@dperny

This comment has been minimized.

Show comment
Hide comment
@dperny

dperny Aug 9, 2016

Member

Related to #1328

Member

dperny commented Aug 9, 2016

Related to #1328

@dongluochen

This comment has been minimized.

Show comment
Hide comment
@dongluochen

dongluochen Aug 9, 2016

Contributor

@Nurza I repeated your test and confirm the issue. As @dperny says, it's likely related to endpoint adding/removing from loadbalancer without knowledge of task change.

EDIT: sometime when I scale down, the service's network freeze during 1min.

I think the curl TCP connection might get into a waiting state and take 1 min to break away. During this period if you start another curl on the side it'd be successful. So it's not network freeze, but rather a connection state problem.

Contributor

dongluochen commented Aug 9, 2016

@Nurza I repeated your test and confirm the issue. As @dperny says, it's likely related to endpoint adding/removing from loadbalancer without knowledge of task change.

EDIT: sometime when I scale down, the service's network freeze during 1min.

I think the curl TCP connection might get into a waiting state and take 1 min to break away. During this period if you start another curl on the side it'd be successful. So it's not network freeze, but rather a connection state problem.

@dongluochen

This comment has been minimized.

Show comment
Hide comment
@dongluochen

dongluochen Aug 9, 2016

Contributor

The TCP connection is in SYN_SENT state. In my system it'd retry 6 times, taking around 45 seconds to give up this connection.

ubuntu@ip-172-19-241-147:~$ netstat -an | grep -i 8080
tcp        0      1 127.0.0.1:45334         127.0.0.1:8080          SYN_SENT
tcp6       0      0 :::8080                 :::*                    LISTEN
ubuntu@ip-172-19-241-147:~$ date
Tue Aug  9 18:27:39 UTC 2016

ubuntu@ip-172-19-241-147:~$ date
Tue Aug  9 18:28:23 UTC 2016
ubuntu@ip-172-19-241-147:~$ netstat -an | grep -i 8080
tcp        0      1 127.0.0.1:45334         127.0.0.1:8080          SYN_SENT
tcp6       0      0 :::8080                 :::*                    LISTEN

ubuntu@ip-172-19-241-147:~$ more /proc/sys/net/ipv4/tcp_syn_retries
6
Contributor

dongluochen commented Aug 9, 2016

The TCP connection is in SYN_SENT state. In my system it'd retry 6 times, taking around 45 seconds to give up this connection.

ubuntu@ip-172-19-241-147:~$ netstat -an | grep -i 8080
tcp        0      1 127.0.0.1:45334         127.0.0.1:8080          SYN_SENT
tcp6       0      0 :::8080                 :::*                    LISTEN
ubuntu@ip-172-19-241-147:~$ date
Tue Aug  9 18:27:39 UTC 2016

ubuntu@ip-172-19-241-147:~$ date
Tue Aug  9 18:28:23 UTC 2016
ubuntu@ip-172-19-241-147:~$ netstat -an | grep -i 8080
tcp        0      1 127.0.0.1:45334         127.0.0.1:8080          SYN_SENT
tcp6       0      0 :::8080                 :::*                    LISTEN

ubuntu@ip-172-19-241-147:~$ more /proc/sys/net/ipv4/tcp_syn_retries
6
@mrjana

This comment has been minimized.

Show comment
Hide comment
@mrjana

mrjana Aug 9, 2016

Contributor

@Nurza @dperny @dongluochen This issue is not in swarmkit. It has been fixed in docker/libnetwork#1370 which will be available in the next patch release of docker engine. Since it is not an issue in swarmkit and since there are already a bunch of issues open for this in docker/docker I am going to close this issue. Please feel free to continue the discussion here if necessary.

Contributor

mrjana commented Aug 9, 2016

@Nurza @dperny @dongluochen This issue is not in swarmkit. It has been fixed in docker/libnetwork#1370 which will be available in the next patch release of docker engine. Since it is not an issue in swarmkit and since there are already a bunch of issues open for this in docker/docker I am going to close this issue. Please feel free to continue the discussion here if necessary.

@mrjana mrjana closed this Aug 9, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment