Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.12 RC3: service vip does not respond correctly after service scale down(new replicas < old replicas) #24531

Closed
ligc opened this issue Jul 12, 2016 · 8 comments
Assignees
Labels
area/networking area/swarm kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. priority/P1 Important: P1 issues are a top priority and a must-have for the next release. version/1.12
Milestone

Comments

@ligc
Copy link

ligc commented Jul 12, 2016

Output of docker version:

Client:
 Version:      1.12.0-rc3
 API version:  1.24
 Go version:   go1.6.2
 Git commit:   91e29e8
 Built:        Sat Jul  2 00:38:44 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.0-rc3
 API version:  1.24
 Go version:   go1.6.2
 Git commit:   91e29e8
 Built:        Sat Jul  2 00:38:44 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 3
 Running: 2
 Paused: 0
 Stopped: 1
Images: 4
Server Version: 1.12.0-rc3
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 27
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge null host overlay
Swarm: active
 NodeID: 39fgxkny289ba0i1qieulw3nv
 IsManager: Yes
 Managers: 1
 Nodes: 5
 CACertHash: sha256:7a77ad2daa473e47c4852430eb1749d83f1ad6cb6c6f2aa591414b647a082117
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-22-generic
Operating System: Ubuntu 16.04 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.859 GiB
Name: x5
ID: AR3X:GBNY:KZBT:2CZN:GYRK:YDKJ:X7FR:SUSS:ANTP:4WKC:RPLM:3DZK
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.):

The swarm has 5 nodes:

root@x5:~# docker node ls
ID                           HOSTNAME  MEMBERSHIP  STATUS  AVAILABILITY  MANAGER STATUS
312szma4ks4b3ifip4ypjo7qw    x6        Accepted    Ready   Active        
39fgxkny289ba0i1qieulw3nv *  x5        Accepted    Ready   Active        Leader
5l0sn100g63bm4dkth7yyq6br    x8        Accepted    Ready   Active        
aebe3jn24kgnv3cnz5kjsz5r7    x9        Accepted    Ready   Active        
bwfgfyh765i1b167lssvhtwgv    x7        Accepted    Ready   Active        
root@x5:~# 

Two services deployed in the swarm

root@x5:~# docker service ls
ID            NAME        REPLICAS  IMAGE                                                   COMMAND
7q1dowt321x1  httpserver  5/5       liguangcheng/ubuntu-16.04-x86_64-apache2                
9p1l1vztb936  httpclient  5/5       liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new  
root@x5:~# docker service tasks httpclient
ID                         NAME          SERVICE     IMAGE                                                   LAST STATE       DESIRED STATE  NODE
eovak0k3z42706huq4bc2lstm  httpclient.1  httpclient  liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new  Running 2 hours  Running        x9
1spgt4jyzwqsbth61ad6sxy8v  httpclient.2  httpclient  liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new  Running 2 hours  Running        x5
67iw42pmub8tqun0howes37ls  httpclient.3  httpclient  liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new  Running 2 hours  Running        x8
bw7buxw380bv25c7wtr03w0i6  httpclient.4  httpclient  liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new  Running 2 hours  Running        x6
6b724w9hmei8e089qeaohblyv  httpclient.5  httpclient  liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new  Running 2 hours  Running        x7
root@x5:~# docker service tasks httpserver
ID                         NAME          SERVICE     IMAGE                                     LAST STATE             DESIRED STATE  NODE
88vvmtyh8mzk5z5cy2rzpgdua  httpserver.1  httpserver  liguangcheng/ubuntu-16.04-x86_64-apache2  Running About an hour  Running        x5
7ymumzqq9fkpkykfwkiwi38un  httpserver.2  httpserver  liguangcheng/ubuntu-16.04-x86_64-apache2  Running 4 minutes      Running        x8
bu05te6fkuonxfq50teuq2nnt  httpserver.3  httpserver  liguangcheng/ubuntu-16.04-x86_64-apache2  Running 2 hours        Running        x7
98213q98i8em1i5zcz8y9p63l  httpserver.4  httpserver  liguangcheng/ubuntu-16.04-x86_64-apache2  Running 2 hours        Running        x9
9uqe2gjaihsbqj99aq6df4b55  httpserver.5  httpserver  liguangcheng/ubuntu-16.04-x86_64-apache2  Running 2 hours        Running        x6
root@x5:~# 

Steps to reproduce the issue:

  1. Init the swarm with command docker swarm init --listen-addr 10.0.189.5:2377
  2. Join the nodes into the swarm with command docker swarm join 10.0.189.5:2377
  3. Deploy the httpclient service with command docker service create --replicas 5 --publish 22 --name httpclient liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new
  4. Deploy the httpserver service with command docker service create --replicas 5 --publish 80 --name httpserver liguangcheng/ubuntu-16.04-x86_64-apache2
  5. Run the apache bechmark tool ab in all the httpclient containers against the httpserver virtual ip 10.255.0.14 in my case, the command is ab -c 10 -n 100 http://10.255.0.14:80/"
  6. Scale down the httpserver with command docker service scale httpserver=4
  7. Rerun the apache bechmark tool ab in all the httpclient containers against the httpserver virtual ip, the command is ab -c 10 -n 100 http://10.255.0.14:80/"

Describe the results you received:

After the httpserver service scale down, the 10.255.0.14:80 does not respond on some nodes, in the case below, the containers on nodes x6, x8 and x9 could not connect to the service vip

[root@ligc ipvs]# cat go
#!/bin/bash
xdsh x5-x9 "docker exec \`docker ps | grep httpclient | awk '{print \$1}'\` ab -c $1 -n $2 http://10.255.0.14:80/" | tee test.result
[root@ligc ipvs]# ./go 10 100
x7: This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
x7: Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
x7: Licensed to The Apache Software Foundation, http://www.apache.org/
x7: 
x7: Benchmarking 10.255.0.14 (be patient).....done
x7: 
x7: 
x7: Server Software:        Apache/2.4.18
x7: Server Hostname:        10.255.0.14
x7: Server Port:            80
x7: 
x7: Document Path:          /
x7: Document Length:        11321 bytes
x7: 
x7: Concurrency Level:      10
x7: Time taken for tests:   0.083 seconds
x7: Complete requests:      100
x7: Failed requests:        0
x7: Total transferred:      1159500 bytes
x7: HTML transferred:       1132100 bytes
x7: Requests per second:    1211.12 [#/sec] (mean)
x7: Time per request:       8.257 [ms] (mean)
x7: Time per request:       0.826 [ms] (mean, across all concurrent requests)
x7: Transfer rate:          13713.84 [Kbytes/sec] received
x7: 
x7: Connection Times (ms)
x7:               min  mean[+/-sd] median   max
x7: Connect:        1    3   2.3      2      13
x7: Processing:     1    5   4.3      3      31
x7: Waiting:        1    3   2.3      3      18
x7: Total:          3    8   4.9      6      35
x7: 
x7: Percentage of the requests served within a certain time (ms)
x7:   50%      6
x7:   66%      7
x7:   75%      9
x7:   80%     10
x7:   90%     14
x7:   95%     18
x7:   98%     24
x7:   99%     35
x7:  100%     35 (longest request)
x5: This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
x5: Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
x5: Licensed to The Apache Software Foundation, http://www.apache.org/
x5: 
x5: Benchmarking 10.255.0.14 (be patient).....done
x5: 
x5: 
x5: Server Software:        Apache/2.4.18
x5: Server Hostname:        10.255.0.14
x5: Server Port:            80
x5: 
x5: Document Path:          /
x5: Document Length:        11321 bytes
x5: 
x5: Concurrency Level:      10
x5: Time taken for tests:   0.093 seconds
x5: Complete requests:      100
x5: Failed requests:        0
x5: Total transferred:      1159500 bytes
x5: HTML transferred:       1132100 bytes
x5: Requests per second:    1080.40 [#/sec] (mean)
x5: Time per request:       9.256 [ms] (mean)
x5: Time per request:       0.926 [ms] (mean, across all concurrent requests)
x5: Transfer rate:          12233.67 [Kbytes/sec] received
x5: 
x5: Connection Times (ms)
x5:               min  mean[+/-sd] median   max
x5: Connect:        0    4   4.6      2      22
x5: Processing:     1    4   2.9      4      17
x5: Waiting:        0    4   2.4      3      14
x5: Total:          1    8   6.1      7      31
x5: 
x5: Percentage of the requests served within a certain time (ms)
x5:   50%      7
x5:   66%      8
x5:   75%     10
x5:   80%     10
x5:   90%     16
x5:   95%     23
x5:   98%     28
x5:   99%     31
x5:  100%     31 (longest request)
x8: This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
x8: Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
x8: Licensed to The Apache Software Foundation, http://www.apache.org/
x8: 
x8: Benchmarking 10.255.0.14 (be patient)...Total of 38 requests completed
x8: apr_pollset_poll: The timeout specified has expired (70007)
x6: apr_pollset_poll: The timeout specified has expired (70007)
x6: This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
x6: Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
x6: Licensed to The Apache Software Foundation, http://www.apache.org/
x6: 
x6: Benchmarking 10.255.0.14 (be patient)...Total of 40 requests completed
x9: This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
x9: Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
x9: Licensed to The Apache Software Foundation, http://www.apache.org/
x9: 
x9: Benchmarking 10.255.0.14 (be patient)...Total of 36 requests completed
x9: apr_pollset_poll: The timeout specified has expired (70007)
[root@ligc ipvs]# 

Describe the results you expected:
The service vip should work no matter the service scale up or scale down

Additional information you deem important (e.g. issue happens only occasionally):

@ligc ligc changed the title 1.12 RC3: service vip does not respond correctly after service scale in(new replicas < old replicas) 1.12 RC3: service vip does not respond correctly after service scale down(new replicas < old replicas) Jul 12, 2016
@thaJeztah thaJeztah added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. area/networking labels Jul 12, 2016
@thaJeztah thaJeztah added this to the 1.12.0 milestone Jul 12, 2016
@thaJeztah
Copy link
Member

/cc @mavenugo

@cohenaj194
Copy link

cohenaj194 commented Jul 13, 2016

I have the same issue only it extends to --mode global containers as well. For example if I create the following service none of the containers will be viewable through port 30000:

docker network -d overlay mynet
docker service create --mode global --name nginx --network mynet -p 30000:80 nginx

I have also noticed that when creating a regular nginx service with the following:

docker network create -d overlay mynet
docker service create –name nginx –replicas 5 -p 30000:80/tcp –network mynet nginx

However I have found that with this setup the containers are only visible occasionally through the port 30000 I have assigned and only through random nodes instead of all the nodes. Also it doesn't seem to work at all unless there are two containers on each node and the containers are occasionally visible through port 80 for some unknowable reason.

When I can access my containers it is usually only through one node and that node is not load balancing the containers. When attempting to curl localhost:30000 of the node that does work only the contents of one container will be visible and only on every other curl attempt.

For reference: I am working in a private openstack cloud, ubuntu 14.04, on 6 machines distributed. 3 manager nodes that are drained and 3 client nodes that are active.

HOSTNAME      MEMBERSHIP  STATUS  AVAILABILITY  MANAGER STATUS
client2 Accepted    Ready   Active        
consul0 Accepted    Ready   Drain         Leader
consul1 Accepted    Ready   Drain         Reachable
client0 Accepted    Ready   Active        
client1 Accepted    Ready   Active        
consul2 Accepted    Ready   Drain         Reachable

@thaJeztah thaJeztah added the priority/P2 Normal priority: default priority applied. label Jul 13, 2016
@cohenaj194
Copy link

cohenaj194 commented Jul 14, 2016

So my issue was caused by my swarm managers using docker 1.12.0-rc3 and my clients were 1.12.0-rc4. After updating my docker versions and remaking my cluster the issue of overlay networking was resolved.

@eungjun-yi
Copy link
Contributor

I have the same issue with rc4.

@aluzzardi
Copy link
Member

@ligc This has been fixed in master and will be in the next (1.12.2) release.

Could you try with master and confirm?

@icecrime icecrime added priority/P1 Important: P1 issues are a top priority and a must-have for the next release. and removed priority/P2 Normal priority: default priority applied. labels Sep 19, 2016
@icecrime
Copy link
Contributor

@aluzzardi Can you please point to the PR that fixed it on master? Is it an all-encompassing vendoring change, or a localized patch?

@mrjana
Copy link
Contributor

mrjana commented Sep 20, 2016

This issue should have been fixed in 1.12.1 itself. The PR that fixed this is moby/libnetwork#1370.

@vieux
Copy link
Contributor

vieux commented Sep 20, 2016

As @mrjana said it is fixed in 1.12.1, closing.

@vieux vieux closed this as completed Sep 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking area/swarm kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. priority/P1 Important: P1 issues are a top priority and a must-have for the next release. version/1.12
Projects
None yet
Development

No branches or pull requests