Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swarm service, with overlay network, fails to remove all containers #26244

Closed
urlund opened this issue Sep 1, 2016 · 26 comments
Closed

Swarm service, with overlay network, fails to remove all containers #26244

urlund opened this issue Sep 1, 2016 · 26 comments
Assignees
Labels
area/networking area/swarm exp/expert kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/1.12

Comments

@urlund
Copy link

urlund commented Sep 1, 2016

Output of docker version:

Client:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Thu Aug 18 05:02:53 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Thu Aug 18 05:02:53 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 4
Server Version: 1.12.1
Storage Driver: overlay2
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null bridge host overlay
Swarm: active
 NodeID: 0phf7j5medrlnam24hcki16db
 Is Manager: true
 ClusterID: 0ol98k8pblzgb1xpqeuuqn27f
 Managers: 1
 Nodes: 3
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 10.20.21.1
Runtimes: runc
Default Runtime: runc
Security Options:
Kernel Version: 4.6.0-0.bpo.1-amd64
Operating System: Debian GNU/Linux 8 (jessie)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.911 GiB
Name: mc-docker1
ID: RG6N:UCJN:AT5K:AYYN:ZPH6:6IAN:ARJZ:TWAD:ZFTZ:BSYC:A4AN:TTP3
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No memory limit support
WARNING: No swap limit support
WARNING: No kernel memory limit support
WARNING: No oom kill disable support
Insecure Registries:
 127.0.0.0/8

Steps to reproduce the issue:

  1. docker network create -d overlay sleeper_nw
  2. docker service create --name sleeper --replicas 20 --network sleeper_nw urlund/sleeper
  3. docker service rm sleeper

Describe the results you received:
Output from docker ps -a:

Node 1:

CONTAINER ID        IMAGE                   COMMAND             CREATED             STATUS                       PORTS               NAMES
9b79d94bcebb        urlund/sleeper:latest   "/run.sh"           41 seconds ago      Exited (137) 7 seconds ago                       sleeper.20.5shcg82g0p2sn8hkqidx6yuim
239a00862bc4        urlund/sleeper:latest   "/run.sh"           41 seconds ago      Exited (137) 7 seconds ago                       sleeper.2.7vm1a1u41fwpf3mjrviphsfyo
274ccd13294e        urlund/sleeper:latest   "/run.sh"           41 seconds ago      Exited (137) 7 seconds ago                       sleeper.18.1qzllrci7j6rzpjdvk01fxip7
7837971e8e4d        urlund/sleeper:latest   "/run.sh"           41 seconds ago      Exited (137) 7 seconds ago                       sleeper.6.dzrkydszgnvd4mzyy37fyxarc

Node 2:

CONTAINER ID        IMAGE                   COMMAND             CREATED             STATUS                        PORTS               NAMES
854ef7a322b7        urlund/sleeper:latest   "/run.sh"           44 seconds ago      Exited (137) 10 seconds ago                       sleeper.19.609h64iauf7oa1i95o0qi7qqk
fcfcf0208ebf        urlund/sleeper:latest   "/run.sh"           44 seconds ago      Exited (137) 10 seconds ago                       sleeper.15.dpxzl9lysf6koofzhld6o8hnq

Node 3:

CONTAINER ID        IMAGE                   COMMAND             CREATED             STATUS                        PORTS               NAMES
32518f0c9ce2        urlund/sleeper:latest   "/run.sh"           45 seconds ago      Exited (137) 12 seconds ago                       sleeper.9.1k6snzf3xnjjfayoq3fhfvdtu
bf15e1d18a8a        urlund/sleeper:latest   "/run.sh"           45 seconds ago      Exited (137) 12 seconds ago                       sleeper.5.bqf8dn47yqcosd45uk0skzvyj
fc406aef2f9c        urlund/sleeper:latest   "/run.sh"           45 seconds ago      Exited (137) 12 seconds ago                       sleeper.12.68hlgm025wtur6xa57v8gfi8i
d5a9359d120e        urlund/sleeper:latest   "/run.sh"           45 seconds ago      Exited (137) 12 seconds ago                       sleeper.7.0ql6whgya5753ckk12nahqx5u
46035a959fbe        urlund/sleeper:latest   "/run.sh"           45 seconds ago      Exited (137) 12 seconds ago                       sleeper.10.e5z07uzu46z9c051iusx6qpqf

Describe the results you expected:
I would expect all containers to be removed.

Additional information you deem important (e.g. issue happens only occasionally):
You should not be able to reproduce this without creating/configuring an overlay network.

@icecrime icecrime added exp/expert kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. area/networking area/swarm version/1.12 labels Sep 1, 2016
@mrjana mrjana self-assigned this Sep 2, 2016
@mrjana
Copy link
Contributor

mrjana commented Sep 2, 2016

@urlund Yes, this is a bug caused by multiple go routines racing to delete the network and only one of them wins and everybody else fails to delete the network and because of that fails to remove the container. Will fix it. But other than some stale containers it will not cause any other issues in terms of functionality.

@mostolog
Copy link

mostolog commented Sep 5, 2016

Having a similar issue here: 1 swarm node, 1 service, 3 replicas, overlay network...and service rm always left one task exited that can be viewed with docker ps -a

Is there anything we could do/test for you?

@xiaods
Copy link
Contributor

xiaods commented Sep 5, 2016

@mrjana could you please point the point to reoslve the problem.

@jmzwcn
Copy link
Contributor

jmzwcn commented Sep 7, 2016

@xiaods ,
As mrjana said, this issue should come from swarmkit && libnetwork side.
I will take an investigation in code and share with you then. Thanks!

@mostolog
Copy link

mostolog commented Sep 7, 2016

I don't know it this can be somehow useful, but with just one swarm node, running just one service, the left containers seems to be always half of replicas.

@xiaods
Copy link
Contributor

xiaods commented Sep 8, 2016

@jmzwcn cool. wait for your result.

@jmzwcn
Copy link
Contributor

jmzwcn commented Sep 9, 2016

@mostolog
Copy link

mostolog commented Sep 9, 2016

Are logs produced by lines like "log.G(ctx).WithError(err).Errorf("failed to list tasks")" visible running docker in verbose mode/docker logs/somewhere?

I could dump some log traces if you need so.

@cpuguy83
Copy link
Member

cpuguy83 commented Sep 9, 2016

@mostolog docker writes logs to stderr. This means logs will be available depending on how/what the init system is/is setup.

If you are using systemd, the logs are in journald.
If you are using upstart, typically the logs would be in /var/log/upstart/docker.log.

@mostolog
Copy link

mostolog commented Sep 9, 2016

Supposed to be anonymized. I just run 4 commands:

  1. service create without overlay network
  2. service rm (all containers are gone)
  3. service create with overlay network
  4. service rm (one dead container left)

log.txt attached
Hope it helps.

@jmzwcn
Copy link
Contributor

jmzwcn commented Sep 13, 2016

Actually when we use "built-in network"(ingress[overlay] or null[bridge]), this issue is gone. I will investigate further more.

@xiaods
Copy link
Contributor

xiaods commented Sep 21, 2016

@jmzwcn wait for your confirm

@mrjana
Copy link
Contributor

mrjana commented Sep 22, 2016

@jmzwcn if you want to take care of fixing this issue it should be fixed here: https://github.com/docker/docker/blob/master/daemon/cluster/executor/container/adapter.go#L136.

In addition to ignoring ActiveEndpointsError it also needs to ignore NoSuchNetwork error as defined here https://github.com/docker/libnetwork/blob/master/error.go#L8. This is because when deleting multiple tasks which are connected to the same network it is possible that the network is removed before all tasks has completely shutdown. If the network is no more there there is no need to fail the executor since that is what our intention was to begin with.

@jmzwcn
Copy link
Contributor

jmzwcn commented Sep 22, 2016

Great, I think it is exact. I will verify and submit a PR then.

@xiaods
Copy link
Contributor

xiaods commented Sep 22, 2016

            if _, ok := err.(*libnetwork.ErrNoSuchNetwork); ok {
                continue
            }

@xiaods
Copy link
Contributor

xiaods commented Sep 25, 2016

@jmzwcn i have follow @mrjana hints and reproduce the steps in above comments, the docker logs report :

time="2016-09-25T08:36:55.472486240Z" level=error msg="network sleeper_nw remove failed: network sleeper_nw not found" module=taskmanager task.id=dcc7wujc41mkagd992va0f61o 
time="2016-09-25T08:36:55.472542896Z" level=error msg="remove task failed" error="network sleeper_nw not found" module=taskmanager task.id=dcc7wujc41mkagd992va0f61o 
time="2016-09-25T08:36:55.505292889Z" level=error msg="network sleeper_nw remove failed: network sleeper_nw not found" module=taskmanager task.id=cm48umnyg4zvmaff8lodnypfd 
time="2016-09-25T08:36:55.505292927Z" level=error msg="remove task failed" error="network sleeper_nw not found" module=taskmanager task.id=cm48umnyg4zvmaff8lodnypfd 

i can confirmed the issue is cause by ErrNoSuchNetwork. give a PR asap ,thanks a lot.

@jmzwcn
Copy link
Contributor

jmzwcn commented Sep 26, 2016

Yes, I have verified using a local binary with the fix(also include UnknownNetworkError) and find the issue has been gone as expected. dev binary.

I will create a PR soon.

@jmzwcn
Copy link
Contributor

jmzwcn commented Sep 26, 2016

Seems not find appropriate place for test.

So directly submit a PR here. Please let me know if there is any problem. @xiaods @urlund @mrjana @mostolog @cpuguy83

@jmzwcn
Copy link
Contributor

jmzwcn commented Sep 30, 2016

Upload a latest binary(without UnknownNetworkError), could anybody help to verify too? here

@mostolog
Copy link

@jmzwcn .deb? 😁

@jmzwcn
Copy link
Contributor

jmzwcn commented Sep 30, 2016

Dev build takes too long time, not sure if deb could be completed before I leave office. 😁

@xiaods
Copy link
Contributor

xiaods commented Sep 30, 2016

1.12.2-rc1 is not also fixed it, please have a try on new testing binary.

@jmzwcn
Copy link
Contributor

jmzwcn commented Oct 3, 2016

All debs have been uploaded to here, please try it based on your contrib. Thanks!

@mostolog
Copy link

mostolog commented Oct 3, 2016

Just run the previous 4 commands:

  1. service create without overlay network
  2. service rm (all containers are gone)
  3. service create with overlay network
  4. service rm (all containers are gone)

Tested on debian-jessie on a single-node swarm cluster.
Thanks @jmzwcn !!!

@xiaods
Copy link
Contributor

xiaods commented Oct 7, 2016

waiting the patch is merged, then close it.

@jmzwcn
Copy link
Contributor

jmzwcn commented Oct 11, 2016

@xiaods

As @aboch said in PR, it will be decided to merge into 1.12.x by @mrjana

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking area/swarm exp/expert kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/1.12
Projects
None yet
Development

No branches or pull requests

8 participants