Swarm service, with overlay network, fails to remove all containers #26244

urlund · 2016-09-01T11:55:26Z

Output of docker version:

Client:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Thu Aug 18 05:02:53 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Thu Aug 18 05:02:53 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 4
Server Version: 1.12.1
Storage Driver: overlay2
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null bridge host overlay
Swarm: active
 NodeID: 0phf7j5medrlnam24hcki16db
 Is Manager: true
 ClusterID: 0ol98k8pblzgb1xpqeuuqn27f
 Managers: 1
 Nodes: 3
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 10.20.21.1
Runtimes: runc
Default Runtime: runc
Security Options:
Kernel Version: 4.6.0-0.bpo.1-amd64
Operating System: Debian GNU/Linux 8 (jessie)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.911 GiB
Name: mc-docker1
ID: RG6N:UCJN:AT5K:AYYN:ZPH6:6IAN:ARJZ:TWAD:ZFTZ:BSYC:A4AN:TTP3
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No memory limit support
WARNING: No swap limit support
WARNING: No kernel memory limit support
WARNING: No oom kill disable support
Insecure Registries:
 127.0.0.0/8

Steps to reproduce the issue:

docker network create -d overlay sleeper_nw
docker service create --name sleeper --replicas 20 --network sleeper_nw urlund/sleeper
docker service rm sleeper

Describe the results you received:
Output from docker ps -a:

Node 1:

CONTAINER ID        IMAGE                   COMMAND             CREATED             STATUS                       PORTS               NAMES
9b79d94bcebb        urlund/sleeper:latest   "/run.sh"           41 seconds ago      Exited (137) 7 seconds ago                       sleeper.20.5shcg82g0p2sn8hkqidx6yuim
239a00862bc4        urlund/sleeper:latest   "/run.sh"           41 seconds ago      Exited (137) 7 seconds ago                       sleeper.2.7vm1a1u41fwpf3mjrviphsfyo
274ccd13294e        urlund/sleeper:latest   "/run.sh"           41 seconds ago      Exited (137) 7 seconds ago                       sleeper.18.1qzllrci7j6rzpjdvk01fxip7
7837971e8e4d        urlund/sleeper:latest   "/run.sh"           41 seconds ago      Exited (137) 7 seconds ago                       sleeper.6.dzrkydszgnvd4mzyy37fyxarc

Node 2:

CONTAINER ID        IMAGE                   COMMAND             CREATED             STATUS                        PORTS               NAMES
854ef7a322b7        urlund/sleeper:latest   "/run.sh"           44 seconds ago      Exited (137) 10 seconds ago                       sleeper.19.609h64iauf7oa1i95o0qi7qqk
fcfcf0208ebf        urlund/sleeper:latest   "/run.sh"           44 seconds ago      Exited (137) 10 seconds ago                       sleeper.15.dpxzl9lysf6koofzhld6o8hnq

Node 3:

CONTAINER ID        IMAGE                   COMMAND             CREATED             STATUS                        PORTS               NAMES
32518f0c9ce2        urlund/sleeper:latest   "/run.sh"           45 seconds ago      Exited (137) 12 seconds ago                       sleeper.9.1k6snzf3xnjjfayoq3fhfvdtu
bf15e1d18a8a        urlund/sleeper:latest   "/run.sh"           45 seconds ago      Exited (137) 12 seconds ago                       sleeper.5.bqf8dn47yqcosd45uk0skzvyj
fc406aef2f9c        urlund/sleeper:latest   "/run.sh"           45 seconds ago      Exited (137) 12 seconds ago                       sleeper.12.68hlgm025wtur6xa57v8gfi8i
d5a9359d120e        urlund/sleeper:latest   "/run.sh"           45 seconds ago      Exited (137) 12 seconds ago                       sleeper.7.0ql6whgya5753ckk12nahqx5u
46035a959fbe        urlund/sleeper:latest   "/run.sh"           45 seconds ago      Exited (137) 12 seconds ago                       sleeper.10.e5z07uzu46z9c051iusx6qpqf

Describe the results you expected:
I would expect all containers to be removed.

Additional information you deem important (e.g. issue happens only occasionally):
You should not be able to reproduce this without creating/configuring an overlay network.

The text was updated successfully, but these errors were encountered:

mrjana · 2016-09-02T02:19:41Z

@urlund Yes, this is a bug caused by multiple go routines racing to delete the network and only one of them wins and everybody else fails to delete the network and because of that fails to remove the container. Will fix it. But other than some stale containers it will not cause any other issues in terms of functionality.

mostolog · 2016-09-05T07:30:54Z

Having a similar issue here: 1 swarm node, 1 service, 3 replicas, overlay network...and service rm always left one task exited that can be viewed with docker ps -a

Is there anything we could do/test for you?

xiaods · 2016-09-05T13:26:24Z

@mrjana could you please point the point to reoslve the problem.

jmzwcn · 2016-09-07T05:51:00Z

@xiaods ,
As mrjana said, this issue should come from swarmkit && libnetwork side.
I will take an investigation in code and share with you then. Thanks!

mostolog · 2016-09-07T06:04:58Z

I don't know it this can be somehow useful, but with just one swarm node, running just one service, the left containers seems to be always half of replicas.

xiaods · 2016-09-08T06:27:07Z

@jmzwcn cool. wait for your result.

jmzwcn · 2016-09-09T06:19:13Z

https://github.com/docker/swarmkit/blob/master/manager/orchestrator/replicated.go#L160
https://github.com/docker/swarmkit/blob/master/manager/state/store/memory.go#L299
But I am not sure if it is really related to overlay network. Could anybody help confirm?

mostolog · 2016-09-09T06:26:33Z

Are logs produced by lines like "log.G(ctx).WithError(err).Errorf("failed to list tasks")" visible running docker in verbose mode/docker logs/somewhere?

I could dump some log traces if you need so.

cpuguy83 · 2016-09-09T13:01:26Z

@mostolog docker writes logs to stderr. This means logs will be available depending on how/what the init system is/is setup.

If you are using systemd, the logs are in journald.
If you are using upstart, typically the logs would be in /var/log/upstart/docker.log.

mostolog · 2016-09-09T13:34:26Z

Supposed to be anonymized. I just run 4 commands:

service create without overlay network
service rm (all containers are gone)
service create with overlay network
service rm (one dead container left)

log.txt attached
Hope it helps.

jmzwcn · 2016-09-13T06:47:53Z

Actually when we use "built-in network"(ingress[overlay] or null[bridge]), this issue is gone. I will investigate further more.

xiaods · 2016-09-21T10:30:55Z

@jmzwcn wait for your confirm

mrjana · 2016-09-22T00:49:12Z

@jmzwcn if you want to take care of fixing this issue it should be fixed here: https://github.com/docker/docker/blob/master/daemon/cluster/executor/container/adapter.go#L136.

In addition to ignoring ActiveEndpointsError it also needs to ignore NoSuchNetwork error as defined here https://github.com/docker/libnetwork/blob/master/error.go#L8. This is because when deleting multiple tasks which are connected to the same network it is possible that the network is removed before all tasks has completely shutdown. If the network is no more there there is no need to fail the executor since that is what our intention was to begin with.

jmzwcn · 2016-09-22T02:45:32Z

Great, I think it is exact. I will verify and submit a PR then.

xiaods · 2016-09-22T02:55:59Z

            if _, ok := err.(*libnetwork.ErrNoSuchNetwork); ok {
                continue
            }

xiaods · 2016-09-25T08:53:07Z

@jmzwcn i have follow @mrjana hints and reproduce the steps in above comments, the docker logs report :

time="2016-09-25T08:36:55.472486240Z" level=error msg="network sleeper_nw remove failed: network sleeper_nw not found" module=taskmanager task.id=dcc7wujc41mkagd992va0f61o 
time="2016-09-25T08:36:55.472542896Z" level=error msg="remove task failed" error="network sleeper_nw not found" module=taskmanager task.id=dcc7wujc41mkagd992va0f61o 
time="2016-09-25T08:36:55.505292889Z" level=error msg="network sleeper_nw remove failed: network sleeper_nw not found" module=taskmanager task.id=cm48umnyg4zvmaff8lodnypfd 
time="2016-09-25T08:36:55.505292927Z" level=error msg="remove task failed" error="network sleeper_nw not found" module=taskmanager task.id=cm48umnyg4zvmaff8lodnypfd

i can confirmed the issue is cause by ErrNoSuchNetwork. give a PR asap ,thanks a lot.

jmzwcn · 2016-09-26T02:25:28Z

Yes, I have verified using a local binary with the fix(also include UnknownNetworkError) and find the issue has been gone as expected. dev binary.

I will create a PR soon.

jmzwcn · 2016-09-26T03:22:08Z

Seems not find appropriate place for test.

So directly submit a PR here. Please let me know if there is any problem. @xiaods @urlund @mrjana @mostolog @cpuguy83

jmzwcn · 2016-09-30T05:50:41Z

Upload a latest binary(without UnknownNetworkError), could anybody help to verify too? here

mostolog · 2016-09-30T06:39:26Z

@jmzwcn .deb? 😁

jmzwcn · 2016-09-30T06:57:14Z

Dev build takes too long time, not sure if deb could be completed before I leave office. 😁

xiaods · 2016-09-30T08:23:19Z

1.12.2-rc1 is not also fixed it, please have a try on new testing binary.

jmzwcn · 2016-10-03T05:25:44Z

All debs have been uploaded to here, please try it based on your contrib. Thanks!

mostolog · 2016-10-03T07:16:13Z

Just run the previous 4 commands:

service create without overlay network
service rm (all containers are gone)
service create with overlay network
service rm (all containers are gone)

Tested on debian-jessie on a single-node swarm cluster.
Thanks @jmzwcn !!!

xiaods · 2016-10-07T17:48:14Z

waiting the patch is merged, then close it.

jmzwcn · 2016-10-11T06:50:40Z

@xiaods

As @aboch said in PR, it will be decided to merge into 1.12.x by @mrjana

GordonTheTurtle added the version/1.12 label Sep 1, 2016

icecrime added exp/expert kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. area/networking area/swarm version/1.12 labels Sep 1, 2016

mrjana self-assigned this Sep 2, 2016

jmzwcn mentioned this issue Sep 26, 2016

Fix issue26244:swarm service, with overlay network, fails to remove all containers #26896

Merged

thaJeztah closed this as completed in #26896 Oct 11, 2016

kostrzewa9ld mentioned this issue Aug 9, 2021

Swarm services do not remove their dead containers #42723

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swarm service, with overlay network, fails to remove all containers #26244

Swarm service, with overlay network, fails to remove all containers #26244

urlund commented Sep 1, 2016 •

edited

mrjana commented Sep 2, 2016

mostolog commented Sep 5, 2016

xiaods commented Sep 5, 2016

jmzwcn commented Sep 7, 2016 •

edited

mostolog commented Sep 7, 2016

xiaods commented Sep 8, 2016

jmzwcn commented Sep 9, 2016 •

edited

mostolog commented Sep 9, 2016

cpuguy83 commented Sep 9, 2016

mostolog commented Sep 9, 2016

jmzwcn commented Sep 13, 2016 •

edited

xiaods commented Sep 21, 2016

mrjana commented Sep 22, 2016

jmzwcn commented Sep 22, 2016

xiaods commented Sep 22, 2016 •

edited

xiaods commented Sep 25, 2016

jmzwcn commented Sep 26, 2016 •

edited

jmzwcn commented Sep 26, 2016

jmzwcn commented Sep 30, 2016 •

edited

mostolog commented Sep 30, 2016

jmzwcn commented Sep 30, 2016

xiaods commented Sep 30, 2016

jmzwcn commented Oct 3, 2016

mostolog commented Oct 3, 2016 •

edited

xiaods commented Oct 7, 2016

jmzwcn commented Oct 11, 2016 •

edited

Swarm service, with overlay network, fails to remove all containers #26244

Swarm service, with overlay network, fails to remove all containers #26244

Comments

urlund commented Sep 1, 2016 • edited

mrjana commented Sep 2, 2016

mostolog commented Sep 5, 2016

xiaods commented Sep 5, 2016

jmzwcn commented Sep 7, 2016 • edited

mostolog commented Sep 7, 2016

xiaods commented Sep 8, 2016

jmzwcn commented Sep 9, 2016 • edited

mostolog commented Sep 9, 2016

cpuguy83 commented Sep 9, 2016

mostolog commented Sep 9, 2016

jmzwcn commented Sep 13, 2016 • edited

xiaods commented Sep 21, 2016

mrjana commented Sep 22, 2016

jmzwcn commented Sep 22, 2016

xiaods commented Sep 22, 2016 • edited

xiaods commented Sep 25, 2016

jmzwcn commented Sep 26, 2016 • edited

jmzwcn commented Sep 26, 2016

jmzwcn commented Sep 30, 2016 • edited

mostolog commented Sep 30, 2016

jmzwcn commented Sep 30, 2016

xiaods commented Sep 30, 2016

jmzwcn commented Oct 3, 2016

mostolog commented Oct 3, 2016 • edited

xiaods commented Oct 7, 2016

jmzwcn commented Oct 11, 2016 • edited

urlund commented Sep 1, 2016 •

edited

jmzwcn commented Sep 7, 2016 •

edited

jmzwcn commented Sep 9, 2016 •

edited

jmzwcn commented Sep 13, 2016 •

edited

xiaods commented Sep 22, 2016 •

edited

jmzwcn commented Sep 26, 2016 •

edited

jmzwcn commented Sep 30, 2016 •

edited

mostolog commented Oct 3, 2016 •

edited

jmzwcn commented Oct 11, 2016 •

edited