`docker service rm` sometimes leaves orphan containers #24244

ghost · 2016-07-01T22:56:29Z

Summary: when I do docker service rm xxx where xxx is a service with 5 replicas spread across two nodes, I find that only one replica gets removed on each node. The remaining three replicas get stopped, but not removed.

Output of docker version:

ubuntu@ip-192-168-33-20:~$ docker version
Client:
 Version:      1.12.0-dev
 API version:  1.25
 Go version:   go1.6.2
 Git commit:   691bbea
 Built:        Fri Jul  1 17:24:51 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.0-dev
 API version:  1.25
 Go version:   go1.6.2
 Git commit:   691bbea
 Built:        Fri Jul  1 17:24:51 2016
 OS/Arch:      linux/amd64

Output of docker info:

ubuntu@ip-192-168-33-20:~$ docker info
Containers: 3
 Running: 0
 Paused: 0
 Stopped: 3
Images: 3
Server Version: 1.12.0-dev
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 43
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host null overlay
Swarm: active
 NodeID: 6jjvdt26ht5goi1eqdl0vbmb7
 IsManager: No
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.2.0-41-generic
Operating System: Ubuntu 15.10
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 991.1 MiB
Name: ip-192-168-33-20
ID: RRIP:NAJO:QAPS:HJB2:4RS2:FRPZ:LTBM:W4CZ:IXIB:6PNU:LY7N:NTMR
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 19
 Goroutines: 60
 System Time: 2016-07-01T22:41:20.925823885Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.):

Two node swarm on AWS

Steps to reproduce the issue:

Start with a working 2-node swarm; manager = 192.168.33.10 and worker = 192.168.33.20
Start a service with 5 replicas: docker service create --replicas 5 --name frontend --env QUEUE_HOSTNAME=redis --env OPTION_A=Kats --env OPTION_B=Doggies --network mynetwork --publish 80/tcp mikegoelzer/s2_frontend:latest
Observe that in this case 2 containers get scheduled to 192.168.33.10, and 3 to 192.168.33.20:

ubuntu@ip-192-168-33-10:~$ docker ps
CONTAINER ID        IMAGE                            COMMAND                  CREATED             STATUS              PORTS               NAMES
ec052f0efbde        mikegoelzer/s2_frontend:latest   "python app.py"          3 hours ago         Up 3 hours          80/tcp              frontend.5.6xafdlj24uh4usev1vbaodsnc
6ebda47c63f0        mikegoelzer/s2_frontend:latest   "python app.py"          3 hours ago         Up 3 hours          80/tcp              frontend.1.bvat96kuyy5xokehxvinas5pw

and

ubuntu@ip-192-168-33-20:~$ docker ps
CONTAINER ID        IMAGE                            COMMAND             CREATED             STATUS              PORTS               NAMES
5cf10ee5f606        mikegoelzer/s2_frontend:latest   "python app.py"     3 hours ago         Up 3 hours          80/tcp              frontend.2.0bf6r1kx711vxaouu0o9u5gul
f63e47c854e9        mikegoelzer/s2_frontend:latest   "python app.py"     3 hours ago         Up 3 hours          80/tcp              frontend.3.7pqp83961dzydonllkbf4nvqi
09119685c9ca        mikegoelzer/s2_frontend:latest   "python app.py"     3 hours ago         Up 3 hours          80/tcp              frontend.4.44crtadceuot29rfoa2v1i676

Now do docker service rm frontend
Wait several minutes to ensure system has had adequate time to converge
Run docker ps -a on both machines and observe the following:

ubuntu@ip-192-168-33-10:~$ docker ps -a
CONTAINER ID        IMAGE                            COMMAND             CREATED             STATUS                      PORTS               NAMES
ec052f0efbde        mikegoelzer/s2_frontend:latest   "python app.py"     3 hours ago         Exited (0) 49 seconds ago                       frontend.5.6xafdlj24uh4usev1vbaodsnc

and

ubuntu@ip-192-168-33-20:~$ docker ps -a
CONTAINER ID        IMAGE                            COMMAND             CREATED             STATUS                        PORTS               NAMES
f63e47c854e9        mikegoelzer/s2_frontend:latest   "python app.py"     4 hours ago         Exited (137) 14 minutes ago                       frontend.3.7pqp83961dzydonllkbf4nvqi
09119685c9ca        mikegoelzer/s2_frontend:latest   "python app.py"     4 hours ago         Exited (137) 14 minutes ago                       frontend.4.44crtadceuot29rfoa2v1i676

Describe the results you received:
6ebda47c63f0 and 5cf10ee5f606 were stopped and removed. The other three containers were stopped but not removed.

Describe the results you expected:
I expect all 5 containers to be stopped and removed.

Additional information you deem important (e.g. issue happens only occasionally):
None

The text was updated successfully, but these errors were encountered:

aaronlehmann · 2016-07-01T23:15:19Z

ping @stevvooe @tonistiigi

The orchestrator should be removing all associated tasks when the service is removed. I think the issue is most likely at the agent level.

tonistiigi · 2016-07-01T23:18:55Z

Just noticed that our test for this only checks active containers https://github.com/docker/docker-1.12-integration/blob/1.12-integration/integration-cli/docker_api_swarm_test.go#L225 . That needs to be updated.

stevvooe · 2016-07-02T00:31:47Z

Logs? My guess is that an error escaped during removal and we just failed out.

tonistiigi · 2016-07-07T20:07:28Z

@mgoelzer I tried to reproduce but couldn't. Does this appear on some specific images or networking setup. If you can reproduce, please post daemon logs.

stevvooe · 2016-07-07T23:20:16Z

Without logs, I am not sure if this is the same issue, but I've reproduced this in rc3.

After generating these examples and removing the service, there are orphan containers:

$ docker ps -a -f name=angry
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
c4683915c083        alpine:latest       "sh"                20 hours ago        Created                                 angry_volhard.8.158923tb8caod0qe8l29393un
eb063366c99a        alpine:latest       "sh"                20 hours ago        Created                                 angry_volhard.4.cd1jwcz7m8tbj0o4tuhpkj89w
4b17aa0280d5        alpine:latest       "sh"                20 hours ago        Created                                 angry_volhard.6.dxm3enk1xqvnz77kni8f4lpfo

We can see this is a bug in the agent, since it receives a zero-length assignment set:

time="2016-07-07T02:17:12.840678263Z" level=debug msg="(*worker).Assign" len(tasks)=5 module=agent
time="2016-07-07T02:17:12.841325218Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=eao1hthvmca9do23p1zq740rd
time="2016-07-07T02:17:12.841878234Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=dzfoijc0e2yw56rqimqbihfob
time="2016-07-07T02:17:12.842323922Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=e8co44ynek75xzxqp2m69ya6h
time="2016-07-07T02:17:12.842748483Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=ey808aaq9y7uuhdb05m7q7s8g
time="2016-07-07T02:17:12.843165886Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=e4gc1y8o2lg6tv8zu4tx5xesq
time="2016-07-07T02:17:12.847570754Z" level=debug msg="(*worker).Assign" len(tasks)=4 module=agent
time="2016-07-07T02:17:12.848010114Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=e8co44ynek75xzxqp2m69ya6h
time="2016-07-07T02:17:12.848383331Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=ey808aaq9y7uuhdb05m7q7s8g
time="2016-07-07T02:17:12.848889004Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=e4gc1y8o2lg6tv8zu4tx5xesq
time="2016-07-07T02:17:12.849477471Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=eao1hthvmca9do23p1zq740rd
time="2016-07-07T02:17:12.853341242Z" level=debug msg="(*worker).Assign" len(tasks)=3 module=agent
time="2016-07-07T02:17:12.853996649Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=ey808aaq9y7uuhdb05m7q7s8g
time="2016-07-07T02:17:12.854522246Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=eao1hthvmca9do23p1zq740rd
time="2016-07-07T02:17:12.855985998Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=e8co44ynek75xzxqp2m69ya6h
time="2016-07-07T02:17:12.857859799Z" level=debug msg="(*worker).Assign" len(tasks)=2 module=agent
time="2016-07-07T02:17:12.858316904Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=eao1hthvmca9do23p1zq740rd
time="2016-07-07T02:17:12.858814482Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=ey808aaq9y7uuhdb05m7q7s8g
time="2016-07-07T02:17:12.876622577Z" level=debug msg="(*worker).Assign" len(tasks)=1 module=agent
time="2016-07-07T02:17:12.877513298Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=ey808aaq9y7uuhdb05m7q7s8g
time="2016-07-07T02:17:12.880530782Z" level=debug msg="(*worker).Assign" len(tasks)=0 module=agent
time="2016-07-07T02:17:12.921916215Z" level=debug msg="Increasing token expiration to: 60 seconds"
time="2016-07-07T02:17:13.134765261Z" level=debug msg="Pulling ref from V2 registry: alpine:latest"
time="2016-07-07T02:17:13.136058894Z" level=debug msg="pull progress map[status:Pulling from library/alpine id:latest]"
time="2016-07-07T02:17:13.136918198Z" level=debug msg="pull progress map[status:Digest: sha256:3dcdb92d7432d56604d4545cbd324b14e647b313626d99b889d0626de158f73a id:latest]"
time="2016-07-07T02:17:13.137394238Z" level=debug msg="pull progress map[status:Status: Image is up to date for alpine:latest id:latest]"
time="2016-07-07T02:17:13.191278185Z" level=debug msg="container mounted via layerStore: /var/lib/docker/aufs/mnt/4602f992138863483606b474e0d17776e4b45ad357eee2b98c35804e2e8b2116"
time="2016-07-07T02:17:13.226260160Z" level=debug msg="state changed" module=taskmanager state.desired=RUNNING state.transition="PREPARING->READY" task.id=cd1jwcz7m8tbj0o4tuhpkj89w
time="2016-07-07T02:17:13.228080658Z" level=debug msg="(*Agent).UpdateTaskStatus" module=agent task.id=cd1jwcz7m8tbj0o4tuhpkj89w
time="2016-07-07T02:17:13.229743960Z" level=warning msg="cannot find target task in store" method="(*Dispatcher).UpdateTaskStatus" node.id=1mml4ok4j88yixp1sttsbblw6 node.session=a2no8i7cqpb4e4n6tv6onf3hl task.id=cd1jwcz7m8tbj0o4tuhpkj89w
time="2016-07-07T02:17:13.230336112Z" level=debug msg="task status reported" module=agent
time="2016-07-07T02:17:13.235782665Z" level=error msg="task unavailable" method="(*Dispatcher).processTaskUpdates" module=dispatcher task.id=cd1jwcz7m8tbj0o4tuhpkj89w
time="2016-07-07T02:17:13.471820057Z" level=debug msg="Pulling ref from V2 registry: alpine:latest"
time="2016-07-07T02:17:13.472562320Z" level=debug msg="pull progress map[status:Pulling from library/alpine id:latest]"
time="2016-07-07T02:17:13.473125498Z" level=debug msg="pull progress map[status:Digest: sha256:3dcdb92d7432d56604d4545cbd324b14e647b313626d99b889d0626de158f73a id:latest]"
time="2016-07-07T02:17:13.473561482Z" level=debug msg="pull progress map[id:latest status:Status: Image is up to date for alpine:latest]"
time="2016-07-07T02:17:13.520004817Z" level=debug msg="container mounted via layerStore: /var/lib/docker/aufs/mnt/4f63ca0beba83351ebca5d553490d94cfa3f0508a748e20968e61abcc1529afe"
time="2016-07-07T02:17:13.556612793Z" level=debug msg="state changed" module=taskmanager state.desired=RUNNING state.transition="PREPARING->READY" task.id=dxm3enk1xqvnz77kni8f4lpfo
time="2016-07-07T02:17:13.557552436Z" level=debug msg="(*Agent).UpdateTaskStatus" module=agent task.id=dxm3enk1xqvnz77kni8f4lpfo
time="2016-07-07T02:17:13.560054352Z" level=warning msg="cannot find target task in store" method="(*Dispatcher).UpdateTaskStatus" node.id=1mml4ok4j88yixp1sttsbblw6 node.session=a2no8i7cqpb4e4n6tv6onf3hl task.id=dxm3enk1xqvnz77kni8f4lpfo
time="2016-07-07T02:17:13.560669873Z" level=debug msg="task status reported" module=agent
time="2016-07-07T02:17:13.641956929Z" level=error msg="task unavailable" method="(*Dispatcher).processTaskUpdates" module=dispatcher task.id=dxm3enk1xqvnz77kni8f4lpfo
time="2016-07-07T02:17:14.428810365Z" level=debug msg="Calling GET /v1.24/containers/json?all=1"

A few things to note:

We don't see any removal errors.
The errors present are not in the removal chain, as far as I can tell, so they aren't relevant.

I was unable to get a goroutine dump but my guess is there is a deadlock on remove.

ghost · 2016-07-08T01:20:39Z

Edited the title to reflect the commonalities between the reports from @stevvooe and me.

jinuxstyle · 2016-07-09T09:27:04Z

I found a race condition between task creation and removal, which could cause orphan containers. But I am not sure whether or not the race is the root cause for this issue. Please take a look at PR moby/swarmkit#1152.

stevvooe · 2016-07-12T00:18:09Z

@jinuxstyle Do you have a reliable reproduction? We aren't able to reproduce this easily and don't see the errors that should be displayed in the logs.

jinuxstyle · 2016-07-12T01:00:01Z

Yes, I have a script which can reproduce it in 100%. It needs a kind of stressful test to enlarge the race window. I will upload the script to gist. Hold on for a moment.

jinuxstyle · 2016-07-12T01:17:18Z

Post it here: https://gist.github.com/jinuxstyle/07f4decb25bfcf6d5388c0231c593b7a. To reproduce the issue, give the option [loops] a value larger or equal than 40, according to my test. It uses busybox as the image which is more lightweight. If you change the image to mikegoelzer/s2_frontend, you can reproduce it in 10 loops as I tested.

stevvooe · 2016-07-20T19:12:00Z

@jinuxstyle @mgoelzer I've characterize the cause of this in #24858 (and with considerable help from @jinuxstyle 🙇 ).

stevvooe · 2016-07-30T00:41:48Z

@mgoelzer For the most part, we have mitigated this for swarm mode. Let's go ahead and close this.

#24858 is still an active bug affecting swarmkit.

GordonTheTurtle added the version/master label Jul 1, 2016

aaronlehmann added the area/swarm label Jul 1, 2016

thaJeztah added this to the 1.12.0 milestone Jul 1, 2016

thaJeztah added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. priority/P2 Normal priority: default priority applied. labels Jul 2, 2016

justincormack mentioned this issue Jul 5, 2016

Can't bind container to unused port #24328

Closed

stevvooe self-assigned this Jul 7, 2016

ghost changed the title ~~docker service rm removes only 1 replica per node~~ docker service rm sometimes leaves orphan containers Jul 8, 2016

jinuxstyle mentioned this issue Jul 12, 2016

Race condition between task creation and removal moby/swarmkit#1159

Open

thaJeztah mentioned this issue Jul 29, 2016

multiple "create service" and "service rm" leave orphan containers #25225

Closed

stevvooe closed this as completed Jul 30, 2016

djeeg mentioned this issue Mar 12, 2018

ghost container with cloudstor-metadata no such file or directory docker-archive/for-azure#53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`docker service rm` sometimes leaves orphan containers #24244

`docker service rm` sometimes leaves orphan containers #24244

ghost commented Jul 1, 2016

aaronlehmann commented Jul 1, 2016

tonistiigi commented Jul 1, 2016

stevvooe commented Jul 2, 2016

tonistiigi commented Jul 7, 2016

stevvooe commented Jul 7, 2016

ghost commented Jul 8, 2016

jinuxstyle commented Jul 9, 2016

stevvooe commented Jul 12, 2016

jinuxstyle commented Jul 12, 2016

jinuxstyle commented Jul 12, 2016

stevvooe commented Jul 20, 2016

stevvooe commented Jul 30, 2016

docker service rm sometimes leaves orphan containers #24244

docker service rm sometimes leaves orphan containers #24244

Comments

ghost commented Jul 1, 2016

aaronlehmann commented Jul 1, 2016

tonistiigi commented Jul 1, 2016

stevvooe commented Jul 2, 2016

tonistiigi commented Jul 7, 2016

stevvooe commented Jul 7, 2016

ghost commented Jul 8, 2016

jinuxstyle commented Jul 9, 2016

stevvooe commented Jul 12, 2016

jinuxstyle commented Jul 12, 2016

jinuxstyle commented Jul 12, 2016

stevvooe commented Jul 20, 2016

stevvooe commented Jul 30, 2016

`docker service rm` sometimes leaves orphan containers #24244

`docker service rm` sometimes leaves orphan containers #24244