Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker service rm sometimes leaves orphan containers #24244

Closed
ghost opened this issue Jul 1, 2016 · 12 comments
Closed

docker service rm sometimes leaves orphan containers #24244

ghost opened this issue Jul 1, 2016 · 12 comments
Assignees
Labels
area/swarm kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. priority/P2 Normal priority: default priority applied. version/master
Milestone

Comments

@ghost
Copy link

ghost commented Jul 1, 2016

Summary: when I do docker service rm xxx where xxx is a service with 5 replicas spread across two nodes, I find that only one replica gets removed on each node. The remaining three replicas get stopped, but not removed.

Output of docker version:

ubuntu@ip-192-168-33-20:~$ docker version
Client:
 Version:      1.12.0-dev
 API version:  1.25
 Go version:   go1.6.2
 Git commit:   691bbea
 Built:        Fri Jul  1 17:24:51 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.0-dev
 API version:  1.25
 Go version:   go1.6.2
 Git commit:   691bbea
 Built:        Fri Jul  1 17:24:51 2016
 OS/Arch:      linux/amd64

Output of docker info:

ubuntu@ip-192-168-33-20:~$ docker info
Containers: 3
 Running: 0
 Paused: 0
 Stopped: 3
Images: 3
Server Version: 1.12.0-dev
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 43
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host null overlay
Swarm: active
 NodeID: 6jjvdt26ht5goi1eqdl0vbmb7
 IsManager: No
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.2.0-41-generic
Operating System: Ubuntu 15.10
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 991.1 MiB
Name: ip-192-168-33-20
ID: RRIP:NAJO:QAPS:HJB2:4RS2:FRPZ:LTBM:W4CZ:IXIB:6PNU:LY7N:NTMR
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 19
 Goroutines: 60
 System Time: 2016-07-01T22:41:20.925823885Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.):

Two node swarm on AWS

Steps to reproduce the issue:

  1. Start with a working 2-node swarm; manager = 192.168.33.10 and worker = 192.168.33.20

  2. Start a service with 5 replicas: docker service create --replicas 5 --name frontend --env QUEUE_HOSTNAME=redis --env OPTION_A=Kats --env OPTION_B=Doggies --network mynetwork --publish 80/tcp mikegoelzer/s2_frontend:latest

  3. Observe that in this case 2 containers get scheduled to 192.168.33.10, and 3 to 192.168.33.20:

ubuntu@ip-192-168-33-10:~$ docker ps
CONTAINER ID        IMAGE                            COMMAND                  CREATED             STATUS              PORTS               NAMES
ec052f0efbde        mikegoelzer/s2_frontend:latest   "python app.py"          3 hours ago         Up 3 hours          80/tcp              frontend.5.6xafdlj24uh4usev1vbaodsnc
6ebda47c63f0        mikegoelzer/s2_frontend:latest   "python app.py"          3 hours ago         Up 3 hours          80/tcp              frontend.1.bvat96kuyy5xokehxvinas5pw

and

ubuntu@ip-192-168-33-20:~$ docker ps
CONTAINER ID        IMAGE                            COMMAND             CREATED             STATUS              PORTS               NAMES
5cf10ee5f606        mikegoelzer/s2_frontend:latest   "python app.py"     3 hours ago         Up 3 hours          80/tcp              frontend.2.0bf6r1kx711vxaouu0o9u5gul
f63e47c854e9        mikegoelzer/s2_frontend:latest   "python app.py"     3 hours ago         Up 3 hours          80/tcp              frontend.3.7pqp83961dzydonllkbf4nvqi
09119685c9ca        mikegoelzer/s2_frontend:latest   "python app.py"     3 hours ago         Up 3 hours          80/tcp              frontend.4.44crtadceuot29rfoa2v1i676
  1. Now do docker service rm frontend

  2. Wait several minutes to ensure system has had adequate time to converge

  3. Run docker ps -a on both machines and observe the following:

ubuntu@ip-192-168-33-10:~$ docker ps -a
CONTAINER ID        IMAGE                            COMMAND             CREATED             STATUS                      PORTS               NAMES
ec052f0efbde        mikegoelzer/s2_frontend:latest   "python app.py"     3 hours ago         Exited (0) 49 seconds ago                       frontend.5.6xafdlj24uh4usev1vbaodsnc

and

ubuntu@ip-192-168-33-20:~$ docker ps -a
CONTAINER ID        IMAGE                            COMMAND             CREATED             STATUS                        PORTS               NAMES
f63e47c854e9        mikegoelzer/s2_frontend:latest   "python app.py"     4 hours ago         Exited (137) 14 minutes ago                       frontend.3.7pqp83961dzydonllkbf4nvqi
09119685c9ca        mikegoelzer/s2_frontend:latest   "python app.py"     4 hours ago         Exited (137) 14 minutes ago                       frontend.4.44crtadceuot29rfoa2v1i676

Describe the results you received:
6ebda47c63f0 and 5cf10ee5f606 were stopped and removed. The other three containers were stopped but not removed.

Describe the results you expected:
I expect all 5 containers to be stopped and removed.

Additional information you deem important (e.g. issue happens only occasionally):
None

@aaronlehmann
Copy link
Contributor

ping @stevvooe @tonistiigi

The orchestrator should be removing all associated tasks when the service is removed. I think the issue is most likely at the agent level.

@tonistiigi
Copy link
Member

Just noticed that our test for this only checks active containers https://github.com/docker/docker-1.12-integration/blob/1.12-integration/integration-cli/docker_api_swarm_test.go#L225 . That needs to be updated.

@thaJeztah thaJeztah added this to the 1.12.0 milestone Jul 1, 2016
@stevvooe
Copy link
Contributor

stevvooe commented Jul 2, 2016

Logs? My guess is that an error escaped during removal and we just failed out.

@thaJeztah thaJeztah added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. priority/P2 Normal priority: default priority applied. labels Jul 2, 2016
@tonistiigi
Copy link
Member

@mgoelzer I tried to reproduce but couldn't. Does this appear on some specific images or networking setup. If you can reproduce, please post daemon logs.

@stevvooe
Copy link
Contributor

stevvooe commented Jul 7, 2016

Without logs, I am not sure if this is the same issue, but I've reproduced this in rc3.

After generating these examples and removing the service, there are orphan containers:

$ docker ps -a -f name=angry
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
c4683915c083        alpine:latest       "sh"                20 hours ago        Created                                 angry_volhard.8.158923tb8caod0qe8l29393un
eb063366c99a        alpine:latest       "sh"                20 hours ago        Created                                 angry_volhard.4.cd1jwcz7m8tbj0o4tuhpkj89w
4b17aa0280d5        alpine:latest       "sh"                20 hours ago        Created                                 angry_volhard.6.dxm3enk1xqvnz77kni8f4lpfo

We can see this is a bug in the agent, since it receives a zero-length assignment set:

time="2016-07-07T02:17:12.840678263Z" level=debug msg="(*worker).Assign" len(tasks)=5 module=agent
time="2016-07-07T02:17:12.841325218Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=eao1hthvmca9do23p1zq740rd
time="2016-07-07T02:17:12.841878234Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=dzfoijc0e2yw56rqimqbihfob
time="2016-07-07T02:17:12.842323922Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=e8co44ynek75xzxqp2m69ya6h
time="2016-07-07T02:17:12.842748483Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=ey808aaq9y7uuhdb05m7q7s8g
time="2016-07-07T02:17:12.843165886Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=e4gc1y8o2lg6tv8zu4tx5xesq
time="2016-07-07T02:17:12.847570754Z" level=debug msg="(*worker).Assign" len(tasks)=4 module=agent
time="2016-07-07T02:17:12.848010114Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=e8co44ynek75xzxqp2m69ya6h
time="2016-07-07T02:17:12.848383331Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=ey808aaq9y7uuhdb05m7q7s8g
time="2016-07-07T02:17:12.848889004Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=e4gc1y8o2lg6tv8zu4tx5xesq
time="2016-07-07T02:17:12.849477471Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=eao1hthvmca9do23p1zq740rd
time="2016-07-07T02:17:12.853341242Z" level=debug msg="(*worker).Assign" len(tasks)=3 module=agent
time="2016-07-07T02:17:12.853996649Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=ey808aaq9y7uuhdb05m7q7s8g
time="2016-07-07T02:17:12.854522246Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=eao1hthvmca9do23p1zq740rd
time="2016-07-07T02:17:12.855985998Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=e8co44ynek75xzxqp2m69ya6h
time="2016-07-07T02:17:12.857859799Z" level=debug msg="(*worker).Assign" len(tasks)=2 module=agent
time="2016-07-07T02:17:12.858316904Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=eao1hthvmca9do23p1zq740rd
time="2016-07-07T02:17:12.858814482Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=ey808aaq9y7uuhdb05m7q7s8g
time="2016-07-07T02:17:12.876622577Z" level=debug msg="(*worker).Assign" len(tasks)=1 module=agent
time="2016-07-07T02:17:12.877513298Z" level=debug msg=assigned module=agent task.desiredstate=SHUTDOWN task.id=ey808aaq9y7uuhdb05m7q7s8g
time="2016-07-07T02:17:12.880530782Z" level=debug msg="(*worker).Assign" len(tasks)=0 module=agent
time="2016-07-07T02:17:12.921916215Z" level=debug msg="Increasing token expiration to: 60 seconds"
time="2016-07-07T02:17:13.134765261Z" level=debug msg="Pulling ref from V2 registry: alpine:latest"
time="2016-07-07T02:17:13.136058894Z" level=debug msg="pull progress map[status:Pulling from library/alpine id:latest]"
time="2016-07-07T02:17:13.136918198Z" level=debug msg="pull progress map[status:Digest: sha256:3dcdb92d7432d56604d4545cbd324b14e647b313626d99b889d0626de158f73a id:latest]"
time="2016-07-07T02:17:13.137394238Z" level=debug msg="pull progress map[status:Status: Image is up to date for alpine:latest id:latest]"
time="2016-07-07T02:17:13.191278185Z" level=debug msg="container mounted via layerStore: /var/lib/docker/aufs/mnt/4602f992138863483606b474e0d17776e4b45ad357eee2b98c35804e2e8b2116"
time="2016-07-07T02:17:13.226260160Z" level=debug msg="state changed" module=taskmanager state.desired=RUNNING state.transition="PREPARING->READY" task.id=cd1jwcz7m8tbj0o4tuhpkj89w
time="2016-07-07T02:17:13.228080658Z" level=debug msg="(*Agent).UpdateTaskStatus" module=agent task.id=cd1jwcz7m8tbj0o4tuhpkj89w
time="2016-07-07T02:17:13.229743960Z" level=warning msg="cannot find target task in store" method="(*Dispatcher).UpdateTaskStatus" node.id=1mml4ok4j88yixp1sttsbblw6 node.session=a2no8i7cqpb4e4n6tv6onf3hl task.id=cd1jwcz7m8tbj0o4tuhpkj89w
time="2016-07-07T02:17:13.230336112Z" level=debug msg="task status reported" module=agent
time="2016-07-07T02:17:13.235782665Z" level=error msg="task unavailable" method="(*Dispatcher).processTaskUpdates" module=dispatcher task.id=cd1jwcz7m8tbj0o4tuhpkj89w
time="2016-07-07T02:17:13.471820057Z" level=debug msg="Pulling ref from V2 registry: alpine:latest"
time="2016-07-07T02:17:13.472562320Z" level=debug msg="pull progress map[status:Pulling from library/alpine id:latest]"
time="2016-07-07T02:17:13.473125498Z" level=debug msg="pull progress map[status:Digest: sha256:3dcdb92d7432d56604d4545cbd324b14e647b313626d99b889d0626de158f73a id:latest]"
time="2016-07-07T02:17:13.473561482Z" level=debug msg="pull progress map[id:latest status:Status: Image is up to date for alpine:latest]"
time="2016-07-07T02:17:13.520004817Z" level=debug msg="container mounted via layerStore: /var/lib/docker/aufs/mnt/4f63ca0beba83351ebca5d553490d94cfa3f0508a748e20968e61abcc1529afe"
time="2016-07-07T02:17:13.556612793Z" level=debug msg="state changed" module=taskmanager state.desired=RUNNING state.transition="PREPARING->READY" task.id=dxm3enk1xqvnz77kni8f4lpfo
time="2016-07-07T02:17:13.557552436Z" level=debug msg="(*Agent).UpdateTaskStatus" module=agent task.id=dxm3enk1xqvnz77kni8f4lpfo
time="2016-07-07T02:17:13.560054352Z" level=warning msg="cannot find target task in store" method="(*Dispatcher).UpdateTaskStatus" node.id=1mml4ok4j88yixp1sttsbblw6 node.session=a2no8i7cqpb4e4n6tv6onf3hl task.id=dxm3enk1xqvnz77kni8f4lpfo
time="2016-07-07T02:17:13.560669873Z" level=debug msg="task status reported" module=agent
time="2016-07-07T02:17:13.641956929Z" level=error msg="task unavailable" method="(*Dispatcher).processTaskUpdates" module=dispatcher task.id=dxm3enk1xqvnz77kni8f4lpfo
time="2016-07-07T02:17:14.428810365Z" level=debug msg="Calling GET /v1.24/containers/json?all=1"

A few things to note:

  1. We don't see any removal errors.
  2. The errors present are not in the removal chain, as far as I can tell, so they aren't relevant.

I was unable to get a goroutine dump but my guess is there is a deadlock on remove.

@stevvooe stevvooe self-assigned this Jul 7, 2016
@ghost ghost changed the title docker service rm removes only 1 replica per node docker service rm sometimes leaves orphan containers Jul 8, 2016
@ghost
Copy link
Author

ghost commented Jul 8, 2016

Edited the title to reflect the commonalities between the reports from @stevvooe and me.

@jinuxstyle
Copy link

I found a race condition between task creation and removal, which could cause orphan containers. But I am not sure whether or not the race is the root cause for this issue. Please take a look at PR moby/swarmkit#1152.

@stevvooe
Copy link
Contributor

@jinuxstyle Do you have a reliable reproduction? We aren't able to reproduce this easily and don't see the errors that should be displayed in the logs.

@jinuxstyle
Copy link

Yes, I have a script which can reproduce it in 100%. It needs a kind of stressful test to enlarge the race window. I will upload the script to gist. Hold on for a moment.

@jinuxstyle
Copy link

Post it here: https://gist.github.com/jinuxstyle/07f4decb25bfcf6d5388c0231c593b7a. To reproduce the issue, give the option [loops] a value larger or equal than 40, according to my test. It uses busybox as the image which is more lightweight. If you change the image to mikegoelzer/s2_frontend, you can reproduce it in 10 loops as I tested.

@stevvooe
Copy link
Contributor

@jinuxstyle @mgoelzer I've characterize the cause of this in #24858 (and with considerable help from @jinuxstyle 🙇 ).

@stevvooe
Copy link
Contributor

@mgoelzer For the most part, we have mitigated this for swarm mode. Let's go ahead and close this.

#24858 is still an active bug affecting swarmkit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/swarm kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. priority/P2 Normal priority: default priority applied. version/master
Projects
None yet
Development

No branches or pull requests

6 participants