-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker service rm
sometimes leaves orphan containers
#24244
Comments
ping @stevvooe @tonistiigi The orchestrator should be removing all associated tasks when the service is removed. I think the issue is most likely at the agent level. |
Just noticed that our test for this only checks active containers https://github.com/docker/docker-1.12-integration/blob/1.12-integration/integration-cli/docker_api_swarm_test.go#L225 . That needs to be updated. |
Logs? My guess is that an error escaped during removal and we just failed out. |
@mgoelzer I tried to reproduce but couldn't. Does this appear on some specific images or networking setup. If you can reproduce, please post daemon logs. |
Without logs, I am not sure if this is the same issue, but I've reproduced this in rc3. After generating these examples and removing the service, there are orphan containers:
We can see this is a bug in the agent, since it receives a zero-length assignment set:
A few things to note:
I was unable to get a goroutine dump but my guess is there is a deadlock on remove. |
docker service rm
removes only 1 replica per nodedocker service rm
sometimes leaves orphan containers
Edited the title to reflect the commonalities between the reports from @stevvooe and me. |
I found a race condition between task creation and removal, which could cause orphan containers. But I am not sure whether or not the race is the root cause for this issue. Please take a look at PR moby/swarmkit#1152. |
@jinuxstyle Do you have a reliable reproduction? We aren't able to reproduce this easily and don't see the errors that should be displayed in the logs. |
Yes, I have a script which can reproduce it in 100%. It needs a kind of stressful test to enlarge the race window. I will upload the script to gist. Hold on for a moment. |
Post it here: https://gist.github.com/jinuxstyle/07f4decb25bfcf6d5388c0231c593b7a. To reproduce the issue, give the option [loops] a value larger or equal than 40, according to my test. It uses busybox as the image which is more lightweight. If you change the image to mikegoelzer/s2_frontend, you can reproduce it in 10 loops as I tested. |
@jinuxstyle @mgoelzer I've characterize the cause of this in #24858 (and with considerable help from @jinuxstyle 🙇 ). |
@mgoelzer For the most part, we have mitigated this for swarm mode. Let's go ahead and close this. #24858 is still an active bug affecting swarmkit. |
Summary: when I do
docker service rm xxx
where xxx is a service with 5 replicas spread across two nodes, I find that only one replica gets removed on each node. The remaining three replicas get stopped, but not removed.Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
Two node swarm on AWS
Steps to reproduce the issue:
Start with a working 2-node swarm; manager =
192.168.33.10
and worker =192.168.33.20
Start a service with 5 replicas:
docker service create --replicas 5 --name frontend --env QUEUE_HOSTNAME=redis --env OPTION_A=Kats --env OPTION_B=Doggies --network mynetwork --publish 80/tcp mikegoelzer/s2_frontend:latest
Observe that in this case 2 containers get scheduled to
192.168.33.10
, and 3 to192.168.33.20
:and
Now do
docker service rm frontend
Wait several minutes to ensure system has had adequate time to converge
Run
docker ps -a
on both machines and observe the following:and
Describe the results you received:
6ebda47c63f0
and5cf10ee5f606
were stopped and removed. The other three containers were stopped but not removed.Describe the results you expected:
I expect all 5 containers to be stopped and removed.
Additional information you deem important (e.g. issue happens only occasionally):
None
The text was updated successfully, but these errors were encountered: