New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

latent docker worker randomly dies for no apparent reason #3800

Open
dragon512 opened this Issue Dec 4, 2017 · 2 comments

Comments

Projects
None yet
1 participant
@dragon512
Contributor

dragon512 commented Dec 4, 2017

Having issue with Docker workers dying or being killed for unknown reasons.

There is no error message except a message that the docker worker disconnected for unknown reason. This issues causes rebuilds, waste time, and messes up the master ( so it will not shutdown)

@dragon512

This comment has been minimized.

Show comment
Hide comment
@dragon512

dragon512 Dec 4, 2017

Contributor

I found the issue. It is in how Buildbot generates the docker name and the way the docker python library filters. When making names for docker workers in a general form like:

docker-worker-1
...
docker-worker-20

you get a case in which docker-worker-1 is being started up to do a build. At this time we might have docker-worker-11 running. the code will do a lookup on running workers and try to do a filter on a name of "buildbot". filter will incorrectly return back docker-worker-11 in the list ( it seems anything that startwith() will matches). The code will then try to force stop these workers from running, killing good builds. The fix I have submitted is to move the hash value to the end to avoid the possibility of an odd match

Contributor

dragon512 commented Dec 4, 2017

I found the issue. It is in how Buildbot generates the docker name and the way the docker python library filters. When making names for docker workers in a general form like:

docker-worker-1
...
docker-worker-20

you get a case in which docker-worker-1 is being started up to do a build. At this time we might have docker-worker-11 running. the code will do a lookup on running workers and try to do a filter on a name of "buildbot". filter will incorrectly return back docker-worker-11 in the list ( it seems anything that startwith() will matches). The code will then try to force stop these workers from running, killing good builds. The fix I have submitted is to move the hash value to the end to avoid the possibility of an odd match

@dragon512 dragon512 changed the title from latent docker worker randomly die for no reason to latent docker worker randomly dies for no apparent reason Dec 4, 2017

@dragon512

This comment has been minimized.

Show comment
Hide comment
@dragon512

dragon512 Dec 4, 2017

Contributor

should be fixed in #3759

Contributor

dragon512 commented Dec 4, 2017

should be fixed in #3759

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment