Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Builds failing with docker errors #381

Closed
lox opened this issue Feb 12, 2018 · 8 comments · Fixed by #385
Closed

Builds failing with docker errors #381

lox opened this issue Feb 12, 2018 · 8 comments · Fixed by #385

Comments

@lox
Copy link
Contributor

lox commented Feb 12, 2018

It looks like we've had a regression on #266 sometime after 2.3.0 where builds are occasionally failing with docker connection errors.

E.g "transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused"

My suspicion is that these are related to a race condition where we configure docker on boot and then restart it. We merged #377 to address this and will be looking to put out a release soon.

@gugahoi
Copy link
Contributor

gugahoi commented Feb 15, 2018

We've had this happen today again. Have the agent handy if anything is needed but from the brief inspection this is related to containerd not being ready when docker is starting up. Is docker responsible for invoking containerd with docker-containerd?

Also is it possible this is somewhat related: moby/moby#36173 ?

@lox
Copy link
Contributor Author

lox commented Feb 15, 2018

A speculative fix went out for this in https://github.com/buildkite/elastic-ci-stack-for-aws/releases/tag/v2.3.4, are you on that @gugahoi?

@lox
Copy link
Contributor Author

lox commented Feb 15, 2018

Yup, that moby bug looks definitely possible, there are regular classes of docker bugs around race conditions on docker's daemons restarting, which we try and mitigate by being very careful about how we do that on stack bootup.

@lox
Copy link
Contributor Author

lox commented Feb 24, 2018

Keeping this open until we confirm it's fixed.

@lox
Copy link
Contributor Author

lox commented Feb 24, 2018

My other suspicion is the incredibly old version of Upstart on Amazon Linux. We might be seeing a variant of moby/moby#6647.

@gugahoi
Copy link
Contributor

gugahoi commented Feb 27, 2018

Is it possible we can fail the agent when docker ps is not responsive over 5 tries here ?

It seems buildkite is simply ignoring the fact that docker ps is not responding and moving on. Would it be better to have the agent not register to schedule builds then? Maybe reboot, or do a more forced sudo service docker restart or something along those lines?

@ankurd1
Copy link

ankurd1 commented Mar 19, 2018

Hey, we are running stack 2.3.5 and seeing a similar error.

An EC2 instance was started by the stack and didn't have docker running. Buildkite kept sending jobs to it and all the jobs kept failing.

I have the machine still running so I can get you any logs that you need to debug. We've been seeing such errors more and more and people have started blaming the CI team for flakiness :(

Would really appreciate if you guys could look into this! Thanks!

@lox
Copy link
Contributor Author

lox commented Jul 28, 2019

Haven't seen this in a long while, closing this out.

@lox lox closed this as completed Jul 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants