New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Builds failing with docker errors #381
Comments
We've had this happen today again. Have the agent handy if anything is needed but from the brief inspection this is related to Also is it possible this is somewhat related: moby/moby#36173 ? |
A speculative fix went out for this in https://github.com/buildkite/elastic-ci-stack-for-aws/releases/tag/v2.3.4, are you on that @gugahoi? |
Yup, that moby bug looks definitely possible, there are regular classes of docker bugs around race conditions on docker's daemons restarting, which we try and mitigate by being very careful about how we do that on stack bootup. |
Keeping this open until we confirm it's fixed. |
My other suspicion is the incredibly old version of Upstart on Amazon Linux. We might be seeing a variant of moby/moby#6647. |
Is it possible we can fail the agent when It seems buildkite is simply ignoring the fact that |
Hey, we are running stack 2.3.5 and seeing a similar error. An EC2 instance was started by the stack and didn't have docker running. Buildkite kept sending jobs to it and all the jobs kept failing. I have the machine still running so I can get you any logs that you need to debug. We've been seeing such errors more and more and people have started blaming the CI team for flakiness :( Would really appreciate if you guys could look into this! Thanks! |
Haven't seen this in a long while, closing this out. |
It looks like we've had a regression on #266 sometime after 2.3.0 where builds are occasionally failing with docker connection errors.
E.g
"transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused"
My suspicion is that these are related to a race condition where we configure docker on boot and then restart it. We merged #377 to address this and will be looking to put out a release soon.
The text was updated successfully, but these errors were encountered: