Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs stuck in Runnable #64

Closed
lewfish opened this issue Jul 6, 2017 · 9 comments
Closed

Jobs stuck in Runnable #64

lewfish opened this issue Jul 6, 2017 · 9 comments

Comments

@lewfish
Copy link
Contributor

lewfish commented Jul 6, 2017

We've noticed that sometimes jobs get stuck in a runnable state on Batch. I just logged into the instance for such a job and found that the ecs-agent is not running as it is supposed to. (See http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-introspection.html)

[ec2-user@ip-172-31-45-73 ecs]$ curl http://localhost:51678/v1/metadata
curl: (7) Failed to connect to localhost port 51678: Connection refused

I also looked at the ecs-agent log, which contains error messages which I don't currently understand.

[ec2-user@ip-172-31-45-73 ecs]$ pwd
/var/log/ecs
[ec2-user@ip-172-31-45-73 ecs]$ cat ecs-init.log.2017-07-06-20
2017-07-06T20:21:28Z [INFO] Network error connecting to docker, backing off for '1.14777941s', error: dial unix /var/run/docker.sock: connect: no such file or directory
2017-07-06T20:21:29Z [INFO] Network error connecting to docker, backing off for '2.282153551s', error: dial unix /var/run/docker.sock: connect: no such file or directory
2017-07-06T20:21:31Z [INFO] Network error connecting to docker, backing off for '4.466145821s', error: dial unix /var/run/docker.sock: connect: no such file or directory
2017-07-06T20:21:36Z [INFO] Network error connecting to docker, backing off for '5.235010051s', error: dial unix /var/run/docker.sock: connect: no such file or directory
2017-07-06T20:21:41Z [INFO] Network error connecting to docker, backing off for '5.287113937s', error: dial unix /var/run/docker.sock: connect: no such file or directory
2017-07-06T20:21:46Z [ERROR] dial unix /var/run/docker.sock: connect: no such file or directory
2017-07-06T20:21:46Z [INFO] Network error connecting to docker, backing off for '1.14777941s', error: dial unix /var/run/docker.sock: connect: no such file or directory
2017-07-06T20:21:48Z [INFO] Network error connecting to docker, backing off for '2.282153551s', error: dial unix /var/run/docker.sock: connect: no such file or directory
2017-07-06T20:22:17Z [INFO] post-stop
2017-07-06T20:22:17Z [INFO] Cleaning up the credentials endpoint setup for Amazon EC2 Container Service Agent
2017-07-06T20:22:17Z [ERROR] Error performing action 'delete' for credentials proxy endpoint route: exit status 1; raw output: iptables: No chain/target/match by that name.

2017-07-06T20:22:17Z [ERROR] Error performing action 'delete' for credentials proxy endpoint route: exit status 1; raw output: iptables: No chain/target/match by that name.
@lewfish
Copy link
Contributor Author

lewfish commented Jul 6, 2017

As a workaround, you can run sudo start ecs on the instance to get it to progress out of the runnable state.

@lossyrob
Copy link
Contributor

lossyrob commented Jul 6, 2017

Tagging @azavea/operations in case it is of interest.

@hectcastro
Copy link

Hm. It is possible that the root cause is buried in another log (possibly the Docker agent log given this agent's inability to connect).

There have also been a flurry of updates to the ECS agent. Are we on a current version (one that supports Docker 17)?

@lewfish
Copy link
Contributor Author

lewfish commented Jul 7, 2017

   __|  __|  __|
   _|  (   \__ \   Amazon ECS-Optimized Amazon Linux AMI 2017.03.g
 ____|\___|____/
[ec2-user@ip-172-31-38-99 ~]$ docker --version
Docker version 1.12.6, build 7392c3b/1.12.6

@lossyrob
Copy link
Contributor

lossyrob commented Jul 7, 2017

This is one the problems of running off of a pre-baked AMI, instead of the official ECS AMI with cloud init installing everything else. If we could do the latter, we could simply bump the ECS AMI ID to the latest version; what we should do now is re-bake our custom AMI off of the latest ECS AMI.

@hectcastro
Copy link

We can probably take a stab and putting something more reproducible together if you can point us to the current steps.

@lewfish
Copy link
Contributor Author

lewfish commented Jul 11, 2017

The AMI we are using for Batch instances was generated as follows:

@lewfish
Copy link
Contributor Author

lewfish commented Jul 11, 2017

The above process was based on the recommendations in http://docs.aws.amazon.com/batch/latest/userguide/batch-gpu-ami.html, although we are not using their recipe for exposing the GPU to the container. Instead, we are using the recipe in https://blog.cloudsight.ai/deep-learning-image-recognition-using-gpus-in-amazon-ecs-docker-containers-5bdb1956f30e. We might want to go with the officially recommended way.

@hectcastro
Copy link

I created #74 to capture the task. I agree that the recommended way makes sense to chase. Please comment in that issue's thread with any additional tweaks we may need.

@lewfish lewfish closed this as completed Sep 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants