-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs stuck in Runnable #64
Comments
As a workaround, you can run |
Tagging @azavea/operations in case it is of interest. |
Hm. It is possible that the root cause is buried in another log (possibly the Docker agent log given this agent's inability to connect). There have also been a flurry of updates to the ECS agent. Are we on a current version (one that supports Docker 17)? |
|
This is one the problems of running off of a pre-baked AMI, instead of the official ECS AMI with cloud init installing everything else. If we could do the latter, we could simply bump the ECS AMI ID to the latest version; what we should do now is re-bake our custom AMI off of the latest ECS AMI. |
We can probably take a stab and putting something more reproducible together if you can point us to the current steps. |
The AMI we are using for Batch instances was generated as follows:
|
The above process was based on the recommendations in http://docs.aws.amazon.com/batch/latest/userguide/batch-gpu-ami.html, although we are not using their recipe for exposing the GPU to the container. Instead, we are using the recipe in https://blog.cloudsight.ai/deep-learning-image-recognition-using-gpus-in-amazon-ecs-docker-containers-5bdb1956f30e. We might want to go with the officially recommended way. |
I created #74 to capture the task. I agree that the recommended way makes sense to chase. Please comment in that issue's thread with any additional tweaks we may need. |
We've noticed that sometimes jobs get stuck in a runnable state on Batch. I just logged into the instance for such a job and found that the ecs-agent is not running as it is supposed to. (See http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-introspection.html)
I also looked at the ecs-agent log, which contains error messages which I don't currently understand.
The text was updated successfully, but these errors were encountered: