Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECS hosts lose ability to start new tasks (inactivity time exceeded timeout / No such container in logs) #547

Closed
jhovell opened this issue Oct 3, 2016 · 6 comments

Comments

@jhovell
Copy link

jhovell commented Oct 3, 2016

Version: Amazon ECS-Optimized Amazon Linux AMI 2016.09.i

Periodically hosts in our cluster become unable to launch new tasks. SSH'ing to the instance show that:

Simple commands like docker ps will hang for 10-20 seconds but then return.

A command like curl localhost:8888/myService which would normally return 200/OK gives curl: (56) Recv failure: Connection reset by peer. Not surprisingly ELB healthchecks fail every few minutes and the ECS agent dutifully restarts each task. It's worth noting existing containers continue to run and operate healthily (we normally run into this issue when trying to deploy a new version of a service).

ECS Agent log has no obvious errors though periodically this is seen:

2016-10-02T22:20:38Z [INFO] Error retrieving stats for container 519e5a44ef67374fe3e919c52ce2181152622fa050cead30eaa21d636b81c618: inactivity time exceeded timeout
2016-10-02T22:20:39Z [INFO] Error retrieving stats for container 927eabd2e765d863237144c81d908a202a907c1cdc5f366cd58f2855fa53d35b: inactivity time exceeded timeout
2016-10-02T22:20:39Z [INFO] Error retrieving stats for container 6939679480b1f7384659262dd8b82d185d3bceb4cc21d81f5889dfc91034fd3a: inactivity time exceeded timeout
2016-10-02T22:20:40Z [INFO] Error retrieving stats for container 2e842968bf9da21f2760b0786c7bbe47583401d206a81f033281a6ec2efd6f35: inactivity time exceeded timeout
2016-10-02T22:20:44Z [INFO] Error retrieving stats for container 927eabd2e765d863237144c81d908a202a907c1cdc5f366cd58f2855fa53d35b: inactivity time exceeded timeout
2016-10-02T22:20:44Z [INFO] Error retrieving stats for container 6939679480b1f7384659262dd8b82d185d3bceb4cc21d81f5889dfc91034fd3a: inactivity time exceeded timeout
2016-10-02T22:20:45Z [INFO] Error retrieving stats for container 2e842968bf9da21f2760b0786c7bbe47583401d206a81f033281a6ec2efd6f35: inactivity time exceeded timeout
2016-10-02T22:20:49Z [INFO] Error retrieving stats for container 927eabd2e765d863237144c81d908a202a907c1cdc5f366cd58f2855fa53d35b: inactivity time exceeded timeout
2016-10-02T22:20:49Z [INFO] Error retrieving stats for container 6939679480b1f7384659262dd8b82d185d3bceb4cc21d81f5889dfc91034fd3a: inactivity time exceeded timeout
2016-10-02T22:20:50Z [INFO] Error retrieving stats for container 2e842968bf9da21f2760b0786c7bbe47583401d206a81f033281a6ec2efd6f35: inactivity time exceeded timeout
2016-10-02T22:20:52Z [INFO] Saving state! module="statemanager"
2016-10-02T22:20:52Z [INFO] Adding event module="eventhandler" change="ContainerChange: arn:aws:ecs:us-west-2:account-number:task/3d59c409-4c19-41e2-91b1-ebbd4bd529dc my-app -> STOPPED, Exit 137, , Known Sent: RUNNING"
2016-10-02T22:20:52Z [INFO] Error retrieving stats for container 519e5a44ef67374fe3e919c52ce2181152622fa050cead30eaa21d636b81c618: context canceled
2016-10-02T22:20:52Z [INFO] Container 519e5a44ef67374fe3e919c52ce2181152622fa050cead30eaa21d636b81c618 is terminal, stopping stats collection

Docker logs contain

time="2016-09-30T22:03:33.283518865Z" level=warning msg="container 29eb26eaa79b30325e77907b9e7f6e38accc6768dae40b76dc910aaa02d39056 restart canceled" 
time="2016-09-30T22:12:23.305386802Z" level=error msg="Handler for GET /v1.17/containers/5a334ffb1ea58f35f447897d4b9382452931804be1db1c2932e0a3c6ba1b4d04/json returned error: No such container: 5a334ffb1ea58f35f447897d4b9382452931804be1db1c2932e0a3c6ba1b4d04" 
time="2016-09-30T22:12:31.329678098Z" level=error msg="Handler for GET /v1.17/containers/aa0c0ea54d80bf16add8e2f9cff5d918c0fd60265176de686780f86ab0caef73/json returned error: No such container: aa0c0ea54d80bf16add8e2f9cff5d918c0fd60265176de686780f86ab0caef73" 
time="2016-09-30T22:13:34.060959954Z" level=error msg="Handler for GET /v1.17/containers/29eb26eaa79b30325e77907b9e7f6e38accc6768dae40b76dc910aaa02d39056/json returned error: No such container: 29eb26eaa79b30325e77907b9e7f6e38accc6768dae40b76dc910aaa02d39056" 
time="2016-09-30T22:42:23.545132110Z" level=info msg="Container 52c26d2b6913e55ed648fed03b74236ec6bc62621f40552810dbbd353ae1cd6b failed to exit within 30 seconds of signal 15 - using the force" 
time="2016-09-30T22:42:23.680673733Z" level=warning msg="container 52c26d2b6913e55ed648fed03b74236ec6bc62621f40552810dbbd353ae1cd6b restart canceled" 
time="2016-09-30T22:42:31.918588398Z" level=info msg="Container c721bb77155b183342ebfbdd8abf0d1d04ef90b561520c4e91f932c118fe3844 failed to exit within 30 seconds of signal 15 - using the force" 
time="2016-09-30T22:42:32.054743992Z" level=warning msg="container c721bb77155b183342ebfbdd8abf0d1d04ef90b561520c4e91f932c118fe3844 restart canceled" 
time="2016-09-30T22:52:24.039730246Z" level=error msg="Handler for GET /v1.17/containers/52c26d2b6913e55ed648fed03b74236ec6bc62621f40552810dbbd353ae1cd6b/json returned error: No such container: 52c26d2b6913e55ed648fed03b74236ec6bc62621f40552810dbbd353ae1cd6b" 
time="2016-09-30T22:52:32.421783700Z" level=error msg="Handler for GET /v1.17/containers/c721bb77155b183342ebfbdd8abf0d1d04ef90b561520c4e91f932c118fe3844/json returned error: No such container: c721bb77155b183342ebfbdd8abf0d1d04ef90b561520c4e91f932c118fe3844" 
time="2016-09-30T23:07:20.395705823Z" level=error msg="Handler for GET /v1.17/containers/187680afd7344c50b08ce84827b4b44dc7de3f54d68c8c564343d3a7a2fb3655/json returned error: No such container: 187680afd7344c50b08ce84827b4b44dc7de3f54d68c8c564343d3a7a2fb3655" 
time="2016-09-30T23:07:26.120185655Z" level=error msg="Handler for GET /v1.17/containers/7fc0304ae5ba5115c50a1cdccf9e3758c709bc7a56697b762ea1763653d70032/json returned error: No such container: 7fc0304ae5ba5115c50a1cdccf9e3758c709bc7a56697b762ea1763653d70032" 
time="2016-09-30T23:07:36.715949055Z" level=error msg="Handler for GET /v1.17/containers/51ae919c7d79902ceccb0e1925adcb3de768e90f0d77b30ef0ae17b50909bd36/json returned error: No such container: 51ae919c7d79902ceccb0e1925adcb3de768e90f0d77b30ef0ae17b50909bd36" 
time="2016-09-30T23:07:47.851121976Z" level=error msg="Handler for GET /v1.17/containers/4cab3a90ebeec024e58eaf05299bc876dd40b211da185e36f797dc87dd47867e/json returned error: No such container: 4cab3a90ebeec024e58eaf05299bc876dd40b211da185e36f797dc87dd47867e" 
time="2016-09-30T23:07:58.898021046Z" level=error msg="Handler for GET /v1.17/containers/60ac61e027e7b305e7b910eb603488186d11c74898d00572277fe79717b58e56/json returned error: No such container: 60ac61e027e7b305e7b910eb603488186d11c74898d00572277fe79717b58e56" 
time="2016-09-30T23:08:11.114405684Z" level=error msg="Handler for GET /v1.17/containers/e434b8d5fcaecb28662341a73035464c4cc20bf343d44f38b12e0fffbe6f893f/json returned error: No such container: e434b8d5fcaecb28662341a73035464c4cc20bf343d44f38b12e0fffbe6f893f" 

also occasionally:

time="2016-09-30T23:10:54.374592410Z" level=error msg="collecting stats for 903ec74fd6a40950d80e10dbdb9bfa4110291ed052af585fd633799f55f47b55: rpc error: code = 2 desc = \"open /run/runc/903ec74fd6a40950d80e10dbdb9bfa4110291ed052af585fd633799f55f47b55/state.json: no such file or directory\"" 

This is happening about once per day across roughly ~20 ECS hosts & is difficult to detect / automate restarting

I can't tell if this is a duplicate of an existing issue though almost all issues here seem to say they are resolved in 1.12.2 version of the agent which we are running, so maybe this issue still persists? Happy to send entire docker log files, ECS agent files.

@richardpen
Copy link

@jhovell The error "Error retrieving stats for container xxxx: inactivity time exceeded timeout" is caused by slow docker response(as docker ps takes 10~20s). And the error "No such container" could occur once for each container when agent received docker stop events but before stopping collecting the metrics. These errors are not the reason for the ELB health check fail.

When docker became slow, it will take longer to finish the deployment, as you said this problem occurred when you try to deploy a new version of service. So the ELB health check would fail as the service is not ready.

Currently we don't support this kind of detection. One way you can do now is to configure your ELB to wait longer to health check for this situation.

Thanks

@jhovell
Copy link
Author

jhovell commented Oct 5, 2016

@richardpen what would/could be the cause of the slowness? Surely it is not necessarily an ECS agent issue; it could be an underlying Docker issue or a issue with my app in general. Hard for me to debug since I don't necessarily know all the internals of the ECS Optimized AMI.

Normally an app might take 20-30 seconds to boot, and the ELB will not fail a task for at least 2 minutes. Increasing the timeout on the ELB to get around this transient issue doesn't seem like a very good one, right? I'm guessing there is some underlying problem with the agent/docker/ECS... not sure where to start diagnosing.

@jhovell
Copy link
Author

jhovell commented Oct 5, 2016

After examining this in a bit more detail I think this is actually a duplicate of #537

It looks like due to a ECS service that gets stuck in a loop restarting, our 66 IOPS/sec budget on our 22GB GP2 volumes is exceeded by all the container restarts and our burst balance is drained to zero. Next, when eventually another service or services is updated to a newer image, due to not being able to burst in EBS the container does not start in time for the ELB health check to pass, and these services get stuck in a restart loop as well, keeping the EBS credit at ~0 and causing degradation across any service on the host requiring more than very light disk access.

In my case I will likely address this by adding Cloudwatch alerts to monitor if EBS read/write IOPS budgets are exceeded for a lengthy period of time such as an hour, and trigger other downstream actions to deal with the issue.

Adding this here in case it helps anyone else as I would imagine this is a common failure mode for anyone using the ECS Optimized AMI that accidentally deploys a task definition that cannot or does not pass ELB healthchecks.

EBS docs on IO credit balances:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html

Forum issue on similar topic:
https://forums.aws.amazon.com/thread.jspa?threadID=164462

@kiranmeduri
Copy link
Contributor

Thanks @jhovell for root-causing this. Yes you are right that this is one of the common failure mode we are observing in ECS. If this becomes a problem we recommend moving to a SSD with provisioned IOPS. We will improve our documentation around this so that it is much clear. Thanks again.

@gkrizek
Copy link

gkrizek commented Mar 29, 2017

@jhovell Do you know what triggers the restart loop in the first place?

We have been experiencing this same thing out of no where and I can't find the root cause of these events happening. It's very strange because we have basically 3 clusters that are identical (Prod, Staging, Dev). All clusters have 3 hosts running t2.larges. We had 20 tasks running on Prod and 12 on Staging and Dev. Staging and Dev started failing across the board yesterday seemingly out of no where. The only way to fix it was to terminate all hosts and have new ones be created.

I initially thought that the hosts were over committed on memory, but that wasn't the case. I looked at what you had said after some googling and you were correct in our case. The burst credits were at 0 and the server was bricked. Agent wasn't connected, couldn't ssh, etc. This just seems like a big problem that happens out of no where and I would prefer to find a solution that prevents this instead of creating alarms that handle it after the fact.

Is PIOPs the only recommended way of fixing this?

@gkrizek
Copy link

gkrizek commented Mar 30, 2017

As an update to my situation, Switching to PIOPs fixed our problems. I doubled what we were provisioned with a regular gp2 volume (so essentially 6 IOPs per GB on the new volume). This seems like the simplest and most cost effective way to resolve the issue. We only switched to PIOPs for the docker volume when using the ECS Optimized AMI. The root partitions are still gp2.

I think the cause of our issue was a mix of high IO docker applications that caused restart loops to start happening and it just spiraled out of control from there. Still monitoring, but it seems like PIOPs is a quick and easy solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants