ecs-agent running but not connected after a while, generates AGENT errors trying to start tasks #316

workybee · 2016-02-19T18:16:24Z

Running 2015.09.f with ecs-agent version 1.8.0 we try and start about 100 tasks every 15 minutes.
This configuration seemed to run fine for a few hours after bringing up a fresh cluster of 4 m4.large instances. Then start getting AGENT errors:
2016-02-19 09:11:46,172 [13767] ERROR ECS - Failed running task prod_new_collect_ads_-u_304671576282352 - reason [{u'reason': u'AGENT', u'arn': u'arn:aws:ecs:us-east-1:860733554218:container-instance/181f09d8-b9af-4062-9728-25a64c2288d0'}, {u'reason': u'AGENT', u'arn': u'arn:aws:ecs:us-east-1:860733554218:container-instance/716ac35d-b6c9-4610-b3f0-19c5385f272c'}, {u'reason': u'AGENT', u'arn': u'arn:aws:ecs:us-east-1:860733554218:container-instance/d27f9931-a69f-4a32-92d0-8a9889ad1412'}, {u'reason': u'AGENT', u'arn': u'arn:aws:ecs:us-east-1:860733554218:container-instance/fdae64b7-583a-4da3-baf1-9cf9e905e3c5'}]
Indeed the console show "Agent Connected" false, but on the instances a "docker ps" show it still running (as the only thing running!).

At the suggestion of @samuelkarp in #313 ecs is configured to produce debug output.
Also configured with:
ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=30m
to help manage disk space usage.

The last several Error lines from one of instance:

/var/log/ecs/ecs-agent.log.2016-02-19-02:2016-02-19T02:57:23Z [WARN] Error getting cpu stats module="stats" err="No data in the queue" container="&{DockerID:1ac2aaabb6a98f201ef4489d0345a35a4370bfc757e3d6421d0ff91a6979a013}"
/var/log/ecs/ecs-agent.log.2016-02-19-02:2016-02-19T02:57:23Z [WARN] Error getting instance metrics module="tcs client" err="No task metrics to report"
/var/log/ecs/ecs-agent.log.2016-02-19-02:2016-02-19T02:59:32Z [DEBUG] Error getting stats module="stats" error="open /var/lib/docker/execdriver/native/1aee3569aa72bb5a09942b83c9aad54dd116e41101889555e6ebe3dfe5a0f604/state.json: no such file or directory" contianer="&{containerMetadata:0xc20817e460 ctx:0xc208320800 cancel:0x4fd120 statePath:/var/lib/docker/execdriver/native/1aee3569aa72bb5a09942b83c9aad54dd116e41101889555e6ebe3dfe5a0f604 statsQueue:0xc2083207c0 statsCollector:0xcb4ab0}"
/var/log/ecs/ecs-agent.log.2016-02-19-03:2016-02-19T03:00:07Z [DEBUG] Error getting stats module="stats" error="open /var/lib/docker/execdriver/native/4c3bf47e8edf4463f3abfed65505351e58c89acaec43b0782cb377fe6175cb41/state.json: no such file or directory" contianer="&{containerMetadata:0xc208203a30 ctx:0xc2083785c0 cancel:0x4fd120 statePath:/var/lib/docker/execdriver/native/4c3bf47e8edf4463f3abfed65505351e58c89acaec43b0782cb377fe6175cb41 statsQueue:0xc208378580 statsCollector:0xcb4ab0}"
/var/log/ecs/ecs-agent.log.2016-02-19-03:2016-02-19T03:00:20Z [ERROR] Error getting message from ws backend module="ws client" err="unexpected EOF"
 /var/log/ecs/ecs-agent.log.2016-02-19-03:2016-02-19T03:00:20Z [INFO] Error from acs; backing off module="acs handler" err="unexpected EOF"
/var/log/ecs/ecs-agent.log.2016-02-19-07:2016-02-19T07:23:05Z [ERROR] Error getting message from ws backend module="ws client" err="unexpected EOF"
 /var/log/ecs/ecs-agent.log.2016-02-19-07:2016-02-19T07:23:05Z [INFO] Error from tcs; backing off module="tcs handler" err="unexpected EOF"
/var/log/ecs/ecs-agent.log.2016-02-19-13:2016-02-19T13:28:48Z [ERROR] Error getting message from ws backend module="ws client" err="unexpected EOF"
/var/log/ecs/ecs-agent.log.2016-02-19-13:2016-02-19T13:28:48Z [INFO] Error from tcs; backing off module="tcs handler" err="unexpected EOF"

The text was updated successfully, but these errors were encountered:

samuelkarp · 2016-02-19T18:53:01Z

@workybee Are you just grepping for "error"? If so, you're excluding a bunch of context that would help us actually dig into what the Agent is doing.

Since it sounds like Docker is being responsive now (docker ps is working?), can you run docker kill -s USR1 ecs-agent and post the resulting stack dump here (you'll find it in the log file in /var/log/ecs)?

workybee · 2016-02-19T21:55:00Z

Yes, I was just grepping for Error.
Sorry took so long to get back to you, many fires today... ;-)

Attached is the result of "docker kill -s USR1 ecs-agent" on one of the instances:

ecs-agent_stacktrace_2016-02-19T2138.log.gz

samuelkarp · 2016-02-19T22:49:27Z

@workybee This looks like the same thing I was seeing in @tschutte's log in #313. Can you let me know how many containers are listed in the data file? Something like this:

sudo yum -y install jq
sudo cat /var/lib/ecs/data/ecs_agent_data.json | jq '.Data.TaskEngine.Tasks[].Containers[].Name' | wc -l

workybee · 2016-02-19T23:05:37Z

Definite similarity, but my case might be more amenable to your lock held theory. ;-)
After the yum install of jq mentioned above:

[ec2-user@ip-172-30-1-173 ~]$ sudo cat /var/lib/ecs/data/ecs_agent_data.json | jq '.Data.TaskEngine.Tasks[].Containers[].Name' | wc -l
41

and certainly my attempts at "docker ps" or similar sometimes take "forever" just after we fire off multiple start tasks (total of about 100 every 15 minutes). Things seem to work for a while, then get wedged. (see #309 and #313 for more context)

andreausu · 2016-02-22T11:55:58Z

We're having the same issue after upgrading to 2015.09.f.
We've updated our nodes to the new AMI on Friday, today (after a couple of days) we're seeing issues deploying new code due to the disconnected agent (but the container is up indeed), the logs are showing these lines over and over:

2016-02-22T10:33:04Z [INFO] Creating poll dialer module="ws client" host="ecs-t-1.eu-west-1.amazonaws.com"
2016-02-22T10:33:05Z [WARN] Error creating a websocket client module="ws client" err="websocket: bad handshake"
2016-02-22T10:33:05Z [ERROR] Error connecting to TCS: {"AccessDeniedException":"Forbidden"}, websocket: bad handshake module="tcs handler"
2016-02-22T10:33:05Z [INFO] Error from tcs; backing off module="tcs handler" err="{"AccessDeniedException":"Forbidden"}, websocket: bad handshake"

We're running less than 10 containers on these nodes and we didn't start/stop tasks during these 2 days.

We've fixed it for now by doing a stop ecs and start ecs.

ejholmes · 2016-02-23T18:18:49Z

We're seeing the same problem after upgrading to 1.8.0, but were seeing the same issue on 1.6.0 (although, with more frequency now).

samuelkarp · 2016-02-23T18:21:00Z

At this point, we believe this is the cause. We've just merged a fix to be included in our next release.

jpignata · 2016-02-24T21:38:21Z

@samuelkarp Awesome, thank you. Is there a workaround here prior to the release of your next version? This is happening multiple times a day on our clusters and causing myriad problems. Is downgrading to the previous version the recommendation?

mmontagna · 2016-02-24T21:50:58Z

@jpignata We downgraded to version 1.7 and haven't had any problems, but for what it's worth periodically killing the 1.8 containers unsticks the service and didn't seem to cause any problems, we did that for a bit, but I'm not sure it's a good long term idea.

ejholmes · 2016-02-24T22:33:09Z

Also downgraded to 1.6 for the time being.

bilalaslamseattle · 2016-02-25T22:03:54Z

@samuelkarp we deployed the dev branch to one of our clusters and it did not fix the problem. ECS is still reporting periodic AGENT disconnects :(

samuelkarp · 2016-02-25T22:05:07Z

@bilalaslamseattle Are they the same symptoms as before? Can you provide logs at debug level (ECS_LOGLEVEL=debug) and the stack trace on a disconnected Agent (docker kill -s USR1 ecs-agent)?

bilalaslamseattle · 2016-02-25T22:06:19Z

@samuelkarp roger that, I'll get these over to you.

jwerak · 2016-02-25T22:42:35Z

@samuelkarp answering here for @bilalaslamseattle .

Is this info you are looking for? https://gist.github.com/veverjak/3607a9dd1f141e31d5a8

Also how does one set ECS_LOGLEVEL?
Are envs from /etc/init/ecs.conf passed to docker run? It doesn't seem so. Is there simple way to pass env to docker from ecs-ini or do I have to stop the service and run ecs-agent container manually using docker run cli?

samuelkarp · 2016-02-25T22:45:38Z

@veverjak Yes, that's what I'm looking for. Is that from an Agent built with the dev branch (and the fix) or from the 1.8.0 version?

If you're using our ECS-optimized AMI, you can set ECS_LOGLEVEL in /etc/ecs/ecs.config, then run sudo stop ecs ; sudo start ecs to restart the Agent.

jwerak · 2016-02-25T23:18:41Z

@samuelkarp yes, we are running latest dev branch.
I am not seeing the errors now (they appear irregularly) so I will switch on debug mode and send you logs when the issue will appear or are even logs without error appearing useful?

samuelkarp · 2016-02-25T23:24:20Z

There may not be lines that say "error", but logs and stack trace information if you run into the deadlock/state drift issue will be very useful. In our testing so far we haven't reproduced the deadlock when building from the dev branch.

jwerak · 2016-02-25T23:52:53Z

@samuelkarp
Got it, here is the stacktrace from disconnected instance: https://gist.github.com/veverjak/73e05734c5c29a2257f4

and here are the debug logs: https://gist.github.com/veverjak/31327365a43115e71fb9

aaithal · 2016-02-26T00:21:04Z

Hi @veverjak ,

Apologies for asking you to confirm this again. But, I looked up the information about the container instance on which you are facing this issue and it seems like it has a different agentHash than the one on the "dev" branch. It should have been computed as *305353a, to correspond to the latest commit. But, I see that it is set to *fd7b600, which does not correspond to the latest commit.

Could you please confirm that the agent you are using does have the fix from the "dev" branch merged in?

jwerak · 2016-02-26T00:51:11Z

Hi @aaithal,

no problem, we are building from this repo actually - https://github.com/appuri/amazon-ecs-agent/commits/build-agent

bilalaslamseattle · 2016-02-26T05:02:35Z

@aaithal thanks for digging into this. This bug sort of makes ECS unusable ... deploys fail, etc. etc. Badness all around. Can the ECS team fast-track this please?

nicolai86 · 2016-02-26T10:01:22Z

👍 we downgraded as well since this is breaking ECS for us very regularly.

juanrhenals · 2016-03-03T02:01:29Z

Fixing this deadlock issue has been released as part of Agent v1.8.1. We also released a new AMI with the v1.8.1 agent bundled.

If you continue to run into issues or have new information, please do reopen this or a new issue.
For now, I don't think we have any additional action items for this issue on our end.

workybee · 2016-03-04T19:09:17Z

Refreshed our cluster with the new AMI (2015.09.g) and this ecs-agent v1.8.1 and have been running smoothly ever since (about 24 hours) where we used to need to restart the ecs-agent every 30 minutes.
Thanks all for the fix.

jwerak · 2016-03-07T21:51:20Z

@juanrhenals we are running ecs-agent v1.8.1 but not on latest AMI - can this be a cause to that?

bilalaslamseattle · 2016-03-08T17:45:18Z

@juanrhenals we are still seeing this issue. Like @veverjak not on the latest AMI. Could this be a cause?

aaithal · 2016-03-08T18:52:10Z

@veverjak Our tests prior to releasing the new version did not reproduce the agent deadlock with the fix. Could you please open a new issue to debug the issues that you are seeing with the agent? We were unable to detect the deadlock from the previous set of logs that you provided. We would like to know the following information from your instance to help us debug this better:

ECS Agent logs at debug level (ECS_LOGLEVEL=debug) and the stack trace on a disconnected Agent (docker kill -s USR1 ecs-agent)
ECS Agent Config, if you have any non-default customizations
Output of curl 127.0.0.1:51678/v1/metadata on the EC2 Instance where you see the deadlock

Feel free to send those to aithal at amazon.com if you do not want to share the files on github.

Also, Are you still using https://github.com/appuri/amazon-ecs-agent/commits/build-agent to build the agent? I took a shot at building this image. But, it seems like there are some unresolved merge conflicts, which would cause build failures.

aaithal · 2016-03-08T18:54:44Z

@bilalaslamseattle Could you please confirm if are you seeing these issues on ECS Agent v1.8.1? I'd also encourage you to create a new issue with more details about the failure as mentioned in my previous post.

jwerak · 2016-03-08T20:27:00Z

@aaithal thanks, I will gather that info once we experience disconnect again and create a ticket for that.
Also sorry for agent with merge conflict, I've cleared it out and we are now on top of 1.8.1 branch here - https://github.com/appuri/amazon-ecs-agent

jhovell · 2016-06-02T18:26:53Z

@samuelkarp I think I am experiencing this issue with ECS Agent 1.9.0/ ECS AMI 2016.03b. I have an instance that dumps the following to the log:

2016-06-02T11:08:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2016-06-02T11:13:41Z [INFO] Creating poll dialer module="ws client" host="ecs-t-2.us-west-2.amazonaws.com"
2016-06-02T11:13:41Z [WARN] Error getting cpu stats module="stats" err="No data in the queue" container="&{DockerID:7f5c4e32286e416897a9399cad5b1d2006adbc2d24cd5c1d08ab7fee5ab9c050}"
2016-06-02T11:13:41Z [WARN] Error getting instance metrics module="tcs client" err="No task metrics to report"
2016-06-02T11:18:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2016-06-02T11:27:41Z [INFO] Creating poll dialer module="ws client" host="ecs-t-2.us-west-2.amazonaws.com"
2016-06-02T11:27:41Z [WARN] Error getting cpu stats module="stats" err="No data in the queue" container="&{DockerID:7f5c4e32286e416897a9399cad5b1d2006adbc2d24cd5c1d08ab7fee5ab9c050}"
2016-06-02T11:27:41Z [WARN] Error getting instance metrics module="tcs client" err="No task metrics to report"
2016-06-02T11:28:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2016-06-02T11:38:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2016-06-02T11:45:41Z [INFO] Creating poll dialer module="ws client" host="ecs-t-2.us-west-2.amazonaws.com"
2016-06-02T11:45:41Z [WARN] Error getting cpu stats module="stats" err="No data in the queue" container="&{DockerID:7f5c4e32286e416897a9399cad5b1d2006adbc2d24cd5c1d08ab7fee5ab9c050}"
2016-06-02T11:45:41Z [WARN] Error getting instance metrics module="tcs client" err="No task metrics to report"
2016-06-02T11:48:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2016-06-02T11:55:41Z [INFO] Error from tcs; backing off module="tcs handler" err="unexpected EOF"
2016-06-02T11:55:42Z [INFO] Creating poll dialer module="ws client" host="ecs-t-2.us-west-2.amazonaws.com"
2016-06-02T11:58:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING

The instance is still running a my-app container (as shown by the log) but the agent is disconnected from the cluster. I could enable debug logging and restart the agent (not sure if that will fix the problem) and post up debug logs if this helps.

Anything else this could be?

samuelkarp · 2016-06-24T18:10:52Z

@jhovell Yes, debug logs definitely helps. If you're still experiencing problems, please open a new issue.

samuelkarp added the more info needed label Feb 19, 2016

samuelkarp removed the more info needed label Feb 19, 2016

samuelkarp added the pending release label Feb 23, 2016

jpignata mentioned this issue Feb 24, 2016

ECS Agent disconnection #250

Closed

JasonSwindle mentioned this issue Feb 26, 2016

Amazon ECS Agent - Support other than ":latest" rancher/os#777

Closed

juanrhenals closed this as completed Mar 3, 2016

juanrhenals mentioned this issue Mar 9, 2016

Could not transition to inspecting; timed out after waiting 30s #309

Closed

josqu4red mentioned this issue Aug 19, 2016

Agent randomly disconnected from ECS, flapping (>= v1.11) #501

Closed

samuelkarp removed the pending release label Oct 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ecs-agent running but not connected after a while, generates AGENT errors trying to start tasks #316

ecs-agent running but not connected after a while, generates AGENT errors trying to start tasks #316

workybee commented Feb 19, 2016

samuelkarp commented Feb 19, 2016

workybee commented Feb 19, 2016

samuelkarp commented Feb 19, 2016

workybee commented Feb 19, 2016

andreausu commented Feb 22, 2016

ejholmes commented Feb 23, 2016

samuelkarp commented Feb 23, 2016

jpignata commented Feb 24, 2016

mmontagna commented Feb 24, 2016

ejholmes commented Feb 24, 2016

bilalaslamseattle commented Feb 25, 2016

samuelkarp commented Feb 25, 2016

bilalaslamseattle commented Feb 25, 2016

jwerak commented Feb 25, 2016

samuelkarp commented Feb 25, 2016

jwerak commented Feb 25, 2016

samuelkarp commented Feb 25, 2016

jwerak commented Feb 25, 2016

aaithal commented Feb 26, 2016

jwerak commented Feb 26, 2016

bilalaslamseattle commented Feb 26, 2016

nicolai86 commented Feb 26, 2016

juanrhenals commented Mar 3, 2016

workybee commented Mar 4, 2016

jwerak commented Mar 7, 2016

bilalaslamseattle commented Mar 8, 2016

aaithal commented Mar 8, 2016

aaithal commented Mar 8, 2016

jwerak commented Mar 8, 2016

jhovell commented Jun 2, 2016

samuelkarp commented Jun 24, 2016

ecs-agent running but not connected after a while, generates AGENT errors trying to start tasks #316

ecs-agent running but not connected after a while, generates AGENT errors trying to start tasks #316

Comments

workybee commented Feb 19, 2016

samuelkarp commented Feb 19, 2016

workybee commented Feb 19, 2016

samuelkarp commented Feb 19, 2016

workybee commented Feb 19, 2016

andreausu commented Feb 22, 2016

ejholmes commented Feb 23, 2016

samuelkarp commented Feb 23, 2016

jpignata commented Feb 24, 2016

mmontagna commented Feb 24, 2016

ejholmes commented Feb 24, 2016

bilalaslamseattle commented Feb 25, 2016

samuelkarp commented Feb 25, 2016

bilalaslamseattle commented Feb 25, 2016

jwerak commented Feb 25, 2016

samuelkarp commented Feb 25, 2016

jwerak commented Feb 25, 2016

samuelkarp commented Feb 25, 2016

jwerak commented Feb 25, 2016

aaithal commented Feb 26, 2016

jwerak commented Feb 26, 2016

bilalaslamseattle commented Feb 26, 2016

nicolai86 commented Feb 26, 2016

juanrhenals commented Mar 3, 2016

workybee commented Mar 4, 2016

jwerak commented Mar 7, 2016

bilalaslamseattle commented Mar 8, 2016

aaithal commented Mar 8, 2016

aaithal commented Mar 8, 2016

jwerak commented Mar 8, 2016

jhovell commented Jun 2, 2016

samuelkarp commented Jun 24, 2016