Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ecs-agent running but not connected after a while, generates AGENT errors trying to start tasks #316

Closed
workybee opened this issue Feb 19, 2016 · 31 comments

Comments

@workybee
Copy link

Running 2015.09.f with ecs-agent version 1.8.0 we try and start about 100 tasks every 15 minutes.
This configuration seemed to run fine for a few hours after bringing up a fresh cluster of 4 m4.large instances. Then start getting AGENT errors:
2016-02-19 09:11:46,172 [13767] ERROR ECS - Failed running task prod_new_collect_ads_-u_304671576282352 - reason [{u'reason': u'AGENT', u'arn': u'arn:aws:ecs:us-east-1:860733554218:container-instance/181f09d8-b9af-4062-9728-25a64c2288d0'}, {u'reason': u'AGENT', u'arn': u'arn:aws:ecs:us-east-1:860733554218:container-instance/716ac35d-b6c9-4610-b3f0-19c5385f272c'}, {u'reason': u'AGENT', u'arn': u'arn:aws:ecs:us-east-1:860733554218:container-instance/d27f9931-a69f-4a32-92d0-8a9889ad1412'}, {u'reason': u'AGENT', u'arn': u'arn:aws:ecs:us-east-1:860733554218:container-instance/fdae64b7-583a-4da3-baf1-9cf9e905e3c5'}]
Indeed the console show "Agent Connected" false, but on the instances a "docker ps" show it still running (as the only thing running!).

At the suggestion of @samuelkarp in #313 ecs is configured to produce debug output.
Also configured with:
ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=30m
to help manage disk space usage.

The last several Error lines from one of instance:

/var/log/ecs/ecs-agent.log.2016-02-19-02:2016-02-19T02:57:23Z [WARN] Error getting cpu stats module="stats" err="No data in the queue" container="&{DockerID:1ac2aaabb6a98f201ef4489d0345a35a4370bfc757e3d6421d0ff91a6979a013}"
/var/log/ecs/ecs-agent.log.2016-02-19-02:2016-02-19T02:57:23Z [WARN] Error getting instance metrics module="tcs client" err="No task metrics to report"
/var/log/ecs/ecs-agent.log.2016-02-19-02:2016-02-19T02:59:32Z [DEBUG] Error getting stats module="stats" error="open /var/lib/docker/execdriver/native/1aee3569aa72bb5a09942b83c9aad54dd116e41101889555e6ebe3dfe5a0f604/state.json: no such file or directory" contianer="&{containerMetadata:0xc20817e460 ctx:0xc208320800 cancel:0x4fd120 statePath:/var/lib/docker/execdriver/native/1aee3569aa72bb5a09942b83c9aad54dd116e41101889555e6ebe3dfe5a0f604 statsQueue:0xc2083207c0 statsCollector:0xcb4ab0}"
/var/log/ecs/ecs-agent.log.2016-02-19-03:2016-02-19T03:00:07Z [DEBUG] Error getting stats module="stats" error="open /var/lib/docker/execdriver/native/4c3bf47e8edf4463f3abfed65505351e58c89acaec43b0782cb377fe6175cb41/state.json: no such file or directory" contianer="&{containerMetadata:0xc208203a30 ctx:0xc2083785c0 cancel:0x4fd120 statePath:/var/lib/docker/execdriver/native/4c3bf47e8edf4463f3abfed65505351e58c89acaec43b0782cb377fe6175cb41 statsQueue:0xc208378580 statsCollector:0xcb4ab0}"
/var/log/ecs/ecs-agent.log.2016-02-19-03:2016-02-19T03:00:20Z [ERROR] Error getting message from ws backend module="ws client" err="unexpected EOF"
 /var/log/ecs/ecs-agent.log.2016-02-19-03:2016-02-19T03:00:20Z [INFO] Error from acs; backing off module="acs handler" err="unexpected EOF"
/var/log/ecs/ecs-agent.log.2016-02-19-07:2016-02-19T07:23:05Z [ERROR] Error getting message from ws backend module="ws client" err="unexpected EOF"
 /var/log/ecs/ecs-agent.log.2016-02-19-07:2016-02-19T07:23:05Z [INFO] Error from tcs; backing off module="tcs handler" err="unexpected EOF"
/var/log/ecs/ecs-agent.log.2016-02-19-13:2016-02-19T13:28:48Z [ERROR] Error getting message from ws backend module="ws client" err="unexpected EOF"
/var/log/ecs/ecs-agent.log.2016-02-19-13:2016-02-19T13:28:48Z [INFO] Error from tcs; backing off module="tcs handler" err="unexpected EOF"
@samuelkarp
Copy link
Contributor

@workybee Are you just grepping for "error"? If so, you're excluding a bunch of context that would help us actually dig into what the Agent is doing.

Since it sounds like Docker is being responsive now (docker ps is working?), can you run docker kill -s USR1 ecs-agent and post the resulting stack dump here (you'll find it in the log file in /var/log/ecs)?

@workybee
Copy link
Author

Yes, I was just grepping for Error.
Sorry took so long to get back to you, many fires today... ;-)

Attached is the result of "docker kill -s USR1 ecs-agent" on one of the instances:

ecs-agent_stacktrace_2016-02-19T2138.log.gz

@samuelkarp
Copy link
Contributor

@workybee This looks like the same thing I was seeing in @tschutte's log in #313. Can you let me know how many containers are listed in the data file? Something like this:

sudo yum -y install jq
sudo cat /var/lib/ecs/data/ecs_agent_data.json | jq '.Data.TaskEngine.Tasks[].Containers[].Name' | wc -l

@workybee
Copy link
Author

Definite similarity, but my case might be more amenable to your lock held theory. ;-)
After the yum install of jq mentioned above:

[ec2-user@ip-172-30-1-173 ~]$ sudo cat /var/lib/ecs/data/ecs_agent_data.json | jq '.Data.TaskEngine.Tasks[].Containers[].Name' | wc -l
41

and certainly my attempts at "docker ps" or similar sometimes take "forever" just after we fire off multiple start tasks (total of about 100 every 15 minutes). Things seem to work for a while, then get wedged. (see #309 and #313 for more context)

@andreausu
Copy link

We're having the same issue after upgrading to 2015.09.f.
We've updated our nodes to the new AMI on Friday, today (after a couple of days) we're seeing issues deploying new code due to the disconnected agent (but the container is up indeed), the logs are showing these lines over and over:

2016-02-22T10:33:04Z [INFO] Creating poll dialer module="ws client" host="ecs-t-1.eu-west-1.amazonaws.com"
2016-02-22T10:33:05Z [WARN] Error creating a websocket client module="ws client" err="websocket: bad handshake"
2016-02-22T10:33:05Z [ERROR] Error connecting to TCS: {"AccessDeniedException":"Forbidden"}, websocket: bad handshake module="tcs handler"
2016-02-22T10:33:05Z [INFO] Error from tcs; backing off module="tcs handler" err="{"AccessDeniedException":"Forbidden"}, websocket: bad handshake"

We're running less than 10 containers on these nodes and we didn't start/stop tasks during these 2 days.

We've fixed it for now by doing a stop ecs and start ecs.

@ejholmes
Copy link
Contributor

We're seeing the same problem after upgrading to 1.8.0, but were seeing the same issue on 1.6.0 (although, with more frequency now).

@samuelkarp
Copy link
Contributor

At this point, we believe this is the cause. We've just merged a fix to be included in our next release.

@jpignata
Copy link

@samuelkarp Awesome, thank you. Is there a workaround here prior to the release of your next version? This is happening multiple times a day on our clusters and causing myriad problems. Is downgrading to the previous version the recommendation?

@mmontagna
Copy link

@jpignata We downgraded to version 1.7 and haven't had any problems, but for what it's worth periodically killing the 1.8 containers unsticks the service and didn't seem to cause any problems, we did that for a bit, but I'm not sure it's a good long term idea.

@ejholmes
Copy link
Contributor

Also downgraded to 1.6 for the time being.

@bilalaslamseattle
Copy link

@samuelkarp we deployed the dev branch to one of our clusters and it did not fix the problem. ECS is still reporting periodic AGENT disconnects :(

@samuelkarp
Copy link
Contributor

@bilalaslamseattle Are they the same symptoms as before? Can you provide logs at debug level (ECS_LOGLEVEL=debug) and the stack trace on a disconnected Agent (docker kill -s USR1 ecs-agent)?

@bilalaslamseattle
Copy link

@samuelkarp roger that, I'll get these over to you.

@jwerak
Copy link

jwerak commented Feb 25, 2016

@samuelkarp answering here for @bilalaslamseattle .

Is this info you are looking for? https://gist.github.com/veverjak/3607a9dd1f141e31d5a8

Also how does one set ECS_LOGLEVEL?
Are envs from /etc/init/ecs.conf passed to docker run? It doesn't seem so. Is there simple way to pass env to docker from ecs-ini or do I have to stop the service and run ecs-agent container manually using docker run cli?

@samuelkarp
Copy link
Contributor

@veverjak Yes, that's what I'm looking for. Is that from an Agent built with the dev branch (and the fix) or from the 1.8.0 version?

If you're using our ECS-optimized AMI, you can set ECS_LOGLEVEL in /etc/ecs/ecs.config, then run sudo stop ecs ; sudo start ecs to restart the Agent.

@jwerak
Copy link

jwerak commented Feb 25, 2016

@samuelkarp yes, we are running latest dev branch.
I am not seeing the errors now (they appear irregularly) so I will switch on debug mode and send you logs when the issue will appear or are even logs without error appearing useful?

@samuelkarp
Copy link
Contributor

There may not be lines that say "error", but logs and stack trace information if you run into the deadlock/state drift issue will be very useful. In our testing so far we haven't reproduced the deadlock when building from the dev branch.

@jwerak
Copy link

jwerak commented Feb 25, 2016

@samuelkarp
Got it, here is the stacktrace from disconnected instance: https://gist.github.com/veverjak/73e05734c5c29a2257f4

and here are the debug logs: https://gist.github.com/veverjak/31327365a43115e71fb9

@aaithal
Copy link
Contributor

aaithal commented Feb 26, 2016

Hi @veverjak ,

Apologies for asking you to confirm this again. But, I looked up the information about the container instance on which you are facing this issue and it seems like it has a different agentHash than the one on the "dev" branch. It should have been computed as *305353a, to correspond to the latest commit. But, I see that it is set to *fd7b600, which does not correspond to the latest commit.

Could you please confirm that the agent you are using does have the fix from the "dev" branch merged in?

@jwerak
Copy link

jwerak commented Feb 26, 2016

Hi @aaithal,

no problem, we are building from this repo actually - https://github.com/appuri/amazon-ecs-agent/commits/build-agent

@bilalaslamseattle
Copy link

@aaithal thanks for digging into this. This bug sort of makes ECS unusable ... deploys fail, etc. etc. Badness all around. Can the ECS team fast-track this please?

@nicolai86
Copy link

👍 we downgraded as well since this is breaking ECS for us very regularly.

@juanrhenals
Copy link
Contributor

Fixing this deadlock issue has been released as part of Agent v1.8.1. We also released a new AMI with the v1.8.1 agent bundled.

If you continue to run into issues or have new information, please do reopen this or a new issue.
For now, I don't think we have any additional action items for this issue on our end.

@workybee
Copy link
Author

workybee commented Mar 4, 2016

Refreshed our cluster with the new AMI (2015.09.g) and this ecs-agent v1.8.1 and have been running smoothly ever since (about 24 hours) where we used to need to restart the ecs-agent every 30 minutes.
Thanks all for the fix.

@jwerak
Copy link

jwerak commented Mar 7, 2016

@juanrhenals we are running ecs-agent v1.8.1 but not on latest AMI - can this be a cause to that?

@bilalaslamseattle
Copy link

@juanrhenals we are still seeing this issue. Like @veverjak not on the latest AMI. Could this be a cause?

@aaithal
Copy link
Contributor

aaithal commented Mar 8, 2016

@veverjak Our tests prior to releasing the new version did not reproduce the agent deadlock with the fix. Could you please open a new issue to debug the issues that you are seeing with the agent? We were unable to detect the deadlock from the previous set of logs that you provided. We would like to know the following information from your instance to help us debug this better:

  1. ECS Agent logs at debug level (ECS_LOGLEVEL=debug) and the stack trace on a disconnected Agent (docker kill -s USR1 ecs-agent)
  2. ECS Agent Config, if you have any non-default customizations
  3. Output of curl 127.0.0.1:51678/v1/metadata on the EC2 Instance where you see the deadlock

Feel free to send those to aithal at amazon.com if you do not want to share the files on github.

Also, Are you still using https://github.com/appuri/amazon-ecs-agent/commits/build-agent to build the agent? I took a shot at building this image. But, it seems like there are some unresolved merge conflicts, which would cause build failures.

@aaithal
Copy link
Contributor

aaithal commented Mar 8, 2016

@bilalaslamseattle Could you please confirm if are you seeing these issues on ECS Agent v1.8.1? I'd also encourage you to create a new issue with more details about the failure as mentioned in my previous post.

@jwerak
Copy link

jwerak commented Mar 8, 2016

@aaithal thanks, I will gather that info once we experience disconnect again and create a ticket for that.
Also sorry for agent with merge conflict, I've cleared it out and we are now on top of 1.8.1 branch here - https://github.com/appuri/amazon-ecs-agent

@jhovell
Copy link

jhovell commented Jun 2, 2016

@samuelkarp I think I am experiencing this issue with ECS Agent 1.9.0/ ECS AMI 2016.03b. I have an instance that dumps the following to the log:

2016-06-02T11:08:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2016-06-02T11:13:41Z [INFO] Creating poll dialer module="ws client" host="ecs-t-2.us-west-2.amazonaws.com"
2016-06-02T11:13:41Z [WARN] Error getting cpu stats module="stats" err="No data in the queue" container="&{DockerID:7f5c4e32286e416897a9399cad5b1d2006adbc2d24cd5c1d08ab7fee5ab9c050}"
2016-06-02T11:13:41Z [WARN] Error getting instance metrics module="tcs client" err="No task metrics to report"
2016-06-02T11:18:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2016-06-02T11:27:41Z [INFO] Creating poll dialer module="ws client" host="ecs-t-2.us-west-2.amazonaws.com"
2016-06-02T11:27:41Z [WARN] Error getting cpu stats module="stats" err="No data in the queue" container="&{DockerID:7f5c4e32286e416897a9399cad5b1d2006adbc2d24cd5c1d08ab7fee5ab9c050}"
2016-06-02T11:27:41Z [WARN] Error getting instance metrics module="tcs client" err="No task metrics to report"
2016-06-02T11:28:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2016-06-02T11:38:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2016-06-02T11:45:41Z [INFO] Creating poll dialer module="ws client" host="ecs-t-2.us-west-2.amazonaws.com"
2016-06-02T11:45:41Z [WARN] Error getting cpu stats module="stats" err="No data in the queue" container="&{DockerID:7f5c4e32286e416897a9399cad5b1d2006adbc2d24cd5c1d08ab7fee5ab9c050}"
2016-06-02T11:45:41Z [WARN] Error getting instance metrics module="tcs client" err="No task metrics to report"
2016-06-02T11:48:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING
2016-06-02T11:55:41Z [INFO] Error from tcs; backing off module="tcs handler" err="unexpected EOF"
2016-06-02T11:55:42Z [INFO] Creating poll dialer module="ws client" host="ecs-t-2.us-west-2.amazonaws.com"
2016-06-02T11:58:47Z [INFO] Redundant container state change for task my-app:177 arn:aws:ecs:us-west-2:ecs-account-id:task/abd127ff-729c-47cf-b6d4-7d80055fb99a, Status: (RUNNING->RUNNING) Containers: [my-app (RUNNING->RUNNING),]: my-app(ecr-account-id.dkr.ecr.us-east-1.amazonaws.com/my-app:my-docker-tag) (RUNNING->RUNNING) to RUNNING, but already RUNNING

The instance is still running a my-app container (as shown by the log) but the agent is disconnected from the cluster. I could enable debug logging and restart the agent (not sure if that will fix the problem) and post up debug logs if this helps.

Anything else this could be?

@samuelkarp
Copy link
Contributor

@jhovell Yes, debug logs definitely helps. If you're still experiencing problems, please open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests