-
Notifications
You must be signed in to change notification settings - Fork 602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ecs-agent running but not connected after a while, generates AGENT errors trying to start tasks #316
Comments
@workybee Are you just grepping for "error"? If so, you're excluding a bunch of context that would help us actually dig into what the Agent is doing. Since it sounds like Docker is being responsive now ( |
Yes, I was just grepping for Error. Attached is the result of "docker kill -s USR1 ecs-agent" on one of the instances: |
@workybee This looks like the same thing I was seeing in @tschutte's log in #313. Can you let me know how many containers are listed in the data file? Something like this: sudo yum -y install jq
sudo cat /var/lib/ecs/data/ecs_agent_data.json | jq '.Data.TaskEngine.Tasks[].Containers[].Name' | wc -l |
Definite similarity, but my case might be more amenable to your lock held theory. ;-)
and certainly my attempts at "docker ps" or similar sometimes take "forever" just after we fire off multiple start tasks (total of about 100 every 15 minutes). Things seem to work for a while, then get wedged. (see #309 and #313 for more context) |
We're having the same issue after upgrading to 2015.09.f.
We're running less than 10 containers on these nodes and we didn't start/stop tasks during these 2 days. We've fixed it for now by doing a |
We're seeing the same problem after upgrading to 1.8.0, but were seeing the same issue on 1.6.0 (although, with more frequency now). |
@samuelkarp Awesome, thank you. Is there a workaround here prior to the release of your next version? This is happening multiple times a day on our clusters and causing myriad problems. Is downgrading to the previous version the recommendation? |
@jpignata We downgraded to version 1.7 and haven't had any problems, but for what it's worth periodically killing the 1.8 containers unsticks the service and didn't seem to cause any problems, we did that for a bit, but I'm not sure it's a good long term idea. |
Also downgraded to 1.6 for the time being. |
@samuelkarp we deployed the dev branch to one of our clusters and it did not fix the problem. ECS is still reporting periodic AGENT disconnects :( |
@bilalaslamseattle Are they the same symptoms as before? Can you provide logs at debug level ( |
@samuelkarp roger that, I'll get these over to you. |
@samuelkarp answering here for @bilalaslamseattle . Is this info you are looking for? https://gist.github.com/veverjak/3607a9dd1f141e31d5a8 Also how does one set |
@veverjak Yes, that's what I'm looking for. Is that from an Agent built with the dev branch (and the fix) or from the 1.8.0 version? If you're using our ECS-optimized AMI, you can set |
@samuelkarp yes, we are running latest dev branch. |
There may not be lines that say "error", but logs and stack trace information if you run into the deadlock/state drift issue will be very useful. In our testing so far we haven't reproduced the deadlock when building from the dev branch. |
@samuelkarp and here are the debug logs: https://gist.github.com/veverjak/31327365a43115e71fb9 |
Hi @veverjak , Apologies for asking you to confirm this again. But, I looked up the information about the container instance on which you are facing this issue and it seems like it has a different Could you please confirm that the agent you are using does have the fix from the "dev" branch merged in? |
Hi @aaithal, no problem, we are building from this repo actually - https://github.com/appuri/amazon-ecs-agent/commits/build-agent |
@aaithal thanks for digging into this. This bug sort of makes ECS unusable ... deploys fail, etc. etc. Badness all around. Can the ECS team fast-track this please? |
👍 we downgraded as well since this is breaking ECS for us very regularly. |
Fixing this deadlock issue has been released as part of Agent v1.8.1. We also released a new AMI with the v1.8.1 agent bundled. If you continue to run into issues or have new information, please do reopen this or a new issue. |
Refreshed our cluster with the new AMI (2015.09.g) and this ecs-agent v1.8.1 and have been running smoothly ever since (about 24 hours) where we used to need to restart the ecs-agent every 30 minutes. |
@juanrhenals we are running ecs-agent |
@juanrhenals we are still seeing this issue. Like @veverjak not on the latest AMI. Could this be a cause? |
@veverjak Our tests prior to releasing the new version did not reproduce the agent deadlock with the fix. Could you please open a new issue to debug the issues that you are seeing with the agent? We were unable to detect the deadlock from the previous set of logs that you provided. We would like to know the following information from your instance to help us debug this better:
Feel free to send those to aithal at amazon.com if you do not want to share the files on github. Also, Are you still using https://github.com/appuri/amazon-ecs-agent/commits/build-agent to build the agent? I took a shot at building this image. But, it seems like there are some unresolved merge conflicts, which would cause build failures. |
@bilalaslamseattle Could you please confirm if are you seeing these issues on ECS Agent |
@aaithal thanks, I will gather that info once we experience disconnect again and create a ticket for that. |
@samuelkarp I think I am experiencing this issue with ECS Agent 1.9.0/ ECS AMI 2016.03b. I have an instance that dumps the following to the log:
The instance is still running a my-app container (as shown by the log) but the agent is disconnected from the cluster. I could enable debug logging and restart the agent (not sure if that will fix the problem) and post up debug logs if this helps. Anything else this could be? |
@jhovell Yes, debug logs definitely helps. If you're still experiencing problems, please open a new issue. |
Running 2015.09.f with ecs-agent version 1.8.0 we try and start about 100 tasks every 15 minutes.
This configuration seemed to run fine for a few hours after bringing up a fresh cluster of 4 m4.large instances. Then start getting AGENT errors:
2016-02-19 09:11:46,172 [13767] ERROR ECS - Failed running task prod_new_collect_ads_-u_304671576282352 - reason [{u'reason': u'AGENT', u'arn': u'arn:aws:ecs:us-east-1:860733554218:container-instance/181f09d8-b9af-4062-9728-25a64c2288d0'}, {u'reason': u'AGENT', u'arn': u'arn:aws:ecs:us-east-1:860733554218:container-instance/716ac35d-b6c9-4610-b3f0-19c5385f272c'}, {u'reason': u'AGENT', u'arn': u'arn:aws:ecs:us-east-1:860733554218:container-instance/d27f9931-a69f-4a32-92d0-8a9889ad1412'}, {u'reason': u'AGENT', u'arn': u'arn:aws:ecs:us-east-1:860733554218:container-instance/fdae64b7-583a-4da3-baf1-9cf9e905e3c5'}]
Indeed the console show "Agent Connected" false, but on the instances a "docker ps" show it still running (as the only thing running!).
At the suggestion of @samuelkarp in #313 ecs is configured to produce debug output.
Also configured with:
ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=30m
to help manage disk space usage.
The last several Error lines from one of instance:
The text was updated successfully, but these errors were encountered: