New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS console way out of sync with instance state #313
Comments
You could use the
Could you please share the status( |
Both status columns in the console showed up as "Running" . |
@aaithal Sorry, forgot to respond to first part of your coment above: |
I spoke to soon. @aaithal or anyone else have ideas? A sample of describe-tasks output:
and for a PENDING task:
|
OK, so the ecs-agent, though still running had lost connectivity with whatever in AWS keeps track of things (ECS Cluster page showed "Agent Connected" "false").
Not sure how relevant these are. |
I just experienced similar behavior after upgrading my instances to v1.8.0, I set a number of my clusters desiredCount to 0, and the console shows there are no more running tasks, but after logging into the EC2 instances I can clearly see the tasks running, and the only relevant logs are the ones @workybee posted (I also was not running in DEBUG mode). It seems there is a bug in the 1.8.0 version of the agent that is causing a disconnect between what the AWS API has and what is actually running on the instances. I'll try and reproduce this issue in a testing environment with DEBUG on and see if I can't get back to you with additional logs, but this is very troubling... |
I have managed to duplicate this on my cluster. What I thought was an issue I referenced in #281, actually looks to be more related to this issue. When a service deploy is initiated, it appears the agent does not stop the running container, however UI shows the new task is in a pending state. If I do nothing more than manually stop and start the ecs agent, the deploy continues as normal. Note I just updated to the latest AMI (2015.09.f) yesterday from 2015.09.e which did not have the issue. The time line is as follows: I've attached a debug log to the issue, and have scrubbed most identifying info. If someone on the ECS team wants the raw log so that they can cross reference to my running ECS cluster let me know. |
@tschutte This looks like the most interesting part to me:
If this happens again, can you run |
So after about 1 hour things locked up again with the AWS console thinking many tasks still running, but CPU metrics, docker and raw ps all saying nothing of interest still running. |
Can you unset 2015.09.f fixes a bug with the Agent that might help (and also includes the fix for the glibc vulnerability), so I would recommend updating. |
After something less than 53 minutes of uptime, things have frozen up in an inconsistent state again. The console thinks there are 8 PENDING tasks, and 48 RUNNING, docker ps, and ps directly on the instances shows nothing except the ecs-agent running. |
So am getting logs now, haven't seen things lock up yet ;-) |
This new combo of 2015.09.f and ecs-agent version 1.8.0 running with debug enabled seems to be staying in sync between the AWS console and what is actually happening on the cluster instances. |
@samuelkarp Thanks. I'll see if I can view one last service update failure and obtain the output of docker kill. Note there was nothing of interest in dmesg or the docker logs other than the standard "GET" messages. My team is getting frustrated with the frequent failing service updates, so I will also be rolling back to a version of the AMI which is more stable for me. I'll attempt to replicate the issue again in a personal test environment. If you think this issue is not related, I'll move out of this issue and create my own. |
@samuelkarp Here is a agent log with the output of the kill command. Note the service update fired at around 14:17:51 according to the ECS UI, but it reported an AGENT error:
I finally ran the docker kill at 14:37. Let me know if you want me to continue in this issue or move to a different one. |
@tschutte Thank you, that output is super helpful. Can you let me know how many containers you had running on this host? (You should be able to see this in the data file located at
TL;DR: I think Docker might be responding very slowly and causing the Agent to hang. I will keep investigating to either confirm or deny. |
@samuelkarp There are a total of 3 containers on each host, 1 for the ecs agent, and 2 as part of my task definition with a shared volume between them. I think I have also reproduced this issue using a couple of small sample containers that just loop in a while loop. I'm working now to button up the task definition and will upload a test container to docker hub so you can hopefully observe this behavior for yourself. Stay tuned. I did go digging through some of the older logs in the hour or two proceeding the error and notice something that I think correlates to the stuck lock. I found a case each time where the log entry "Writing docker event to the associated task" was missing the corresponding "Wrote docker event to the associated task" (from docker_task_engine.go line 382). |
@tschutte That's 2 in the task definition, but how many times has that task been run on the host? Each time a task is run, it creates new containers. I'm interested in the total number of known containers, which you'll find in that data file. Something like this should show it: sudo yum -y install jq
sudo cat /var/lib/ecs/data/ecs_agent_data.json | jq '.Data.TaskEngine.Tasks[].Containers[].Name' | wc -l What you're noting about the log lines is indeed relevant; that's one of the residual effects (channel blocked because there's no active reader) of the lock being held a long time. |
@samuelkarp sorry, I misunderstood the question. The last remaining qa host I have reports a value of 12. |
ok, I have a basic task definition which seems to reliably recreate the issue. I did find I have to wait past the task cleanup wait duration (I set it low here for testing). Basically launch a task, wait 12-15 minutes, then update the task definition and watch the service enter the weird state, where the UI shows no tasks running, but the task is still present on the host via "docker ps". sample userdata (launched on a barebones 2015.09.f ami) - https://gist.github.com/tschutte/3182549229dc0ca7563c |
@tschutte If it's only 12, that may invalidate my hypothesis (I saw goroutines blocked for around 700 minutes, implying that there would be something on the order of 1400 containers each taking 30 seconds to timeout). I'll try your reproduction task today. |
Nope, that environment only deploys 3-4 times a day max, times the 2-3 containers. These are very low volume hosts overall, and is why I'm pleased my reproduction task worked on a fairly vanilla EC2 instance launched today. Hopefully you are successful reproducing. |
@samuelkarp In response to your theory three comments ago: the I have a different theory after glancing through those logs. First of all, I sorted all the goroutines by how long they'd been hanging to try and figure out when this might have started;
The bits that matter are So, how'd this happen? Well, my guess is if you get a docker event for a task and the engine task manager has decided to delete that task, you get a deadlock.
Frankly, reading this code, I'm surprised it doesn't reliably deadlock here with a large number of containers. It looks like in my usual test environment (mmm overlay), you end up with removes being so quick this doesn't happen in practice so much. There might be further intricacies as well. However, after being surprised it doesn't deadlock easily, I threw a 10-container task at it, waited the cleanup duration (I set mine to 60s to speed it along), and immediately deadlocked on removal. That repro works reliably for me with the ECS optimized AMI (3/3 tries), and works slightly less reliably on an AMI with a sane FS configuration. If it doesn't work for you, just make You can identify that what I describe is the root cause of some breakage by looking at debug output and verifying that a I'll give a go at writing a test that properly models this problem, and then write a fix. |
I finally got around to reproducing this in a test environment, it sounds like @euank is already working on a fix for this issue, but incase you need additional data I have attached my ecs-agent.log w/ Debugging enabled. In the mean-time (while we wait for @euank 's fix) would you recommend downgrading the agent/AMI, or is there perhaps another work-around we can use to avoid this issue ? |
@euank 's theory is supported by our experience of even more frequent lock ups when we tried removing old tasks ourselves before ecs-agent version 1.8.0 supported: ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=30m A semi-workaround we are trying: |
Relates to aws#313 This fixes a deadlock that could occur during the period of time between when a container waited for cleanup and when the actual cleanup finished (removes and state management changes). During that period of time, the task_manager would not accept any new events and, immediately after that period of time, the task_manager needed to take the processTasks lock to complete removing itself. Due to another issue, the scope of that lock when sending events was overly broad, leading to a deadlock. This fixes that deadlock by appropriately reducing the scope of the processTasks lock when handling events. It also speeds up the processing of multiple tasks by continuing to read task events during the cleanup actions. Prior to this, while a task's container's were being removed, no other task could process new events.
Relates to aws#313 This fixes a deadlock that could occur during the period of time between when a container waited for cleanup and when the actual cleanup finished (removes and state management changes). During that period of time, the task_manager would not accept any new events and, immediately after that period of time, the task_manager needed to take the processTasks lock to complete removing itself. Due to another issue, the scope of that lock when sending events was overly broad, leading to a deadlock. This fixes that deadlock by appropriately reducing the scope of the processTasks lock when handling events. It also speeds up the processing of multiple tasks by continuing to read task events during the cleanup actions. Prior to this, while a task's container's were being removed, no other task could process new events.
Question that is hopefully not out of line: In a case such as this where there is a major bug that disrupts core functionality, would it make sense to re-publish the previous version (v1.7.1 in this case) with the "latest" tag, so that launch configs and autoscaling groups that are not using version pinning remain in a functional state? |
I'm experiencing this issue since I upgraded from 1.7.1 to 1.8.0. |
Relates to aws#313 This fixes a deadlock that could occur during the period of time between when a container waited for cleanup and when the actual cleanup finished (removes and state management changes). During that period of time, the task_manager would not accept any new events and, immediately after that period of time, the task_manager needed to take the processTasks lock to complete removing itself. Due to another issue, the scope of that lock when sending events was overly broad, leading to a deadlock. This fixes that deadlock by appropriately reducing the scope of the processTasks lock when handling events. It also speeds up the processing of multiple tasks by continuing to read task events during the cleanup actions. Prior to this, while a task's container's were being removed, no other task could process new events.
We were also experiencing this issue after upgrading to v1.8.0. Almost all cluster machines reported the ECS agent to be disconnected which caused containers that terminated for some reason not to be restarted on another machine. I consider this a bug that caused disruption of production environments. We worked around this by downgrading to v1.7.1 again. However, I'd really like to see a v1.8.1 release with the fix merged in (#320). |
We are also having the same issue with ECS agent disconnection. We are downgrading our agent's version to 1.7.1 until this issue is fixed. |
Fixing this deadlock issue has been released as part of Agent v1.8.1. We also released a new AMI with the v1.8.1 agent bundled. If you continue to run into issues or have new information, please do reopen this or a new issue. |
After thinking about it a little, I do have one more question related to this issue: basically, what caused it? The code that was patched wasn't introduced in 1.8.0 (and the changes for 1.8.0 were actually pretty small, based on Why did downgrading to v1.7.1 help at all? The things I see as possibilities here:
If it's fixed, it's fixed, and it's not vital to understand all the details of "whyyyyy", but it'd still be cool to know! I have no plan to investigate further, but if you do I'd like to hear the results. |
We have a theory that this was a side-effect of this commit, which was intended to fix losing container IDs in case of inspect errors. With this fix, we are now not missing container IDs anymore. But, the state transition for the same in terms of container clean-up leads up to the deadlock. You can test it by reverting that commit in the v1.8.0 agent and using the task definition that you posted, with cleanup time set to 1minute. |
Fixing this deadlock issue has been released as part of Agent v1.8.1. AWS has also released a new AMI with the v1.8.1 agent bundled. http://docs.aws.amazon.com/AmazonECS/latest/developerguide/container_agent_versions.html#ecs-optimized-ami-agent-versions |
@n1t1nv3rma seems like the amazon yum repo hasn't been updated with this yet? following the update procedure here: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-update.html#procedure_update_ecs-init I get: If I update manually using the ECS console I get 1.8.1 |
@alexmac ,
I don't believe the |
@ChrisRut I'm seeing following exception when trying above one liner for agents v1.8.0 |
@nikhilo , I think you meant that for me, sorry, that one-liner is assuming you are running the "ECS Optimized AMI" |
* optimization: disable ipv6 router advertisements * sysctl package unit tests * generated engine mocks and update unit tests * fix gocyclo check
This is a follow-up to aws#313 Some customers might have already disabled ipv6 entirely, which would mean that the ipv6 accept_ra sysctl key wouldn't exist. In this case, the sysctl command would fail and the agent would exit. The -e flag tells sysctl not to fail if the config key doesnt exist. from sysctl man page: -e Use this option to ignore errors about unknown keys.
The AWS console "Task" tab shows ~48 tasks, but instances have only 3.
My naive understanding is that the ecs-agent is what the AWS console uses to know what is happening on the instances, hence the query here.
A "docker ps -a" on all the instances shows nothing related to a task mentioned on the console.
I am doing an hourly purge of meta-data ( i.e.: "docker ps -a -q | xargs docker rm" on each instance )to help keep the disk footprint low, prevent docker from locking up completely.
I suspect this may be the causing the issue, but how to do that without the console (and hence AWS's notion of memory reservation etc. getting all messed up?
And/Or how to re-sync things?
The text was updated successfully, but these errors were encountered: