AWS console way out of sync with instance state #313

Closed
workybee opened this Issue Feb 16, 2016 · 39 comments

Comments

Projects
None yet
@workybee

The AWS console "Task" tab shows ~48 tasks, but instances have only 3.
My naive understanding is that the ecs-agent is what the AWS console uses to know what is happening on the instances, hence the query here.

A "docker ps -a" on all the instances shows nothing related to a task mentioned on the console.

I am doing an hourly purge of meta-data ( i.e.: "docker ps -a -q | xargs docker rm" on each instance )to help keep the disk footprint low, prevent docker from locking up completely.
I suspect this may be the causing the issue, but how to do that without the console (and hence AWS's notion of memory reservation etc. getting all messed up?

And/Or how to re-sync things?

@workybee

This comment has been minimized.

Show comment
Hide comment
@workybee

workybee Feb 17, 2016

@aaithal - perhaps you have some comment, relates a bit to the space issue you helped me with in issue #309

@aaithal - perhaps you have some comment, relates a bit to the space issue you helped me with in issue #309

@aaithal

This comment has been minimized.

Show comment
Hide comment
@aaithal

aaithal Feb 17, 2016

Member

You could use the stop-task API from the aws cli or the SDK to stop tasks.

The AWS console "Task" tab shows ~48 tasks

Could you please share the status(Last status and Desired status) of tasks that are not running on the instance?

Member

aaithal commented Feb 17, 2016

You could use the stop-task API from the aws cli or the SDK to stop tasks.

The AWS console "Task" tab shows ~48 tasks

Could you please share the status(Last status and Desired status) of tasks that are not running on the instance?

@workybee

This comment has been minimized.

Show comment
Hide comment
@workybee

workybee Feb 17, 2016

Both status columns in the console showed up as "Running" .
I upgraded the ecs-agent to version 1.8.0 and things look better so far.
Thank you for the help.

Both status columns in the console showed up as "Running" .
I upgraded the ecs-agent to version 1.8.0 and things look better so far.
Thank you for the help.

@workybee

This comment has been minimized.

Show comment
Hide comment
@workybee

workybee Feb 17, 2016

@aaithal Sorry, forgot to respond to first part of your coment above:
Pretty sure I tried something like "ecs stop-task --task ffffffff-uuuu-nnnn-kkkk-yyyyggggoooo" with no success.
Choosing "Stop All" from the "Tasks" tab on the "Cluster" page did seem to clear them out though.
(task-id changed for fun, and in case it leaks anything of import)

@aaithal Sorry, forgot to respond to first part of your coment above:
Pretty sure I tried something like "ecs stop-task --task ffffffff-uuuu-nnnn-kkkk-yyyyggggoooo" with no success.
Choosing "Stop All" from the "Tasks" tab on the "Cluster" page did seem to clear them out though.
(task-id changed for fun, and in case it leaks anything of import)

@workybee

This comment has been minimized.

Show comment
Hide comment
@workybee

workybee Feb 18, 2016

I spoke to soon. @aaithal or anyone else have ideas?
Seems when I finally upgraded all 4 instances to ecs-agent 1.8.0 everything broke.
This morning I find the console (and aws cli) thinks some 40 tasks are associated with the cluster, 27 running, 13 in pending state. Looking on the actual cluster instances - nothing running at all except the ecs-agent according to docker and ps, and the cpu usage has been flat all night confirming it is not doing much if anything. ;-)

A sample of describe-tasks output:

aws ecs describe-tasks --cluster macc --tasks 010ebe32-99a6-42f3-ac85-61440d6af71f
{
    "failures": [],
    "tasks": [
        {
            "taskArn": "arn:aws:ecs:us-east-1:860733554218:task/010ebe32-99a6-42f3-ac85-61440d6af71f",
            "overrides": {
                "containerOverrides": [
                    {
                        "name": "django_mentad"
                    }
                ]
            },
            "lastStatus": "RUNNING",
            "containerInstanceArn": "arn:aws:ecs:us-east-1:860733554218:container-instance/54659fea-2db6-43a3-9044-540f1712b13f",
            "createdAt": 1455762905.788,
            "clusterArn": "arn:aws:ecs:us-east-1:860733554218:cluster/macc",
            "startedAt": 1455762907.83,
            "desiredStatus": "RUNNING",
            "taskDefinitionArn": "arn:aws:ecs:us-east-1:860733554218:task-definition/prod_new_collect_ads_-u_262317968:4",
            "containers": [
                {
                    "containerArn": "arn:aws:ecs:us-east-1:860733554218:container/f33092df-3534-4ca2-b2d3-13a53ddbcffd",
                    "taskArn": "arn:aws:ecs:us-east-1:860733554218:task/010ebe32-99a6-42f3-ac85-61440d6af71f",
                    "lastStatus": "RUNNING",
                    "name": "django_mentad",
                    "networkBindings": []
                }
            ]
        }
    ]
}

and for a PENDING task:

aws ecs describe-tasks --cluster macc --tasks 21629223-5b44-4d2d-bcb3-34fe2a955974
{
    "failures": [],
    "tasks": [
        {
            "taskArn": "arn:aws:ecs:us-east-1:860733554218:task/21629223-5b44-4d2d-bcb3-34fe2a955974",
            "overrides": {
                "containerOverrides": [
                    {
                        "name": "django_mentad"
                    }
                ]
            },
            "lastStatus": "PENDING",
            "containerInstanceArn": "arn:aws:ecs:us-east-1:860733554218:container-instance/7a6bbe7b-ec41-4f34-ab5c-6e6ad7c2f714",
            "createdAt": 1455761670.714,
            "clusterArn": "arn:aws:ecs:us-east-1:860733554218:cluster/macc",
            "desiredStatus": "RUNNING",
            "taskDefinitionArn": "arn:aws:ecs:us-east-1:860733554218:task-definition/prod_new_collect_ads_-u_115973711880080:6",
            "containers": [
                {
                    "containerArn": "arn:aws:ecs:us-east-1:860733554218:container/1cfb409d-d38b-4c87-b80b-4f8203d9f393",
                    "taskArn": "arn:aws:ecs:us-east-1:860733554218:task/21629223-5b44-4d2d-bcb3-34fe2a955974",
                    "lastStatus": "PENDING",
                    "name": "django_mentad"
                }
            ]
        }
    ]
}

I spoke to soon. @aaithal or anyone else have ideas?
Seems when I finally upgraded all 4 instances to ecs-agent 1.8.0 everything broke.
This morning I find the console (and aws cli) thinks some 40 tasks are associated with the cluster, 27 running, 13 in pending state. Looking on the actual cluster instances - nothing running at all except the ecs-agent according to docker and ps, and the cpu usage has been flat all night confirming it is not doing much if anything. ;-)

A sample of describe-tasks output:

aws ecs describe-tasks --cluster macc --tasks 010ebe32-99a6-42f3-ac85-61440d6af71f
{
    "failures": [],
    "tasks": [
        {
            "taskArn": "arn:aws:ecs:us-east-1:860733554218:task/010ebe32-99a6-42f3-ac85-61440d6af71f",
            "overrides": {
                "containerOverrides": [
                    {
                        "name": "django_mentad"
                    }
                ]
            },
            "lastStatus": "RUNNING",
            "containerInstanceArn": "arn:aws:ecs:us-east-1:860733554218:container-instance/54659fea-2db6-43a3-9044-540f1712b13f",
            "createdAt": 1455762905.788,
            "clusterArn": "arn:aws:ecs:us-east-1:860733554218:cluster/macc",
            "startedAt": 1455762907.83,
            "desiredStatus": "RUNNING",
            "taskDefinitionArn": "arn:aws:ecs:us-east-1:860733554218:task-definition/prod_new_collect_ads_-u_262317968:4",
            "containers": [
                {
                    "containerArn": "arn:aws:ecs:us-east-1:860733554218:container/f33092df-3534-4ca2-b2d3-13a53ddbcffd",
                    "taskArn": "arn:aws:ecs:us-east-1:860733554218:task/010ebe32-99a6-42f3-ac85-61440d6af71f",
                    "lastStatus": "RUNNING",
                    "name": "django_mentad",
                    "networkBindings": []
                }
            ]
        }
    ]
}

and for a PENDING task:

aws ecs describe-tasks --cluster macc --tasks 21629223-5b44-4d2d-bcb3-34fe2a955974
{
    "failures": [],
    "tasks": [
        {
            "taskArn": "arn:aws:ecs:us-east-1:860733554218:task/21629223-5b44-4d2d-bcb3-34fe2a955974",
            "overrides": {
                "containerOverrides": [
                    {
                        "name": "django_mentad"
                    }
                ]
            },
            "lastStatus": "PENDING",
            "containerInstanceArn": "arn:aws:ecs:us-east-1:860733554218:container-instance/7a6bbe7b-ec41-4f34-ab5c-6e6ad7c2f714",
            "createdAt": 1455761670.714,
            "clusterArn": "arn:aws:ecs:us-east-1:860733554218:cluster/macc",
            "desiredStatus": "RUNNING",
            "taskDefinitionArn": "arn:aws:ecs:us-east-1:860733554218:task-definition/prod_new_collect_ads_-u_115973711880080:6",
            "containers": [
                {
                    "containerArn": "arn:aws:ecs:us-east-1:860733554218:container/1cfb409d-d38b-4c87-b80b-4f8203d9f393",
                    "taskArn": "arn:aws:ecs:us-east-1:860733554218:task/21629223-5b44-4d2d-bcb3-34fe2a955974",
                    "lastStatus": "PENDING",
                    "name": "django_mentad"
                }
            ]
        }
    ]
}
@workybee

This comment has been minimized.

Show comment
Hide comment
@workybee

workybee Feb 18, 2016

OK, so the ecs-agent, though still running had lost connectivity with whatever in AWS keeps track of things (ECS Cluster page showed "Agent Connected" "false").
Running "sudo stop ecs; sleep 10; sudo start ecs" on each instance seemed to get things flowing again.
Logs I should look at? (was not running agent in DEBUG mode... ECS_LOGFILE is not set)
In each instance I found lots of "Error" lines like:

[...]
ec2-user@172.30.1.152: /var/log/ecs/ecs-agent.log.2016-02-18-05:2016-02-18T05:16:27Z [INFO] Error from tcs; backing off module="tcs handler" err="unexpected EOF"
ec2-user@172.30.1.152: /var/log/ecs/ecs-agent.log.2016-02-18-11:2016-02-18T11:25:30Z [ERROR] Error getting message from ws backend module="ws client" err="unexpected EOF"
ec2-user@172.30.1.152: /var/log/ecs/ecs-agent.log.2016-02-18-11:2016-02-18T11:25:30Z [INFO] Error from tcs; backing off module="tcs handler" err="unexpected EOF"
ec2-user@172.30.1.152: /var/log/ecs/ecs-agent.log.2016-02-18-16:2016-02-18T16:56:33Z [ERROR] Error getting message from ws backend module="ws client" err="unexpected EOF"
ec2-user@172.30.1.152: /var/log/ecs/ecs-agent.log.2016-02-18-16:2016-02-18T16:56:33Z [INFO] Error from tcs; backing off module="tcs handler" err="unexpected EOF"
ec2-user@172.30.1.152: /var/log/ecs/ecs-agent.log.2016-02-18-18:2016-02-18T18:42:53Z [CRITICAL] Error saving state before final shutdown module="TerminationHandler" err="Multiple error:
ec2-user@172.30.1.152: /var/log/ecs/ecs-agent.log.2016-02-18-18:2016-02-18T18:51:25Z [WARN] Error getting cpu stats module="stats" err="No data in the queue" container="&{DockerID:47f2fcd76a16af1974a904fea1c0717813c17c1489a7193557f0477d174c0026}"

Not sure how relevant these are.

OK, so the ecs-agent, though still running had lost connectivity with whatever in AWS keeps track of things (ECS Cluster page showed "Agent Connected" "false").
Running "sudo stop ecs; sleep 10; sudo start ecs" on each instance seemed to get things flowing again.
Logs I should look at? (was not running agent in DEBUG mode... ECS_LOGFILE is not set)
In each instance I found lots of "Error" lines like:

[...]
ec2-user@172.30.1.152: /var/log/ecs/ecs-agent.log.2016-02-18-05:2016-02-18T05:16:27Z [INFO] Error from tcs; backing off module="tcs handler" err="unexpected EOF"
ec2-user@172.30.1.152: /var/log/ecs/ecs-agent.log.2016-02-18-11:2016-02-18T11:25:30Z [ERROR] Error getting message from ws backend module="ws client" err="unexpected EOF"
ec2-user@172.30.1.152: /var/log/ecs/ecs-agent.log.2016-02-18-11:2016-02-18T11:25:30Z [INFO] Error from tcs; backing off module="tcs handler" err="unexpected EOF"
ec2-user@172.30.1.152: /var/log/ecs/ecs-agent.log.2016-02-18-16:2016-02-18T16:56:33Z [ERROR] Error getting message from ws backend module="ws client" err="unexpected EOF"
ec2-user@172.30.1.152: /var/log/ecs/ecs-agent.log.2016-02-18-16:2016-02-18T16:56:33Z [INFO] Error from tcs; backing off module="tcs handler" err="unexpected EOF"
ec2-user@172.30.1.152: /var/log/ecs/ecs-agent.log.2016-02-18-18:2016-02-18T18:42:53Z [CRITICAL] Error saving state before final shutdown module="TerminationHandler" err="Multiple error:
ec2-user@172.30.1.152: /var/log/ecs/ecs-agent.log.2016-02-18-18:2016-02-18T18:51:25Z [WARN] Error getting cpu stats module="stats" err="No data in the queue" container="&{DockerID:47f2fcd76a16af1974a904fea1c0717813c17c1489a7193557f0477d174c0026}"

Not sure how relevant these are.

@ChrisRut

This comment has been minimized.

Show comment
Hide comment
@ChrisRut

ChrisRut Feb 18, 2016

I just experienced similar behavior after upgrading my instances to v1.8.0, I set a number of my clusters desiredCount to 0, and the console shows there are no more running tasks, but after logging into the EC2 instances I can clearly see the tasks running, and the only relevant logs are the ones @workybee posted (I also was not running in DEBUG mode). It seems there is a bug in the 1.8.0 version of the agent that is causing a disconnect between what the AWS API has and what is actually running on the instances. I'll try and reproduce this issue in a testing environment with DEBUG on and see if I can't get back to you with additional logs, but this is very troubling...

I just experienced similar behavior after upgrading my instances to v1.8.0, I set a number of my clusters desiredCount to 0, and the console shows there are no more running tasks, but after logging into the EC2 instances I can clearly see the tasks running, and the only relevant logs are the ones @workybee posted (I also was not running in DEBUG mode). It seems there is a bug in the 1.8.0 version of the agent that is causing a disconnect between what the AWS API has and what is actually running on the instances. I'll try and reproduce this issue in a testing environment with DEBUG on and see if I can't get back to you with additional logs, but this is very troubling...

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Feb 18, 2016

Member

@workybee Do you have more complete logs? The lines you've posted are not relevant; nothing that mentions tcs relates to task delivery.

@ChrisRut If you're able to reproduce with debug logs, that would be very helpful.

Member

samuelkarp commented Feb 18, 2016

@workybee Do you have more complete logs? The lines you've posted are not relevant; nothing that mentions tcs relates to task delivery.

@ChrisRut If you're able to reproduce with debug logs, that would be very helpful.

@tschutte

This comment has been minimized.

Show comment
Hide comment
@tschutte

tschutte Feb 18, 2016

I have managed to duplicate this on my cluster. What I thought was an issue I referenced in #281, actually looks to be more related to this issue. When a service deploy is initiated, it appears the agent does not stop the running container, however UI shows the new task is in a pending state. If I do nothing more than manually stop and start the ecs agent, the deploy continues as normal. Note I just updated to the latest AMI (2015.09.f) yesterday from 2015.09.e which did not have the issue.

The time line is as follows:
21:17:22 Updated the service (via a cloudformation stack update)
21:27:58 Got tired of wating and manual restarted the ecs agent
Deploy finished.

I've attached a debug log to the issue, and have scrubbed most identifying info. If someone on the ECS team wants the raw log so that they can cross reference to my running ECS cluster let me know.
ecs-agent.log.2016-02-18-21.txt

I have managed to duplicate this on my cluster. What I thought was an issue I referenced in #281, actually looks to be more related to this issue. When a service deploy is initiated, it appears the agent does not stop the running container, however UI shows the new task is in a pending state. If I do nothing more than manually stop and start the ecs agent, the deploy continues as normal. Note I just updated to the latest AMI (2015.09.f) yesterday from 2015.09.e which did not have the issue.

The time line is as follows:
21:17:22 Updated the service (via a cloudformation stack update)
21:27:58 Got tired of wating and manual restarted the ecs agent
Deploy finished.

I've attached a debug log to the issue, and have scrubbed most identifying info. If someone on the ECS team wants the raw log so that they can cross reference to my running ECS cluster let me know.
ecs-agent.log.2016-02-18-21.txt

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Feb 18, 2016

Member

@tschutte This looks like the most interesting part to me:

2016-02-18T21:27:58Z [CRITICAL] Error saving state before final shutdown module="TerminationHandler" err="Multiple error:
    0: Timed out waiting for TaskEngine to settle"

If this happens again, can you run docker kill -s USR1 ecs-agent before restarting it? It'd also be good to get the Docker daemon logs and anything suspicious in the output of dmesg. It's unclear if what you're seeing is the same as this issue or not.

Member

samuelkarp commented Feb 18, 2016

@tschutte This looks like the most interesting part to me:

2016-02-18T21:27:58Z [CRITICAL] Error saving state before final shutdown module="TerminationHandler" err="Multiple error:
    0: Timed out waiting for TaskEngine to settle"

If this happens again, can you run docker kill -s USR1 ecs-agent before restarting it? It'd also be good to get the Docker daemon logs and anything suspicious in the output of dmesg. It's unclear if what you're seeing is the same as this issue or not.

@workybee

This comment has been minimized.

Show comment
Hide comment
@workybee

workybee Feb 18, 2016

So after about 1 hour things locked up again with the AWS console thinking many tasks still running, but CPU metrics, docker and raw ps all saying nothing of interest still running.
Am now trying to get DEBUG logs.
Have added ECS_LOGFILE=/ecs-agent.log to the end of /etc/ecs/ecs.config.
When I restarted the ecs-agent ("sudo stop ecs ; sleep 5 ; sudo start ecs" on each instance) I don't seem to get anything into the log file, but all of a sudden things start flowing again.
Happy to try other efforts, recommendations? Should I upgrade from 2015.09.e to 2015.09.f?

So after about 1 hour things locked up again with the AWS console thinking many tasks still running, but CPU metrics, docker and raw ps all saying nothing of interest still running.
Am now trying to get DEBUG logs.
Have added ECS_LOGFILE=/ecs-agent.log to the end of /etc/ecs/ecs.config.
When I restarted the ecs-agent ("sudo stop ecs ; sleep 5 ; sudo start ecs" on each instance) I don't seem to get anything into the log file, but all of a sudden things start flowing again.
Happy to try other efforts, recommendations? Should I upgrade from 2015.09.e to 2015.09.f?

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Feb 19, 2016

Member

Can you unset ECS_LOGFILE and just change ECS_LOGLEVEL=debug? Logs should be mounted by default into /var/log/ecs so you should see them there without messing with ECS_LOGFILE.

2015.09.f fixes a bug with the Agent that might help (and also includes the fix for the glibc vulnerability), so I would recommend updating.

Member

samuelkarp commented Feb 19, 2016

Can you unset ECS_LOGFILE and just change ECS_LOGLEVEL=debug? Logs should be mounted by default into /var/log/ecs so you should see them there without messing with ECS_LOGFILE.

2015.09.f fixes a bug with the Agent that might help (and also includes the fix for the glibc vulnerability), so I would recommend updating.

@workybee

This comment has been minimized.

Show comment
Hide comment
@workybee

workybee Feb 19, 2016

After something less than 53 minutes of uptime, things have frozen up in an inconsistent state again. The console thinks there are 8 PENDING tasks, and 48 RUNNING, docker ps, and ps directly on the instances shows nothing except the ecs-agent running.
Will upgrade to 2015.09.f and try ECS_LOGLEVEL=debug, post what that produces in an hour or so.

After something less than 53 minutes of uptime, things have frozen up in an inconsistent state again. The console thinks there are 8 PENDING tasks, and 48 RUNNING, docker ps, and ps directly on the instances shows nothing except the ecs-agent running.
Will upgrade to 2015.09.f and try ECS_LOGLEVEL=debug, post what that produces in an hour or so.

@workybee

This comment has been minimized.

Show comment
Hide comment
@workybee

workybee Feb 19, 2016

So am getting logs now, haven't seen things lock up yet ;-)
Took a while to get new instances properly configured, so still time... will be happy if things keep going for another hour (or at least start to feel better).

So am getting logs now, haven't seen things lock up yet ;-)
Took a while to get new instances properly configured, so still time... will be happy if things keep going for another hour (or at least start to feel better).

@workybee

This comment has been minimized.

Show comment
Hide comment
@workybee

workybee Feb 19, 2016

This new combo of 2015.09.f and ecs-agent version 1.8.0 running with debug enabled seems to be staying in sync between the AWS console and what is actually happening on the cluster instances.
Can hope it keeps going through the night - but I'll not be looking hard till tomorrow. ;-)
Thanks for the suggestions today. @samuelkarp

This new combo of 2015.09.f and ecs-agent version 1.8.0 running with debug enabled seems to be staying in sync between the AWS console and what is actually happening on the cluster instances.
Can hope it keeps going through the night - but I'll not be looking hard till tomorrow. ;-)
Thanks for the suggestions today. @samuelkarp

@tschutte

This comment has been minimized.

Show comment
Hide comment
@tschutte

tschutte Feb 19, 2016

@samuelkarp Thanks. I'll see if I can view one last service update failure and obtain the output of docker kill. Note there was nothing of interest in dmesg or the docker logs other than the standard "GET" messages.

My team is getting frustrated with the frequent failing service updates, so I will also be rolling back to a version of the AMI which is more stable for me. I'll attempt to replicate the issue again in a personal test environment. If you think this issue is not related, I'll move out of this issue and create my own.

@samuelkarp Thanks. I'll see if I can view one last service update failure and obtain the output of docker kill. Note there was nothing of interest in dmesg or the docker logs other than the standard "GET" messages.

My team is getting frustrated with the frequent failing service updates, so I will also be rolling back to a version of the AMI which is more stable for me. I'll attempt to replicate the issue again in a personal test environment. If you think this issue is not related, I'll move out of this issue and create my own.

@tschutte

This comment has been minimized.

Show comment
Hide comment
@tschutte

tschutte Feb 19, 2016

@samuelkarp Here is a agent log with the output of the kill command. Note the service update fired at around 14:17:51 according to the ECS UI, but it reported an AGENT error:

service MyService-1OXQ7AOIF1ZZT was unable to place a task because no container instance met all of its requirements. The closest matching container-instance 4ac6193c-8165-409f-9ab3-b769637b2749 encountered error "AGENT". For more information, see the Troubleshooting section.

I finally ran the docker kill at 14:37. Let me know if you want me to continue in this issue or move to a different one.
ecs-agent.log.2016-02-19-14.txt

@samuelkarp Here is a agent log with the output of the kill command. Note the service update fired at around 14:17:51 according to the ECS UI, but it reported an AGENT error:

service MyService-1OXQ7AOIF1ZZT was unable to place a task because no container instance met all of its requirements. The closest matching container-instance 4ac6193c-8165-409f-9ab3-b769637b2749 encountered error "AGENT". For more information, see the Troubleshooting section.

I finally ran the docker kill at 14:37. Let me know if you want me to continue in this issue or move to a different one.
ecs-agent.log.2016-02-19-14.txt

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Feb 19, 2016

Member

@tschutte Thank you, that output is super helpful. Can you let me know how many containers you had running on this host? (You should be able to see this in the data file located at /var/lib/ecs/data/ecs_agent_data.json.) Here's what I think is happening:

  • The engine package inside the Agent maintains state and attempts to prevent concurrency problems by locking the state whenever it's read or modified (this lock). In the output you provided, I see tons of goroutines waiting to acquire a read lock (which indicates that the write lock is held) and residual effects elsewhere (channels waiting to send/receive that are blocked because the engine is waiting on the lock already).
  • While normally we expect to receive state changes from the Docker daemon via the event stream, we also have a reconciliation mechanism to periodically inspect each of the containers under management and ensure that we have the state correctly (this method). That method acquires the state lock before iterating through all known containers.
  • Each time we inspect a container, we expect Docker to return a result within 30 seconds.
  • I'm guessing (and I'll keep investigating to confirm) that Docker is responding extremely slowly (potentially taking the full 30 seconds) and thus that method is holding the lock for 30 seconds multiplied by the total number of known containers (this includes new containers, running containers, and exited containers that have not yet been removed).

TL;DR: I think Docker might be responding very slowly and causing the Agent to hang. I will keep investigating to either confirm or deny.

Member

samuelkarp commented Feb 19, 2016

@tschutte Thank you, that output is super helpful. Can you let me know how many containers you had running on this host? (You should be able to see this in the data file located at /var/lib/ecs/data/ecs_agent_data.json.) Here's what I think is happening:

  • The engine package inside the Agent maintains state and attempts to prevent concurrency problems by locking the state whenever it's read or modified (this lock). In the output you provided, I see tons of goroutines waiting to acquire a read lock (which indicates that the write lock is held) and residual effects elsewhere (channels waiting to send/receive that are blocked because the engine is waiting on the lock already).
  • While normally we expect to receive state changes from the Docker daemon via the event stream, we also have a reconciliation mechanism to periodically inspect each of the containers under management and ensure that we have the state correctly (this method). That method acquires the state lock before iterating through all known containers.
  • Each time we inspect a container, we expect Docker to return a result within 30 seconds.
  • I'm guessing (and I'll keep investigating to confirm) that Docker is responding extremely slowly (potentially taking the full 30 seconds) and thus that method is holding the lock for 30 seconds multiplied by the total number of known containers (this includes new containers, running containers, and exited containers that have not yet been removed).

TL;DR: I think Docker might be responding very slowly and causing the Agent to hang. I will keep investigating to either confirm or deny.

@tschutte

This comment has been minimized.

Show comment
Hide comment
@tschutte

tschutte Feb 19, 2016

@samuelkarp There are a total of 3 containers on each host, 1 for the ecs agent, and 2 as part of my task definition with a shared volume between them. I think I have also reproduced this issue using a couple of small sample containers that just loop in a while loop. I'm working now to button up the task definition and will upload a test container to docker hub so you can hopefully observe this behavior for yourself. Stay tuned.

I did go digging through some of the older logs in the hour or two proceeding the error and notice something that I think correlates to the stuck lock. I found a case each time where the log entry "Writing docker event to the associated task" was missing the corresponding "Wrote docker event to the associated task" (from docker_task_engine.go line 382).

@samuelkarp There are a total of 3 containers on each host, 1 for the ecs agent, and 2 as part of my task definition with a shared volume between them. I think I have also reproduced this issue using a couple of small sample containers that just loop in a while loop. I'm working now to button up the task definition and will upload a test container to docker hub so you can hopefully observe this behavior for yourself. Stay tuned.

I did go digging through some of the older logs in the hour or two proceeding the error and notice something that I think correlates to the stuck lock. I found a case each time where the log entry "Writing docker event to the associated task" was missing the corresponding "Wrote docker event to the associated task" (from docker_task_engine.go line 382).

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Feb 19, 2016

Member

@tschutte That's 2 in the task definition, but how many times has that task been run on the host? Each time a task is run, it creates new containers. I'm interested in the total number of known containers, which you'll find in that data file. Something like this should show it:

sudo yum -y install jq
sudo cat /var/lib/ecs/data/ecs_agent_data.json | jq '.Data.TaskEngine.Tasks[].Containers[].Name' | wc -l

What you're noting about the log lines is indeed relevant; that's one of the residual effects (channel blocked because there's no active reader) of the lock being held a long time.

Member

samuelkarp commented Feb 19, 2016

@tschutte That's 2 in the task definition, but how many times has that task been run on the host? Each time a task is run, it creates new containers. I'm interested in the total number of known containers, which you'll find in that data file. Something like this should show it:

sudo yum -y install jq
sudo cat /var/lib/ecs/data/ecs_agent_data.json | jq '.Data.TaskEngine.Tasks[].Containers[].Name' | wc -l

What you're noting about the log lines is indeed relevant; that's one of the residual effects (channel blocked because there's no active reader) of the lock being held a long time.

@tschutte

This comment has been minimized.

Show comment
Hide comment
@tschutte

tschutte Feb 19, 2016

@samuelkarp sorry, I misunderstood the question. The last remaining qa host I have reports a value of 12.

@samuelkarp sorry, I misunderstood the question. The last remaining qa host I have reports a value of 12.

@tschutte

This comment has been minimized.

Show comment
Hide comment
@tschutte

tschutte Feb 19, 2016

ok, I have a basic task definition which seems to reliably recreate the issue. I did find I have to wait past the task cleanup wait duration (I set it low here for testing). Basically launch a task, wait 12-15 minutes, then update the task definition and watch the service enter the weird state, where the UI shows no tasks running, but the task is still present on the host via "docker ps".

sample userdata (launched on a barebones 2015.09.f ami) - https://gist.github.com/tschutte/3182549229dc0ca7563c
sample task defintion (using 2 tasks with a shared volume, I have no idea if this is or is not contributing to the problem) - https://gist.github.com/tschutte/3cc9db05da951dea5af4

ok, I have a basic task definition which seems to reliably recreate the issue. I did find I have to wait past the task cleanup wait duration (I set it low here for testing). Basically launch a task, wait 12-15 minutes, then update the task definition and watch the service enter the weird state, where the UI shows no tasks running, but the task is still present on the host via "docker ps".

sample userdata (launched on a barebones 2015.09.f ami) - https://gist.github.com/tschutte/3182549229dc0ca7563c
sample task defintion (using 2 tasks with a shared volume, I have no idea if this is or is not contributing to the problem) - https://gist.github.com/tschutte/3cc9db05da951dea5af4

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Feb 19, 2016

Member

@tschutte If it's only 12, that may invalidate my hypothesis (I saw goroutines blocked for around 700 minutes, implying that there would be something on the order of 1400 containers each taking 30 seconds to timeout). I'll try your reproduction task today.

Member

samuelkarp commented Feb 19, 2016

@tschutte If it's only 12, that may invalidate my hypothesis (I saw goroutines blocked for around 700 minutes, implying that there would be something on the order of 1400 containers each taking 30 seconds to timeout). I'll try your reproduction task today.

@tschutte

This comment has been minimized.

Show comment
Hide comment
@tschutte

tschutte Feb 19, 2016

Nope, that environment only deploys 3-4 times a day max, times the 2-3 containers. These are very low volume hosts overall, and is why I'm pleased my reproduction task worked on a fairly vanilla EC2 instance launched today. Hopefully you are successful reproducing.

Nope, that environment only deploys 3-4 times a day max, times the 2-3 containers. These are very low volume hosts overall, and is why I'm pleased my reproduction task worked on a fairly vanilla EC2 instance launched today. Hopefully you are successful reproducing.

@euank

This comment has been minimized.

Show comment
Hide comment
@euank

euank Feb 22, 2016

Contributor

@samuelkarp In response to your theory three comments ago: the synchonizeState method you link is only called during the engine's Init, so shouldn't matter.

I have a different theory after glancing through those logs.

First of all, I sorted all the goroutines by how long they'd been hanging to try and figure out when this might have started; grep -Eo "[0-9]+ minutes" | sort -nr | head (skipping the goroutines from within a minute of launch over 1000 mins ago) gives us 849 minutes. The goroutines from then:

goroutine 43 [chan send, 849 minutes]:
github.com/aws/amazon-ecs-agent/agent/engine.func·012()
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_container_engine.go:636 +0x44b
created by github.com/aws/amazon-ecs-agent/agent/engine.(*dockerGoClient).ContainerEvents
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_container_engine.go:638 +0x4e4

goroutine 47 [semacquire, 849 minutes]:
sync.(*RWMutex).Lock(0xc2080622b8)
    /usr/src/go/src/sync/rwmutex.go:87 +0xa4
github.com/aws/amazon-ecs-agent/agent/engine.(*managedTask).cleanupTask(0xc208297140, 0x9d29229e000)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/task_manager.go:441 +0x467
github.com/aws/amazon-ecs-agent/agent/engine.(*managedTask).overseeTask(0xc208297140)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/task_manager.go:164 +0xd13
created by github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).startTask
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:317 +0x199

goroutine 48 [chan send, 849 minutes]:
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).handleDockerEvents(0xc208062240, 0x7f54409a5c08, 0xc208270f80)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:383 +0x819
created by github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).Init
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:131 +0x17a

The bits that matter are docker_task_engine.go:383 (if you're on that line of code, you're holding the processTasks.RLock) and task_manager.go:441 (waiting on the processTasks.RLock). 💥 deadlock.

So, how'd this happen? Well, my guess is if you get a docker event for a task and the engine task manager has decided to delete that task, you get a deadlock.

  1. The handleDockerEvents code holds the lock for both the task lookup and the message sending. If it only held it for the task lookup, this deadlock couldn't possibly happen. Regardless, that's not a problem when the taskManager is properly reading events ASAP as they come in.
  2. The taskManager, during cleanup, stops listening to events for the duration of sweepTask, which is a serial slow operation. That gives a nice window during which a new event can cause a deadlock.

Frankly, reading this code, I'm surprised it doesn't reliably deadlock here with a large number of containers. It looks like in my usual test environment (mmm overlay), you end up with removes being so quick this doesn't happen in practice so much. There might be further intricacies as well.

However, after being surprised it doesn't deadlock easily, I threw a 10-container task at it, waited the cleanup duration (I set mine to 60s to speed it along), and immediately deadlocked on removal.

That repro works reliably for me with the ECS optimized AMI (3/3 tries), and works slightly less reliably on an AMI with a sane FS configuration. If it doesn't work for you, just make docker rm slower (without slowing down docker inspect too much as that's needed to emit an event and trigger this deadlock).

You can identify that what I describe is the root cause of some breakage by looking at debug output and verifying that a Writing docker event to the associated task debug log appears, but Wrote docker event doesn't appear shortly after, with Removing container stuff happening for the task around then.

I'll give a go at writing a test that properly models this problem, and then write a fix.

Contributor

euank commented Feb 22, 2016

@samuelkarp In response to your theory three comments ago: the synchonizeState method you link is only called during the engine's Init, so shouldn't matter.

I have a different theory after glancing through those logs.

First of all, I sorted all the goroutines by how long they'd been hanging to try and figure out when this might have started; grep -Eo "[0-9]+ minutes" | sort -nr | head (skipping the goroutines from within a minute of launch over 1000 mins ago) gives us 849 minutes. The goroutines from then:

goroutine 43 [chan send, 849 minutes]:
github.com/aws/amazon-ecs-agent/agent/engine.func·012()
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_container_engine.go:636 +0x44b
created by github.com/aws/amazon-ecs-agent/agent/engine.(*dockerGoClient).ContainerEvents
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_container_engine.go:638 +0x4e4

goroutine 47 [semacquire, 849 minutes]:
sync.(*RWMutex).Lock(0xc2080622b8)
    /usr/src/go/src/sync/rwmutex.go:87 +0xa4
github.com/aws/amazon-ecs-agent/agent/engine.(*managedTask).cleanupTask(0xc208297140, 0x9d29229e000)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/task_manager.go:441 +0x467
github.com/aws/amazon-ecs-agent/agent/engine.(*managedTask).overseeTask(0xc208297140)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/task_manager.go:164 +0xd13
created by github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).startTask
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:317 +0x199

goroutine 48 [chan send, 849 minutes]:
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).handleDockerEvents(0xc208062240, 0x7f54409a5c08, 0xc208270f80)
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:383 +0x819
created by github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).Init
    /go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:131 +0x17a

The bits that matter are docker_task_engine.go:383 (if you're on that line of code, you're holding the processTasks.RLock) and task_manager.go:441 (waiting on the processTasks.RLock). 💥 deadlock.

So, how'd this happen? Well, my guess is if you get a docker event for a task and the engine task manager has decided to delete that task, you get a deadlock.

  1. The handleDockerEvents code holds the lock for both the task lookup and the message sending. If it only held it for the task lookup, this deadlock couldn't possibly happen. Regardless, that's not a problem when the taskManager is properly reading events ASAP as they come in.
  2. The taskManager, during cleanup, stops listening to events for the duration of sweepTask, which is a serial slow operation. That gives a nice window during which a new event can cause a deadlock.

Frankly, reading this code, I'm surprised it doesn't reliably deadlock here with a large number of containers. It looks like in my usual test environment (mmm overlay), you end up with removes being so quick this doesn't happen in practice so much. There might be further intricacies as well.

However, after being surprised it doesn't deadlock easily, I threw a 10-container task at it, waited the cleanup duration (I set mine to 60s to speed it along), and immediately deadlocked on removal.

That repro works reliably for me with the ECS optimized AMI (3/3 tries), and works slightly less reliably on an AMI with a sane FS configuration. If it doesn't work for you, just make docker rm slower (without slowing down docker inspect too much as that's needed to emit an event and trigger this deadlock).

You can identify that what I describe is the root cause of some breakage by looking at debug output and verifying that a Writing docker event to the associated task debug log appears, but Wrote docker event doesn't appear shortly after, with Removing container stuff happening for the task around then.

I'll give a go at writing a test that properly models this problem, and then write a fix.

@ChrisRut

This comment has been minimized.

Show comment
Hide comment
@ChrisRut

ChrisRut Feb 22, 2016

@samuelkarp ,

I finally got around to reproducing this in a test environment, it sounds like @euank is already working on a fix for this issue, but incase you need additional data I have attached my ecs-agent.log w/ Debugging enabled.

313_ecs-agent.log.txt

In the mean-time (while we wait for @euank 's fix) would you recommend downgrading the agent/AMI, or is there perhaps another work-around we can use to avoid this issue ?

@samuelkarp ,

I finally got around to reproducing this in a test environment, it sounds like @euank is already working on a fix for this issue, but incase you need additional data I have attached my ecs-agent.log w/ Debugging enabled.

313_ecs-agent.log.txt

In the mean-time (while we wait for @euank 's fix) would you recommend downgrading the agent/AMI, or is there perhaps another work-around we can use to avoid this issue ?

@workybee

This comment has been minimized.

Show comment
Hide comment
@workybee

workybee Feb 22, 2016

@euank 's theory is supported by our experience of even more frequent lock ups when we tried removing old tasks ourselves before ecs-agent version 1.8.0 supported: ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=30m

A semi-workaround we are trying:
cron job that stop/starts ecs every 30 minutes
(with potentially bad side effects we haven't thought of/seen yet, so use at your own risk ;-)

@euank 's theory is supported by our experience of even more frequent lock ups when we tried removing old tasks ourselves before ecs-agent version 1.8.0 supported: ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=30m

A semi-workaround we are trying:
cron job that stop/starts ecs every 30 minutes
(with potentially bad side effects we haven't thought of/seen yet, so use at your own risk ;-)

euank added a commit to euank/amazon-ecs-agent that referenced this issue Feb 22, 2016

Discard task events during cleanup period
Relates to #313

This fixes a deadlock that could occur during the period of time between
when a container waited for cleanup and when the actual cleanup finished
(removes and state management changes).

During that period of time, the task_manager would not accept any new
events and, immediately after that period of time, the task_manager
needed to take the processTasks lock to complete removing itself.

Due to another issue, the scope of that lock when sending events was
overly broad, leading to a deadlock.

This fixes that deadlock by appropriately reducing the scope of the
processTasks lock when handling events.

It also speeds up the processing of multiple tasks by continuing to read
task events during the cleanup actions. Prior to this, while a task's
container's were being removed, no other task could process new events.

euank added a commit to euank/amazon-ecs-agent that referenced this issue Feb 22, 2016

Discard task events during cleanup period
Relates to #313

This fixes a deadlock that could occur during the period of time between
when a container waited for cleanup and when the actual cleanup finished
(removes and state management changes).

During that period of time, the task_manager would not accept any new
events and, immediately after that period of time, the task_manager
needed to take the processTasks lock to complete removing itself.

Due to another issue, the scope of that lock when sending events was
overly broad, leading to a deadlock.

This fixes that deadlock by appropriately reducing the scope of the
processTasks lock when handling events.

It also speeds up the processing of multiple tasks by continuing to read
task events during the cleanup actions. Prior to this, while a task's
container's were being removed, no other task could process new events.
@CameronGo

This comment has been minimized.

Show comment
Hide comment
@CameronGo

CameronGo Feb 24, 2016

Question that is hopefully not out of line: In a case such as this where there is a major bug that disrupts core functionality, would it make sense to re-publish the previous version (v1.7.1 in this case) with the "latest" tag, so that launch configs and autoscaling groups that are not using version pinning remain in a functional state?

Question that is hopefully not out of line: In a case such as this where there is a major bug that disrupts core functionality, would it make sense to re-publish the previous version (v1.7.1 in this case) with the "latest" tag, so that launch configs and autoscaling groups that are not using version pinning remain in a functional state?

@jeekajoo

This comment has been minimized.

Show comment
Hide comment
@jeekajoo

jeekajoo Feb 25, 2016

I'm experiencing this issue since I upgraded from 1.7.1 to 1.8.0.
stop/start fixes it, but I guess it is temporary and that's better to downgrade.

I'm experiencing this issue since I upgraded from 1.7.1 to 1.8.0.
stop/start fixes it, but I guess it is temporary and that's better to downgrade.

jwerak added a commit to appuri/amazon-ecs-agent that referenced this issue Feb 25, 2016

Discard task events during cleanup period
Relates to #313

This fixes a deadlock that could occur during the period of time between
when a container waited for cleanup and when the actual cleanup finished
(removes and state management changes).

During that period of time, the task_manager would not accept any new
events and, immediately after that period of time, the task_manager
needed to take the processTasks lock to complete removing itself.

Due to another issue, the scope of that lock when sending events was
overly broad, leading to a deadlock.

This fixes that deadlock by appropriately reducing the scope of the
processTasks lock when handling events.

It also speeds up the processing of multiple tasks by continuing to read
task events during the cleanup actions. Prior to this, while a task's
container's were being removed, no other task could process new events.
@seiffert

This comment has been minimized.

Show comment
Hide comment
@seiffert

seiffert Feb 29, 2016

Contributor

We were also experiencing this issue after upgrading to v1.8.0. Almost all cluster machines reported the ECS agent to be disconnected which caused containers that terminated for some reason not to be restarted on another machine. I consider this a bug that caused disruption of production environments. We worked around this by downgrading to v1.7.1 again. However, I'd really like to see a v1.8.1 release with the fix merged in (#320).

Contributor

seiffert commented Feb 29, 2016

We were also experiencing this issue after upgrading to v1.8.0. Almost all cluster machines reported the ECS agent to be disconnected which caused containers that terminated for some reason not to be restarted on another machine. I consider this a bug that caused disruption of production environments. We worked around this by downgrading to v1.7.1 again. However, I'd really like to see a v1.8.1 release with the fix merged in (#320).

@apsey

This comment has been minimized.

Show comment
Hide comment
@apsey

apsey Mar 1, 2016

We are also having the same issue with ECS agent disconnection. We are downgrading our agent's version to 1.7.1 until this issue is fixed.

apsey commented Mar 1, 2016

We are also having the same issue with ECS agent disconnection. We are downgrading our agent's version to 1.7.1 until this issue is fixed.

@juanrhenals

This comment has been minimized.

Show comment
Hide comment
@juanrhenals

juanrhenals Mar 3, 2016

Contributor

Fixing this deadlock issue has been released as part of Agent v1.8.1. We also released a new AMI with the v1.8.1 agent bundled.

If you continue to run into issues or have new information, please do reopen this or a new issue.
For now, I don't think we have any additional action items for this issue on our end.

Contributor

juanrhenals commented Mar 3, 2016

Fixing this deadlock issue has been released as part of Agent v1.8.1. We also released a new AMI with the v1.8.1 agent bundled.

If you continue to run into issues or have new information, please do reopen this or a new issue.
For now, I don't think we have any additional action items for this issue on our end.

@juanrhenals juanrhenals closed this Mar 3, 2016

@euank

This comment has been minimized.

Show comment
Hide comment
@euank

euank Mar 3, 2016

Contributor

After thinking about it a little, I do have one more question related to this issue: basically, what caused it?

The code that was patched wasn't introduced in 1.8.0 (and the changes for 1.8.0 were actually pretty small, based on git diff v1.7.1..v1.8.0 -- . ':!/agent/Godeps/_workspace/' ':!*_test.go' ':!*/service.go' ':!*/api.go' ':!*_mocks.go').

Why did downgrading to v1.7.1 help at all? The things I see as possibilities here:

  • AWS SDK upgrade somehow changed timing enough this became a problem
  • AMI differences related to storage/1.9 between the 1.7.1 and 1.8.0 AMIs triggered it
  • some sort of strange packaging/build issue
  • ???

If it's fixed, it's fixed, and it's not vital to understand all the details of "whyyyyy", but it'd still be cool to know!

I have no plan to investigate further, but if you do I'd like to hear the results.

Contributor

euank commented Mar 3, 2016

After thinking about it a little, I do have one more question related to this issue: basically, what caused it?

The code that was patched wasn't introduced in 1.8.0 (and the changes for 1.8.0 were actually pretty small, based on git diff v1.7.1..v1.8.0 -- . ':!/agent/Godeps/_workspace/' ':!*_test.go' ':!*/service.go' ':!*/api.go' ':!*_mocks.go').

Why did downgrading to v1.7.1 help at all? The things I see as possibilities here:

  • AWS SDK upgrade somehow changed timing enough this became a problem
  • AMI differences related to storage/1.9 between the 1.7.1 and 1.8.0 AMIs triggered it
  • some sort of strange packaging/build issue
  • ???

If it's fixed, it's fixed, and it's not vital to understand all the details of "whyyyyy", but it'd still be cool to know!

I have no plan to investigate further, but if you do I'd like to hear the results.

@aaithal

This comment has been minimized.

Show comment
Hide comment
@aaithal

aaithal Mar 3, 2016

Member

We have a theory that this was a side-effect of this commit, which was intended to fix losing container IDs in case of inspect errors. With this fix, we are now not missing container IDs anymore. But, the state transition for the same in terms of container clean-up leads up to the deadlock. You can test it by reverting that commit in the v1.8.0 agent and using the task definition that you posted, with cleanup time set to 1minute.

Member

aaithal commented Mar 3, 2016

We have a theory that this was a side-effect of this commit, which was intended to fix losing container IDs in case of inspect errors. With this fix, we are now not missing container IDs anymore. But, the state transition for the same in terms of container clean-up leads up to the deadlock. You can test it by reverting that commit in the v1.8.0 agent and using the task definition that you posted, with cleanup time set to 1minute.

@n1t1nv3rma

This comment has been minimized.

Show comment
Hide comment
@n1t1nv3rma

n1t1nv3rma Mar 3, 2016

Fixing this deadlock issue has been released as part of Agent v1.8.1. AWS has also released a new AMI with the v1.8.1 agent bundled. http://docs.aws.amazon.com/AmazonECS/latest/developerguide/container_agent_versions.html#ecs-optimized-ami-agent-versions

Fixing this deadlock issue has been released as part of Agent v1.8.1. AWS has also released a new AMI with the v1.8.1 agent bundled. http://docs.aws.amazon.com/AmazonECS/latest/developerguide/container_agent_versions.html#ecs-optimized-ami-agent-versions

@alexmac

This comment has been minimized.

Show comment
Hide comment
@alexmac

alexmac Mar 3, 2016

@n1t1nv3rma seems like the amazon yum repo hasn't been updated with this yet? following the update procedure here: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-update.html#procedure_update_ecs-init

I get:
sudo yum install -y ecs-init Loaded plugins: priorities, update-motd, upgrade-helper 27 packages excluded due to repository priority protections Package ecs-init-1.8.0-1.amzn1.x86_64 already installed and latest version Nothing to do

If I update manually using the ECS console I get 1.8.1

alexmac commented Mar 3, 2016

@n1t1nv3rma seems like the amazon yum repo hasn't been updated with this yet? following the update procedure here: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-update.html#procedure_update_ecs-init

I get:
sudo yum install -y ecs-init Loaded plugins: priorities, update-motd, upgrade-helper 27 packages excluded due to repository priority protections Package ecs-init-1.8.0-1.amzn1.x86_64 already installed and latest version Nothing to do

If I update manually using the ECS console I get 1.8.1

@ChrisRut

This comment has been minimized.

Show comment
Hide comment
@ChrisRut

ChrisRut Mar 3, 2016

@alexmac ,
You should be able to update your entire fleet by using the following bash one-liner w/ aws-cli and jq:

for cluster in $(aws ecs list-clusters | jq -r .clusterArns[] | cut -d/ -f2); do echo "--- CLUSTER: ${cluster} ---"; for instance in $(aws ecs list-container-instances --cluster $cluster | jq -r .containerInstanceArns[] | cut -d/ -f2); do echo "--- INSTANCE: ${instance} ---" ; aws ecs update-container-agent --cluster ${cluster} --container-instance ${instance} ; sleep 1 ; done; done

I don't believe the ecs-init package needs to be updated in-order to avoid the deadlocks.

ChrisRut commented Mar 3, 2016

@alexmac ,
You should be able to update your entire fleet by using the following bash one-liner w/ aws-cli and jq:

for cluster in $(aws ecs list-clusters | jq -r .clusterArns[] | cut -d/ -f2); do echo "--- CLUSTER: ${cluster} ---"; for instance in $(aws ecs list-container-instances --cluster $cluster | jq -r .containerInstanceArns[] | cut -d/ -f2); do echo "--- INSTANCE: ${instance} ---" ; aws ecs update-container-agent --cluster ${cluster} --container-instance ${instance} ; sleep 1 ; done; done

I don't believe the ecs-init package needs to be updated in-order to avoid the deadlocks.

@nikhilo

This comment has been minimized.

Show comment
Hide comment
@nikhilo

nikhilo Mar 3, 2016

@ChrisRut I'm seeing following exception when trying above one liner for agents v1.8.0
A client error (NoUpdateAvailableException) occurred when calling the UpdateContainerAgent operation: There is no update available for your container agent.
Note: We are running a custom CentOS 7 AMI

nikhilo commented Mar 3, 2016

@ChrisRut I'm seeing following exception when trying above one liner for agents v1.8.0
A client error (NoUpdateAvailableException) occurred when calling the UpdateContainerAgent operation: There is no update available for your container agent.
Note: We are running a custom CentOS 7 AMI

@ChrisRut

This comment has been minimized.

Show comment
Hide comment
@ChrisRut

ChrisRut Mar 3, 2016

@nikhilo , I think you meant that for me, sorry, that one-liner is assuming you are running the "ECS Optimized AMI"

ChrisRut commented Mar 3, 2016

@nikhilo , I think you meant that for me, sorry, that one-liner is assuming you are running the "ECS Optimized AMI"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment