Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Slower starting and stopping of tasks with v1.1 agent #92
We're having performance problems with v1.1 of the ECS agent. We're using ECS for force12.io which is a demo of container autoscaling / prioritization. The demo starts and stops tasks based on a random metric that changes every 5 seconds.
Our live site is using the v1.0 agent and usually keeps up with the metric. Our staging site is running the v1.1 agent and is noticeably slower and doesn't keep up. Otherwise the 2 environments are identical.
The delay occurs after tasks have been stopped and a new task is started. It seems to be a delay in the agent receiving the task from the scheduler rather than the agent taking a long time.
I can reproduce the problem on a container instance with the 1.1 agent by
For the final task there is a delay of 20-25 seconds before the POST /v1.17/images/create message appears in the agent logs. Doing the same test with a 1.0 agent the message appears in 2 seconds.
We're running CoreOS stable (ami-ea657582) with this cloud-config data.
Thanks for reporting the issue! v1.1.0 of the Agent fixes a correctness bug present in previous versions related to starting and stopping tasks. Specifically, v1.1.0 attempts to order stops before starts to avoid resource conflicts caused by stopped tasks lingering when new ones are started (e.g. a stopped task with a still-running container bound to a host port which a new task attempts to use). In practice this can lead to a delay of about 30 seconds (the timeout used for the docker stop command) when stopping an old task before starting a new one.
Based on your description, I believe this is what you’re observing. To confirm, you can look at your ecs-agent.log and find lines like "Waiting for any previous stops to complete".
Since the Agent performs a docker stop command, one way to improve the performance here is for your containers to listen to and respond to
A configurable timeout might be useful but if we can get our containers to stop cleanly I don't think we need it for our current use case.
I'll look into this tomorrow as well as checking for that log message and then update the issue.