New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slower starting and stopping of tasks with v1.1 agent #92

Closed
rossf7 opened this Issue May 26, 2015 · 5 comments

Comments

Projects
None yet
2 participants
@rossf7

rossf7 commented May 26, 2015

We're having performance problems with v1.1 of the ECS agent. We're using ECS for force12.io which is a demo of container autoscaling / prioritization. The demo starts and stops tasks based on a random metric that changes every 5 seconds.

Our live site is using the v1.0 agent and usually keeps up with the metric. Our staging site is running the v1.1 agent and is noticeably slower and doesn't keep up. Otherwise the 2 environments are identical.

The delay occurs after tasks have been stopped and a new task is started. It seems to be a delay in the agent receiving the task from the scheduler rather than the agent taking a long time.

I can reproduce the problem on a container instance with the 1.1 agent by

  • starting 4 tasks
  • stopping 3 tasks
  • starting another task

For the final task there is a delay of 20-25 seconds before the POST /v1.17/images/create message appears in the agent logs. Doing the same test with a 1.0 agent the message appears in 2 seconds.

We're running CoreOS stable (ami-ea657582) with this cloud-config data.

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp May 26, 2015

Member

Hi @rossf7,

Thanks for reporting the issue! v1.1.0 of the Agent fixes a correctness bug present in previous versions related to starting and stopping tasks. Specifically, v1.1.0 attempts to order stops before starts to avoid resource conflicts caused by stopped tasks lingering when new ones are started (e.g. a stopped task with a still-running container bound to a host port which a new task attempts to use). In practice this can lead to a delay of about 30 seconds (the timeout used for the docker stop command) when stopping an old task before starting a new one.

Based on your description, I believe this is what you’re observing. To confirm, you can look at your ecs-agent.log and find lines like "Waiting for any previous stops to complete".

Since the Agent performs a docker stop command, one way to improve the performance here is for your containers to listen to and respond to SIGTERM; this will allow the containers to stop more quickly and reduce the delay. Other possible approaches to reducing the delay are adding a configurable timeout to the Agent (so you can shorten less than 30 seconds) or performing Agent-side resource accounting and enforcing an ordering only when a conflict exists. Please let us know if either of those are enhancements you'd be interested in seeing.

Thanks,
Sam

Member

samuelkarp commented May 26, 2015

Hi @rossf7,

Thanks for reporting the issue! v1.1.0 of the Agent fixes a correctness bug present in previous versions related to starting and stopping tasks. Specifically, v1.1.0 attempts to order stops before starts to avoid resource conflicts caused by stopped tasks lingering when new ones are started (e.g. a stopped task with a still-running container bound to a host port which a new task attempts to use). In practice this can lead to a delay of about 30 seconds (the timeout used for the docker stop command) when stopping an old task before starting a new one.

Based on your description, I believe this is what you’re observing. To confirm, you can look at your ecs-agent.log and find lines like "Waiting for any previous stops to complete".

Since the Agent performs a docker stop command, one way to improve the performance here is for your containers to listen to and respond to SIGTERM; this will allow the containers to stop more quickly and reduce the delay. Other possible approaches to reducing the delay are adding a configurable timeout to the Agent (so you can shorten less than 30 seconds) or performing Agent-side resource accounting and enforcing an ordering only when a conflict exists. Please let us know if either of those are enhancements you'd be interested in seeing.

Thanks,
Sam

@rossf7

This comment has been minimized.

Show comment
Hide comment
@rossf7

rossf7 May 26, 2015

Hi Sam,
thanks yes that makes sense that we're hitting a timeout on the docker stop. Our demo containers are the busybox image running in an infinite loop as we just needed something simple. But its very possible we're not responding to SIGTERM correctly. Running docker ps -a shows lot of the stopped containers have Exited 137 as the status which I think is related.

A configurable timeout might be useful but if we can get our containers to stop cleanly I don't think we need it for our current use case.

I'll look into this tomorrow as well as checking for that log message and then update the issue.

Thanks

Ross

rossf7 commented May 26, 2015

Hi Sam,
thanks yes that makes sense that we're hitting a timeout on the docker stop. Our demo containers are the busybox image running in an infinite loop as we just needed something simple. But its very possible we're not responding to SIGTERM correctly. Running docker ps -a shows lot of the stopped containers have Exited 137 as the status which I think is related.

A configurable timeout might be useful but if we can get our containers to stop cleanly I don't think we need it for our current use case.

I'll look into this tomorrow as well as checking for that log message and then update the issue.

Thanks

Ross

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp May 26, 2015

Member

Hi Ross,

Please let us know whether that solves your problem. If you need an example of responding to SIGTERM in busybox, you can look at this script I wrote.

Thanks,
Sam

Member

samuelkarp commented May 26, 2015

Hi Ross,

Please let us know whether that solves your problem. If you need an example of responding to SIGTERM in busybox, you can look at this script I wrote.

Thanks,
Sam

@rossf7

This comment has been minimized.

Show comment
Hide comment
@rossf7

rossf7 Jun 1, 2015

Hi Sam,
based on your script we're now properly handling SIGINT and SIGTERM. Our demo containers are now exiting cleanly and without the 30 second timeout. Thanks for your help with this, its much appreciated.

Cheers

Ross

rossf7 commented Jun 1, 2015

Hi Sam,
based on your script we're now properly handling SIGINT and SIGTERM. Our demo containers are now exiting cleanly and without the 30 second timeout. Thanks for your help with this, its much appreciated.

Cheers

Ross

@rossf7 rossf7 closed this Jun 1, 2015

@samuelkarp

This comment has been minimized.

Show comment
Hide comment
@samuelkarp

samuelkarp Jun 1, 2015

Member

Great, I'm glad I was able to help!

Member

samuelkarp commented Jun 1, 2015

Great, I'm glad I was able to help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment