Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change reconcile/container update order on init and waitForHostResources/emitCurrentStatus order #3747

Conversation

prateekchaudhry
Copy link
Contributor

@prateekchaudhry prateekchaudhry commented Jun 11, 2023

Summary

This PR fixes a bug figured out in task resource accounting on Agent Restarts. The issue is when Agent is down. a container stops (repro by docker stop <container_id>) and agent comes back up, task resources are not released by host resource manager.

Problems in current orders of calls are :

  • reconcileHostResources (see Remove task serialization and use host resource manager for task resources #3723) is being called after stopped containers are accounted for in task engine. So if a container stops while the agent is stopped, reconcileHostResources may not pre-'consume' resources for a task, because it's overseeTask would still need to be run for cleanup if it's status has changed to stop. As a result, reconcileHostResources should be called before filterTasksToStartUnsafe which updates the container and task statuses. This order is updated with comments.
  • In overseeTask, emitCurrentStatus also does an emitTaskEvent call - which releases 'consumed' resources in host resource manager (see Management of host resources in summary in Remove task serialization and use host resource manager for task resources #3723). And waitForHostResources() again re-allocates resources, leading to persistent accounted for resources in host resource manager in these cases (when tasks stop when agent is down). Changing it's order after waitForHostResources() call and updating comments.

This PR also makes a change in host_resource_manager and ToHostResources in task.go to dereference ports from []*string to []string during logging for more understandable debug logging. This change results in outputting actual port values instead of lvalues of the ports which might not be always relevant. See PORTS_TCP and PORTS_UDP after the change here
[Debug] logger=structured msg="Consumed resources after task consume call" CPU=512 MEMORY=768 PORTS_TCP=[22 23] PORTS_UDP=[1000 1001] GPU= 1 taskArn="arn:aws:ecs:us-east-1:<aws_account_id>:task/cluster-name/11111"

Related Containers Roadmap Issue

aws/containers-roadmap#325

Testing

With debug logs, verified agent restarts function properly and resources are accounted and released properly for following scenarios :

  • Tasks are running while agent restarts
  • Tasks stop while agent restarts - agent picks up stopped tasks, emits change of status and host resource manager releases resources
  • Instance reboots - agent picks up stopped tasks, emits change of status and host resource manager releases resources

Description for the changelog

Fix Agent restarts and ports logging with task resource accounting

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Yiyuanzzz
Yiyuanzzz previously approved these changes Jun 12, 2023
// Wait here until enough resources are available on host for the task to progress
// - Waits until host resource manager succesfully 'consume's task resources and returns
// - For tasks which have crossed this stage before (on agent restarts), resources are pre-consumed - returns immediately
// (resources are later 'release'd on Stopped task emitTaskEvent call)
mtask.waitForHostResources()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add a small optimization here to skip invoking waitForHostResources if task known status is not STOPPED? Since we know in that case the host resources will be released immediately anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let me add that. Although for this particular 'restart case', resources would be detected as 'STOPPED', and waitForHostResources behavior is resources are pre-consumed - returns immediately i.e. without queueing.

yinyic
yinyic previously approved these changes Jun 13, 2023
singholt
singholt previously approved these changes Jun 13, 2023
Yiyuanzzz
Yiyuanzzz previously approved these changes Jun 13, 2023
yinyic
yinyic previously approved these changes Jun 13, 2023
@prateekchaudhry prateekchaudhry merged commit abf9bd1 into aws:feature/task-resource-accounting Jun 13, 2023
35 checks passed
prateekchaudhry added a commit that referenced this pull request Jun 21, 2023
prateekchaudhry added a commit that referenced this pull request Jun 22, 2023
@prateekchaudhry prateekchaudhry mentioned this pull request Jun 22, 2023
sparrc added a commit to sparrc/amazon-ecs-agent that referenced this pull request Jun 30, 2023
…stResources/emitCurrentStatus order (aws#3747)"

This reverts commit 9e77f6f.
sparrc added a commit that referenced this pull request Jul 5, 2023
…stResources/emitCurrentStatus order (#3747)"

This reverts commit 9e77f6f.
prateekchaudhry added a commit that referenced this pull request Jul 12, 2023
prateekchaudhry added a commit to prateekchaudhry/amazon-ecs-agent that referenced this pull request Jul 12, 2023
…aitForHostResources/emitCurrentStatus order (aws#3747)""

This reverts commit 60a3f42.
prateekchaudhry added a commit that referenced this pull request Jul 12, 2023
* Revert "Revert "host resource manager initialization""

This reverts commit dafb967.

* Revert "Revert "Add method to get host resources reserved for a task (#3706)""

This reverts commit 8d824db.

* Revert "Revert "Add host resource manager methods (#3700)""

This reverts commit bec1303.

* Revert "Revert "Remove task serialization and use host resource manager for task resources (#3723)""

This reverts commit cb54139.

* Revert "Revert "add integ tests for task accounting (#3741)""

This reverts commit 61ad010.

* Revert "Revert "Change reconcile/container update order on init and waitForHostResources/emitCurrentStatus order (#3747)""

This reverts commit 60a3f42.

* Revert "Revert "Dont consume host resources for tasks getting STOPPED while waiting in waitingTasksQueue (#3750)""

This reverts commit 8943792.
Realmonia pushed a commit that referenced this pull request Jul 20, 2023
* Revert reverted changes for task resource accounting (#3796)

* Revert "Revert "host resource manager initialization""

This reverts commit dafb967.

* Revert "Revert "Add method to get host resources reserved for a task (#3706)""

This reverts commit 8d824db.

* Revert "Revert "Add host resource manager methods (#3700)""

This reverts commit bec1303.

* Revert "Revert "Remove task serialization and use host resource manager for task resources (#3723)""

This reverts commit cb54139.

* Revert "Revert "add integ tests for task accounting (#3741)""

This reverts commit 61ad010.

* Revert "Revert "Change reconcile/container update order on init and waitForHostResources/emitCurrentStatus order (#3747)""

This reverts commit 60a3f42.

* Revert "Revert "Dont consume host resources for tasks getting STOPPED while waiting in waitingTasksQueue (#3750)""

This reverts commit 8943792.

* fix memory resource accounting for multiple containers in single task (#3782)

* fix memory resource accounting for multiple containers

* change unit tests for multiple containers, add unit test for awsvpc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants