Ensure task status is reported before cleanup #705

samuelkarp · 2017-02-10T02:25:56Z

Summary

There's a rare bug that can occur when the following situation happens:

Extremely high task launch/task stop rate in the cluster leading to throttles on SubmitTaskStateChange and SubmitContainerStateChange
Very low ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION`
The agent reconnects to ECS after the task has been cleaned up locally but before the ECS API was successfully notified (due to throttles).

This can lead to the agent becoming extremely confused and internal state getting corrupted, possibly leading to even more calls to Submit*StateChange and attempts to pull whatever images are referenced in the corrupted task.

This change attempts to force the agent to wait until the status has been properly submitted before starting cleanup, and aborting cleanup if the task never gets submitted properly.

Implementation details

Gave the api.TaskStateChange and api.ContainerStateChange structs real pointers to the api.Task and api.Container rather than pointers to fields of those respective structs
Modified managedTask.cleanupTask to check the SentStatus field and wait if it's not api.TaskStopped
Added some unit tests
Fixed a bunch of warnings from gometalinter

Testing

Builds on Linux (make release)
Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
Unit tests on Linux (make test) pass
Unit tests on Windows (go test -timeout=25s ./agent/...) pass
Integration tests on Linux (make run-integ-tests) pass
Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
Functional tests on Linux (make run-functional-tests) pass
Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass
Created high load on an instance in a cluster with artificially low throttles

New tests cover the changes: yes

Description for the changelog

Bug - Fixed a bug where throttles on state change reporting could lead to corrupted state

Licensing

This contribution is under the terms of the Apache 2.0 License: yes (Amazon employee)

aaithal · 2017-02-10T15:53:52Z

agent/eventhandler/handler_test.go

@@ -28,10 +28,10 @@ import (
 )

 func contEvent(arn string) api.ContainerStateChange {
-	return api.ContainerStateChange{TaskArn: arn, ContainerName: "containerName", Status: api.ContainerRunning}
+	return api.ContainerStateChange{TaskArn: arn, ContainerName: "containerName", Status: api.ContainerRunning, Container: &api.Container{}}


Why is this an empty struct and not nil?

Also, a nit: Breaking this across multiple lines would be even nicer.

Container.GetSentStatus() and Container.SetSentStatus() are called without nil-checks.

aaithal · 2017-02-10T16:03:27Z

agent/engine/task_manager.go

 	for !mtask.waitEvent(cleanupTimeBool) {
 	}
+	stoppedSentBool := make(chan bool)
+	go func() {


Can this be broken out into a named method?

aaithal · 2017-02-10T16:19:37Z

Looks super neat overall. Words can't express my joy about the TaskEngineState interface and locks around SentStatus. I have some minor comments/questions. Also, can you ensure that all the edited files have the copyright year 2017 in them? I think you might have missed some.

samuelkarp · 2017-02-10T20:32:17Z

@aaithal I've updated the copyright on all the files I touched.

aaithal · 2017-02-10T23:38:26Z

@samuelkarp the new commit lgtm.

Prior to this change, a race condition exists between reporting task status and task cleanup in the agent. If reporting task status took an excessively long time and ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION is set very short, the containers and associated task metadata could be removed before the ECS backend is informed that the task has stopped. When the task is especially short-lived, it's also possible that the cleanup could occur before the ECS backend is even informed that the task is running. In this situation, it's possible for the ECS backend to re-transmit the task to the agent (assuming that it hadn't started yet) and the de-duplication logic in the agent can break. With this change, we ensure that the task status is reported prior to cleanup actually starting.

liwenwu-amazon · 2017-02-14T17:22:23Z

agent/engine/task_manager.go

 	for !mtask.waitEvent(cleanupTimeBool) {
 	}
+	stoppedSentBool := make(chan bool)
+	go func() {


For the new written go routine, shall we start enforcing the rule of 'always pass "Context" to it', so that it can provide simple "cancelation"?

There's no context wired in to most of our codebase as most of it was written before context existed. We could add that to the managedTask but that's outside the scope of my change here.

I don't think that always passing context is a hard rule that we should enforce. I think that whether or not we should use context depends on what the code is doing.

Without context, there is no way to stop this potential "long run" NEW go routine (which could run for at worst 72 hours).

In case if there is another kind of state mismatch between agent and backend, backend thinks this instance is able to launch new task but agent is holding those "long run" cleanup GO routines. Is it possible, that agent could run out of memory ...?

I meant "backend service" keep on starting a new task, and these task get stuck in "cleanup" state for 72 hours..., eventually will agent run out of memory?

This is the desired behavior. A successful submission of task state will result in this goroutine exiting. Unsuccessful submissions will delay until success or the timeout or 72 hours, whichever is sooner. There is no use-case to stop the goroutine other than this.

petderek

Neat. I had a few mostly curious questions — I don't think anything needs to change now.

petderek · 2017-02-14T20:09:47Z

agent/engine/dockerstate/generate_mocks.go

+// express or implied. See the License for the specific language governing
+// permissions and limitations under the License.
+
+package dockerstate


Naming? It may just be me, but when I saw this referenced in other bits of code it I assumed its part of docker's library.

Yes, I agree that this name is bad. I didn't want to change it as part of this PR though.

petderek · 2017-02-14T20:11:52Z

agent/engine/docker_image_manager_integ_test.go


 	// Allow Task cleanup to occur
-	time.Sleep(2 * time.Second)
+	time.Sleep(5 * time.Second)


Why time.Sleep? Is there an event we can listen for instead of setting an arbitrary time?

Is there a best practice around testing these sorts of interactions in Go?

Event-based would be better, but we'd actually still want a timeout on the event. Part of the tests here are ensuring that cleanup happens within the time that is expected.

petderek · 2017-02-14T20:28:23Z

agent/engine/task_manager.go

+				break
+			}
+			seelog.Warnf("Blocking cleanup for task %v until the task has been reported stopped. SentStatus: %v (%d/%d)", mtask, sentStatus, i+1, _maxStoppedWaitTimes)
+			mtask._time.Sleep(_stoppedSentWaitInterval)


Why are you using _time as a member of mtask instead of just using time.Sleep directly?

Why not use exponential backoff with the same 72 hour cap?

_time is a ttime.Time implementation that can be swapped out for tests. In unit tests, we inject a mock here that can let us verify and control the interactions that the code is doing with respect to time.

I didn't feel like exponential backoff was necessary or particularly desirable here; the only thing that this does is check the value of a variable. Constant delay is also a bit easier to read.

samuelkarp assigned aaithal and richardpen Feb 10, 2017

samuelkarp added this to the 1.14.1 milestone Feb 10, 2017

samuelkarp added the kind/bug label Feb 10, 2017

aaithal reviewed Feb 10, 2017

View reviewed changes

samuelkarp force-pushed the consistent-state-cleanup branch from db4f950 to 1db71f5 Compare February 10, 2017 20:29

aaithal approved these changes Feb 10, 2017

View reviewed changes

samuelkarp added 9 commits February 13, 2017 16:59

api/task.go: Fix many metalinter warnings

496df86

api: Handle SentStatus more safely

0bd64f3

dockerstate: Create TaskEngineState interface

eb98724

engine: Unit tests for managedTask.cleanupTask

c9a1295

engine: Abort tasks with corrupted internal state

3d154da

engine: Extract wait logic for stop reporting

4477e30

Update copyright on modified files

d113ab4

engine: Simplify code in mtask.cleanupTask

61371b4

samuelkarp force-pushed the consistent-state-cleanup branch from 1fdca9a to 61371b4 Compare February 14, 2017 01:02

samuelkarp merged commit 61371b4 into aws:dev Feb 14, 2017

liwenwu-amazon reviewed Feb 14, 2017

View reviewed changes

petderek reviewed Feb 14, 2017

View reviewed changes

adnxn mentioned this pull request Mar 6, 2017

V1.14.1 release #723

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure task status is reported before cleanup #705

Ensure task status is reported before cleanup #705

samuelkarp commented Feb 10, 2017 •

edited

Loading

aaithal Feb 10, 2017

samuelkarp Feb 10, 2017

aaithal Feb 10, 2017 •

edited

Loading

samuelkarp Feb 10, 2017

aaithal commented Feb 10, 2017

samuelkarp commented Feb 10, 2017

aaithal commented Feb 10, 2017

liwenwu-amazon Feb 14, 2017

samuelkarp Feb 14, 2017

liwenwu-amazon Feb 14, 2017

liwenwu-amazon Feb 14, 2017

samuelkarp Feb 14, 2017

petderek left a comment

petderek Feb 14, 2017

samuelkarp Feb 14, 2017

petderek Feb 14, 2017

samuelkarp Feb 14, 2017

petderek Feb 14, 2017

samuelkarp Feb 14, 2017

Ensure task status is reported before cleanup #705

Ensure task status is reported before cleanup #705

Conversation

samuelkarp commented Feb 10, 2017 • edited Loading

Summary

Implementation details

Testing

Description for the changelog

Licensing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaithal Feb 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaithal commented Feb 10, 2017

samuelkarp commented Feb 10, 2017

aaithal commented Feb 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petderek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samuelkarp commented Feb 10, 2017 •

edited

Loading

aaithal Feb 10, 2017 •

edited

Loading