Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix type assertion bug #910

Merged
merged 2 commits into from Aug 2, 2017
Merged

Fix type assertion bug #910

merged 2 commits into from Aug 2, 2017

Conversation

nmeyerhans
Copy link
Contributor

@nmeyerhans nmeyerhans commented Aug 2, 2017

A type assertion of event.Error assumed that the variable being tested was a
reference, when we actually weren't consistent in how we assigned that
variable. This caused the assertion to fail in places where it should have
passed, ultimately causing us to leak containers in the case where docker
returned an API error in response to the 'stop' call.

Summary

Fix a bug in which containers can be leaked in the event that docker returns an error in response to a stop API call.

Implementation details

This is a minimal fix that ensures that we're performing assertions of the same type that we're actually assigning. It additionally makes a minor change to log output to add some additional detail.

Testing

Using a custom build of Docker to force the stop API call to return a 500 error. Verify that without this change, agent logs Error while pulling container; will try to run anyways and leaks a container. (Leak, in this case, means that it has indicated to the ECS service that the task has successfully transitioned to STOPPED even though at least one associated container was still running, and no further efforts would be made to terminate the container.)

With the change, Agent exhibits the expected behavior, which is that it retries the stop call indefinitely and does not signal that the task has transitioned to STOPPED.

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass

New tests cover the changes:

Description for the changelog

Bug -- Fix a situation in which containers may be falsely reported as STOPPED in the case of a Docker "stop" API failure.

Licensing

This contribution is under the terms of the Apache 2.0 License: yes

Copy link
Member

@aaithal aaithal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now. But, we have to find a better way of doing this in the future. We can have all Cannot*Container errors or just all custom error types implement an interface that extends the error interface with an additional GetErrorName() method and use that to check the error instead of being inconsistent with pointer/non-pointer assertions.

Copy link
Member

@jahkeup jahkeup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I agree with @aaithal that we may need to rethink our error wrappers to give them clearer meaning and remove the chance of the sort of inappropriate type assertion that bit us here. We should do an audit of these assertions and ensure we don't have another one of these bugs hiding in the code.

@nmeyerhans
Copy link
Contributor Author

nmeyerhans commented Aug 2, 2017

@aaithal we already have an engineError interface for our custom types. CannotStopContainerError is a special case, as it's the only error type that defines additional methods beyond what are already specified by engineError.

What we really need here is a new interface type for the CannotStopContainerError, then type assertion will work a bit more like it's expected to. This should eliminate the possibility of future bugs introduced by mixing pointers with direct references.

@@ -254,7 +254,7 @@ func (mtask *managedTask) handleContainerChange(containerChange dockerContainerC
}
// If docker returned a transient error while trying to stop a container,
// reset the known status to the current status and return
cannotStopContainerError, ok := event.Error.(*CannotStopContainerError)
cannotStopContainerError, ok := event.Error.(CannotStopContainerError)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this fixes it for now, this doesn't prevent the same problem from happening again. I would much prefer if we defined an interface for this and type-asserted to the interface. Interface type-assertions don't matter whether you have a struct or a pointer since either can satisfy the interface. See here.

Something like this:

type maybeUnretriableError interface {
	IsUnretriableError() bool
}

You should be able to just define this above handleContainerChange and use it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much exactly what I'm doing... 🥇

Except that I'm taking the opportunity to change that method name!

Copy link
Contributor

@samuelkarp samuelkarp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if you fix the godoc 😄

@@ -154,15 +159,19 @@ func (err CannotStopContainerError) ErrorName() string {
return "CannotStopContainerError"
}

func (err CannotStopContainerError) IsUnretriableError() bool {
// When stopping a container, most errors that we can get should be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have this in a godoc location, but it's not godoc format. Change the beginning to read "IsRetriableError returns ..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🦑 🐙 🌊 🐋 🐟 🌊 🐚

Copy link
Member

@aaithal aaithal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please address Sam's godoc comment before merging this.

Noah Meyerhans added 2 commits August 2, 2017 11:48
… status.

A type assertion of event.Error assumed that the variable being tested was a
reference, when we actually weren't consistent in how we assigned that
variable. This caused the assertion to fail in places where it should have
passed, ultimately causing us to falsely report containers as stopped in the
case where docker returned an API error in response to the 'stop' call.

To fix this, we define a new 'cannotStopContainerError' interface and perform
the type assertion on the interface.
…ntainerError.IsUnretryableError

The only place this method was used was in a negated test. Negating negatives
requires unnecessarily mental work on the part of the reader. IsRetryableError
is quicker to understand.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants