Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get container exit code from event if we can't get it from inspect. #2940

Merged
merged 1 commit into from Jul 12, 2021

Conversation

fenxiong
Copy link
Contributor

@fenxiong fenxiong commented Jul 6, 2021

Summary

Currently, when we received a container die event from docker, but was not able to inspect the container (e.g. due to inspect timeout), the container's exit code is left as null, because we only get it from the inspect output, and the output is empty when we fail to inspect the container. However, the exit code is also in the container die event itself (as an attribute, ref: https://github.com/moby/moby/blob/master/daemon/monitor.go#L61, didn't trace docker history but i think it has been there for a long time), and we are logging it, but just not using it.

This change sets the exit code from the docker event in case of inspect failure, so that the exit code of the container is still reported in this case and available for customer to see when describing task.

Implementation details

Set the exit code when receiving a container die event. This is done in the docker client at the place where we handle the docker event.

Testing

added unit test. also manually tested by: added some test code to return inspect timeout for exited container https://gist.github.com/fenxiong/9afdeeeedab05177b6d425c5db3eb3b9 to repro the scenario (since the timeout isn't easy to repro), then run a simple task:

{
    "family": "test-sleep-exit",
    "networkMode": "bridge",
    "containerDefinitions": [
        {
            "name": "container_1",
            "image": "public.ecr.aws/x1v8g9p0/busybox:latest",
            "essential": true,
            "memory": 32,
            "command": ["sh", "-c", "sleep 5 && exit 42"]
        }
    ]
}

Without the fix, the container's exit code is left as null:

xx:files fenxiong$ aws ecs describe-tasks --cluster test-exit --tasks xx | grep exitCode
xx:files fenxiong$ aws ecs describe-tasks --cluster test-exit --tasks xx | grep reason
                    "reason": "CannotInspectContainerError: Could not transition to inspecting; timed out after waiting 30s",

With the fix, the container's exit code is correctly reported in describe-tasks:

xx:files fenxiong$ aws ecs describe-tasks --cluster test-exit --tasks xx | grep exitCode
                    "exitCode": 42,
xx:files fenxiong$ aws ecs describe-tasks --cluster test-exit --tasks xx | grep reason
                    "reason": "CannotInspectContainerError: Could not transition to inspecting; timed out after waiting 30s",

New tests cover the changes: yes

Description for the changelog

Enhancement: get container's exit code from docker event in case we receive a container die event, but fail to inspect the container. Previously the container's exit code was left as null in this case.

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

In case when we received a container die event but was not able to inspect the container (e.g. due to timeout), we will use the exit code from the event, so that the exit code of the container is still reported and available for customer to see from describing task.
@fenxiong fenxiong marked this pull request as ready for review July 6, 2021 22:20
@fenxiong fenxiong requested a review from a team July 6, 2021 22:20
@@ -1015,6 +1019,10 @@ func (dg *dockerGoClient) handleContainerEvents(ctx context.Context,
}

metadata := dg.containerMetadata(ctx, containerID)
// In case when we received a container die event but was not able to inspect the container (e.g. due to timeout),
// we will use the exit code from the event, so that the exit code of the container is still reported and
// available for customer to see from describing task.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if exit code is set here, do we still need the inspect later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, because we need to get the exitAt timestamp https://github.com/aws/amazon-ecs-agent/blob/master/agent/api/container/container.go#L310. this is shown in the metadata, and is only available by inspecting the container (see https://github.com/moby/moby/blob/master/daemon/monitor.go#L48, the exit code is stored in event but exitAt is not). otherwise i would have remove the inspect 😞

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. if inspect fails, do we still show the error in task stopped reason even when container exits fine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we will show the error in the reason field. see the Testing section in pr description for an example

@fenxiong fenxiong added this to the 1.54.1 milestone Jul 7, 2021
@fenxiong fenxiong merged commit 7117905 into aws:dev Jul 12, 2021
@fenxiong fenxiong deleted the inspect branch July 12, 2021 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants