Enable container health check #1141

richardpen · 2017-12-12T20:12:56Z

Summary

Enable docker container health check in the agent and report container health metrics to backend

Implementation details

Testing

Builds on Linux (make release)
Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
Unit tests on Linux (make test) pass
Unit tests on Windows (go test -timeout=25s ./agent/...) pass
Integration tests on Linux (make run-integ-tests) pass
Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
Functional tests on Linux (make run-functional-tests) pass
Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass

New tests cover the changes:

Description for the changelog

Licensing

This contribution is under the terms of the Apache 2.0 License:

aaithal

I still need to review changes to stats engine and the tcs handler. Publishing first list of comments so that you have an initial set of changes to work with. Can you also create this PR against a new "container-health" branch in the repo?

aaithal · 2017-12-13T17:33:25Z

agent/api/containerstatus.go

+	case ContainerUnhealthy:
+		return "UNHEALTHY"
+	default:
+		return "UNKNOWN"


When do you expect this to be the case?

Is this a valid backend status? That is, what will the health of the container/task be if we report it as "UNKNOWN"?

Please address this comment.

When do you expect this to be the case?

When the docker inspect api failed(eg, timeout), agent wasn't able to get the health status of the container.

Is this a valid backend status? That is, what will the health of the container/task be if we report it as "UNKNOWN"?

Yes, backend expect 'UNKNOWN' when agent can't get container health status. In that case the container health status will be reported as unknown.

aaithal · 2017-12-13T17:34:37Z

agent/api/containerevent.go

+package api
+
+// DockerEventsType represents the type of docker events
+type DockerEventsType int


nit: can you please rename this to DockerEventType ?

aaithal · 2017-12-13T17:35:34Z

agent/api/task.go

+	if container.HealthCheckType == dockerHealthCheckType {
+		// configure the docker health check config if it's set
+		healthConfig := &docker.HealthConfig{}
+		err := json.Unmarshal([]byte(*container.DockerConfig.HealthCheck), healthConfig)


please use aws.StringValue() here any everywhere else where you're dereferencing pointers directly.

aaithal · 2017-12-13T17:37:07Z

agent/api/task_test.go

+				Name:            "c1",
+				HealthCheckType: dockerHealthCheckType,
+				DockerConfig: DockerConfig{
+					HealthCheck: aws.String(`{"Test":["command"],"Interval":5000000000,"Timeout":4000000000,"StartPeriod":60000000000,"Retries":5}`),


minor: Can you please split this into a multi-line string to make it more readable?

aaithal · 2017-12-13T17:38:53Z

agent/app/agent_capability.go

@@ -99,6 +100,11 @@ func (agent *ecsAgent) capabilities() ([]*ecs.Attribute, error) {
 		capabilities = appendNameOnlyAttribute(capabilities, attributePrefix+"execution-role-ecr-pull")
 	}

+	if _, ok := supportedVersions[dockerclient.Version_1_29]; ok {
+		// StartPeriod was added in API 1.29


minor: make this more explanatory. "Docker health check start period was added in .."

Also, we it feels like we're ignoring a large number of container instances which are running docker version >= 1.12 and <= 17.05. I think we should instead check if version >= 1.24

aaithal · 2017-12-13T17:51:34Z

agent/engine/docker_container_engine.go

@@ -802,6 +822,10 @@ func (dg *dockerGoClient) ContainerEvents(ctx context.Context) (<-chan DockerCon
 				// mean the container dies (non-init processes). If the container also
 				// dies, you see a "die" status as well; we'll update suitably there
 				continue
+			case "health_status: healthy":


just double checking here that the the event's Status is really "health_status: healthy" and this is not a typo/bug. Should this be be just "health_status"?

yes, the event's status is "health_status: healthy"

aaithal · 2017-12-13T17:52:03Z

agent/engine/docker_events_buffer.go

@@ -23,7 +23,7 @@ const (
 	containerTypeEvent = "container"
 )

-var containerEvents = []string{"create", "start", "stop", "die", "restart", "oom"}
+var containerEvents = []string{"create", "start", "stop", "die", "restart", "oom", "health_status: unhealthy", "health_status: healthy"}


minor: please split this into multiple lines

aaithal · 2017-12-13T18:03:21Z

agent/app/agent.go

-		go tcshandler.StartMetricsSession(telemetrySessionParams)
-	}
+	// Start metrics session in a go routine
+	go tcshandler.StartMetricsSession(telemetrySessionParams)


This is problematic because today, disabling metrics also means disabling the metrics collection engine that gets initialized (stats engine's init) from tcs handler. Customers might choose to do that because they want lighter cpu/memory footprint for the ECS Agent. If agent.cfg.DisableMetrics is false, we should disable both metrics collection and reporting. I think with this change, we're only disabling the reporting pat. We should also disable metrics collection

please address this. not sure if it was addressed offline though.

This has already been addressed. Disable the metrics won't affect the health check part.

aaithal · 2017-12-13T18:06:12Z

agent/engine/docker_task_engine.go

 	engine.processTasks.RLock()
 	managedTask, ok := engine.managedTasks[task.Arn]
 	// hold the lock until the message is sent so we don't send on a closed channel
 	defer engine.processTasks.RUnlock()
 	if !ok {
-		log.Crit("Could not find managed task corresponding to a docker event", "event", event, "task", task)
-		return true
+		seelog.Criticalf("Could not find managed task corresponding to a docker event, event: %s, task: %s", event, task)


I don't think we need the whole task here (or in other log lines you've changed). Can you use task.Arn instead?

aaithal · 2017-12-13T18:06:49Z

agent/engine/types.go

@@ -63,6 +65,8 @@ type DockerContainerMetadata struct {
 	StartedAt time.Time
 	// FinishedAt is the timestamp of container stop
 	FinishedAt time.Time
+	//Health contains the result of a container health check


lint: // Health

aaithal · 2017-12-13T18:19:16Z

agent/stats/engine.go

 func (engine *DockerStatsEngine) isIdle() bool {
 	engine.lock.RLock()
 	defer engine.lock.RUnlock()

 	return len(engine.tasksToContainers) == 0
 }

+func (engine *DockerStatsEngine) getTaskHealth(taskARN string) *ecstcs.TaskHealth {


This should have the 'Unsafe' suffix.

aaithal · 2017-12-13T18:20:24Z

agent/stats/engine.go

+		if taskHealth == nil {
+			continue
+		}
+		taskHealths = append(taskHealths, engine.getTaskHealth(taskARN))


you can just usetaskHealth here, instead of an addition getTaskHealth call

aaithal · 2017-12-13T19:36:34Z

agent/tcs/client/client.go

+// copyHealthMetadata performs a deep copy of HealthMetadata object
+func copyHealthMetadata(metadata *ecstcs.HealthMetadata, fin bool) *ecstcs.HealthMetadata {
+	return &ecstcs.HealthMetadata{
+		Cluster:           aws.String(*metadata.Cluster),


I think you can just do Cluster: metadata.Cluster here (same for other string fields in this block.

No, metadata.xx is a pointer, but we want a deep copy here, so that the actual value wasn't changed elsewhere.

sorry, missed that. please use aws.StringValue here as well. Avoid doing pointer dereferencing as much as possible.

This is still an outstanding item

aaithal · 2017-12-13T19:46:37Z

agent/tcs/client/client.go

+		// create a request if the number of task reaches the maximum
+		if (i+1)%tasksInMessage == 0 {
+			requestMetadata := copyHealthMetadata(metadata, (i+1) == numOfTasks)
+			requests = append(requests, ecstcs.NewPublishHealthMetricsRequest(requestMetadata, copyTaskHealthMetrics(taskHealths)))


please split this into multiple lines so that it's easier to read

aaithal · 2017-12-13T19:47:11Z

agent/tcs/client/client.go

+	numOfTasks := len(taskHealthMetrics)
+	for i, taskHealth := range taskHealthMetrics {
+		taskHealths = append(taskHealths, taskHealth)
+		// create a request if the number of task reaches the maximum


minor: ".. reaches the maximum page size"

aaithal · 2017-12-15T19:15:47Z

agent/api/container.go

-	Config     *string `json:"config"`
-	HostConfig *string `json:"hostConfig"`
-	Version    *string `json:"version"`
+	Config      *string `json:"config"`


minor: this is a good opportunity to document these fields

aaithal · 2017-12-15T19:16:26Z

agent/api/container.go

@@ -65,11 +81,12 @@ type Container struct {
 	Overrides              ContainerOverrides          `json:"overrides"`
 	DockerConfig           DockerConfig                `json:"dockerConfig"`
 	RegistryAuthentication *RegistryAuthenticationData `json:"registryAuthentication"`
-
+	HealthCheckType        string                      `json:"healthCheckType,omitempty"`


minor: // HealthCheckType is ..

aaithal · 2017-12-15T19:18:34Z

agent/api/container.go

+	copyHealth := c.Health
+
+	if c.Health.Since != nil {
+		copyHealth.Since = aws.Time(*c.Health.Since)


please avoid doing pointer dereferencing this way if possible.

@aaithal: why exactly do you advise against this?

aaithal · 2017-12-15T19:19:00Z

agent/api/containerevent.go

+
+const (
+	// ContainerStatusEvent represents the container status change events from docker
+	ContainerStatusEvent DockerEventType = iota


Can you also document which events fall into this bucket here?

aaithal · 2017-12-15T19:19:27Z

agent/api/containerevent.go

+const (
+	// ContainerStatusEvent represents the container status change events from docker
+	ContainerStatusEvent DockerEventType = iota
+	// ContainerHealthEvent represents the container health status event from docker


Same as above. Please document all the event types that fall into this bucket.

aaithal · 2017-12-15T20:06:58Z

agent/stats/engine.go

+	}
+
+	if len(taskHealths) == 0 {
+		return nil, nil, EmptyMetricsError


please return a different error here. EmptyMetricsError is not the same as EmptyHealthStatusError for example

aaithal · 2017-12-15T20:11:26Z

agent/stats/engine.go

@@ -445,6 +563,17 @@ func (engine *DockerStatsEngine) doRemoveContainerUnsafe(container *StatsContain
 		delete(engine.tasksToDefinitions, taskArn)
 		seelog.Debugf("Deleted task from tasks, arn: %s", taskArn)
 	}
+
+	// Remove the container from health container watch list
+	if _, ok := engine.tasksToHealthCheckContainers[taskArn][dockerID]; !ok {


I don't think you need this. delete is a no-op for non-existentkeys

This check makes sure the log below is correct, otherwise the log message will be misleading.

aaithal · 2017-12-15T20:12:03Z

agent/stats/types.go

@@ -38,7 +38,7 @@ type UsageStats struct {

 // ContainerMetadata contains meta-data information for a container.
 type ContainerMetadata struct {
-	DockerID string `json:"-"`
+	DockerID string


why did you change this?

aaithal · 2017-12-15T20:15:22Z

agent/tcs/client/client.go

+	for i, taskHealth := range taskHealthMetrics {
+		taskHealths = append(taskHealths, taskHealth)
+		// create a request if the number of task reaches the maximum page size
+		if (i+1)%tasksInMessage == 0 {


Please define a new const for tasksInHealthMessage and rename tasksInMessage to tasksInMetricsMessage. Also add a TODO for determining the proper value for tasksInHealthMessage. Because of changes in payload structure, we'd have differing values for these I think

aaithal · 2017-12-15T20:15:50Z

agent/tcs/client/client.go

+// copyHealthMetadata performs a deep copy of HealthMetadata object
+func copyHealthMetadata(metadata *ecstcs.HealthMetadata, fin bool) *ecstcs.HealthMetadata {
+	return &ecstcs.HealthMetadata{
+		Cluster:           aws.String(*metadata.Cluster),


This is still an outstanding item

adnxn

took a first pass, ill take another look soon.

adnxn · 2017-12-15T23:34:14Z

agent/api/container.go

@@ -32,15 +33,30 @@ const (
 	// that specifies that the log driver should be authenticated using the
 	// execution role
 	awslogsAuthExecutionRole = "ExecutionRole"
+
+	dockerHealthCheckType = "DOCKER"


nit: docs for dockerHealthCheckType

adnxn · 2017-12-15T23:40:40Z

agent/api/container.go

+	copyHealth := c.Health
+
+	if c.Health.Since != nil {
+		copyHealth.Since = aws.Time(*c.Health.Since)


@aaithal: why exactly do you advise against this?

adnxn · 2017-12-15T23:48:30Z

agent/api/container_test.go

+	assert.Equal(t, health.Output, "test")
+	assert.NotEmpty(t, health.Since)
+
+	// set the health status again shouldn't update the timestamp


so i'm clear, the timestamp shouldn't change cause the Status hasn't changed ya?

yes, the timestamps means the timestamp when the status changed.

adnxn · 2017-12-18T17:37:59Z

agent/api/containerevent.go

+	case ContainerHealthEvent:
+		return "ContainerHealthChangeEvent"
+	default:
+		return "UNKNOWN"


should make this something similar to "DockerEventType: UNKNOWN"?

On a second thought, I think UNKNOWN should be enough, as in logs we will log like "event: UNKNOWN" which will give you the context.

adnxn · 2017-12-18T17:46:58Z

agent/api/containerstatus.go

+		return nil
+	}
+
+	return errors.New("Unrecognized container health status: " + string(b))


this doesn't follow error string format convention

@aaithal do you mean the string should start with a lowercase?

adnxn · 2017-12-18T18:21:40Z

agent/engine/docker_container_engine.go

+	logLength := len(dockerContainer.State.Health.Log)
+	if logLength != 0 {
+		// Only save the last log from the health check
+		health.Output = dockerContainer.State.Health.Log[logLength-1].Output


is this only the last line from the health check log? will this have enough context? also - is this the only way to get visibility into why the health check command may have failed?

Is this only the last line from the health check log? will this have enough context?

Yes, since the log is the health check command output, and if the command failed the output should be the same. So I think the last line is enough.

is this the only way to get visibility into why the health check command may have failed?

You can also get the failure reason from exitcode. the output is the health check command output, which is more understandable than the exitcode.

Why would you only have the last line? Also: I don't see any limit here to the length of the line. We should probably limit this in terms of bytes and maybe select the last N bytes instead of the last line.

adnxn · 2017-12-18T18:45:38Z

agent/engine/docker_task_engine.go


 	managedTask.dockerMessages <- dockerContainerChange{container: cont.Container, event: event}
-	log.Debug("Wrote docker event to the associated task", "task", task, "event", event)
-	return true
+	seelog.Debugf("Wrote docker event to the associated task: %s, event: %s", task.Arn, event)


nit: are we sure we need both Writing/Wrote debug lines here? this may add a lot of noise to our debug logs and they're already pretty noisy.

sharanyad

Initial pass. some nits.

sharanyad · 2017-12-19T23:57:23Z

agent/api/container.go

+	// HealthCheckType is the mechnism to use for the container health check
+	// currently it only supports 'DOCKER'
+	HealthCheckType string `json:"healthCheckType,omitempty"`
+	// Health contains the health check information of


nit : health check information of ... incomplete?

sharanyad · 2017-12-20T00:37:49Z

agent/tcs/client/client.go

@@ -221,6 +232,86 @@ func (cs *clientServer) metricsToPublishMetricRequests() ([]*ecstcs.PublishMetri
 	return requests, nil
 }

+// publishHelathMetrics send the container health information to backend
+func (cs *clientServer) publishHelathMetrics() {


nit: publishHealthMetrics()

sharanyad · 2017-12-20T00:45:27Z

agent/api/containerstatus.go

+	ContainerHealthUnknown ContainerHealthStatus = iota
+	// ContainerHealthy represents the container health check returned healthy
+	ContainerHealthy
+	// ContainerUnhealthy represents the container health check failed


This comment reads like the health check is Unknown. Should it be // ContainerUnhealthy represents the container health check when returned unhealthy ?

sharanyad · 2017-12-20T00:50:26Z

agent/app/agent_capability.go

@@ -99,6 +100,11 @@ func (agent *ecsAgent) capabilities() ([]*ecs.Attribute, error) {
 		capabilities = appendNameOnlyAttribute(capabilities, attributePrefix+"execution-role-ecr-pull")
 	}

+	if _, ok := supportedVersions[dockerclient.Version_1_24]; ok {


Should we also check for remote api 1.29 here?

Docker health check was available from 1.24 and afterwards. StartPeriod was added in 1.29. We only check the 1.24 here, and backend will check 1.29 capability if the StartPeriod is specified.

sharanyad · 2017-12-21T20:07:07Z

agent/stats/engine.go

+
+const (
+	// statsMetricsMap represents the map 'tasksToContainers'
+	statsMetricsMap statsEngineMapType = iota


Can this be statsTaskMetricsMap and the health one something like statsTaskHealthMetricsMap since we might have different health check types?

sharanyad · 2017-12-21T20:09:28Z

agent/stats/engine.go

+	}
+}
+
+// synchronize go through all the containers on the instance to synchronize the state on agent start


nit: synchronizeState goes..

aaithal

I mostly have minor comments.

aaithal · 2017-12-22T00:00:21Z

agent/api/containerstatus.go

@@ -121,7 +150,7 @@ func (cs *ContainerStatus) UnmarshalJSON(b []byte) error {
 	}
 	if b[0] != '"' || b[len(b)-1] != '"' {
 		*cs = ContainerStatusNone
-		return errors.New("ContainerStatus must be a string or null; Got " + string(b))
+		return errors.New("containerStatus must be a string or null; Got " + string(b))


Can you reformat this as: "container status unmarshal: status must be a string or null.."

aaithal · 2017-12-22T00:00:42Z

agent/api/containerstatus.go

@@ -137,7 +166,7 @@ func (cs *ContainerStatus) UnmarshalJSON(b []byte) error {
 	stat, ok := containerStatusMap[strStatus]
 	if !ok {
 		*cs = ContainerStatusNone
-		return errors.New("Unrecognized ContainerStatus")
+		return errors.New("unrecognized ContainerStatus")


Can you reformat this as: "container status unmarshal: unrecognized status"

aaithal · 2017-12-22T00:01:25Z

agent/api/containerstatus.go

+
+	if b[0] != '"' || b[len(b)-1] != '"' {
+		*healthStatus = ContainerHealthUnknown
+		return errors.New("containerHealthStatus must be a string or null; Got " + string(b))


Same here. Please reformat this as per go standard for error strings:
"container health status: ..."

aaithal · 2017-12-22T00:01:57Z

agent/api/containerstatus.go

+		return nil
+	}
+
+	return errors.New("unrecognized container health status: " + string(b))


Same here too: "container health status: unrecognized status..."

aaithal · 2017-12-22T00:28:02Z

agent/engine/docker_task_engine.go


 	task, taskFound := engine.state.TaskByID(event.DockerID)
 	cont, containerFound := engine.state.ContainerByID(event.DockerID)
 	if !taskFound || !containerFound {
-		log.Debug("Event for container not managed", "dockerId", event.DockerID)
-		return false
+		seelog.Debugf("Event for container not managed, container: %s", event.DockerID)


can you also print the values of taskFound and containerFound in the log?

aaithal · 2017-12-22T00:47:01Z

agent/stats/engine.go

+func (engine *DockerStatsEngine) addToStatsContainerMapUnsafe(
+	taskARN, containerID string,
+	statsContainer *StatsContainer,
+	mapType statsEngineMapType) *StatsContainer {


as per offline conversation, this still seems a bit clunky to refer to a map that you have access from an enum. Can you consider using a function pointer here instead, which returns you the appropriate map to manipulate?

aaithal · 2017-12-22T00:50:21Z

agent/stats/engine.go

 func (engine *DockerStatsEngine) isIdle() bool {
 	engine.lock.RLock()
 	defer engine.lock.RUnlock()

 	return len(engine.tasksToContainers) == 0
 }

+func (engine *DockerStatsEngine) isHealthContainerIdle() bool {


can you rename this to containerHealthsToMonitor()? we can then do something like

if ok:=engine.containerHealthsToMonitor(); ok { // we have containers whose healths are being monitored } else { // no containers' healths are being monitored }

It reads better than isHealthContainerIdle, which doesn't exactly convey the intent of the method.

aaithal · 2017-12-22T00:51:54Z

agent/tcs/client/client.go

+const (
+	// tasksInMessage is the maximum number of tasks that can be sent in a message to the backend
+	// This is a very conservative estimate assuming max allowed string lengths for all fields.
+	tasksInMessage = 10


you can rename this to tasksInMetricMessage

aaithal · 2017-12-22T00:52:20Z

agent/tcs/client/client.go

+	if !cs.disableResourceMetrics {
+		go cs.publishMetrics()
+	}
+	go cs.publishHelathMetrics()


typo: publishHelathMetrics -> publishHealthMetrics

aaithal · 2017-12-22T00:53:13Z

agent/tcs/client/client.go

+	// This is a very conservative estimate assuming max allowed string lengths for all fields.
+	tasksInMessage = 10
+	// tasksInHealthMessage is the maximum number of tasks that can be sent in a message to the backend
+	tasksInHealthMessage = 10


please add a TODO for evaluating what this number should be

aaithal · 2018-01-02T20:49:26Z

agent/api/containerstatus.go

@@ -151,6 +180,43 @@ func (cs *ContainerStatus) MarshalJSON() ([]byte, error) {
 	return []byte(`"` + cs.String() + `"`), nil
 }

+// UnmarshalJSON overrides the logic for parsing the JSON-encoded container health data
+func (healthStatus *ContainerHealthStatus) UnmarshalJSON(b []byte) error {
+	if strings.ToLower(string(b)) == "null" {


nit: can you extract *healthStatus = ContainerHealthUnknown outside of these conditional blocks and just set it at the beginning of the method?

aaithal · 2018-01-02T20:50:00Z

agent/api/containerstatus.go

+		return errors.New("container health status unmarshal: status must be a string or null; Got " + string(b))
+	}
+	strStatus := string(b[1 : len(b)-1])
+	if strStatus == "UNKNOWN" {


all of these can be moved to a switch-case block

Please use switch for this.

aaithal · 2018-01-02T21:03:47Z

agent/engine/docker_container_engine.go

@@ -45,6 +45,10 @@ const (
 	imageNameFormat = "%s:%s"
 	// the buffer size will ensure agent doesn't miss any event from docker
 	dockerEventBufferSize = 100
+	// healthCheckHealthy is the health status returned from docker container health check
+	healthCheckHealthy = "healthy"
+	// healthCheckUnhealthy is unhealth status returned from docker container health check


nit: 'unhealthy'

aaithal · 2018-01-02T21:10:40Z

agent/engine/docker_task_engine.go

+		if cont.Container.HealthStatusShouldBeReported() {
+			seelog.Debugf("Updating container health status: %s", event.DockerContainerMetadata.Health)
+			cont.Container.SetHealthStatus(event.DockerContainerMetadata.Health)
+			engine.saver.Save()


I don't think we need to persist this in the state file. In the event of a restart, inspecting the container should be sufficient to get the health status of the container. wdyt? imo, we can delete this line.

aaithal · 2018-01-02T21:15:09Z

agent/stats/engine.go

-	engine.lock.Lock()
-	defer engine.lock.Unlock()
-
+func (engine *DockerStatsEngine) addContainer(dockerID string) (*StatsContainer, error) {


please rename this to addContainerUnsafe

aaithal · 2018-01-02T21:16:03Z

agent/stats/engine.go

+	return statsContainer, nil
+}
+
+func (engine *DockerStatsEngine) containerMetricsMap() map[string]map[string]*StatsContainer {


please rename this as containerMetricsMapUnsafe

aaithal · 2018-01-02T21:16:16Z

agent/stats/engine.go

+func (engine *DockerStatsEngine) containerMetricsMap() map[string]map[string]*StatsContainer {
+	return engine.tasksToContainers
+}
+func (engine *DockerStatsEngine) healthCheckContainerMetricsMap() map[string]map[string]*StatsContainer {


please rename this as healthCheckContainerMetricsMapUnsafe

aaithal · 2018-01-02T21:18:28Z

agent/stats/engine.go

+func (engine *DockerStatsEngine) addToStatsContainerMapUnsafe(
+	taskARN, containerID string,
+	statsContainer *StatsContainer,
+	fun func() map[string]map[string]*StatsContainer) *StatsContainer {


fun is a very generic, non descriptive name. Please rename that to be something more meaningful

samuelkarp · 2018-01-02T23:29:21Z

agent/api/container.go

-	Version    *string `json:"version"`
+	// Version specifies the docker client API version to use
+	Version *string `json:"version"`
+	// HealthCheck is the configuration of docker health check


Is this another json-serialized object? Is this part of Config/HostConfig?

No and it's not part of the Config/HostConfig from the payload message. It was sent as a string and agent will unmarshal it into the docker recognized struct api/task.go

Why isn't it part of the Config from the payload message? It's being stuck in the Config anyway, so why do we need a separate unmarshal for it?

samuelkarp · 2018-01-02T23:33:56Z

agent/api/containerstatus.go

+		return errors.New("container health status unmarshal: status must be a string or null; Got " + string(b))
+	}
+	strStatus := string(b[1 : len(b)-1])
+	if strStatus == "UNKNOWN" {


Please use switch for this.

samuelkarp · 2018-01-02T23:37:34Z

agent/engine/docker_container_engine.go

+	logLength := len(dockerContainer.State.Health.Log)
+	if logLength != 0 {
+		// Only save the last log from the health check
+		health.Output = dockerContainer.State.Health.Log[logLength-1].Output


Why would you only have the last line? Also: I don't see any limit here to the length of the line. We should probably limit this in terms of bytes and maybe select the last N bytes instead of the last line.

samuelkarp · 2018-01-02T23:39:03Z

agent/functional_tests/tests/generated/simpletests_unix/simpletests_generated_unix_test.go

@@ -397,7 +397,7 @@ func TestInitProcessEnabled(t *testing.T) {
 	}
 	agent := RunAgent(t, nil)
 	defer agent.Cleanup()
-	agent.RequireVersion(">=1.14.5")
+	agent.RequireVersion(">=1.15.0")


Why did this change?

I believe this was changed to 1.14.5 unexpectedly here, so I changed it back.

@sharanyad Can you clarify whether that was intended?

@richardpen is right. According to the releases, 1.15.0 has the init support in agent. Not sure how the testing change got slipped in.

richardpen · 2018-01-03T00:05:09Z

@samuelkarp Not sure why github doesn't allow me to response your comment in place, so I put it here. We only have the last line because the line should be the same in most cases for the failure and it should be enough for debugging purpose, and it saves memory to save the additional output especially when the retry number is very big.
For your suggestion to add limit, I think that's a good idea, I'll make the change. Thanks.

samuelkarp · 2018-01-03T00:09:30Z

the line should be the same in most cases for the failure

I don't understand what you mean by that.

it should be enough for debugging purpose

This seems like a pretty big assumption, unless I'm misunderstanding who creates the output.

richardpen · 2018-01-03T00:44:20Z

@samuelkarp Sorry for the confusing, the last line isn't the last line of the output, it is the last entry of the outputs which is the whole output of the latest health check command.

samuelkarp · 2018-01-03T00:45:29Z

@richardpen Thanks, that does help. I think that makes sense, the only thing now is to look at making sure we implement a size limit for the output.

Regenerate the acs api to add new fields in the payload message to support container health check configuration.

Add the container health configuration into DockerConfig struct, and also add the function to pass it to docker when create the container.

Add the container health event type as a type of event that agent should handle, also update the container health status during the periodically conciliation of container metadata with docker.

Add the functionality to send container health metrics to backend

richardpen · 2018-01-05T01:29:27Z

@samuelkarp @aaithal I have addressed all the comments, please take another look.

aaithal

I've got some more comments. Apologies for not identifying these earlier.

aaithal · 2018-01-05T16:13:46Z

agent/api/task.go

 		if err != nil {
 			return nil, &DockerClientConfigError{"Unable decode given docker config: " + err.Error()}
 		}
 	}
+	if container.HealthCheckType == dockerHealthCheckType && config.Healthcheck == nil {
+		return nil, &DockerClientConfigError{"docker health check is nil while container health check type is DOCKER"}


nit: please split this into multiple lines for readability

aaithal · 2018-01-05T16:14:03Z

agent/api/task.go

 	}

 	resourceSplit := strings.SplitN(resource, arnResourceDelimiter, arnResourceSections)
 	if len(resourceSplit) != arnResourceSections {
-		return "", errors.New(fmt.Sprintf("task get-id: invalid task resource split: %s, expected=%d, actual=%d", resource, arnResourceSections, len(resourceSplit)))
+		return "", errors.Errorf("task get-id: invalid task resource split: %s, expected=%d, actual=%d", resource, arnResourceSections, len(resourceSplit))


nit: please split this into multiple lines for readability

aaithal · 2018-01-05T16:15:25Z

agent/engine/docker_container_engine.go

+	healthCheckHealthy = "healthy"
+	// healthCheckUnhealthy is unhealthy status returned from docker container health check
+	healthCheckUnhealthy = "unhealthy"
+	// maxHealthCheckOutputSize is the maximum size of healthcheck command output that agent will save


nit: please change this to "maximum length" instead of "size"

aaithal · 2018-01-05T16:24:09Z

agent/stats/engine.go

 	if err != nil {
 		return fmt.Errorf("Failed to subscribe to container change event stream, err %v", err)
 	}

-	go engine.listContainersAndStartEventHandler()
-	go engine.waitToStop()
+	engine.synchronizeState()


this returns an error. please make sure that you log a warning if an error is returned.

aaithal · 2018-01-05T16:24:44Z

agent/stats/engine.go

+	defer engine.lock.Unlock()
+	statsContainer, err := engine.addContainerUnsafe(containerID)
+	if err != nil {
+		seelog.Debugf("Adding container to stats watch list failed, err: %v", err)


please log container id here.

aaithal · 2018-01-05T16:32:15Z

agent/stats/engine.go

+func (engine *DockerStatsEngine) containerMetricsMapUnsafe() map[string]map[string]*StatsContainer {
+	return engine.tasksToContainers
+}
+func (engine *DockerStatsEngine) healthCheckContainerMetricsMapUnsafe() map[string]map[string]*StatsContainer {


please add a newline here and rename this to healthCheckContainerMapUnsafe

aaithal · 2018-01-05T16:37:55Z

agent/stats/engine.go

+func (engine *DockerStatsEngine) addToStatsContainerMapUnsafe(
+	taskARN, containerID string,
+	statsContainer *StatsContainer,
+	statsMapToUpdate func() map[string]map[string]*StatsContainer) *StatsContainer {


why is this method returning *StatsContainer? iiuc, you're just returning statsContainer here and that's not changing any state. It may make more sense to return a bool indicating if we should start watching this container or not based on its existence in the map.

I was trying to avoid an extra if check, but I can modify that if this makes the code easier to read.

aaithal · 2018-01-05T16:40:17Z

agent/tcs/client/client.go

+	// due to a connection reset.
+	err := cs.publishHealthMetricsOnce()
+	if err != nil {
+		seelog.Warnf("Error publishing health metrics: %v", err)


nit: "Unable to publish health metrics" would be a better log message

aaithal · 2018-01-05T16:40:23Z

agent/tcs/client/client.go

+		case <-cs.publishTicker.C:
+			err := cs.publishHealthMetricsOnce()
+			if err != nil {
+				seelog.Warnf("Error publishing health metrics: %v", err)


nit: "Unable to publish health metrics" would be a better log message

aaithal · 2018-01-05T16:41:47Z

agent/tcs/handler/handler.go

+// the ack each time it processes a health message
+func ackPublishHealthMetricHandler(timer *time.Timer) func(*ecstcs.AckPublishHealth) {
+	return func(*ecstcs.AckPublishHealth) {
+		seelog.Debug("Received ACKPublishHealth from tcs")


Can you check if the message id can be logged here? That'd improve the message tracking if we need to debug anything with lossy messages in the future.

The message ID isn't available alone here, we can only log the whole message.

oh, never mind then

adnxn · 2018-01-05T18:00:13Z

agent/api/container.go

+	// EntryPoint is entrypoint of the container, corresponding to docker option: --entrypoint
+	EntryPoint *[]string
+	// Environment is the environment variable set in the container
+	Environment map[string]string `json:"environment"`


nit: maybe we should also document context for what the key:value represents in this case

I think this is pretty straightforward, as it's like setting any environment variable. Not sure what to put here. I'll leave it as for now, unless you have a better suggestion what to put here.

adnxn · 2018-01-05T18:33:50Z

agent/app/agent.go

-		go tcshandler.StartMetricsSession(telemetrySessionParams)
-	}
+	// Start metrics session in a go routine
+	go tcshandler.StartMetricsSession(telemetrySessionParams)


please address this. not sure if it was addressed offline though.

adnxn · 2018-01-05T19:05:44Z

agent/stats/engine.go

-	engine.lock.Lock()
-	defer engine.lock.Unlock()
-
+// addContainerUnsafe adds a container to the map of containers being watched.


maybe i missed this. why are we getting rid of the locks here and using *Unsafe?

No, the lock was just moved to line 138.

oops not sure how i missed that. thanks.

aaithal

Apart from the comments, the one thing that's also missing is the integration test in 'stats' package to test the container health status tracking code path. Can you please add that as well?

aaithal · 2018-01-08T22:11:00Z

agent/api/containerstatus.go

+
+	strStatus := string(b[1 : len(b)-1])
+	switch strStatus {
+	case "UNKNOWN":


nit: please add a comment here stating that the status is already set to ContainerHealthUnknown.

aaithal · 2018-01-08T22:35:12Z

agent/engine/docker_task_engine.go

+	// no need to process this in task manager
+	if event.Type == api.ContainerHealthEvent {
+		if cont.Container.HealthStatusShouldBeReported() {
+			seelog.Debugf("Updating container health status: %v", event.DockerContainerMetadata.Health)


please log container name/id and task arn here as well

aaithal · 2018-01-08T22:37:01Z

agent/stats/engine.go

+		seelog.Debugf("Adding container to stats health check watch list, id: %s, task: %s", dockerID, task.Arn)
+	}
+
+	if !shouldCollectStats {


can you move this to line 238, before invoking ResolveContainer?

line 238-243 should be always executed for collecting container health status. So move this to line 238 will skip this part if shouldCollectStats was false.

Ah, I misunderstood. Can you please rename shouldCollectStats to untrackedStatsContainer or something similar? shouldCollectStats is somewhat misleading in this context.

richardpen · 2018-01-08T23:23:29Z

@aaithal I will add the integration test in a separate PR, please take another look, thanks.

aaithal

Please address a couple of minor comments I have before merging this.

aaithal · 2018-01-09T00:04:20Z

agent/stats/engine.go

+		seelog.Debugf("Adding container to stats health check watch list, id: %s, task: %s", dockerID, task.Arn)
+	}
+
+	if !statsContainerTracked {


this still doesn't read right. if stats container is not tracked, we should start tracking it. However, that's not what this variable is conveying. Can you rename this to be watchStatsContainer?

aaithal · 2018-01-09T00:05:12Z

agent/stats/engine.go

+
+// addToStatsContainerMapUnsafe adds the statscontainer into stats for tracking and returns whether
+// stats should start the statscontainer to collect metrics
+func (engine *DockerStatsEngine) addToStatsContainerMapUnsafe(


please update the documentation for with a description of what's returned from this method.

damiencarol · 2018-01-13T13:39:51Z

Any new on this one?

richardpen · 2018-01-15T18:49:34Z

@damiencarol Thanks for tracking this, we are still actively working on this. I'm going to merge this PR, and we still need to add some tests, please track the #534 for updates.

richardpen mentioned this pull request Dec 12, 2017

Feature Request: Support for Docker Health Checks when bumping a task definition revision #534

Closed

richardpen force-pushed the chc branch 2 times, most recently from 98e41a5 to 989f825 Compare December 12, 2017 22:34

richardpen added this to the 1.18.0 milestone Dec 13, 2017

aaithal suggested changes Dec 13, 2017

View reviewed changes

aaithal reviewed Dec 13, 2017

View reviewed changes

richardpen force-pushed the chc branch from 989f825 to e66a1e0 Compare December 15, 2017 18:32

richardpen changed the title ~~[WIP] Enable container health check~~ Enable container health check Dec 15, 2017

aaithal reviewed Dec 15, 2017

View reviewed changes

richardpen force-pushed the chc branch from f53136d to 1febc14 Compare December 18, 2017 23:00

adnxn reviewed Dec 18, 2017

View reviewed changes

richardpen force-pushed the chc branch 3 times, most recently from 44f126e to 8630a7f Compare December 20, 2017 21:37

sharanyad reviewed Dec 21, 2017

View reviewed changes

aaithal reviewed Dec 22, 2017

View reviewed changes

richardpen force-pushed the chc branch from 8630a7f to dfdc730 Compare December 22, 2017 21:02

aaithal reviewed Jan 2, 2018

View reviewed changes

samuelkarp reviewed Jan 2, 2018

View reviewed changes

Peng Yin added 7 commits January 4, 2018 19:38

acs: update acs api for container health

fb51481

Regenerate the acs api to add new fields in the payload message to support container health check configuration.

ecs: update ecs client for container health check

c27fc9f

tcs: update tcs api to support container health metrics

43bfaa7

api: add container health configuration fields

b6ca690

Add the container health configuration into DockerConfig struct, and also add the function to pass it to docker when create the container.

capability: add container health check capability

c946b3f

engine: track container health status in docker events

1dafb52

Add the container health event type as a type of event that agent should handle, also update the container health status during the periodically conciliation of container metadata with docker.

tcs: send container health metrics

96d2d62

Add the functionality to send container health metrics to backend

Peng Yin added 3 commits January 4, 2018 19:38

stats: add function to collect container health metrics

45f88a5

lint: modify the variable name and add comments

ceb2b96

test: add test to cover container health change

f378932

richardpen changed the base branch from dev to container-health-check January 5, 2018 00:55

richardpen force-pushed the chc branch from dfdc730 to 3407261 Compare January 5, 2018 01:20

aaithal reviewed Jan 5, 2018

View reviewed changes

richardpen force-pushed the chc branch 2 times, most recently from 4d56802 to 186876c Compare January 5, 2018 19:07

adnxn reviewed Jan 5, 2018

View reviewed changes

aaithal reviewed Jan 8, 2018

View reviewed changes

richardpen force-pushed the chc branch from 186876c to 9b5523b Compare January 8, 2018 23:20

aaithal approved these changes Jan 9, 2018

View reviewed changes

acs: move the health check config into docker config

131d69e

richardpen force-pushed the chc branch from 9b5523b to 131d69e Compare January 9, 2018 00:12

richardpen merged commit b7d1784 into aws:container-health-check Jan 15, 2018

aaithal modified the milestones: 1.18.0, 1.17.0 Jan 17, 2018

richardpen deleted the chc branch January 30, 2018 18:58

Enable container health check #1141

Enable container health check #1141

Conversation

richardpen commented Dec 12, 2017 • edited

Summary

Implementation details

Testing

Description for the changelog

Licensing

aaithal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adnxn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardpen Dec 19, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sharanyad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaithal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardpen commented Dec 12, 2017 •

edited

richardpen Dec 19, 2017 •

edited