Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable container health check #1141

Merged
merged 11 commits into from Jan 15, 2018
Merged

Conversation

richardpen
Copy link

@richardpen richardpen commented Dec 12, 2017

Summary

Enable docker container health check in the agent and report container health metrics to backend

Implementation details

Testing

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass

New tests cover the changes:

Description for the changelog

Licensing

This contribution is under the terms of the Apache 2.0 License:

Copy link
Member

@aaithal aaithal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to review changes to stats engine and the tcs handler. Publishing first list of comments so that you have an initial set of changes to work with. Can you also create this PR against a new "container-health" branch in the repo?

case ContainerUnhealthy:
return "UNHEALTHY"
default:
return "UNKNOWN"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. When do you expect this to be the case?
  2. Is this a valid backend status? That is, what will the health of the container/task be if we report it as "UNKNOWN"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address this comment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do you expect this to be the case?

When the docker inspect api failed(eg, timeout), agent wasn't able to get the health status of the container.

Is this a valid backend status? That is, what will the health of the container/task be if we report it as "UNKNOWN"?

Yes, backend expect 'UNKNOWN' when agent can't get container health status. In that case the container health status will be reported as unknown.

package api

// DockerEventsType represents the type of docker events
type DockerEventsType int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you please rename this to DockerEventType ?

if container.HealthCheckType == dockerHealthCheckType {
// configure the docker health check config if it's set
healthConfig := &docker.HealthConfig{}
err := json.Unmarshal([]byte(*container.DockerConfig.HealthCheck), healthConfig)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use aws.StringValue() here any everywhere else where you're dereferencing pointers directly.

Name: "c1",
HealthCheckType: dockerHealthCheckType,
DockerConfig: DockerConfig{
HealthCheck: aws.String(`{"Test":["command"],"Interval":5000000000,"Timeout":4000000000,"StartPeriod":60000000000,"Retries":5}`),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Can you please split this into a multi-line string to make it more readable?

@@ -99,6 +100,11 @@ func (agent *ecsAgent) capabilities() ([]*ecs.Attribute, error) {
capabilities = appendNameOnlyAttribute(capabilities, attributePrefix+"execution-role-ecr-pull")
}

if _, ok := supportedVersions[dockerclient.Version_1_29]; ok {
// StartPeriod was added in API 1.29
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. minor: make this more explanatory. "Docker health check start period was added in .."
  2. Also, we it feels like we're ignoring a large number of container instances which are running docker version >= 1.12 and <= 17.05. I think we should instead check if version >= 1.24

@@ -802,6 +822,10 @@ func (dg *dockerGoClient) ContainerEvents(ctx context.Context) (<-chan DockerCon
// mean the container dies (non-init processes). If the container also
// dies, you see a "die" status as well; we'll update suitably there
continue
case "health_status: healthy":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just double checking here that the the event's Status is really "health_status: healthy" and this is not a typo/bug. Should this be be just "health_status"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the event's status is "health_status: healthy"

@@ -23,7 +23,7 @@ const (
containerTypeEvent = "container"
)

var containerEvents = []string{"create", "start", "stop", "die", "restart", "oom"}
var containerEvents = []string{"create", "start", "stop", "die", "restart", "oom", "health_status: unhealthy", "health_status: healthy"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: please split this into multiple lines

go tcshandler.StartMetricsSession(telemetrySessionParams)
}
// Start metrics session in a go routine
go tcshandler.StartMetricsSession(telemetrySessionParams)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is problematic because today, disabling metrics also means disabling the metrics collection engine that gets initialized (stats engine's init) from tcs handler. Customers might choose to do that because they want lighter cpu/memory footprint for the ECS Agent. If agent.cfg.DisableMetrics is false, we should disable both metrics collection and reporting. I think with this change, we're only disabling the reporting pat. We should also disable metrics collection

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please address this. not sure if it was addressed offline though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has already been addressed. Disable the metrics won't affect the health check part.

engine.processTasks.RLock()
managedTask, ok := engine.managedTasks[task.Arn]
// hold the lock until the message is sent so we don't send on a closed channel
defer engine.processTasks.RUnlock()
if !ok {
log.Crit("Could not find managed task corresponding to a docker event", "event", event, "task", task)
return true
seelog.Criticalf("Could not find managed task corresponding to a docker event, event: %s, task: %s", event, task)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the whole task here (or in other log lines you've changed). Can you use task.Arn instead?

@@ -63,6 +65,8 @@ type DockerContainerMetadata struct {
StartedAt time.Time
// FinishedAt is the timestamp of container stop
FinishedAt time.Time
//Health contains the result of a container health check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lint: // Health

func (engine *DockerStatsEngine) isIdle() bool {
engine.lock.RLock()
defer engine.lock.RUnlock()

return len(engine.tasksToContainers) == 0
}

func (engine *DockerStatsEngine) getTaskHealth(taskARN string) *ecstcs.TaskHealth {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have the 'Unsafe' suffix.

if taskHealth == nil {
continue
}
taskHealths = append(taskHealths, engine.getTaskHealth(taskARN))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just usetaskHealth here, instead of an addition getTaskHealth call

// copyHealthMetadata performs a deep copy of HealthMetadata object
func copyHealthMetadata(metadata *ecstcs.HealthMetadata, fin bool) *ecstcs.HealthMetadata {
return &ecstcs.HealthMetadata{
Cluster: aws.String(*metadata.Cluster),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just do Cluster: metadata.Cluster here (same for other string fields in this block.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, metadata.xx is a pointer, but we want a deep copy here, so that the actual value wasn't changed elsewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, missed that. please use aws.StringValue here as well. Avoid doing pointer dereferencing as much as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still an outstanding item

// create a request if the number of task reaches the maximum
if (i+1)%tasksInMessage == 0 {
requestMetadata := copyHealthMetadata(metadata, (i+1) == numOfTasks)
requests = append(requests, ecstcs.NewPublishHealthMetricsRequest(requestMetadata, copyTaskHealthMetrics(taskHealths)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please split this into multiple lines so that it's easier to read

numOfTasks := len(taskHealthMetrics)
for i, taskHealth := range taskHealthMetrics {
taskHealths = append(taskHealths, taskHealth)
// create a request if the number of task reaches the maximum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: ".. reaches the maximum page size"

@richardpen richardpen changed the title [WIP] Enable container health check Enable container health check Dec 15, 2017
Config *string `json:"config"`
HostConfig *string `json:"hostConfig"`
Version *string `json:"version"`
Config *string `json:"config"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: this is a good opportunity to document these fields

@@ -65,11 +81,12 @@ type Container struct {
Overrides ContainerOverrides `json:"overrides"`
DockerConfig DockerConfig `json:"dockerConfig"`
RegistryAuthentication *RegistryAuthenticationData `json:"registryAuthentication"`

HealthCheckType string `json:"healthCheckType,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: // HealthCheckType is ..

copyHealth := c.Health

if c.Health.Since != nil {
copyHealth.Since = aws.Time(*c.Health.Since)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid doing pointer dereferencing this way if possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaithal: why exactly do you advise against this?


const (
// ContainerStatusEvent represents the container status change events from docker
ContainerStatusEvent DockerEventType = iota
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also document which events fall into this bucket here?

const (
// ContainerStatusEvent represents the container status change events from docker
ContainerStatusEvent DockerEventType = iota
// ContainerHealthEvent represents the container health status event from docker
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. Please document all the event types that fall into this bucket.

}

if len(taskHealths) == 0 {
return nil, nil, EmptyMetricsError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please return a different error here. EmptyMetricsError is not the same as EmptyHealthStatusError for example

@@ -445,6 +563,17 @@ func (engine *DockerStatsEngine) doRemoveContainerUnsafe(container *StatsContain
delete(engine.tasksToDefinitions, taskArn)
seelog.Debugf("Deleted task from tasks, arn: %s", taskArn)
}

// Remove the container from health container watch list
if _, ok := engine.tasksToHealthCheckContainers[taskArn][dockerID]; !ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this. delete is a no-op for non-existentkeys

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check makes sure the log below is correct, otherwise the log message will be misleading.

@@ -38,7 +38,7 @@ type UsageStats struct {

// ContainerMetadata contains meta-data information for a container.
type ContainerMetadata struct {
DockerID string `json:"-"`
DockerID string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you change this?

for i, taskHealth := range taskHealthMetrics {
taskHealths = append(taskHealths, taskHealth)
// create a request if the number of task reaches the maximum page size
if (i+1)%tasksInMessage == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please define a new const for tasksInHealthMessage and rename tasksInMessage to tasksInMetricsMessage. Also add a TODO for determining the proper value for tasksInHealthMessage. Because of changes in payload structure, we'd have differing values for these I think

// copyHealthMetadata performs a deep copy of HealthMetadata object
func copyHealthMetadata(metadata *ecstcs.HealthMetadata, fin bool) *ecstcs.HealthMetadata {
return &ecstcs.HealthMetadata{
Cluster: aws.String(*metadata.Cluster),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still an outstanding item

Copy link
Contributor

@adnxn adnxn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took a first pass, ill take another look soon.

@@ -32,15 +33,30 @@ const (
// that specifies that the log driver should be authenticated using the
// execution role
awslogsAuthExecutionRole = "ExecutionRole"

dockerHealthCheckType = "DOCKER"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: docs for dockerHealthCheckType

copyHealth := c.Health

if c.Health.Since != nil {
copyHealth.Since = aws.Time(*c.Health.Since)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaithal: why exactly do you advise against this?

assert.Equal(t, health.Output, "test")
assert.NotEmpty(t, health.Since)

// set the health status again shouldn't update the timestamp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i'm clear, the timestamp shouldn't change cause the Status hasn't changed ya?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the timestamps means the timestamp when the status changed.

case ContainerHealthEvent:
return "ContainerHealthChangeEvent"
default:
return "UNKNOWN"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should make this something similar to "DockerEventType: UNKNOWN"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought, I think UNKNOWN should be enough, as in logs we will log like "event: UNKNOWN" which will give you the context.

return nil
}

return errors.New("Unrecognized container health status: " + string(b))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't follow error string format convention

@aaithal do you mean the string should start with a lowercase?

logLength := len(dockerContainer.State.Health.Log)
if logLength != 0 {
// Only save the last log from the health check
health.Output = dockerContainer.State.Health.Log[logLength-1].Output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this only the last line from the health check log? will this have enough context? also - is this the only way to get visibility into why the health check command may have failed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only the last line from the health check log? will this have enough context?

Yes, since the log is the health check command output, and if the command failed the output should be the same. So I think the last line is enough.

is this the only way to get visibility into why the health check command may have failed?

You can also get the failure reason from exitcode. the output is the health check command output, which is more understandable than the exitcode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you only have the last line? Also: I don't see any limit here to the length of the line. We should probably limit this in terms of bytes and maybe select the last N bytes instead of the last line.


managedTask.dockerMessages <- dockerContainerChange{container: cont.Container, event: event}
log.Debug("Wrote docker event to the associated task", "task", task, "event", event)
return true
seelog.Debugf("Wrote docker event to the associated task: %s, event: %s", task.Arn, event)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: are we sure we need both Writing/Wrote debug lines here? this may add a lot of noise to our debug logs and they're already pretty noisy.

Copy link
Contributor

@sharanyad sharanyad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial pass. some nits.

// HealthCheckType is the mechnism to use for the container health check
// currently it only supports 'DOCKER'
HealthCheckType string `json:"healthCheckType,omitempty"`
// Health contains the health check information of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : health check information of ... incomplete?

@@ -221,6 +232,86 @@ func (cs *clientServer) metricsToPublishMetricRequests() ([]*ecstcs.PublishMetri
return requests, nil
}

// publishHelathMetrics send the container health information to backend
func (cs *clientServer) publishHelathMetrics() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: publishHealthMetrics()

ContainerHealthUnknown ContainerHealthStatus = iota
// ContainerHealthy represents the container health check returned healthy
ContainerHealthy
// ContainerUnhealthy represents the container health check failed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment reads like the health check is Unknown. Should it be // ContainerUnhealthy represents the container health check when returned unhealthy ?

@@ -99,6 +100,11 @@ func (agent *ecsAgent) capabilities() ([]*ecs.Attribute, error) {
capabilities = appendNameOnlyAttribute(capabilities, attributePrefix+"execution-role-ecr-pull")
}

if _, ok := supportedVersions[dockerclient.Version_1_24]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check for remote api 1.29 here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker health check was available from 1.24 and afterwards. StartPeriod was added in 1.29. We only check the 1.24 here, and backend will check 1.29 capability if the StartPeriod is specified.


const (
// statsMetricsMap represents the map 'tasksToContainers'
statsMetricsMap statsEngineMapType = iota
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be statsTaskMetricsMap and the health one something like statsTaskHealthMetricsMap since we might have different health check types?

}
}

// synchronize go through all the containers on the instance to synchronize the state on agent start
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: synchronizeState goes..

Copy link
Member

@aaithal aaithal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly have minor comments.

@@ -121,7 +150,7 @@ func (cs *ContainerStatus) UnmarshalJSON(b []byte) error {
}
if b[0] != '"' || b[len(b)-1] != '"' {
*cs = ContainerStatusNone
return errors.New("ContainerStatus must be a string or null; Got " + string(b))
return errors.New("containerStatus must be a string or null; Got " + string(b))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reformat this as: "container status unmarshal: status must be a string or null.."

@@ -137,7 +166,7 @@ func (cs *ContainerStatus) UnmarshalJSON(b []byte) error {
stat, ok := containerStatusMap[strStatus]
if !ok {
*cs = ContainerStatusNone
return errors.New("Unrecognized ContainerStatus")
return errors.New("unrecognized ContainerStatus")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reformat this as: "container status unmarshal: unrecognized status"


if b[0] != '"' || b[len(b)-1] != '"' {
*healthStatus = ContainerHealthUnknown
return errors.New("containerHealthStatus must be a string or null; Got " + string(b))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Please reformat this as per go standard for error strings:
"container health status: ..."

return nil
}

return errors.New("unrecognized container health status: " + string(b))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here too: "container health status: unrecognized status..."


task, taskFound := engine.state.TaskByID(event.DockerID)
cont, containerFound := engine.state.ContainerByID(event.DockerID)
if !taskFound || !containerFound {
log.Debug("Event for container not managed", "dockerId", event.DockerID)
return false
seelog.Debugf("Event for container not managed, container: %s", event.DockerID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also print the values of taskFound and containerFound in the log?

func (engine *DockerStatsEngine) addToStatsContainerMapUnsafe(
taskARN, containerID string,
statsContainer *StatsContainer,
mapType statsEngineMapType) *StatsContainer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per offline conversation, this still seems a bit clunky to refer to a map that you have access from an enum. Can you consider using a function pointer here instead, which returns you the appropriate map to manipulate?

func (engine *DockerStatsEngine) isIdle() bool {
engine.lock.RLock()
defer engine.lock.RUnlock()

return len(engine.tasksToContainers) == 0
}

func (engine *DockerStatsEngine) isHealthContainerIdle() bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rename this to containerHealthsToMonitor()? we can then do something like

if ok:=engine.containerHealthsToMonitor(); ok {
 // we have containers whose healths are being monitored
} else {
 // no containers' healths are being monitored
}

It reads better than isHealthContainerIdle, which doesn't exactly convey the intent of the method.

const (
// tasksInMessage is the maximum number of tasks that can be sent in a message to the backend
// This is a very conservative estimate assuming max allowed string lengths for all fields.
tasksInMessage = 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can rename this to tasksInMetricMessage

if !cs.disableResourceMetrics {
go cs.publishMetrics()
}
go cs.publishHelathMetrics()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: publishHelathMetrics -> publishHealthMetrics

// This is a very conservative estimate assuming max allowed string lengths for all fields.
tasksInMessage = 10
// tasksInHealthMessage is the maximum number of tasks that can be sent in a message to the backend
tasksInHealthMessage = 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a TODO for evaluating what this number should be

@@ -151,6 +180,43 @@ func (cs *ContainerStatus) MarshalJSON() ([]byte, error) {
return []byte(`"` + cs.String() + `"`), nil
}

// UnmarshalJSON overrides the logic for parsing the JSON-encoded container health data
func (healthStatus *ContainerHealthStatus) UnmarshalJSON(b []byte) error {
if strings.ToLower(string(b)) == "null" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you extract *healthStatus = ContainerHealthUnknown outside of these conditional blocks and just set it at the beginning of the method?

return errors.New("container health status unmarshal: status must be a string or null; Got " + string(b))
}
strStatus := string(b[1 : len(b)-1])
if strStatus == "UNKNOWN" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of these can be moved to a switch-case block

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use switch for this.

@@ -45,6 +45,10 @@ const (
imageNameFormat = "%s:%s"
// the buffer size will ensure agent doesn't miss any event from docker
dockerEventBufferSize = 100
// healthCheckHealthy is the health status returned from docker container health check
healthCheckHealthy = "healthy"
// healthCheckUnhealthy is unhealth status returned from docker container health check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 'unhealthy'

if cont.Container.HealthStatusShouldBeReported() {
seelog.Debugf("Updating container health status: %s", event.DockerContainerMetadata.Health)
cont.Container.SetHealthStatus(event.DockerContainerMetadata.Health)
engine.saver.Save()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to persist this in the state file. In the event of a restart, inspecting the container should be sufficient to get the health status of the container. wdyt? imo, we can delete this line.

engine.lock.Lock()
defer engine.lock.Unlock()

func (engine *DockerStatsEngine) addContainer(dockerID string) (*StatsContainer, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rename this to addContainerUnsafe

return statsContainer, nil
}

func (engine *DockerStatsEngine) containerMetricsMap() map[string]map[string]*StatsContainer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rename this as containerMetricsMapUnsafe

func (engine *DockerStatsEngine) containerMetricsMap() map[string]map[string]*StatsContainer {
return engine.tasksToContainers
}
func (engine *DockerStatsEngine) healthCheckContainerMetricsMap() map[string]map[string]*StatsContainer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rename this as healthCheckContainerMetricsMapUnsafe

func (engine *DockerStatsEngine) addToStatsContainerMapUnsafe(
taskARN, containerID string,
statsContainer *StatsContainer,
fun func() map[string]map[string]*StatsContainer) *StatsContainer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fun is a very generic, non descriptive name. Please rename that to be something more meaningful

Version *string `json:"version"`
// Version specifies the docker client API version to use
Version *string `json:"version"`
// HealthCheck is the configuration of docker health check
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this another json-serialized object? Is this part of Config/HostConfig?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No and it's not part of the Config/HostConfig from the payload message. It was sent as a string and agent will unmarshal it into the docker recognized struct api/task.go

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't it part of the Config from the payload message? It's being stuck in the Config anyway, so why do we need a separate unmarshal for it?

return errors.New("container health status unmarshal: status must be a string or null; Got " + string(b))
}
strStatus := string(b[1 : len(b)-1])
if strStatus == "UNKNOWN" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use switch for this.

logLength := len(dockerContainer.State.Health.Log)
if logLength != 0 {
// Only save the last log from the health check
health.Output = dockerContainer.State.Health.Log[logLength-1].Output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you only have the last line? Also: I don't see any limit here to the length of the line. We should probably limit this in terms of bytes and maybe select the last N bytes instead of the last line.

@@ -397,7 +397,7 @@ func TestInitProcessEnabled(t *testing.T) {
}
agent := RunAgent(t, nil)
defer agent.Cleanup()
agent.RequireVersion(">=1.14.5")
agent.RequireVersion(">=1.15.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this was changed to 1.14.5 unexpectedly here, so I changed it back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sharanyad Can you clarify whether that was intended?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardpen is right. According to the releases, 1.15.0 has the init support in agent. Not sure how the testing change got slipped in.

@richardpen
Copy link
Author

@samuelkarp Not sure why github doesn't allow me to response your comment in place, so I put it here. We only have the last line because the line should be the same in most cases for the failure and it should be enough for debugging purpose, and it saves memory to save the additional output especially when the retry number is very big.
For your suggestion to add limit, I think that's a good idea, I'll make the change. Thanks.

@samuelkarp
Copy link
Contributor

the line should be the same in most cases for the failure

I don't understand what you mean by that.

it should be enough for debugging purpose

This seems like a pretty big assumption, unless I'm misunderstanding who creates the output.

@richardpen
Copy link
Author

@samuelkarp Sorry for the confusing, the last line isn't the last line of the output, it is the last entry of the outputs which is the whole output of the latest health check command.

@samuelkarp
Copy link
Contributor

@richardpen Thanks, that does help. I think that makes sense, the only thing now is to look at making sure we implement a size limit for the output.

Peng Yin added 7 commits January 4, 2018 19:38
Regenerate the acs api to add new fields in the payload message to
support container health check configuration.
Add the container health configuration into DockerConfig struct, and
also add the function to pass it to docker when create the container.
Add the container health event type as a type of event that agent should
handle, also update the container health status during the
periodically conciliation of container metadata with docker.
Add the functionality to send container health metrics to backend
@richardpen richardpen changed the base branch from dev to container-health-check January 5, 2018 00:55
@richardpen
Copy link
Author

@samuelkarp @aaithal I have addressed all the comments, please take another look.

Copy link
Member

@aaithal aaithal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got some more comments. Apologies for not identifying these earlier.

if err != nil {
return nil, &DockerClientConfigError{"Unable decode given docker config: " + err.Error()}
}
}
if container.HealthCheckType == dockerHealthCheckType && config.Healthcheck == nil {
return nil, &DockerClientConfigError{"docker health check is nil while container health check type is DOCKER"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please split this into multiple lines for readability

}

resourceSplit := strings.SplitN(resource, arnResourceDelimiter, arnResourceSections)
if len(resourceSplit) != arnResourceSections {
return "", errors.New(fmt.Sprintf("task get-id: invalid task resource split: %s, expected=%d, actual=%d", resource, arnResourceSections, len(resourceSplit)))
return "", errors.Errorf("task get-id: invalid task resource split: %s, expected=%d, actual=%d", resource, arnResourceSections, len(resourceSplit))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please split this into multiple lines for readability

healthCheckHealthy = "healthy"
// healthCheckUnhealthy is unhealthy status returned from docker container health check
healthCheckUnhealthy = "unhealthy"
// maxHealthCheckOutputSize is the maximum size of healthcheck command output that agent will save
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please change this to "maximum length" instead of "size"

if err != nil {
return fmt.Errorf("Failed to subscribe to container change event stream, err %v", err)
}

go engine.listContainersAndStartEventHandler()
go engine.waitToStop()
engine.synchronizeState()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this returns an error. please make sure that you log a warning if an error is returned.

defer engine.lock.Unlock()
statsContainer, err := engine.addContainerUnsafe(containerID)
if err != nil {
seelog.Debugf("Adding container to stats watch list failed, err: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please log container id here.

func (engine *DockerStatsEngine) containerMetricsMapUnsafe() map[string]map[string]*StatsContainer {
return engine.tasksToContainers
}
func (engine *DockerStatsEngine) healthCheckContainerMetricsMapUnsafe() map[string]map[string]*StatsContainer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a newline here and rename this to healthCheckContainerMapUnsafe

func (engine *DockerStatsEngine) addToStatsContainerMapUnsafe(
taskARN, containerID string,
statsContainer *StatsContainer,
statsMapToUpdate func() map[string]map[string]*StatsContainer) *StatsContainer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this method returning *StatsContainer? iiuc, you're just returning statsContainer here and that's not changing any state. It may make more sense to return a bool indicating if we should start watching this container or not based on its existence in the map.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to avoid an extra if check, but I can modify that if this makes the code easier to read.

// due to a connection reset.
err := cs.publishHealthMetricsOnce()
if err != nil {
seelog.Warnf("Error publishing health metrics: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Unable to publish health metrics" would be a better log message

case <-cs.publishTicker.C:
err := cs.publishHealthMetricsOnce()
if err != nil {
seelog.Warnf("Error publishing health metrics: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Unable to publish health metrics" would be a better log message

// the ack each time it processes a health message
func ackPublishHealthMetricHandler(timer *time.Timer) func(*ecstcs.AckPublishHealth) {
return func(*ecstcs.AckPublishHealth) {
seelog.Debug("Received ACKPublishHealth from tcs")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if the message id can be logged here? That'd improve the message tracking if we need to debug anything with lossy messages in the future.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message ID isn't available alone here, we can only log the whole message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, never mind then

// EntryPoint is entrypoint of the container, corresponding to docker option: --entrypoint
EntryPoint *[]string
// Environment is the environment variable set in the container
Environment map[string]string `json:"environment"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe we should also document context for what the key:value represents in this case

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is pretty straightforward, as it's like setting any environment variable. Not sure what to put here. I'll leave it as for now, unless you have a better suggestion what to put here.

go tcshandler.StartMetricsSession(telemetrySessionParams)
}
// Start metrics session in a go routine
go tcshandler.StartMetricsSession(telemetrySessionParams)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please address this. not sure if it was addressed offline though.

engine.lock.Lock()
defer engine.lock.Unlock()

// addContainerUnsafe adds a container to the map of containers being watched.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i missed this. why are we getting rid of the locks here and using *Unsafe?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the lock was just moved to line 138.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops not sure how i missed that. thanks.

Copy link
Member

@aaithal aaithal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the comments, the one thing that's also missing is the integration test in 'stats' package to test the container health status tracking code path. Can you please add that as well?


strStatus := string(b[1 : len(b)-1])
switch strStatus {
case "UNKNOWN":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please add a comment here stating that the status is already set to ContainerHealthUnknown.

// no need to process this in task manager
if event.Type == api.ContainerHealthEvent {
if cont.Container.HealthStatusShouldBeReported() {
seelog.Debugf("Updating container health status: %v", event.DockerContainerMetadata.Health)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please log container name/id and task arn here as well

seelog.Debugf("Adding container to stats health check watch list, id: %s, task: %s", dockerID, task.Arn)
}

if !shouldCollectStats {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this to line 238, before invoking ResolveContainer?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 238-243 should be always executed for collecting container health status. So move this to line 238 will skip this part if shouldCollectStats was false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I misunderstood. Can you please rename shouldCollectStats to untrackedStatsContainer or something similar? shouldCollectStats is somewhat misleading in this context.

@richardpen
Copy link
Author

@aaithal I will add the integration test in a separate PR, please take another look, thanks.

Copy link
Member

@aaithal aaithal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address a couple of minor comments I have before merging this.

seelog.Debugf("Adding container to stats health check watch list, id: %s, task: %s", dockerID, task.Arn)
}

if !statsContainerTracked {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still doesn't read right. if stats container is not tracked, we should start tracking it. However, that's not what this variable is conveying. Can you rename this to be watchStatsContainer?


// addToStatsContainerMapUnsafe adds the statscontainer into stats for tracking and returns whether
// stats should start the statscontainer to collect metrics
func (engine *DockerStatsEngine) addToStatsContainerMapUnsafe(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update the documentation for with a description of what's returned from this method.

@damiencarol
Copy link

Any new on this one?

@richardpen
Copy link
Author

@damiencarol Thanks for tracking this, we are still actively working on this. I'm going to merge this PR, and we still need to add some tests, please track the #534 for updates.

@richardpen richardpen merged commit b7d1784 into aws:container-health-check Jan 15, 2018
@aaithal aaithal modified the milestones: 1.18.0, 1.17.0 Jan 17, 2018
@richardpen richardpen deleted the chc branch January 30, 2018 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants