New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable container health check #1141
Conversation
98e41a5
to
989f825
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to review changes to stats engine and the tcs handler. Publishing first list of comments so that you have an initial set of changes to work with. Can you also create this PR against a new "container-health" branch in the repo?
case ContainerUnhealthy: | ||
return "UNHEALTHY" | ||
default: | ||
return "UNKNOWN" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- When do you expect this to be the case?
- Is this a valid backend status? That is, what will the health of the container/task be if we report it as
"UNKNOWN"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do you expect this to be the case?
When the docker inspect api failed(eg, timeout), agent wasn't able to get the health status of the container.
Is this a valid backend status? That is, what will the health of the container/task be if we report it as "UNKNOWN"?
Yes, backend expect 'UNKNOWN' when agent can't get container health status. In that case the container health status will be reported as unknown.
agent/api/containerevent.go
Outdated
package api | ||
|
||
// DockerEventsType represents the type of docker events | ||
type DockerEventsType int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you please rename this to DockerEventType
?
agent/api/task.go
Outdated
if container.HealthCheckType == dockerHealthCheckType { | ||
// configure the docker health check config if it's set | ||
healthConfig := &docker.HealthConfig{} | ||
err := json.Unmarshal([]byte(*container.DockerConfig.HealthCheck), healthConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use aws.StringValue()
here any everywhere else where you're dereferencing pointers directly.
agent/api/task_test.go
Outdated
Name: "c1", | ||
HealthCheckType: dockerHealthCheckType, | ||
DockerConfig: DockerConfig{ | ||
HealthCheck: aws.String(`{"Test":["command"],"Interval":5000000000,"Timeout":4000000000,"StartPeriod":60000000000,"Retries":5}`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: Can you please split this into a multi-line string to make it more readable?
agent/app/agent_capability.go
Outdated
@@ -99,6 +100,11 @@ func (agent *ecsAgent) capabilities() ([]*ecs.Attribute, error) { | |||
capabilities = appendNameOnlyAttribute(capabilities, attributePrefix+"execution-role-ecr-pull") | |||
} | |||
|
|||
if _, ok := supportedVersions[dockerclient.Version_1_29]; ok { | |||
// StartPeriod was added in API 1.29 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- minor: make this more explanatory. "Docker health check start period was added in .."
- Also, we it feels like we're ignoring a large number of container instances which are running docker version >= 1.12 and <= 17.05. I think we should instead check if version >= 1.24
@@ -802,6 +822,10 @@ func (dg *dockerGoClient) ContainerEvents(ctx context.Context) (<-chan DockerCon | |||
// mean the container dies (non-init processes). If the container also | |||
// dies, you see a "die" status as well; we'll update suitably there | |||
continue | |||
case "health_status: healthy": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just double checking here that the the event's Status
is really "health_status: healthy"
and this is not a typo/bug. Should this be be just "health_status"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the event's status is "health_status: healthy"
agent/engine/docker_events_buffer.go
Outdated
@@ -23,7 +23,7 @@ const ( | |||
containerTypeEvent = "container" | |||
) | |||
|
|||
var containerEvents = []string{"create", "start", "stop", "die", "restart", "oom"} | |||
var containerEvents = []string{"create", "start", "stop", "die", "restart", "oom", "health_status: unhealthy", "health_status: healthy"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: please split this into multiple lines
go tcshandler.StartMetricsSession(telemetrySessionParams) | ||
} | ||
// Start metrics session in a go routine | ||
go tcshandler.StartMetricsSession(telemetrySessionParams) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is problematic because today, disabling metrics also means disabling the metrics collection engine that gets initialized (stats engine's init) from tcs handler. Customers might choose to do that because they want lighter cpu/memory footprint for the ECS Agent. If agent.cfg.DisableMetrics
is false
, we should disable both metrics collection and reporting. I think with this change, we're only disabling the reporting pat. We should also disable metrics collection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please address this. not sure if it was addressed offline though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has already been addressed. Disable the metrics won't affect the health check part.
agent/engine/docker_task_engine.go
Outdated
engine.processTasks.RLock() | ||
managedTask, ok := engine.managedTasks[task.Arn] | ||
// hold the lock until the message is sent so we don't send on a closed channel | ||
defer engine.processTasks.RUnlock() | ||
if !ok { | ||
log.Crit("Could not find managed task corresponding to a docker event", "event", event, "task", task) | ||
return true | ||
seelog.Criticalf("Could not find managed task corresponding to a docker event, event: %s, task: %s", event, task) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need the whole task here (or in other log lines you've changed). Can you use task.Arn
instead?
agent/engine/types.go
Outdated
@@ -63,6 +65,8 @@ type DockerContainerMetadata struct { | |||
StartedAt time.Time | |||
// FinishedAt is the timestamp of container stop | |||
FinishedAt time.Time | |||
//Health contains the result of a container health check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lint: // Health
agent/stats/engine.go
Outdated
func (engine *DockerStatsEngine) isIdle() bool { | ||
engine.lock.RLock() | ||
defer engine.lock.RUnlock() | ||
|
||
return len(engine.tasksToContainers) == 0 | ||
} | ||
|
||
func (engine *DockerStatsEngine) getTaskHealth(taskARN string) *ecstcs.TaskHealth { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have the 'Unsafe' suffix.
agent/stats/engine.go
Outdated
if taskHealth == nil { | ||
continue | ||
} | ||
taskHealths = append(taskHealths, engine.getTaskHealth(taskARN)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just usetaskHealth
here, instead of an addition getTaskHealth
call
agent/tcs/client/client.go
Outdated
// copyHealthMetadata performs a deep copy of HealthMetadata object | ||
func copyHealthMetadata(metadata *ecstcs.HealthMetadata, fin bool) *ecstcs.HealthMetadata { | ||
return &ecstcs.HealthMetadata{ | ||
Cluster: aws.String(*metadata.Cluster), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just do Cluster: metadata.Cluster
here (same for other string fields in this block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, metadata.xx
is a pointer, but we want a deep copy here, so that the actual value wasn't changed elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, missed that. please use aws.StringValue here as well. Avoid doing pointer dereferencing as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still an outstanding item
agent/tcs/client/client.go
Outdated
// create a request if the number of task reaches the maximum | ||
if (i+1)%tasksInMessage == 0 { | ||
requestMetadata := copyHealthMetadata(metadata, (i+1) == numOfTasks) | ||
requests = append(requests, ecstcs.NewPublishHealthMetricsRequest(requestMetadata, copyTaskHealthMetrics(taskHealths))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please split this into multiple lines so that it's easier to read
agent/tcs/client/client.go
Outdated
numOfTasks := len(taskHealthMetrics) | ||
for i, taskHealth := range taskHealthMetrics { | ||
taskHealths = append(taskHealths, taskHealth) | ||
// create a request if the number of task reaches the maximum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: ".. reaches the maximum page size"
agent/api/container.go
Outdated
Config *string `json:"config"` | ||
HostConfig *string `json:"hostConfig"` | ||
Version *string `json:"version"` | ||
Config *string `json:"config"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: this is a good opportunity to document these fields
agent/api/container.go
Outdated
@@ -65,11 +81,12 @@ type Container struct { | |||
Overrides ContainerOverrides `json:"overrides"` | |||
DockerConfig DockerConfig `json:"dockerConfig"` | |||
RegistryAuthentication *RegistryAuthenticationData `json:"registryAuthentication"` | |||
|
|||
HealthCheckType string `json:"healthCheckType,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: // HealthCheckType is ..
agent/api/container.go
Outdated
copyHealth := c.Health | ||
|
||
if c.Health.Since != nil { | ||
copyHealth.Since = aws.Time(*c.Health.Since) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please avoid doing pointer dereferencing this way if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aaithal: why exactly do you advise against this?
agent/api/containerevent.go
Outdated
|
||
const ( | ||
// ContainerStatusEvent represents the container status change events from docker | ||
ContainerStatusEvent DockerEventType = iota |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also document which events fall into this bucket here?
agent/api/containerevent.go
Outdated
const ( | ||
// ContainerStatusEvent represents the container status change events from docker | ||
ContainerStatusEvent DockerEventType = iota | ||
// ContainerHealthEvent represents the container health status event from docker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. Please document all the event types that fall into this bucket.
agent/stats/engine.go
Outdated
} | ||
|
||
if len(taskHealths) == 0 { | ||
return nil, nil, EmptyMetricsError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please return a different error here. EmptyMetricsError
is not the same as EmptyHealthStatusError
for example
@@ -445,6 +563,17 @@ func (engine *DockerStatsEngine) doRemoveContainerUnsafe(container *StatsContain | |||
delete(engine.tasksToDefinitions, taskArn) | |||
seelog.Debugf("Deleted task from tasks, arn: %s", taskArn) | |||
} | |||
|
|||
// Remove the container from health container watch list | |||
if _, ok := engine.tasksToHealthCheckContainers[taskArn][dockerID]; !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need this. delete
is a no-op for non-existentkeys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check makes sure the log below is correct, otherwise the log message will be misleading.
agent/stats/types.go
Outdated
@@ -38,7 +38,7 @@ type UsageStats struct { | |||
|
|||
// ContainerMetadata contains meta-data information for a container. | |||
type ContainerMetadata struct { | |||
DockerID string `json:"-"` | |||
DockerID string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you change this?
agent/tcs/client/client.go
Outdated
for i, taskHealth := range taskHealthMetrics { | ||
taskHealths = append(taskHealths, taskHealth) | ||
// create a request if the number of task reaches the maximum page size | ||
if (i+1)%tasksInMessage == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please define a new const for tasksInHealthMessage
and rename tasksInMessage
to tasksInMetricsMessage
. Also add a TODO for determining the proper value for tasksInHealthMessage
. Because of changes in payload structure, we'd have differing values for these I think
agent/tcs/client/client.go
Outdated
// copyHealthMetadata performs a deep copy of HealthMetadata object | ||
func copyHealthMetadata(metadata *ecstcs.HealthMetadata, fin bool) *ecstcs.HealthMetadata { | ||
return &ecstcs.HealthMetadata{ | ||
Cluster: aws.String(*metadata.Cluster), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still an outstanding item
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
took a first pass, ill take another look soon.
agent/api/container.go
Outdated
@@ -32,15 +33,30 @@ const ( | |||
// that specifies that the log driver should be authenticated using the | |||
// execution role | |||
awslogsAuthExecutionRole = "ExecutionRole" | |||
|
|||
dockerHealthCheckType = "DOCKER" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: docs for dockerHealthCheckType
agent/api/container.go
Outdated
copyHealth := c.Health | ||
|
||
if c.Health.Since != nil { | ||
copyHealth.Since = aws.Time(*c.Health.Since) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aaithal: why exactly do you advise against this?
assert.Equal(t, health.Output, "test") | ||
assert.NotEmpty(t, health.Since) | ||
|
||
// set the health status again shouldn't update the timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so i'm clear, the timestamp shouldn't change cause the Status
hasn't changed ya?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the timestamps means the timestamp when the status changed.
case ContainerHealthEvent: | ||
return "ContainerHealthChangeEvent" | ||
default: | ||
return "UNKNOWN" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should make this something similar to "DockerEventType: UNKNOWN"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second thought, I think UNKNOWN
should be enough, as in logs we will log like "event: UNKNOWN" which will give you the context.
agent/api/containerstatus.go
Outdated
return nil | ||
} | ||
|
||
return errors.New("Unrecognized container health status: " + string(b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't follow error string format convention
@aaithal do you mean the string should start with a lowercase?
logLength := len(dockerContainer.State.Health.Log) | ||
if logLength != 0 { | ||
// Only save the last log from the health check | ||
health.Output = dockerContainer.State.Health.Log[logLength-1].Output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this only the last line from the health check log? will this have enough context? also - is this the only way to get visibility into why the health check command may have failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only the last line from the health check log? will this have enough context?
Yes, since the log is the health check command output, and if the command failed the output should be the same. So I think the last line is enough.
is this the only way to get visibility into why the health check command may have failed?
You can also get the failure reason from exitcode. the output is the health check command output, which is more understandable than the exitcode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would you only have the last line? Also: I don't see any limit here to the length of the line. We should probably limit this in terms of bytes and maybe select the last N bytes instead of the last line.
agent/engine/docker_task_engine.go
Outdated
|
||
managedTask.dockerMessages <- dockerContainerChange{container: cont.Container, event: event} | ||
log.Debug("Wrote docker event to the associated task", "task", task, "event", event) | ||
return true | ||
seelog.Debugf("Wrote docker event to the associated task: %s, event: %s", task.Arn, event) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: are we sure we need both Writing
/Wrote
debug lines here? this may add a lot of noise to our debug logs and they're already pretty noisy.
44f126e
to
8630a7f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial pass. some nits.
agent/api/container.go
Outdated
// HealthCheckType is the mechnism to use for the container health check | ||
// currently it only supports 'DOCKER' | ||
HealthCheckType string `json:"healthCheckType,omitempty"` | ||
// Health contains the health check information of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit : health check information of
... incomplete?
agent/tcs/client/client.go
Outdated
@@ -221,6 +232,86 @@ func (cs *clientServer) metricsToPublishMetricRequests() ([]*ecstcs.PublishMetri | |||
return requests, nil | |||
} | |||
|
|||
// publishHelathMetrics send the container health information to backend | |||
func (cs *clientServer) publishHelathMetrics() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: publishHealthMetrics()
agent/api/containerstatus.go
Outdated
ContainerHealthUnknown ContainerHealthStatus = iota | ||
// ContainerHealthy represents the container health check returned healthy | ||
ContainerHealthy | ||
// ContainerUnhealthy represents the container health check failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment reads like the health check is Unknown. Should it be // ContainerUnhealthy represents the container health check when returned unhealthy
?
@@ -99,6 +100,11 @@ func (agent *ecsAgent) capabilities() ([]*ecs.Attribute, error) { | |||
capabilities = appendNameOnlyAttribute(capabilities, attributePrefix+"execution-role-ecr-pull") | |||
} | |||
|
|||
if _, ok := supportedVersions[dockerclient.Version_1_24]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also check for remote api 1.29 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docker health check was available from 1.24 and afterwards. StartPeriod
was added in 1.29. We only check the 1.24 here, and backend will check 1.29 capability if the StartPeriod
is specified.
agent/stats/engine.go
Outdated
|
||
const ( | ||
// statsMetricsMap represents the map 'tasksToContainers' | ||
statsMetricsMap statsEngineMapType = iota |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be statsTaskMetricsMap
and the health one something like statsTaskHealthMetricsMap
since we might have different health check types?
agent/stats/engine.go
Outdated
} | ||
} | ||
|
||
// synchronize go through all the containers on the instance to synchronize the state on agent start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: synchronizeState goes..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly have minor comments.
agent/api/containerstatus.go
Outdated
@@ -121,7 +150,7 @@ func (cs *ContainerStatus) UnmarshalJSON(b []byte) error { | |||
} | |||
if b[0] != '"' || b[len(b)-1] != '"' { | |||
*cs = ContainerStatusNone | |||
return errors.New("ContainerStatus must be a string or null; Got " + string(b)) | |||
return errors.New("containerStatus must be a string or null; Got " + string(b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you reformat this as: "container status unmarshal: status must be a string or null.."
agent/api/containerstatus.go
Outdated
@@ -137,7 +166,7 @@ func (cs *ContainerStatus) UnmarshalJSON(b []byte) error { | |||
stat, ok := containerStatusMap[strStatus] | |||
if !ok { | |||
*cs = ContainerStatusNone | |||
return errors.New("Unrecognized ContainerStatus") | |||
return errors.New("unrecognized ContainerStatus") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you reformat this as: "container status unmarshal: unrecognized status"
agent/api/containerstatus.go
Outdated
|
||
if b[0] != '"' || b[len(b)-1] != '"' { | ||
*healthStatus = ContainerHealthUnknown | ||
return errors.New("containerHealthStatus must be a string or null; Got " + string(b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Please reformat this as per go standard for error strings:
"container health status: ..."
agent/api/containerstatus.go
Outdated
return nil | ||
} | ||
|
||
return errors.New("unrecognized container health status: " + string(b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here too: "container health status: unrecognized status..."
agent/engine/docker_task_engine.go
Outdated
|
||
task, taskFound := engine.state.TaskByID(event.DockerID) | ||
cont, containerFound := engine.state.ContainerByID(event.DockerID) | ||
if !taskFound || !containerFound { | ||
log.Debug("Event for container not managed", "dockerId", event.DockerID) | ||
return false | ||
seelog.Debugf("Event for container not managed, container: %s", event.DockerID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also print the values of taskFound and containerFound in the log?
agent/stats/engine.go
Outdated
func (engine *DockerStatsEngine) addToStatsContainerMapUnsafe( | ||
taskARN, containerID string, | ||
statsContainer *StatsContainer, | ||
mapType statsEngineMapType) *StatsContainer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as per offline conversation, this still seems a bit clunky to refer to a map that you have access from an enum. Can you consider using a function pointer here instead, which returns you the appropriate map to manipulate?
agent/stats/engine.go
Outdated
func (engine *DockerStatsEngine) isIdle() bool { | ||
engine.lock.RLock() | ||
defer engine.lock.RUnlock() | ||
|
||
return len(engine.tasksToContainers) == 0 | ||
} | ||
|
||
func (engine *DockerStatsEngine) isHealthContainerIdle() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rename this to containerHealthsToMonitor()
? we can then do something like
if ok:=engine.containerHealthsToMonitor(); ok {
// we have containers whose healths are being monitored
} else {
// no containers' healths are being monitored
}
It reads better than isHealthContainerIdle
, which doesn't exactly convey the intent of the method.
agent/tcs/client/client.go
Outdated
const ( | ||
// tasksInMessage is the maximum number of tasks that can be sent in a message to the backend | ||
// This is a very conservative estimate assuming max allowed string lengths for all fields. | ||
tasksInMessage = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can rename this to tasksInMetricMessage
agent/tcs/client/client.go
Outdated
if !cs.disableResourceMetrics { | ||
go cs.publishMetrics() | ||
} | ||
go cs.publishHelathMetrics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: publishHelathMetrics
-> publishHealthMetrics
agent/tcs/client/client.go
Outdated
// This is a very conservative estimate assuming max allowed string lengths for all fields. | ||
tasksInMessage = 10 | ||
// tasksInHealthMessage is the maximum number of tasks that can be sent in a message to the backend | ||
tasksInHealthMessage = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a TODO for evaluating what this number should be
agent/api/containerstatus.go
Outdated
@@ -151,6 +180,43 @@ func (cs *ContainerStatus) MarshalJSON() ([]byte, error) { | |||
return []byte(`"` + cs.String() + `"`), nil | |||
} | |||
|
|||
// UnmarshalJSON overrides the logic for parsing the JSON-encoded container health data | |||
func (healthStatus *ContainerHealthStatus) UnmarshalJSON(b []byte) error { | |||
if strings.ToLower(string(b)) == "null" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you extract *healthStatus = ContainerHealthUnknown
outside of these conditional blocks and just set it at the beginning of the method?
agent/api/containerstatus.go
Outdated
return errors.New("container health status unmarshal: status must be a string or null; Got " + string(b)) | ||
} | ||
strStatus := string(b[1 : len(b)-1]) | ||
if strStatus == "UNKNOWN" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of these can be moved to a switch-case
block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use switch
for this.
@@ -45,6 +45,10 @@ const ( | |||
imageNameFormat = "%s:%s" | |||
// the buffer size will ensure agent doesn't miss any event from docker | |||
dockerEventBufferSize = 100 | |||
// healthCheckHealthy is the health status returned from docker container health check | |||
healthCheckHealthy = "healthy" | |||
// healthCheckUnhealthy is unhealth status returned from docker container health check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 'unhealthy'
agent/engine/docker_task_engine.go
Outdated
if cont.Container.HealthStatusShouldBeReported() { | ||
seelog.Debugf("Updating container health status: %s", event.DockerContainerMetadata.Health) | ||
cont.Container.SetHealthStatus(event.DockerContainerMetadata.Health) | ||
engine.saver.Save() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to persist this in the state file. In the event of a restart, inspecting the container should be sufficient to get the health status of the container. wdyt? imo, we can delete this line.
agent/stats/engine.go
Outdated
engine.lock.Lock() | ||
defer engine.lock.Unlock() | ||
|
||
func (engine *DockerStatsEngine) addContainer(dockerID string) (*StatsContainer, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please rename this to addContainerUnsafe
agent/stats/engine.go
Outdated
return statsContainer, nil | ||
} | ||
|
||
func (engine *DockerStatsEngine) containerMetricsMap() map[string]map[string]*StatsContainer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please rename this as containerMetricsMapUnsafe
agent/stats/engine.go
Outdated
func (engine *DockerStatsEngine) containerMetricsMap() map[string]map[string]*StatsContainer { | ||
return engine.tasksToContainers | ||
} | ||
func (engine *DockerStatsEngine) healthCheckContainerMetricsMap() map[string]map[string]*StatsContainer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please rename this as healthCheckContainerMetricsMapUnsafe
agent/stats/engine.go
Outdated
func (engine *DockerStatsEngine) addToStatsContainerMapUnsafe( | ||
taskARN, containerID string, | ||
statsContainer *StatsContainer, | ||
fun func() map[string]map[string]*StatsContainer) *StatsContainer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fun
is a very generic, non descriptive name. Please rename that to be something more meaningful
agent/api/container.go
Outdated
Version *string `json:"version"` | ||
// Version specifies the docker client API version to use | ||
Version *string `json:"version"` | ||
// HealthCheck is the configuration of docker health check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this another json-serialized object? Is this part of Config/HostConfig?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No and it's not part of the Config/HostConfig from the payload message. It was sent as a string and agent will unmarshal it into the docker recognized struct api/task.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't it part of the Config
from the payload message? It's being stuck in the Config
anyway, so why do we need a separate unmarshal for it?
agent/api/containerstatus.go
Outdated
return errors.New("container health status unmarshal: status must be a string or null; Got " + string(b)) | ||
} | ||
strStatus := string(b[1 : len(b)-1]) | ||
if strStatus == "UNKNOWN" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use switch
for this.
logLength := len(dockerContainer.State.Health.Log) | ||
if logLength != 0 { | ||
// Only save the last log from the health check | ||
health.Output = dockerContainer.State.Health.Log[logLength-1].Output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would you only have the last line? Also: I don't see any limit here to the length of the line. We should probably limit this in terms of bytes and maybe select the last N bytes instead of the last line.
@@ -397,7 +397,7 @@ func TestInitProcessEnabled(t *testing.T) { | |||
} | |||
agent := RunAgent(t, nil) | |||
defer agent.Cleanup() | |||
agent.RequireVersion(">=1.14.5") | |||
agent.RequireVersion(">=1.15.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this was changed to 1.14.5
unexpectedly here, so I changed it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sharanyad Can you clarify whether that was intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardpen is right. According to the releases, 1.15.0
has the init
support in agent. Not sure how the testing change got slipped in.
@samuelkarp Not sure why github doesn't allow me to response your comment in place, so I put it here. We only have the last line because the line should be the same in most cases for the failure and it should be enough for debugging purpose, and it saves memory to save the additional output especially when the retry number is very big. |
I don't understand what you mean by that.
This seems like a pretty big assumption, unless I'm misunderstanding who creates the output. |
@samuelkarp Sorry for the confusing, the last line isn't the last line of the output, it is the last entry of the outputs which is the whole output of the latest health check command. |
@richardpen Thanks, that does help. I think that makes sense, the only thing now is to look at making sure we implement a size limit for the output. |
Regenerate the acs api to add new fields in the payload message to support container health check configuration.
Add the container health configuration into DockerConfig struct, and also add the function to pass it to docker when create the container.
Add the container health event type as a type of event that agent should handle, also update the container health status during the periodically conciliation of container metadata with docker.
Add the functionality to send container health metrics to backend
@samuelkarp @aaithal I have addressed all the comments, please take another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got some more comments. Apologies for not identifying these earlier.
agent/api/task.go
Outdated
if err != nil { | ||
return nil, &DockerClientConfigError{"Unable decode given docker config: " + err.Error()} | ||
} | ||
} | ||
if container.HealthCheckType == dockerHealthCheckType && config.Healthcheck == nil { | ||
return nil, &DockerClientConfigError{"docker health check is nil while container health check type is DOCKER"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please split this into multiple lines for readability
agent/api/task.go
Outdated
} | ||
|
||
resourceSplit := strings.SplitN(resource, arnResourceDelimiter, arnResourceSections) | ||
if len(resourceSplit) != arnResourceSections { | ||
return "", errors.New(fmt.Sprintf("task get-id: invalid task resource split: %s, expected=%d, actual=%d", resource, arnResourceSections, len(resourceSplit))) | ||
return "", errors.Errorf("task get-id: invalid task resource split: %s, expected=%d, actual=%d", resource, arnResourceSections, len(resourceSplit)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please split this into multiple lines for readability
healthCheckHealthy = "healthy" | ||
// healthCheckUnhealthy is unhealthy status returned from docker container health check | ||
healthCheckUnhealthy = "unhealthy" | ||
// maxHealthCheckOutputSize is the maximum size of healthcheck command output that agent will save |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please change this to "maximum length" instead of "size"
agent/stats/engine.go
Outdated
if err != nil { | ||
return fmt.Errorf("Failed to subscribe to container change event stream, err %v", err) | ||
} | ||
|
||
go engine.listContainersAndStartEventHandler() | ||
go engine.waitToStop() | ||
engine.synchronizeState() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this returns an error. please make sure that you log a warning if an error is returned.
agent/stats/engine.go
Outdated
defer engine.lock.Unlock() | ||
statsContainer, err := engine.addContainerUnsafe(containerID) | ||
if err != nil { | ||
seelog.Debugf("Adding container to stats watch list failed, err: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please log container id here.
agent/stats/engine.go
Outdated
func (engine *DockerStatsEngine) containerMetricsMapUnsafe() map[string]map[string]*StatsContainer { | ||
return engine.tasksToContainers | ||
} | ||
func (engine *DockerStatsEngine) healthCheckContainerMetricsMapUnsafe() map[string]map[string]*StatsContainer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a newline here and rename this to healthCheckContainerMapUnsafe
agent/stats/engine.go
Outdated
func (engine *DockerStatsEngine) addToStatsContainerMapUnsafe( | ||
taskARN, containerID string, | ||
statsContainer *StatsContainer, | ||
statsMapToUpdate func() map[string]map[string]*StatsContainer) *StatsContainer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this method returning *StatsContainer
? iiuc, you're just returning statsContainer
here and that's not changing any state. It may make more sense to return a bool indicating if we should start watching this container or not based on its existence in the map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to avoid an extra if
check, but I can modify that if this makes the code easier to read.
agent/tcs/client/client.go
Outdated
// due to a connection reset. | ||
err := cs.publishHealthMetricsOnce() | ||
if err != nil { | ||
seelog.Warnf("Error publishing health metrics: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "Unable to publish health metrics"
would be a better log message
agent/tcs/client/client.go
Outdated
case <-cs.publishTicker.C: | ||
err := cs.publishHealthMetricsOnce() | ||
if err != nil { | ||
seelog.Warnf("Error publishing health metrics: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "Unable to publish health metrics"
would be a better log message
// the ack each time it processes a health message | ||
func ackPublishHealthMetricHandler(timer *time.Timer) func(*ecstcs.AckPublishHealth) { | ||
return func(*ecstcs.AckPublishHealth) { | ||
seelog.Debug("Received ACKPublishHealth from tcs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check if the message id can be logged here? That'd improve the message tracking if we need to debug anything with lossy messages in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message ID isn't available alone here, we can only log the whole message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, never mind then
4d56802
to
186876c
Compare
// EntryPoint is entrypoint of the container, corresponding to docker option: --entrypoint | ||
EntryPoint *[]string | ||
// Environment is the environment variable set in the container | ||
Environment map[string]string `json:"environment"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe we should also document context for what the key:value represents in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is pretty straightforward, as it's like setting any environment variable. Not sure what to put here. I'll leave it as for now, unless you have a better suggestion what to put here.
go tcshandler.StartMetricsSession(telemetrySessionParams) | ||
} | ||
// Start metrics session in a go routine | ||
go tcshandler.StartMetricsSession(telemetrySessionParams) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please address this. not sure if it was addressed offline though.
engine.lock.Lock() | ||
defer engine.lock.Unlock() | ||
|
||
// addContainerUnsafe adds a container to the map of containers being watched. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe i missed this. why are we getting rid of the locks here and using *Unsafe
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the lock was just moved to line 138.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops not sure how i missed that. thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the comments, the one thing that's also missing is the integration test in 'stats' package to test the container health status tracking code path. Can you please add that as well?
|
||
strStatus := string(b[1 : len(b)-1]) | ||
switch strStatus { | ||
case "UNKNOWN": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please add a comment here stating that the status is already set to ContainerHealthUnknown
.
agent/engine/docker_task_engine.go
Outdated
// no need to process this in task manager | ||
if event.Type == api.ContainerHealthEvent { | ||
if cont.Container.HealthStatusShouldBeReported() { | ||
seelog.Debugf("Updating container health status: %v", event.DockerContainerMetadata.Health) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please log container name/id and task arn here as well
agent/stats/engine.go
Outdated
seelog.Debugf("Adding container to stats health check watch list, id: %s, task: %s", dockerID, task.Arn) | ||
} | ||
|
||
if !shouldCollectStats { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move this to line 238, before invoking ResolveContainer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 238-243 should be always executed for collecting container health status. So move this to line 238 will skip this part if shouldCollectStats
was false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I misunderstood. Can you please rename shouldCollectStats
to untrackedStatsContainer
or something similar? shouldCollectStats
is somewhat misleading in this context.
@aaithal I will add the integration test in a separate PR, please take another look, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address a couple of minor comments I have before merging this.
agent/stats/engine.go
Outdated
seelog.Debugf("Adding container to stats health check watch list, id: %s, task: %s", dockerID, task.Arn) | ||
} | ||
|
||
if !statsContainerTracked { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this still doesn't read right. if stats container is not tracked, we should start tracking it. However, that's not what this variable is conveying. Can you rename this to be watchStatsContainer
?
agent/stats/engine.go
Outdated
|
||
// addToStatsContainerMapUnsafe adds the statscontainer into stats for tracking and returns whether | ||
// stats should start the statscontainer to collect metrics | ||
func (engine *DockerStatsEngine) addToStatsContainerMapUnsafe( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update the documentation for with a description of what's returned from this method.
Any new on this one? |
@damiencarol Thanks for tracking this, we are still actively working on this. I'm going to merge this PR, and we still need to add some tests, please track the #534 for updates. |
Summary
Enable docker container health check in the agent and report container health metrics to backend
Implementation details
Testing
make release
)go build -out amazon-ecs-agent.exe ./agent
)make test
) passgo test -timeout=25s ./agent/...
) passmake run-integ-tests
) pass.\scripts\run-integ-tests.ps1
) passmake run-functional-tests
) pass.\scripts\run-functional-tests.ps1
) passNew tests cover the changes:
Description for the changelog
Licensing
This contribution is under the terms of the Apache 2.0 License: