Adding container metrics support #265

abhi · 2017-09-20T06:31:36Z

Built on top of #257
Review by commits

Implements ListContainerStats and ContainerStats to collect metrics from containerd
Added couple of integration tests to verify stat collection

This commit does not include caching. Caching will be implemented if the benchmark results go northwards.

yanxuean · 2017-09-20T08:06:56Z

cmd/cri-containerd/options/options.go

@@ -31,6 +31,8 @@ const configFilePathArgName = "config"

 // ContainerdConfig contains config related to containerd
 type ContainerdConfig struct {
+	// ContainerdRootDir is the root directory path for containerd.
+	ContainerdRootDir string


Will do in my PR. Thanks!

Random-Liu · 2017-09-21T02:47:10Z

pkg/server/container_stats.go

 	"k8s.io/kubernetes/pkg/kubelet/apis/cri/v1alpha1/runtime"
 )

 // ContainerStats returns stats of the container. If the container does not
 // exist, the call returns an error.
 func (c *criContainerdService) ContainerStats(ctx context.Context, in *runtime.ContainerStatsRequest) (*runtime.ContainerStatsResponse, error) {
-	return nil, errors.New("not implemented")
+	// Validate the stats request
+	if in == nil || in.ContainerId == "" {


nit: in.GetContainerId() == ""

Not important, just a nit.

Random-Liu · 2017-09-21T02:53:59Z

pkg/server/container_stats_list.go

+	cs.Cpu = &runtime.CpuUsage{Timestamp: stats.Timestamp.Unix(), UsageCoreNanoSeconds: &runtime.UInt64Value{Value: metrics.CPU.Usage.Total}}
+	cs.Memory = &runtime.MemoryUsage{Timestamp: stats.Timestamp.Unix(), WorkingSetBytes: &runtime.UInt64Value{metrics.Memory.Usage.Usage}}
+
+	cs.Attributes = &runtime.ContainerAttributes{Id: stats.ID}


Please fill the attributes, all required information is in c.containerStore.Get(in.ContainerId), which you have called.

Random-Liu · 2017-09-21T02:57:15Z

pkg/server/container_stats.go

+
+	var cs runtime.ContainerStats
+	if err := getContainerMetrics(resp.Metrics[0], &cs); err != nil {
+		return nil, fmt.Errorf("failed to decode container metrics for container %q: %v", err)


Unused %q. I'm fine with removing it directly.

Random-Liu · 2017-09-21T02:57:50Z

pkg/server/container_stats_list.go

+	metrics := s.(*cgroups.Metrics)
+	cs.Cpu = &runtime.CpuUsage{Timestamp: stats.Timestamp.Unix(), UsageCoreNanoSeconds: &runtime.UInt64Value{Value: metrics.CPU.Usage.Total}}
+	cs.Memory = &runtime.MemoryUsage{Timestamp: stats.Timestamp.Unix(), WorkingSetBytes: &runtime.UInt64Value{metrics.Memory.Usage.Usage}}
+


We should also get writable layer size from snapshots. See the imagefs stats PR.

Random-Liu · 2017-09-21T02:59:11Z

pkg/server/container_stats_list.go

+		return fmt.Errorf("failed to extract container metrics: %v", err)
+	}
+	metrics := s.(*cgroups.Metrics)
+	cs.Cpu = &runtime.CpuUsage{Timestamp: stats.Timestamp.Unix(), UsageCoreNanoSeconds: &runtime.UInt64Value{Value: metrics.CPU.Usage.Total}}


nit: Seems too long and also the following one.

Random-Liu · 2017-09-21T05:39:31Z

pkg/server/container_stats_list.go

+}
+func (c *criContainerdService) buildTaskMetricsRequest(r *runtime.ListContainerStatsRequest) (tasks.MetricsRequest, error) {
+	var req tasks.MetricsRequest
+	if r == nil || r.Filter == nil {


nit: r.GetFilter().GetId() and r.GetFilter().GetPodSandboxId()

These getters could help you avoid the nil check.

Random-Liu · 2017-09-21T05:42:05Z

pkg/server/container_stats_list.go

+	var req tasks.MetricsRequest
+	if r == nil || r.Filter == nil {
+		return req, nil
+	}


I believe the relationship between different filters in CRI is AND.

What about containerd?

If different filters in containerd is OR, then it's not correct.

If different filters in containerd is AND, then we'll get no container, because there is no single container has all the ids.

As is mentioned in our discussion, I'm fine with list all containerstats and do filter afterwards, if the filter doesn't work that well :) Up to you to decide.

Random-Liu · 2017-09-21T05:46:28Z

integration/container_stats_test.go

+)
+
+const (
+	defaultImage = "gcr.io/google_containers/pause:3.0"


This const is global, let's make the name more specific.

Random-Liu · 2017-09-21T05:48:29Z

integration/container_stats_test.go

+	s, err := runtimeService.ContainerStats(sb)
+	t.Logf("Verify stats received for sandbox container")
+	require.NoError(t, err)
+	testStats(t, s)


Why does this work? We don't store sandbox container id in container store, right?

Random-Liu · 2017-09-21T05:49:36Z

pkg/server/container_stats_list.go

+	if err != nil {
+		return nil, fmt.Errorf("failed to build metrics request: %v", err)
+	}
+	resp, err := c.taskService.Metrics(context.Background(), &request)


We may want to add a filter to filter containerKindLabel=container. This will also include sandbox containers in the list.

I need to discuss with @yujuhong to finalize whether we should include sandbox container stats in ListContainerStats.

Discussed with @yujuhong, let's exclude the sandbox container for now. We need to figure out a way to handle pod level overhead.

ok makes sense. I wasn't sure about it.

Circling back again. If we exclude sandbox container. The aggregated stats for a pod might be misleading right ? Or is the reason that the stats for sandbox containers is negligible and consistent for all pods so it doesnt factor into decision making ?

@abhinandanpb This is a problem in CRI now.

We have pod level cgroup, and ideally Kubelet should make resource management decision based on resource usage got from pod level cgroup. And the container metrics are only for user who cares about how much resource their workloads (application containers) are using.

However, today Kubelet is not using pod level stats yet. It is still making decision based on container metrics. This means that the problem you mentioned does exist, kubelet is not looking at all resource usage inside the pod (sandbox container, container-shim etc.).

This is an issue needs to be addressed. We should:

Either change Kubelet to make resource management decision based on pod level stats, which include everything inside the pod cgroup.

Or change CRI to include pod sandbox extra overhead.

1 is preferable now, and we'll probably do it next Quarter.

For now, we could only provide application container metrics based on current CRI. We'll address this in 1.9. We are going to refactor cadvisor, and pod cgroup metrics will be part of the effort.

Got it. Thanks for the explaination.

Random-Liu · 2017-09-21T05:52:39Z

integration/container_stats_test.go

+	}
+}
+
+func testStats(t *testing.T, s *runtime.ContainerStats) {


Will this be flaky? Is it possible that a container gets 0 cpu seconds? Do we need to use Eventually to check this?

Just my question. I'm not sure about the answer. :) I think 0 cpu seconds shouldn't be possible, but you may have some more ideas. :)

Random-Liu · 2017-09-21T18:04:03Z

integration/container_stats_test.go

+}
+
+func testStats(t *testing.T, s *runtime.ContainerStats) {
+	require.NotEmpty(t, s.Attributes)


nit: Use Getter, which will save many lines. :)

Getter already handles the case that the original object is nil, it will just return zero value.

abhi · 2017-09-22T16:28:21Z

pkg/server/container_stats.go

 	"k8s.io/kubernetes/pkg/kubelet/apis/cri/v1alpha1/runtime"
 )

 // ContainerStats returns stats of the container. If the container does not
 // exist, the call returns an error.
 func (c *criContainerdService) ContainerStats(ctx context.Context, in *runtime.ContainerStatsRequest) (*runtime.ContainerStatsResponse, error) {
-	return nil, errors.New("not implemented")
+	// Validate the stats request
+	if in == nil || in.GetContainerId() == "" {


will get rid of this

nit: in.GetContainerId() == "" is enough

Random-Liu · 2017-09-25T22:36:04Z

@abhinandanpb Dependent PR is merged. Please rebase.

Random-Liu · 2017-09-26T00:18:52Z

pkg/server/container_stats.go

+
+	var cs runtime.ContainerStats
+	if err := c.getContainerMetrics(resp.Metrics[0], &cs); err != nil {
+		return nil, fmt.Errorf("failed to decode container metrics %v", err)


nit: Add :.

Random-Liu · 2017-09-26T00:28:54Z

pkg/server/container_stats_list.go

+	var usedBytes, inodesUsed uint64
+	sn, err := c.snapshotStore.Get(stats.ID)
+	// If snapshotstore doesnt have cached snapshotStore
+	// set WritableLayer to zero


nit: WritableLayer usage.

Random-Liu · 2017-09-26T00:29:04Z

pkg/server/container_stats_list.go

+
+	var usedBytes, inodesUsed uint64
+	sn, err := c.snapshotStore.Get(stats.ID)
+	// If snapshotstore doesnt have cached snapshotStore


nit: s/doesnt/doesn't
nit: cached snapshot information.

Random-Liu · 2017-09-26T00:30:04Z

pkg/server/container_stats_list.go

+		usedBytes = sn.Inodes
+	}
+	cs.WritableLayer = &runtime.FilesystemUsage{
+		Timestamp: stats.Timestamp.Unix(),


Should use the timestamp in the snapshot information.

Random-Liu · 2017-09-26T03:34:01Z

pkg/server/container_stats_list.go

+// the information in the stats request and the containerStore
+func (c *criContainerdService) buildTaskMetricsRequest(r *runtime.ListContainerStatsRequest) (tasks.MetricsRequest, error) {
+	var req tasks.MetricsRequest
+	if r == nil || r.Filter == nil {


nit: r.GetFilter() == nil

Random-Liu · 2017-09-26T03:34:27Z

pkg/server/container_stats_list.go

+	return nil
+}
+
+// buildTaskMetricsRequest constructs a taskMetricsRequest based on


nit: task.MetricsRequest.

Random-Liu · 2017-09-26T03:37:38Z

pkg/server/container_stats_list.go

+	return req, nil
+}
+
+func matchLabelSelector(selector map[string]string, labels map[string]string) bool {


nit: selector, labels map[string]string.

Random-Liu · 2017-09-26T03:43:29Z

pkg/server/container_stats_list.go

+	}
+
+	// Get the container from store and extract the attributes
+	cnt, err := c.containerStore.Get(stats.ID)


It's possible that a container shows up during list, but get deleted before this point. In that case, we should not return error for the whole ListContainerStats.

We should either pass in the container config into this function, or distinguish not found error here.

The former seems cleaner to me.

Random-Liu · 2017-09-26T03:44:33Z

pkg/server/container_stats.go

 	"k8s.io/kubernetes/pkg/kubelet/apis/cri/v1alpha1/runtime"
 )

 // ContainerStats returns stats of the container. If the container does not
 // exist, the call returns an error.
 func (c *criContainerdService) ContainerStats(ctx context.Context, in *runtime.ContainerStatsRequest) (*runtime.ContainerStatsResponse, error) {
-	return nil, errors.New("not implemented")
+	// Validate the stats request
+	if in == nil || in.GetContainerId() == "" {


nit: in.GetContainerId() == "" is enough

Random-Liu · 2017-09-26T03:49:19Z

integration/container_stats_test.go

+)
+
+const (
+	defaultTestImage = "gcr.io/google_containers/pause:3.0"


Either add pauseImage in test_utils.go as a utility, or make this variable specific containerStatsTestImage.

Random-Liu · 2017-09-26T03:51:18Z

integration/container_stats_test.go

+		assert.NoError(t, runtimeService.RemovePodSandbox(sb))
+	}()
+	t.Logf("Create a container config and run container in a pod")
+	opt := func() ContainerOpts {


nit: Add helper function for labels and annotations.

Random-Liu · 2017-09-26T03:53:48Z

integration/container_stats_test.go

+		s, err = runtimeService.ContainerStats(cn)
+		if err != nil {
+			return false, err
+		}


nit: Just testStats here.

Random-Liu · 2017-09-26T03:54:52Z

integration/container_stats_test.go

+		if err != nil {
+			return false, err
+		}
+		for _, s := range stats {


nit: testStats here directly.

Random-Liu · 2017-09-26T03:56:27Z

integration/container_stats_test.go

+		testStats(t, s)
+	}
+}
+


Each of the filter test takes 10 seconds. It means that the 3 filter tests take 30 seconds. I feel like a unit test for buildTaskMetricsRequest is good enough given that we don't actually use these filters in CRI. And in unit test we could cover more cases, label filtering, other combination etc.
We could keep one filter test here to make sure containerd.Metrics filtering does work.

WDYT?

Thought process here is that its fine not to be conservative with tests especially with this scenario not being exercised by e2e tests. Also this would cover the basic combinations between the 2 components. Unit test is good to test the filtering I agree. I would like to add a TODO here to take this off once we have e2e coverage for these functionalities. Always happy to remove code :) Let me know.

I feel like that we should:

Add comprehensive unit test for filtering no matter we test filter or not here.

There are many duplicated code in this file, we could easily use a table-driven test pattern to get rid of these duplication.

However, I don't want to block this PR because of test code, so I'm fine with adding TODO for now, and come back later.

Random-Liu · 2017-09-26T04:06:52Z

integration/container_stats_test.go

+	}
+}
+
+func testStats(t *testing.T, s *runtime.ContainerStats) {


Pass in container config for verification? Even for ListContainerStats, we could find corresponding container config inside the list and pass in.

abhi · 2017-09-26T17:53:42Z

@Random-Liu addressed comments. PTAL. I have added to TODO to clean up the tests which I will raise a PR in a day.

Random-Liu · 2017-09-26T18:21:27Z

pkg/server/container_stats_list.go

-	"errors"
-
+	"fmt"
+	"github.com/golang/glog"


Move to the next group.

Random-Liu · 2017-09-26T18:21:48Z

pkg/server/container_stats_list.go

+	if err != nil {
+		return nil, fmt.Errorf("failed to build metrics request: %v", err)
+	}
+	resp, err := c.taskService.Metrics(context.Background(), &request)


Pass ctx in.

Random-Liu · 2017-09-26T18:23:12Z

pkg/server/container_stats_list.go

+		delete(candidateContainers, stat.ID)
+		containerStats.Stats = append(containerStats.Stats, &cs)
+	}
+	// If there is a transient state where containers are dead at the time of query


It's not a transient state. For any dead container, this will happen until we remove the container.

Random-Liu · 2017-09-26T18:23:30Z

pkg/server/container_stats_list.go

+		containerStats.Stats = append(containerStats.Stats, &cs)
+	}
+	// If there is a transient state where containers are dead at the time of query
+	// but present in the containerStore , then check if the writeableLayer information


get rid of the space before ,.

Random-Liu · 2017-09-26T18:24:08Z

pkg/server/container_stats_list.go

+	stats *types.Metric,
+	cs *runtime.ContainerStats,
+) error {
+


unnecessary empty line.

Random-Liu · 2017-09-26T18:26:22Z

LGTM overall with some nits.

Random-Liu · 2017-09-26T18:27:49Z

integration/container_stats_test.go

+	}()
+	t.Logf("Create a container config and run containers in a pod")
+	containerConfigMap := make(map[string]*runtime.ContainerConfig)
+	for i := 0; i < 3; i++ {


We should have a dead container in the test, and for dead container, testStats won't work.
Probably add a TODO here, and you may want to manually validate it first.

had added on top of testStats coz that needed change. I have manually verified the case for dead containers by introducing a stopContainer and checking the stats.

=== RUN TestContainerListStatsDead &ContainerStats{Attributes:&ContainerAttributes{Id:6bf3066a1929839b5a5802cf3b3b49c7c292cc53364b3280d2a0a07dc24d9f77,Metadata:&ContainerMetadata{Name:container0,Attempt:0,},Labels:map[string]string{key: value,},Annotations:map[string]string{a.b.c: test,},},Cpu:nil,Memory:nil,WritableLayer:&FilesystemUsage{Timestamp:1506452046974590603,StorageId:&StorageIdentifier{Uuid:c656051a-9391-40fe-8333-57b410884232,},UsedBytes:&UInt64Value{Value:7,},InodesUsed:&UInt64Value{Value:20480,},},} &ContainerStats{Attributes:&ContainerAttributes{Id:178a2722bd4ecec8728a952f07474b7714896f4afe8d69b3b58989b387e45804,Metadata:&ContainerMetadata{Name:container1,Attempt:0,},Labels:map[string]string{key: value,},Annotations:map[string]string{a.b.c: test,},},Cpu:nil,Memory:nil,WritableLayer:&FilesystemUsage{Timestamp:1506452046972987825,StorageId:&StorageIdentifier{Uuid:c656051a-9391-40fe-8333-57b410884232,},UsedBytes:&UInt64Value{Value:7,},InodesUsed:&UInt64Value{Value:20480,},},} &ContainerStats{Attributes:&ContainerAttributes{Id:52e891cd436a10d71bc37d71f978d4142c4d49b7922ca7a2442ca75fe15c10b8,Metadata:&ContainerMetadata{Name:container2,Attempt:0,},Labels:map[string]string{key: value,},Annotations:map[string]string{a.b.c: test,},},Cpu:nil,Memory:nil,WritableLayer:&FilesystemUsage{Timestamp:1506452046973408844,StorageId:&StorageIdentifier{Uuid:c656051a-9391-40fe-8333-57b410884232,},UsedBytes:&UInt64Value{Value:7,},InodesUsed:&UInt64Value{Value:20480,},},} --- PASS: TestContainerListStatsDead (32.88s)

Signed-off-by: Abhinandan Prativadi <abhi@docker.com>

Random-Liu · 2017-09-26T20:57:05Z

LGTM

k8s-ci-robot added cncf-cla: yes size/XXL labels Sep 20, 2017

Random-Liu self-assigned this Sep 20, 2017

yanxuean reviewed Sep 20, 2017

View reviewed changes

Random-Liu reviewed Sep 21, 2017

View reviewed changes

Random-Liu mentioned this pull request Sep 21, 2017

CRI-Containerd Missing Pieces #62

Closed

42 tasks

Random-Liu added this to the v1.0.0-alpha.0 milestone Sep 22, 2017

abhi force-pushed the metrics branch from 8fe68aa to 81c6ddc Compare September 22, 2017 05:59

abhi commented Sep 22, 2017

View reviewed changes

abhi force-pushed the metrics branch from 81c6ddc to 043ed76 Compare September 22, 2017 16:35

abhi force-pushed the metrics branch from 043ed76 to efcab5e Compare September 26, 2017 00:10

k8s-ci-robot added size/L and removed size/XXL labels Sep 26, 2017

abhi force-pushed the metrics branch 2 times, most recently from 83f1fe2 to 1a98921 Compare September 26, 2017 00:14

k8s-ci-robot added size/XL and removed size/L labels Sep 26, 2017

abhi force-pushed the metrics branch from 1a98921 to 8c873c2 Compare September 26, 2017 00:16

k8s-ci-robot added size/L and removed size/XL labels Sep 26, 2017

Random-Liu reviewed Sep 26, 2017

View reviewed changes

abhi force-pushed the metrics branch from 8c873c2 to e9348c4 Compare September 26, 2017 17:29

k8s-ci-robot added size/XL and removed size/L labels Sep 26, 2017

abhi force-pushed the metrics branch from e9348c4 to e9f1dc0 Compare September 26, 2017 17:33

abhi force-pushed the metrics branch from e9f1dc0 to d29bdff Compare September 26, 2017 17:47

abhi force-pushed the metrics branch 2 times, most recently from cc071de to 39de552 Compare September 26, 2017 18:21

Random-Liu reviewed Sep 26, 2017

View reviewed changes

abhi force-pushed the metrics branch 2 times, most recently from c8be4d4 to 49aa6f9 Compare September 26, 2017 19:00

abhi added 2 commits September 26, 2017 12:03

Adding container metrics

d029894

Signed-off-by: Abhinandan Prativadi <abhi@docker.com>

Adding integration test for container stats

853804b

Signed-off-by: Abhinandan Prativadi <abhi@docker.com>

abhi force-pushed the metrics branch from 49aa6f9 to 853804b Compare September 26, 2017 19:03

Random-Liu added the lgtm label Sep 26, 2017

Random-Liu merged commit e7a5001 into containerd:master Sep 26, 2017

Adding container metrics support #265

Adding container metrics support #265

Conversation

abhi commented Sep 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Random-Liu Sep 21, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Random-Liu Sep 22, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Random-Liu commented Sep 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Random-Liu Sep 26, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhi Sep 26, 2017 • edited

Choose a reason for hiding this comment

Random-Liu Sep 26, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhi commented Sep 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Random-Liu commented Sep 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Random-Liu commented Sep 26, 2017

Random-Liu Sep 21, 2017 •

edited

Random-Liu Sep 22, 2017 •

edited

Random-Liu Sep 26, 2017 •

edited

abhi Sep 26, 2017 •

edited

Random-Liu Sep 26, 2017 •

edited