Node details dashboard has missing metrics with containerd #2800

wyb1 · 2020-08-31T15:42:49Z

How to categorize this issue?

/area monitoring
/kind bug
/priority normal

What happened:
The node details dashboard has missing metrics when the shoot is configured to use containerd.
Network I/O pressure is missing and the system service usage is also missing.

What you expected to happen:
The dashboard should contain the data. Example with docker:

How to reproduce it (as minimally and precisely as possible):
Create a shoot with

cri: 
  name: containerd

check the node details dashboard and see the missing data.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.18.5
Cloud provider or hardware configuration: aws

The text was updated successfully, but these errors were encountered:

danielfoehrKn · 2020-09-11T09:49:36Z

Containerd exposes a configurable prometheus compatible metrics endpoint (not part of the default config.toml) that we do not set yet.
Other than that it should expose the default metrics via CRI for each container & per pod ContainerStats

CpuUsage cpu = 2;
// Memory usage gathered from the container.
MemoryUsage memory = 3;
// Usage of the writable layer.
FilesystemUsage writable_layer = 4;

Not sure if containerd even exposes Network IO - would need some more investigation.
I can take a look at it when focusing on the Container Runtime topic in the future - though blocked with quality focus atm.

rfranzke · 2021-01-20T08:51:02Z

Any plans to mitigate this issue @wyb1 @danielfoehrKn ?

danielfoehrKn · 2021-01-20T16:03:54Z

Still working on other issues - but I intend to pick it up when picking up ContainerRuntimes again. Otherwise @wyb1 might take a look

rfranzke · 2022-01-14T13:24:55Z

/ping @wyb1 @istvanballok

gardener-robot · 2022-01-14T13:24:58Z

@istvanballok, @wyb1

Message

/ping @wyb1 @istvanballok

istvanballok · 2022-02-01T06:43:45Z

My latest info is that the cadvisor component is (currently) not exposing some metrics if the container runtime is containerd. This is the reason why the panels are empty on the screenshot above. cc @voelzmo

gardener-ci-robot · 2022-03-30T13:59:24Z

The Gardener project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed
You can:
Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten

/close

gardener-prow · 2022-03-30T13:59:29Z

@gardener-ci-robot: Closing this issue.

In response to this:

The Gardener project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed
You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

istvanballok · 2022-08-18T10:40:48Z

/reopen

Looking into this issue I found that actually the network related metrics are exposed by the cadvisor.
The reason why we drop them is this rule:

gardener/charts/seed-monitoring/charts/core/charts/prometheus/templates/config.yaml

Lines 168 to 170 in 21ac430

    
           - source_labels: [ container ] 
        
             regex: ^$ 
        
             action: drop

If the container label is empty, the series is dropped. This heuristic was probably introduced to keep only relevant metrics.
With docker as container runtime, the container label was "POD". With containerd, it is empty.
Note that for network related metrics, the container label doesn't make sense (and hence the empty value is correct), because the containers of a pod share the same network namespace and hence the network related metrics can not distinguish between containers.

gardener-prow · 2022-08-18T10:40:52Z

@istvanballok: Reopened this issue.

In response to this:

/reopen

Looking into this issue I found that actually the network related metrics are exposed by the cadvisor.
The reason why we drop them is this rule:

gardener/charts/seed-monitoring/charts/core/charts/prometheus/templates/config.yaml

Lines 168 to 170 in 21ac430

- source_labels: [ container ]

regex: ^$

action: drop

If the container label is empty, the series is dropped. This heuristic was probably introduced to keep only relevant metrics.
With docker as container runtime, the container label was "POD". With containerd, it is empty.
Note that for network related metrics, the container label doesn't make sense (and hence the empty value is correct), because the containers of a pod share the same network namespace and hence the network related metrics can not distinguish between containers.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

timebertt · 2022-08-22T06:18:36Z

/remove-lifecycle rotten

The runtime cgroup is the cgroup path the container runtime is expected to be isolated in. https://github.com/kubernetes/kubernetes/blob/efa5692c0b5f01bd33d8a112ab98b386300198e7/pkg/kubelet/config/flags.go#L31 Without this flag, the cadvisor metrics exposed by the kubelet via ``` k proxy curl -s http://localhost:8001/api/v1/nodes/<node>/proxy/metrics/cadvisor ``` in a cluster with containerd as container runtime, only contain metrics for the `/system.slice/kubelet.service`. With this command line flag, metrics are reported for both `/system.slice/kubelet.service` and `/system.slice/containerd.service`. This is the expected behavior based on the experience with clusters that use docker as a container runtime: in those clusters, metrics are reported for both the kubelet.service and the docker.service. Consequently in clusters with containerd, one would expect metrics for both the kubelet.service and the containerd.service. See the system services panels in the issue gardener#2800 Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

The runtime cgroup is the cgroup path the container runtime is expected to be isolated in. https://github.com/kubernetes/kubernetes/blob/efa5692c0b5f01bd33d8a112ab98b386300198e7/pkg/kubelet/config/flags.go#L31 Without this flag, the cadvisor metrics exposed by the kubelet via ``` k proxy curl -s http://localhost:8001/api/v1/nodes/<node>/proxy/metrics/cadvisor ``` in a cluster with containerd as container runtime, only contain metrics for the `/system.slice/kubelet.service`. With this command line flag, metrics are reported for both `/system.slice/kubelet.service` and `/system.slice/containerd.service`. This is the expected behavior based on the experience with clusters that use docker as container runtime: in those clusters, metrics are reported for both the kubelet.service and the docker.service. Consequently in clusters with containerd, one would expect metrics for both the kubelet.service and the containerd.service. See the system services panels in the issue gardener#2800 Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

The runtime cgroup is the cgroup path the container runtime is expected to be isolated in. https://github.com/kubernetes/kubernetes/blob/efa5692c0b5f01bd33d8a112ab98b386300198e7/pkg/kubelet/config/flags.go#L31 Without this flag, the cadvisor metrics exposed by the kubelet via ``` k proxy curl -s http://localhost:8001/api/v1/nodes/<node>/proxy/metrics/cadvisor ``` in a cluster with containerd as container runtime, only contain metrics for the `/system.slice/kubelet.service`. With this command line flag, metrics are reported for both `/system.slice/kubelet.service` and `/system.slice/containerd.service`. This is the expected behavior based on the experience with clusters that use docker as container runtime: in those clusters, metrics are reported for both the kubelet.service and the docker.service. Consequently in clusters with containerd, one would expect metrics for both the kubelet.service and the containerd.service. See the system services panels in the issue #2800 Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com> Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

wyb1 added the kind/bug Bug label Aug 31, 2020

gardener-robot added area/monitoring Monitoring (including availability monitoring and alerting) related priority/normal labels Aug 31, 2020

vlerenc mentioned this issue Sep 22, 2020

Dashboard improvements gardener/monitoring#4

Closed

7 tasks

gardener-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 11, 2020

gardener-robot removed the priority/normal label Mar 8, 2021

voelzmo mentioned this issue May 26, 2021

Deprecate usage of docker as container runtime in gardener #4110

Closed

31 tasks

gardener-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 22, 2021

gardener-prow bot closed this as completed Mar 30, 2022

gardener-prow bot reopened this Aug 18, 2022

gardener-prow bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 22, 2022

istvanballok mentioned this issue Aug 24, 2022

Specify the kubelet flag runtime-cgroups when using containerd #6574

Merged

istvanballok mentioned this issue Sep 5, 2022

Fix the network metrics for clusters with containerd #6628

Merged

timebertt closed this as completed in #6628 Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node details dashboard has missing metrics with containerd #2800

Node details dashboard has missing metrics with containerd #2800

wyb1 commented Aug 31, 2020

danielfoehrKn commented Sep 11, 2020 •

edited

Loading

rfranzke commented Jan 20, 2021

danielfoehrKn commented Jan 20, 2021

rfranzke commented Jan 14, 2022

gardener-robot commented Jan 14, 2022

istvanballok commented Feb 1, 2022

gardener-ci-robot commented Mar 30, 2022

gardener-prow bot commented Mar 30, 2022

istvanballok commented Aug 18, 2022

gardener-prow bot commented Aug 18, 2022

timebertt commented Aug 22, 2022

Node details dashboard has missing metrics with containerd #2800

Node details dashboard has missing metrics with containerd #2800

Comments

wyb1 commented Aug 31, 2020

danielfoehrKn commented Sep 11, 2020 • edited Loading

rfranzke commented Jan 20, 2021

danielfoehrKn commented Jan 20, 2021

rfranzke commented Jan 14, 2022

gardener-robot commented Jan 14, 2022

istvanballok commented Feb 1, 2022

gardener-ci-robot commented Mar 30, 2022

gardener-prow bot commented Mar 30, 2022

istvanballok commented Aug 18, 2022

gardener-prow bot commented Aug 18, 2022

timebertt commented Aug 22, 2022

danielfoehrKn commented Sep 11, 2020 •

edited

Loading