Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node details dashboard has missing metrics with containerd #2800

Closed
Tracked by #4110
wyb1 opened this issue Aug 31, 2020 · 11 comments · Fixed by #6628
Closed
Tracked by #4110

Node details dashboard has missing metrics with containerd #2800

wyb1 opened this issue Aug 31, 2020 · 11 comments · Fixed by #6628
Labels
area/monitoring Monitoring (including availability monitoring and alerting) related kind/bug Bug

Comments

@wyb1
Copy link
Contributor

wyb1 commented Aug 31, 2020

How to categorize this issue?

/area monitoring
/kind bug
/priority normal

What happened:
The node details dashboard has missing metrics when the shoot is configured to use containerd.
Network I/O pressure is missing and the system service usage is also missing.
image

What you expected to happen:
The dashboard should contain the data. Example with docker:
image

How to reproduce it (as minimally and precisely as possible):
Create a shoot with

cri: 
  name: containerd

check the node details dashboard and see the missing data.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.18.5
  • Cloud provider or hardware configuration: aws
@wyb1 wyb1 added the kind/bug Bug label Aug 31, 2020
@gardener-robot gardener-robot added area/monitoring Monitoring (including availability monitoring and alerting) related priority/normal labels Aug 31, 2020
@danielfoehrKn
Copy link
Contributor

danielfoehrKn commented Sep 11, 2020

Containerd exposes a configurable prometheus compatible metrics endpoint (not part of the default config.toml) that we do not set yet.
Other than that it should expose the default metrics via CRI for each container & per pod ContainerStats

CpuUsage cpu = 2;
// Memory usage gathered from the container.
MemoryUsage memory = 3;
// Usage of the writable layer.
FilesystemUsage writable_layer = 4;

Not sure if containerd even exposes Network IO - would need some more investigation.
I can take a look at it when focusing on the Container Runtime topic in the future - though blocked with quality focus atm.

@gardener-robot gardener-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 11, 2020
@rfranzke
Copy link
Member

Any plans to mitigate this issue @wyb1 @danielfoehrKn ?

@danielfoehrKn
Copy link
Contributor

Still working on other issues - but I intend to pick it up when picking up ContainerRuntimes again. Otherwise @wyb1 might take a look

@gardener-robot gardener-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 22, 2021
@rfranzke
Copy link
Member

/ping @wyb1 @istvanballok

@gardener-robot
Copy link

@istvanballok, @wyb1

Message

/ping @wyb1 @istvanballok

@istvanballok
Copy link
Contributor

My latest info is that the cadvisor component is (currently) not exposing some metrics if the container runtime is containerd. This is the reason why the panels are empty on the screenshot above. cc @voelzmo

@gardener-ci-robot
Copy link
Contributor

The Gardener project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed
    You can:
  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten

/close

@gardener-prow gardener-prow bot closed this as completed Mar 30, 2022
@gardener-prow
Copy link
Contributor

gardener-prow bot commented Mar 30, 2022

@gardener-ci-robot: Closing this issue.

In response to this:

The Gardener project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed
    You can:
  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@istvanballok
Copy link
Contributor

/reopen

Looking into this issue I found that actually the network related metrics are exposed by the cadvisor.
The reason why we drop them is this rule:

- source_labels: [ container ]
regex: ^$
action: drop

If the container label is empty, the series is dropped. This heuristic was probably introduced to keep only relevant metrics.
With docker as container runtime, the container label was "POD". With containerd, it is empty.
Note that for network related metrics, the container label doesn't make sense (and hence the empty value is correct), because the containers of a pod share the same network namespace and hence the network related metrics can not distinguish between containers.

@gardener-prow
Copy link
Contributor

gardener-prow bot commented Aug 18, 2022

@istvanballok: Reopened this issue.

In response to this:

/reopen

Looking into this issue I found that actually the network related metrics are exposed by the cadvisor.
The reason why we drop them is this rule:

- source_labels: [ container ]
regex: ^$
action: drop

If the container label is empty, the series is dropped. This heuristic was probably introduced to keep only relevant metrics.
With docker as container runtime, the container label was "POD". With containerd, it is empty.
Note that for network related metrics, the container label doesn't make sense (and hence the empty value is correct), because the containers of a pod share the same network namespace and hence the network related metrics can not distinguish between containers.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gardener-prow gardener-prow bot reopened this Aug 18, 2022
@timebertt
Copy link
Member

/remove-lifecycle rotten

@gardener-prow gardener-prow bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 22, 2022
istvanballok added a commit to istvanballok/gardener that referenced this issue Aug 24, 2022
The runtime cgroup is the cgroup path the container runtime is expected
to be isolated in.
https://github.com/kubernetes/kubernetes/blob/efa5692c0b5f01bd33d8a112ab98b386300198e7/pkg/kubelet/config/flags.go#L31

Without this flag, the cadvisor metrics exposed by the kubelet via

```
k proxy
curl -s http://localhost:8001/api/v1/nodes/<node>/proxy/metrics/cadvisor
```

in a cluster with containerd as container runtime, only contain metrics
for the `/system.slice/kubelet.service`.

With this command line flag, metrics are reported for both
`/system.slice/kubelet.service` and `/system.slice/containerd.service`.

This is the expected behavior based on the experience with clusters
that use docker as a container runtime: in those clusters, metrics
are reported for both the kubelet.service and the docker.service.

Consequently in clusters with containerd, one would expect
metrics for both the kubelet.service and the containerd.service.

See the system services panels in the issue
gardener#2800

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
istvanballok added a commit to istvanballok/gardener that referenced this issue Aug 24, 2022
The runtime cgroup is the cgroup path the container runtime is expected
to be isolated in.
https://github.com/kubernetes/kubernetes/blob/efa5692c0b5f01bd33d8a112ab98b386300198e7/pkg/kubelet/config/flags.go#L31

Without this flag, the cadvisor metrics exposed by the kubelet via

```
k proxy
curl -s http://localhost:8001/api/v1/nodes/<node>/proxy/metrics/cadvisor
```

in a cluster with containerd as container runtime, only contain metrics
for the `/system.slice/kubelet.service`.

With this command line flag, metrics are reported for both
`/system.slice/kubelet.service` and `/system.slice/containerd.service`.

This is the expected behavior based on the experience with clusters
that use docker as container runtime: in those clusters, metrics
are reported for both the kubelet.service and the docker.service.

Consequently in clusters with containerd, one would expect
metrics for both the kubelet.service and the containerd.service.

See the system services panels in the issue
gardener#2800

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
istvanballok added a commit to istvanballok/gardener that referenced this issue Aug 24, 2022
The runtime cgroup is the cgroup path the container runtime is expected
to be isolated in.
https://github.com/kubernetes/kubernetes/blob/efa5692c0b5f01bd33d8a112ab98b386300198e7/pkg/kubelet/config/flags.go#L31

Without this flag, the cadvisor metrics exposed by the kubelet via

```
k proxy
curl -s http://localhost:8001/api/v1/nodes/<node>/proxy/metrics/cadvisor
```

in a cluster with containerd as container runtime, only contain metrics
for the `/system.slice/kubelet.service`.

With this command line flag, metrics are reported for both
`/system.slice/kubelet.service` and `/system.slice/containerd.service`.

This is the expected behavior based on the experience with clusters
that use docker as container runtime: in those clusters, metrics
are reported for both the kubelet.service and the docker.service.

Consequently in clusters with containerd, one would expect
metrics for both the kubelet.service and the containerd.service.

See the system services panels in the issue
gardener#2800

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
gardener-prow bot pushed a commit that referenced this issue Aug 25, 2022
The runtime cgroup is the cgroup path the container runtime is expected
to be isolated in.
https://github.com/kubernetes/kubernetes/blob/efa5692c0b5f01bd33d8a112ab98b386300198e7/pkg/kubelet/config/flags.go#L31

Without this flag, the cadvisor metrics exposed by the kubelet via

```
k proxy
curl -s http://localhost:8001/api/v1/nodes/<node>/proxy/metrics/cadvisor
```

in a cluster with containerd as container runtime, only contain metrics
for the `/system.slice/kubelet.service`.

With this command line flag, metrics are reported for both
`/system.slice/kubelet.service` and `/system.slice/containerd.service`.

This is the expected behavior based on the experience with clusters
that use docker as container runtime: in those clusters, metrics
are reported for both the kubelet.service and the docker.service.

Consequently in clusters with containerd, one would expect
metrics for both the kubelet.service and the containerd.service.

See the system services panels in the issue
#2800

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitoring Monitoring (including availability monitoring and alerting) related kind/bug Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants