Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly implement metrics for Kata Containers when using CRI stats. #4470

Conversation

fidencio
Copy link
Contributor

/kind bug

What this PR does / why we need it:

When using CRI stats, Kata Container pods have no metrics at all. This happens because CRI-O doesn't know how to properly unmarshal the message coming from Kata Containers.

Which issue(s) this PR fixes:

None

Special notes for your reviewer:

This PR, as it's at the moment it was opened, allows CRI-O to return something to kubelet, although the returned values for memory are clearly wrong. I'll keep updating the description according to the progress made.

Last but not least, we should compare results with what containerd provides, as I do believe we should be getting information about Block & Net Input & Ouput.

Does this PR introduce a user-facing change?

None

Or should we add a note about this change?

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Jan 13, 2021
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 13, 2021
@fidencio fidencio force-pushed the wip/runtime-vm-properly-implement-metrics branch from 0b974d3 to e142245 Compare January 13, 2021 21:26
@codecov
Copy link

codecov bot commented Jan 13, 2021

Codecov Report

Merging #4470 (a496e30) into master (8118d4a) will decrease coverage by 0.03%.
The diff coverage is 0.00%.

❗ Current head a496e30 differs from pull request most recent head 1993450. Consider uploading reports for the commit 1993450 to get more accurate results

@@            Coverage Diff             @@
##           master    #4470      +/-   ##
==========================================
- Coverage   41.32%   41.28%   -0.04%     
==========================================
  Files         107      107              
  Lines        9405     9414       +9     
==========================================
  Hits         3887     3887              
- Misses       5074     5083       +9     
  Partials      444      444              

@fidencio fidencio force-pushed the wip/runtime-vm-properly-implement-metrics branch 2 times, most recently from 19c31c5 to bc046e6 Compare January 15, 2021 12:32
@fidencio
Copy link
Contributor Author

fidencio commented Jan 15, 2021

Okay, with the last commit CRI stats are mostly functional when used with the "vm" runtime type. I still need to write tests for those, though.

[fidencio@localhost cri-o]$ kubectl get pods
NAME             READY   STATUS    RESTARTS   AGE
example-fedora   1/1     Running   0          130m

[fidencio@localhost cri-o]$ kubectl get pod example-fedora -o yaml | grep runtimeClassName
      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"app":"example-fedora-app"},"name":"example-fedora","namespace":"default"},"spec":{"containers":[{"args":["-m","http.server","8080"],"command":["python3"],"image":"fedora:33","name":"example-fedora","ports":[{"containerPort":8080}]}],"runtimeClassName":"kata"}}
        f:runtimeClassName: {}
  runtimeClassName: kata

[fidencio@localhost cri-o]$ kubectl top pod
NAME             CPU(cores)   MEMORY(bytes)
example-fedora   1m           9Mi

@fidencio fidencio force-pushed the wip/runtime-vm-properly-implement-metrics branch from bc046e6 to a0a52e4 Compare January 15, 2021 17:03
@fidencio fidencio force-pushed the wip/runtime-vm-properly-implement-metrics branch from a0a52e4 to 885b7fd Compare April 8, 2021 21:06
@fidencio fidencio changed the title [WIP] Properly implement metrics for Kata Containers when using CRI stats. Properly implement metrics for Kata Containers when using CRI stats. Apr 8, 2021
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 8, 2021
@TomSweeneyRedHat
Copy link
Contributor

One nit comment and some unhappy tests.
Otherwise LGTM

@fidencio fidencio force-pushed the wip/runtime-vm-properly-implement-metrics branch from 885b7fd to 79913bf Compare April 8, 2021 22:18
There's no reason for us to keep maintaining our own copy of typeurl,
let's directly rely on the `github.com/containerd/typeurl` instead.

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
As this function is only used by runtimeVM::ContainerStats(), let's move
it to the runtime_vm.go file, making our life easier when doing upcoming
changes on runtimeVM::ContainerStats().

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
CRI-O has been using `containers/libpod/pkg/cgroups` in order to get
metrics and, later on, convert it to CRI Stats. Although this approach
is fine (and desired) for the OCI runtime type. we can't rely on that
for the VM runtime type as the data sent by Kata Containers comes from
`containerd/cgroup`.

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
WorkingSetBytes is the bit that needs to be set in order to provide
kubelet the pod's memory information.

Although it's calculated in a slightly different way for "oci" runtime
type, the logic is quite similar for the "vm" runtime type, with the
only difference being where the TotalInactiveFile information comes
from.

This is the last bit needed in order to have `kubectl top pod $pod`
working, as shown below:
```
[fidencio@localhost cri-o]$ kubectl get pods
NAME             READY   STATUS    RESTARTS   AGE
example-fedora   1/1     Running   0          130m

[fidencio@localhost cri-o]$ kubectl get pod example-fedora -o yaml | grep runtimeClassName
      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"app":"example-fedora-app"},"name":"example-fedora","namespace":"default"},"spec":{"containers":[{"args":["-m","http.server","8080"],"command":["python3"],"image":"fedora:33","name":"example-fedora","ports":[{"containerPort":8080}]}],"runtimeClassName":"kata"}}
        f:runtimeClassName: {}
  runtimeClassName: kata

[fidencio@localhost cri-o]$ kubectl top pod
NAME             CPU(cores)   MEMORY(bytes)
example-fedora   1m           9Mi
```

Signed-off-by: Fabiano Fidêncio <fidencio@redhat.com>
@fidencio fidencio force-pushed the wip/runtime-vm-properly-implement-metrics branch from 79913bf to 1993450 Compare April 9, 2021 04:47
Copy link
Member

@saschagrunert saschagrunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fidencio, saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [fidencio,saschagrunert]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fidencio
Copy link
Contributor Author

fidencio commented Apr 9, 2021

/retest

1 similar comment
@fidencio
Copy link
Contributor Author

fidencio commented Apr 9, 2021

/retest

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

7 similar comments
@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@fidencio
Copy link
Contributor Author

/hold till kata-containers CI is fixed.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 18, 2021
@fidencio
Copy link
Contributor Author

/unhold

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 19, 2021
@fidencio
Copy link
Contributor Author

/test kata-containers

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@fidencio
Copy link
Contributor Author

/test kata-containers

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@fidencio
Copy link
Contributor Author

/test kata-containers

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@fidencio
Copy link
Contributor Author

/test kata-containers

4 similar comments
@fidencio
Copy link
Contributor Author

/test kata-containers

@fidencio
Copy link
Contributor Author

/test kata-containers

@fidencio
Copy link
Contributor Author

/test kata-containers

@fidencio
Copy link
Contributor Author

/test kata-containers

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 19, 2021

@fidencio: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/openshift-jenkins/e2e_crun_cgroupv2 1993450 link /test e2e_cgroupv2

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 807c9da into cri-o:master Apr 19, 2021
@fidencio
Copy link
Contributor Author

/cherry-pick release-1.21

@openshift-cherrypick-robot

@fidencio: new pull request created: #4776

In response to this:

/cherry-pick release-1.21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants