Collect container and Sandbox stats through libctr cgroup managers #7658

Mo-Fatah · 2024-01-05T14:08:49Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

stats server currently manually collects stats through the Populate** functions in cgmgr package which doesn't report correct stats. This PR bases the stats collection on libcontainer cgroup managers.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Use libcontainer cgroup manager to collect container and Sandbox stats

haircommander · 2024-01-05T19:41:08Z

/ok-to-test
/approve

LGTM

Mo-Fatah · 2024-01-06T11:04:19Z

/retest

sohankunkerkar · 2024-01-08T14:31:39Z

/retest

sohankunkerkar · 2024-01-08T14:35:26Z

@Mo-Fatah do you mind pasting the stats o/p with this change? I just wanted to see the actual stats.

Mo-Fatah · 2024-01-08T15:41:54Z

Sure! @sohankunkerkar
I built crio from the current branch, ran a pod, ran a sleep container, finally used the command crictl statsp -o json

stats from the current changes:

...
           "cpu": {
              "timestamp": "1704727367994867370",
              "usageCoreNanoSeconds": {
                "value": "39525000"
              },
              "usageNanoCores": {
                "value": "0"
              }
            },
            "memory": {
              "timestamp": "1704727367994867370",
              "workingSetBytes": {
                "value": "839680"
              },
              "availableBytes": {
                "value": "265420800"
              },
              "usageBytes": {
                "value": "3014656"
              },
              "rssBytes": {
                "value": "98304"
              },
              "pageFaults": {
                "value": "2561"
              },
              "majorPageFaults": {
                "value": "17"
              }
            },
            "writableLayer": {
              "timestamp": "1704727367994869583",
              "fsId": {
                "mountpoint": "/var/lib/containers/storage/overlay/833e7fe8b7b3de29c496abea65b7260059af7cf54aa97d8f42bf519cd70b28f1/merged"
              },
              "usedBytes": {
                "value": "0"
              },
              "inodesUsed": {
                "value": "4"
              }
            },
            "swap": {
              "timestamp": "1704727367994867370",
              "swapAvailableBytes": {
                "value": "265420800"
              },
              "swapUsageBytes": {
                "value": "3014656"
              }
            }
...

doing the same steps with the main branch:

...
            "cpu": {
              "timestamp": "1704727905610530972",
              "usageCoreNanoSeconds": {
                "value": "35000000"
              },
              "usageNanoCores": null
            },
            "memory": {
              "timestamp": "1704727905610530972",
              "workingSetBytes": {
                "value": "102400"
              },
              "availableBytes": {
                "value": "18446744073441218560"
              },
              "usageBytes": {
                "value": "102400"
              },
              "rssBytes": {
                "value": "0"
              },
              "pageFaults": {
                "value": "2541"
              },
              "majorPageFaults": {
                "value": "0"
              }
            },
            "writableLayer": {
              "timestamp": "1704727905610609128",
              "fsId": {
                "mountpoint": "/var/lib/containers/storage/overlay/12351a22aa4c2f4339754a2e8d670f17b3dc00aeebf903a8c9b7c77a95dd4466/merged"
              },
              "usedBytes": {
                "value": "0"
              },
              "inodesUsed": {
                "value": "4"
              }
            },
            "swap": null
...

Notable changes in this PR:

adding the swap values. (it was neglected before)
some memory values are present (e.g. RSS)
The numbers should be a bit different since the stats reported from libcontainer uses different values from the containers/common package (e.g. in the memory/usageBytes value: containers/common uses memory.stat:anon value from the cgroup while libcontainer uses memory.current which should be the correct value.

sohankunkerkar · 2024-01-08T15:52:42Z

@Mo-Fatah Thanks for the information. The changes LGTM
/lgtm

kwilczynski · 2024-01-08T16:50:37Z

@Mo-Fatah, this value (an excerpt from your example):

              "availableBytes": {
                "value": "18446744073441218560"
              },

That is about 18 Exabytes. if I am not mistaken 😄 How does this sort of value work with metrics?

Update:

I suppose, we are outputting this value via a metric per:

# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19

However, we will now be going from this large value to the following:

              "availableBytes": {
                "value": "265420800"
              },

Would this be the same metric?

Admittedly, I haven't checked how libcontainer collects these values. Were you able to compare these metrics with what /proc/<PID>/statm and /proc/<PID>/status shows? I am curious.

codecov · 2024-01-08T17:18:44Z

Codecov Report

❗ No coverage uploaded for pull request base (main@87a309c). Click here to learn what that means.
Report is 2 commits behind head on main.
The diff coverage is 0.00%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #7658   +/-   ##
=======================================
  Coverage        ?   47.83%           
=======================================
  Files           ?      145           
  Lines           ?    16247           
  Branches        ?        0           
=======================================
  Hits            ?     7772           
  Misses          ?     7532           
  Partials        ?      943

Mo-Fatah · 2024-01-08T17:33:44Z

@kwilczynski I am not sure how this huge number is generated 😅 . However, the availableBytes isn't related to libcontainer because we calculate it ourselves in the code. k8s calculate it by memory limit - memory usage ( or workingset)
https://github.com/kubernetes/kubernetes/blob/94f15bbbcbe952762b7f5e6e3f77d86ecec7d7c2/pkg/kubelet/stats/helper.go#L69

which is currently miscalculated on the main branch in cri-o
https://github.com/cri-o/cri-o/blob/main/internal/config/cgmgr/stats_linux.go#L73

cri-o calculates it the opposite order, and usually the memory limit > usage. So, this is my guess why we get this huge number.

haircommander · 2024-01-08T17:36:21Z

@kwilczynski I am not sure how this huge number is generated 😅 . However, the availableBytes isn't related to libcontainer because we calculate it ourselves in the code. k8s calculate it by memory limit - memory usage ( or workingset) https://github.com/kubernetes/kubernetes/blob/94f15bbbcbe952762b7f5e6e3f77d86ecec7d7c2/pkg/kubelet/stats/helper.go#L69

which is currently miscalculated on the main branch in cri-o https://github.com/cri-o/cri-o/blob/main/internal/config/cgmgr/stats_linux.go#L73

cri-o calculates it the opposite order, and usually the memory limit > usage. So, this is my guess why we get this huge number.

whoops, this probably was me a couple of years ago. I wonder if this causes #6747 as well

Mo-Fatah · 2024-01-08T18:34:06Z

@kwilczynski I am not sure how this huge number is generated 😅 . However, the availableBytes isn't related to libcontainer because we calculate it ourselves in the code. k8s calculate it by memory limit - memory usage ( or workingset) https://github.com/kubernetes/kubernetes/blob/94f15bbbcbe952762b7f5e6e3f77d86ecec7d7c2/pkg/kubelet/stats/helper.go#L69
which is currently miscalculated on the main branch in cri-o https://github.com/cri-o/cri-o/blob/main/internal/config/cgmgr/stats_linux.go#L73
cri-o calculates it the opposite order, and usually the memory limit > usage. So, this is my guess why we get this huge number.

whoops, this probably was me a couple of years ago. I wonder if this causes #6747 as well

I guess so, will have a look at it tomorrow.

sohankunkerkar · 2024-01-09T14:35:18Z

/retest

sohankunkerkar · 2024-01-09T15:52:29Z

/test ci-cgroupv2-integration

sohankunkerkar · 2024-01-09T19:12:18Z

/test ci-cgroupv2-integration

Mo-Fatah · 2024-01-11T13:37:23Z

I couldn't reproduce the failures on my machine. They also seem to be persistently failing, don't think it's a flakiness issue

Mo-Fatah · 2024-01-11T13:37:42Z

/retest

haircommander · 2024-01-11T19:13:13Z

@littlejawa can you PTAL?

littlejawa · 2024-01-12T09:04:23Z

@littlejawa can you PTAL?

With kata, the error seems to come from :

# time="2024-01-11 14:34:09.959493738Z" level=error msg="Error getting container stats e53309b9b475e2fcd9b0b1f5691b1c9ee3a3a913e25ae46014d91c48d5b3f1fa: error while statting cgroup v2: [openat2 /sys/fs/cgroup/pod_123.slice/pod_123-456.slice/crio-e53309b9b475e2fcd9b0b1f5691b1c9ee3a3a913e25ae46014d91c48d5b3f1fa.scope/cgroup.procs: no such file or directory]" file="stats/stats_server_linux.go:119"

As I understand it, libcontainer tries to access the cgroup information from the host, but with kata the container's cgroup information is within a VM, not accessible to the host.

The way we do it today is through the call to
cStats, err := ss.Runtime().ContainerStats(context.TODO(), c, sb.CgroupParent())
that you removed.

It allows a different behaviour between runtimes: for regular (oci) runtimes, it is conlcuded by a call to PopulateContainerCgroupStats.
For kata containers, it is done by an exchange with the kata agent within the VM, which returns the expected values. This is done in ContainerStats

I'm wondering if we could keep this indirection: keep calling ss.Runtime().ContainerStats(), and modify the oci part to call cgroupStats() and return the stats in cgroup format? The rest of you code (converting from cgroup to CRI) wouldn't change.

I would need to change the kata-specific part to return cgroup data too.
I will need to do some conversion, as the VM cgroup version may differ from the host's. Today I'm converting it to CRI anyway, but if we expect cgroup stats, I'll need to make sure I provide the right version of it.
But returning cgroup stats seems right here, as the conversion to CRI would then be centralized for all runtimes.

Do you think that would work, or do you have other suggestions?

kwilczynski · 2024-01-12T13:21:46Z

Aside from the bug that we had for a while, and which @Mo-Fatah, fixed (thank you!). We need to make sure that these new metrics hold water.

@Mo-Fatah, were you able to compare these metrics with what /proc/<PID>/statm and /proc/<PID>/status shows?

For instance, from your example, both of these values are the same:

"availableBytes": {
  "value": "265420800"
},

"swapAvailableBytes": {
  "value": "265420800"
},

A problem here or merely a happy coincidence? 😄 If there is no swap, does it show as a value of 0 or a null in JSON?

Basically, everything looks very good, and it's solid work, but we need to make some sanity checks to see if things are what they say they are and that values also make sense for certain metrics.

Generally, these metrics are such that process, a CRI-O process, I suppose, metrics are stored alongside what seems to be system-wide metrics.

"memory": {
  "timestamp": "1704727367994867370",
  "workingSetBytes": {
    "value": "839680"
  },
  "availableBytes": {
    "value": "265420800"
  },
  "usageBytes": {
    "value": "3014656"
  },
  "rssBytes": {
    "value": "98304"
  },
  "pageFaults": {
    "value": "2561"
  },
  "majorPageFaults": {
    "value": "17"
  }
},

For example, within the above, we don't have virtual address space allocated for the CRI-O process, which is also useful to track, even though we have workingSetBytes that most likely include anonymous and virtual address space too.

So, basically, verify some things, make sure that everything looks reasonable, check the Prometheus labels, etc.

Mo-Fatah · 2024-01-12T15:21:48Z

Thanks for the details! @littlejawa . Your approach seems reasonable to me.

for oci I will use cStats, err := ss.Runtime().ContainerStats(context.TODO(), c, sb.CgroupParent()) and will change ContainerStats internally to return the new CgroupStats type.
for kata (vm runtime), will change the ContainerStats function to return CgroupStats instead of CRI stats by converting the metrics into CgroupStats. I see that we already have conversion functions, so I will edit those.

the conversion to CRI finally will be remain in the stats server. (correct me if i lost my way)

Mo-Fatah · 2024-01-12T15:46:17Z

@kwilczynski

"availableBytes": {
"value": "265420800"
},
"swapAvailableBytes": {
"value": "265420800"
},
A problem here or merely a happy coincidence? 😄 If there is no swap, does it show as a value of 0 or a null in JSON?

Not a coincidence. The reason why availableBytes = swapAvailableBytes and usageBytes = swapUsageBytes is due to libcontainer's choice of being compatible with cgroupv1 (the numbers I provided are by cgroupv2) https://github.com/opencontainers/runc/blob/main/libcontainer/cgroups/fs2/memory.go#L122-L128 They add the the memory availableBytes , usageBytes to swapAvailableBytes, swapUsageBytes respectively.
In the stats i provided, the swap values are 0, therefore their values are equal to the memory ones.

were you able to compare these metrics with what /proc//statm and /proc//status shows?
So, basically, verify some things, make sure that everything looks reasonable, check the Prometheus labels, etc.

will have a look again at the numbers and verify them

For example, within the above, we don't have virtual address space allocated for the CRI-O process, which is also useful to track, even though we have workingSetBytes that most likely include anonymous and virtual address space too.

Can you elaborate that part? I mean, those stats are part of the cri-api

haircommander · 2024-01-12T16:46:52Z

For example, within the above, we don't have virtual address space allocated for the CRI-O process, which is also useful to track, even though we have workingSetBytes that most likely include anonymous and virtual address space too.

yeah unfortunately for now we can't fuss with the metrics too much. We basically need 1:1 metrics coverage between what currently exists and what cri-o will collect. In the future we can investigate adding additional ones that better reflect the state of the processes

also note: these are for pods and containers, so the address space of cri-o itself is not relevant here

sohankunkerkar · 2024-01-25T15:37:10Z

Hey @Mo-Fatah once you mark this PR ready for review, could you also squash the redundant commits and place them in the meaningful ones?

Mo-Fatah · 2024-01-26T08:21:40Z

/retest

Mo-Fatah · 2024-01-26T17:51:16Z

/retest

Mo-Fatah · 2024-01-26T18:39:48Z

internal/config/cgmgr/stats_test.go

@@ -1,91 +0,0 @@
-package cgmgr_test


This file just tests UpdateWithMemoryStatsFromFile function which is deleted and no longer used

littlejawa · 2024-01-29T08:54:19Z

internal/lib/stats/stats_server_linux.go

@@ -40,11 +47,12 @@ func (ss *StatsServer) updateSandbox(sb *sandbox.Sandbox) *types.PodSandboxStats
 		if c.StateNoLock().Status == oci.ContainerStateStopped {
 			continue
 		}
-		cStats, err := ss.Runtime().ContainerStats(context.TODO(), c, sb.CgroupParent())
+		cgstats, err := ss.Runtime().ContainerStats(context.TODO(), c, sb.CgroupParent())


There is a problem here: you're expecting ContainerStats() to return a CgroupStats structure, but you're modifying ContainerStats only in the next commit.
Your first commit then doesn't build - you get an error when you try to use cgstats as a parameter to containerCRIStats()

internal/lib/stats/stats_server_linux.go:55:31: cannot use cgstats (variable of type *"k8s.io/cri-api/pkg/apis/runtime/v1".ContainerStats) as *cgmgr.CgroupStats value in argument to containerCRIStats

If you don't want to merge the two commits (which would make it pretty big), maybe you could introduce that function code changes only in the second commit?
Not sure what option is best, but I think we need to have independently buildable commits

I would go with merging the two commits if no problem with that. it will be a big commit but I guess all changes are relevant. ( will make the commit message a bit meaningful)

haircommander · 2024-01-29T15:29:14Z

internal/config/cgmgr/stats_linux.go

-// CgroupStats is a structure used to augment libctr.Stats object, because
-// it does not have all of the information required for all of the stats we're
-// interested in.
+// This is a universal stats object to be used across different runtime implementations.


I find myself worried about the translation here. there is already high cpu usage on stats collection from golang object GC (each tiny object is created and destroyed quite quickly). I think I am okay with this for now but it's possible we may want to find a different way. For instance, having the statsserver run linux only for now

If you'd rather do that as a follow up (or not deal with that and have someone else do it) @Mo-Fatah let me know

I personally wanted to go the way Sohan did before here to just use the , but due to the newly-added FreeBSD build workflow, apparently any inclusion of libcontainer in runtime_vm.go (through importing libcontainer or adding a type that uses a libcontainer type internally) will lead to FreeBSD build failure (at least this what I noticed from multiple CI failures).

I would prefer to solve this in a separate PR if it's Ok.

I would prefer to solve this in a separate PR if it's Ok.

That sounds good!

- Collect container and Sandbox stats through libctr cgroup managers - Move stats collection from statsserver to cgmgr - Use the generic CgroupStats type for all stats functions - Convert all stats collected into CgroupStats - Add new stats types for unsupported Signed-off-by: mohamed <mohamedabdelfatah2027@gmail.com>

Signed-off-by: mohamed <mohamedabdelfatah2027@gmail.com>

Mo-Fatah · 2024-01-30T17:37:15Z

/retest

sohankunkerkar · 2024-01-31T14:58:24Z

/lgtm
/hold
@cri-o/cri-o-maintainers Please feel free unhold if everything looks good.

haircommander · 2024-01-31T15:08:09Z

/hold cancel
/approve

well done @Mo-Fatah !

openshift-ci · 2024-01-31T15:08:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, Mo-Fatah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [haircommander]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Mo-Fatah requested a review from mrunalp as a code owner January 5, 2024 14:08

openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. approved Indicates a PR has been approved by an approver from all required OWNERS files. release-note-none Denotes a PR that doesn't merit a release note. labels Jan 5, 2024

openshift-ci bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Jan 8, 2024

openshift-ci bot assigned sohankunkerkar Jan 8, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 8, 2024

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 8, 2024

openshift-ci bot added dco-signoff: no Indicates the PR's author has not DCO signed all their commits. and removed dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Jan 26, 2024

Mo-Fatah force-pushed the libctr-based-statsserver branch from 9fa70b9 to 48fbe60 Compare January 26, 2024 15:07

openshift-ci bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Jan 26, 2024

Mo-Fatah force-pushed the libctr-based-statsserver branch 2 times, most recently from 4987e25 to f212c33 Compare January 26, 2024 16:27

Mo-Fatah commented Jan 26, 2024

View reviewed changes

littlejawa reviewed Jan 29, 2024

View reviewed changes

haircommander reviewed Jan 29, 2024

View reviewed changes

Mo-Fatah added 3 commits January 30, 2024 17:38

clean up the Populate* functions with their helpers

601d859

Signed-off-by: mohamed <mohamedabdelfatah2027@gmail.com>

report memory Swap-only values

5118771

Signed-off-by: mohamed <mohamedabdelfatah2027@gmail.com>

Mo-Fatah force-pushed the libctr-based-statsserver branch from f212c33 to 5118771 Compare January 30, 2024 15:39

openshift-ci bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 31, 2024

openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Jan 31, 2024

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 31, 2024

openshift-merge-bot bot merged commit fc7e852 into cri-o:main Jan 31, 2024
53 of 57 checks passed

Mo-Fatah mentioned this pull request Feb 5, 2024

Invalid WorkingSetBytes, AvailableBytes, UsageBytes, RssBytes for RuntimeService.ListContainerStats (RuntimeService.ContainerStats) #7354

Open

Mo-Fatah mentioned this pull request Feb 19, 2024

Implement CRI metrics collection #7796

Merged

Mo-Fatah deleted the libctr-based-statsserver branch May 3, 2024 09:05

Collect container and Sandbox stats through libctr cgroup managers #7658

Collect container and Sandbox stats through libctr cgroup managers #7658

Conversation

Mo-Fatah commented Jan 5, 2024 • edited by sohankunkerkar

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

haircommander commented Jan 5, 2024

Mo-Fatah commented Jan 6, 2024

sohankunkerkar commented Jan 8, 2024

sohankunkerkar commented Jan 8, 2024

Mo-Fatah commented Jan 8, 2024 • edited

sohankunkerkar commented Jan 8, 2024

kwilczynski commented Jan 8, 2024 • edited

codecov bot commented Jan 8, 2024 • edited

Codecov Report

Mo-Fatah commented Jan 8, 2024 • edited

haircommander commented Jan 8, 2024

Mo-Fatah commented Jan 8, 2024

sohankunkerkar commented Jan 9, 2024

sohankunkerkar commented Jan 9, 2024

sohankunkerkar commented Jan 9, 2024

Mo-Fatah commented Jan 11, 2024

Mo-Fatah commented Jan 11, 2024

haircommander commented Jan 11, 2024

littlejawa commented Jan 12, 2024

kwilczynski commented Jan 12, 2024

Mo-Fatah commented Jan 12, 2024

Mo-Fatah commented Jan 12, 2024 • edited

haircommander commented Jan 12, 2024

sohankunkerkar commented Jan 25, 2024

Mo-Fatah commented Jan 26, 2024

Mo-Fatah commented Jan 26, 2024

Mo-Fatah Jan 26, 2024

Choose a reason for hiding this comment

littlejawa Jan 29, 2024

Choose a reason for hiding this comment

Mo-Fatah Jan 30, 2024

Choose a reason for hiding this comment

haircommander Jan 29, 2024

Choose a reason for hiding this comment

haircommander Jan 29, 2024

Choose a reason for hiding this comment

Mo-Fatah Jan 30, 2024

Choose a reason for hiding this comment

sohankunkerkar Jan 31, 2024 • edited

Choose a reason for hiding this comment

Mo-Fatah commented Jan 30, 2024

sohankunkerkar commented Jan 31, 2024

haircommander commented Jan 31, 2024

openshift-ci bot commented Jan 31, 2024

Mo-Fatah commented Jan 5, 2024 •

edited by sohankunkerkar

Mo-Fatah commented Jan 8, 2024 •

edited

kwilczynski commented Jan 8, 2024 •

edited

codecov bot commented Jan 8, 2024 •

edited

Mo-Fatah commented Jan 8, 2024 •

edited

Mo-Fatah commented Jan 12, 2024 •

edited

sohankunkerkar Jan 31, 2024 •

edited