Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect container and Sandbox stats through libctr cgroup managers #7658

Merged
merged 3 commits into from
Jan 31, 2024

Conversation

Mo-Fatah
Copy link
Contributor

@Mo-Fatah Mo-Fatah commented Jan 5, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

stats server currently manually collects stats through the Populate** functions in cgmgr package which doesn't report correct stats. This PR bases the stats collection on libcontainer cgroup managers.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Use libcontainer cgroup manager to collect container and Sandbox stats

@Mo-Fatah Mo-Fatah requested a review from mrunalp as a code owner January 5, 2024 14:08
@haircommander
Copy link
Member

/ok-to-test
/approve

LGTM

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. approved Indicates a PR has been approved by an approver from all required OWNERS files. release-note-none Denotes a PR that doesn't merit a release note. labels Jan 5, 2024
@Mo-Fatah
Copy link
Contributor Author

Mo-Fatah commented Jan 6, 2024

/retest

1 similar comment
@sohankunkerkar
Copy link
Member

/retest

@sohankunkerkar
Copy link
Member

@Mo-Fatah do you mind pasting the stats o/p with this change? I just wanted to see the actual stats.

@openshift-ci openshift-ci bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Jan 8, 2024
@Mo-Fatah
Copy link
Contributor Author

Mo-Fatah commented Jan 8, 2024

Sure! @sohankunkerkar
I built crio from the current branch, ran a pod, ran a sleep container, finally used the command crictl statsp -o json

stats from the current changes:

...
           "cpu": {
              "timestamp": "1704727367994867370",
              "usageCoreNanoSeconds": {
                "value": "39525000"
              },
              "usageNanoCores": {
                "value": "0"
              }
            },
            "memory": {
              "timestamp": "1704727367994867370",
              "workingSetBytes": {
                "value": "839680"
              },
              "availableBytes": {
                "value": "265420800"
              },
              "usageBytes": {
                "value": "3014656"
              },
              "rssBytes": {
                "value": "98304"
              },
              "pageFaults": {
                "value": "2561"
              },
              "majorPageFaults": {
                "value": "17"
              }
            },
            "writableLayer": {
              "timestamp": "1704727367994869583",
              "fsId": {
                "mountpoint": "/var/lib/containers/storage/overlay/833e7fe8b7b3de29c496abea65b7260059af7cf54aa97d8f42bf519cd70b28f1/merged"
              },
              "usedBytes": {
                "value": "0"
              },
              "inodesUsed": {
                "value": "4"
              }
            },
            "swap": {
              "timestamp": "1704727367994867370",
              "swapAvailableBytes": {
                "value": "265420800"
              },
              "swapUsageBytes": {
                "value": "3014656"
              }
            }
...

doing the same steps with the main branch:

...
            "cpu": {
              "timestamp": "1704727905610530972",
              "usageCoreNanoSeconds": {
                "value": "35000000"
              },
              "usageNanoCores": null
            },
            "memory": {
              "timestamp": "1704727905610530972",
              "workingSetBytes": {
                "value": "102400"
              },
              "availableBytes": {
                "value": "18446744073441218560"
              },
              "usageBytes": {
                "value": "102400"
              },
              "rssBytes": {
                "value": "0"
              },
              "pageFaults": {
                "value": "2541"
              },
              "majorPageFaults": {
                "value": "0"
              }
            },
            "writableLayer": {
              "timestamp": "1704727905610609128",
              "fsId": {
                "mountpoint": "/var/lib/containers/storage/overlay/12351a22aa4c2f4339754a2e8d670f17b3dc00aeebf903a8c9b7c77a95dd4466/merged"
              },
              "usedBytes": {
                "value": "0"
              },
              "inodesUsed": {
                "value": "4"
              }
            },
            "swap": null
...

Notable changes in this PR:

  • adding the swap values. (it was neglected before)
  • some memory values are present (e.g. RSS)
  • The numbers should be a bit different since the stats reported from libcontainer uses different values from the containers/common package (e.g. in the memory/usageBytes value: containers/common uses memory.stat:anon value from the cgroup while libcontainer uses memory.current which should be the correct value.

@sohankunkerkar
Copy link
Member

@Mo-Fatah Thanks for the information. The changes LGTM
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 8, 2024
@kwilczynski
Copy link
Member

kwilczynski commented Jan 8, 2024

@Mo-Fatah, this value (an excerpt from your example):

              "availableBytes": {
                "value": "18446744073441218560"
              },

That is about 18 Exabytes. if I am not mistaken 😄 How does this sort of value work with metrics?

Update:

I suppose, we are outputting this value via a metric per:

# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19

However, we will now be going from this large value to the following:

              "availableBytes": {
                "value": "265420800"
              },

Would this be the same metric?

Admittedly, I haven't checked how libcontainer collects these values. Were you able to compare these metrics with what /proc/<PID>/statm and /proc/<PID>/status shows? I am curious.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 8, 2024
Copy link

codecov bot commented Jan 8, 2024

Codecov Report

❗ No coverage uploaded for pull request base (main@87a309c). Click here to learn what that means.
Report is 2 commits behind head on main.
The diff coverage is 0.00%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #7658   +/-   ##
=======================================
  Coverage        ?   47.83%           
=======================================
  Files           ?      145           
  Lines           ?    16247           
  Branches        ?        0           
=======================================
  Hits            ?     7772           
  Misses          ?     7532           
  Partials        ?      943           

@Mo-Fatah
Copy link
Contributor Author

Mo-Fatah commented Jan 8, 2024

@kwilczynski I am not sure how this huge number is generated 😅 . However, the availableBytes isn't related to libcontainer because we calculate it ourselves in the code. k8s calculate it by memory limit - memory usage ( or workingset)
https://github.com/kubernetes/kubernetes/blob/94f15bbbcbe952762b7f5e6e3f77d86ecec7d7c2/pkg/kubelet/stats/helper.go#L69

which is currently miscalculated on the main branch in cri-o
https://github.com/cri-o/cri-o/blob/main/internal/config/cgmgr/stats_linux.go#L73

cri-o calculates it the opposite order, and usually the memory limit > usage. So, this is my guess why we get this huge number.

@haircommander
Copy link
Member

@kwilczynski I am not sure how this huge number is generated 😅 . However, the availableBytes isn't related to libcontainer because we calculate it ourselves in the code. k8s calculate it by memory limit - memory usage ( or workingset) https://github.com/kubernetes/kubernetes/blob/94f15bbbcbe952762b7f5e6e3f77d86ecec7d7c2/pkg/kubelet/stats/helper.go#L69

which is currently miscalculated on the main branch in cri-o https://github.com/cri-o/cri-o/blob/main/internal/config/cgmgr/stats_linux.go#L73

cri-o calculates it the opposite order, and usually the memory limit > usage. So, this is my guess why we get this huge number.

whoops, this probably was me a couple of years ago. I wonder if this causes #6747 as well

@Mo-Fatah
Copy link
Contributor Author

Mo-Fatah commented Jan 8, 2024

@kwilczynski I am not sure how this huge number is generated 😅 . However, the availableBytes isn't related to libcontainer because we calculate it ourselves in the code. k8s calculate it by memory limit - memory usage ( or workingset) https://github.com/kubernetes/kubernetes/blob/94f15bbbcbe952762b7f5e6e3f77d86ecec7d7c2/pkg/kubelet/stats/helper.go#L69
which is currently miscalculated on the main branch in cri-o https://github.com/cri-o/cri-o/blob/main/internal/config/cgmgr/stats_linux.go#L73
cri-o calculates it the opposite order, and usually the memory limit > usage. So, this is my guess why we get this huge number.

whoops, this probably was me a couple of years ago. I wonder if this causes #6747 as well

I guess so, will have a look at it tomorrow.

@sohankunkerkar
Copy link
Member

/retest

@sohankunkerkar
Copy link
Member

/test ci-cgroupv2-integration

1 similar comment
@sohankunkerkar
Copy link
Member

/test ci-cgroupv2-integration

@Mo-Fatah
Copy link
Contributor Author

I couldn't reproduce the failures on my machine. They also seem to be persistently failing, don't think it's a flakiness issue

@Mo-Fatah
Copy link
Contributor Author

/retest

@haircommander
Copy link
Member

@littlejawa can you PTAL?

@littlejawa
Copy link
Contributor

@littlejawa can you PTAL?

With kata, the error seems to come from :

# time="2024-01-11 14:34:09.959493738Z" level=error msg="Error getting container stats e53309b9b475e2fcd9b0b1f5691b1c9ee3a3a913e25ae46014d91c48d5b3f1fa: error while statting cgroup v2: [openat2 /sys/fs/cgroup/pod_123.slice/pod_123-456.slice/crio-e53309b9b475e2fcd9b0b1f5691b1c9ee3a3a913e25ae46014d91c48d5b3f1fa.scope/cgroup.procs: no such file or directory]" file="stats/stats_server_linux.go:119"

As I understand it, libcontainer tries to access the cgroup information from the host, but with kata the container's cgroup information is within a VM, not accessible to the host.

The way we do it today is through the call to
cStats, err := ss.Runtime().ContainerStats(context.TODO(), c, sb.CgroupParent())
that you removed.

It allows a different behaviour between runtimes: for regular (oci) runtimes, it is conlcuded by a call to PopulateContainerCgroupStats.
For kata containers, it is done by an exchange with the kata agent within the VM, which returns the expected values. This is done in ContainerStats

I'm wondering if we could keep this indirection: keep calling ss.Runtime().ContainerStats(), and modify the oci part to call cgroupStats() and return the stats in cgroup format? The rest of you code (converting from cgroup to CRI) wouldn't change.

I would need to change the kata-specific part to return cgroup data too.
I will need to do some conversion, as the VM cgroup version may differ from the host's. Today I'm converting it to CRI anyway, but if we expect cgroup stats, I'll need to make sure I provide the right version of it.
But returning cgroup stats seems right here, as the conversion to CRI would then be centralized for all runtimes.

Do you think that would work, or do you have other suggestions?

@kwilczynski
Copy link
Member

Aside from the bug that we had for a while, and which @Mo-Fatah, fixed (thank you!). We need to make sure that these new metrics hold water.

@Mo-Fatah, were you able to compare these metrics with what /proc/<PID>/statm and /proc/<PID>/status shows?

For instance, from your example, both of these values are the same:

"availableBytes": {
  "value": "265420800"
},
"swapAvailableBytes": {
  "value": "265420800"
},

A problem here or merely a happy coincidence? 😄 If there is no swap, does it show as a value of 0 or a null in JSON?

Basically, everything looks very good, and it's solid work, but we need to make some sanity checks to see if things are what they say they are and that values also make sense for certain metrics.

Generally, these metrics are such that process, a CRI-O process, I suppose, metrics are stored alongside what seems to be system-wide metrics.

"memory": {
  "timestamp": "1704727367994867370",
  "workingSetBytes": {
    "value": "839680"
  },
  "availableBytes": {
    "value": "265420800"
  },
  "usageBytes": {
    "value": "3014656"
  },
  "rssBytes": {
    "value": "98304"
  },
  "pageFaults": {
    "value": "2561"
  },
  "majorPageFaults": {
    "value": "17"
  }
},

For example, within the above, we don't have virtual address space allocated for the CRI-O process, which is also useful to track, even though we have workingSetBytes that most likely include anonymous and virtual address space too.

So, basically, verify some things, make sure that everything looks reasonable, check the Prometheus labels, etc.

@Mo-Fatah
Copy link
Contributor Author

Thanks for the details! @littlejawa . Your approach seems reasonable to me.

  • for oci I will use cStats, err := ss.Runtime().ContainerStats(context.TODO(), c, sb.CgroupParent()) and will change ContainerStats internally to return the new CgroupStats type.
  • for kata (vm runtime), will change the ContainerStats function to return CgroupStats instead of CRI stats by converting the metrics into CgroupStats. I see that we already have conversion functions, so I will edit those.

the conversion to CRI finally will be remain in the stats server. (correct me if i lost my way)

@Mo-Fatah
Copy link
Contributor Author

Mo-Fatah commented Jan 12, 2024

@kwilczynski

"availableBytes": {
"value": "265420800"
},
"swapAvailableBytes": {
"value": "265420800"
},
A problem here or merely a happy coincidence? 😄 If there is no swap, does it show as a value of 0 or a null in JSON?

Not a coincidence. The reason why availableBytes = swapAvailableBytes and usageBytes = swapUsageBytes is due to libcontainer's choice of being compatible with cgroupv1 (the numbers I provided are by cgroupv2) https://github.com/opencontainers/runc/blob/main/libcontainer/cgroups/fs2/memory.go#L122-L128 They add the the memory availableBytes , usageBytes to swapAvailableBytes, swapUsageBytes respectively.
In the stats i provided, the swap values are 0, therefore their values are equal to the memory ones.

were you able to compare these metrics with what /proc//statm and /proc//status shows?
So, basically, verify some things, make sure that everything looks reasonable, check the Prometheus labels, etc.

will have a look again at the numbers and verify them

For example, within the above, we don't have virtual address space allocated for the CRI-O process, which is also useful to track, even though we have workingSetBytes that most likely include anonymous and virtual address space too.

Can you elaborate that part? I mean, those stats are part of the cri-api

@haircommander
Copy link
Member

For example, within the above, we don't have virtual address space allocated for the CRI-O process, which is also useful to track, even though we have workingSetBytes that most likely include anonymous and virtual address space too.

yeah unfortunately for now we can't fuss with the metrics too much. We basically need 1:1 metrics coverage between what currently exists and what cri-o will collect. In the future we can investigate adding additional ones that better reflect the state of the processes

also note: these are for pods and containers, so the address space of cri-o itself is not relevant here

@sohankunkerkar
Copy link
Member

Hey @Mo-Fatah once you mark this PR ready for review, could you also squash the redundant commits and place them in the meaningful ones?

@openshift-ci openshift-ci bot added dco-signoff: no Indicates the PR's author has not DCO signed all their commits. and removed dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Jan 26, 2024
@Mo-Fatah
Copy link
Contributor Author

/retest

@openshift-ci openshift-ci bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Jan 26, 2024
@Mo-Fatah Mo-Fatah force-pushed the libctr-based-statsserver branch 2 times, most recently from 4987e25 to f212c33 Compare January 26, 2024 16:27
@Mo-Fatah
Copy link
Contributor Author

/retest

@@ -1,91 +0,0 @@
package cgmgr_test
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file just tests UpdateWithMemoryStatsFromFile function which is deleted and no longer used

@@ -40,11 +47,12 @@ func (ss *StatsServer) updateSandbox(sb *sandbox.Sandbox) *types.PodSandboxStats
if c.StateNoLock().Status == oci.ContainerStateStopped {
continue
}
cStats, err := ss.Runtime().ContainerStats(context.TODO(), c, sb.CgroupParent())
cgstats, err := ss.Runtime().ContainerStats(context.TODO(), c, sb.CgroupParent())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a problem here: you're expecting ContainerStats() to return a CgroupStats structure, but you're modifying ContainerStats only in the next commit.
Your first commit then doesn't build - you get an error when you try to use cgstats as a parameter to containerCRIStats()

internal/lib/stats/stats_server_linux.go:55:31: cannot use cgstats (variable of type *"k8s.io/cri-api/pkg/apis/runtime/v1".ContainerStats) as *cgmgr.CgroupStats value in argument to containerCRIStats

If you don't want to merge the two commits (which would make it pretty big), maybe you could introduce that function code changes only in the second commit?
Not sure what option is best, but I think we need to have independently buildable commits

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go with merging the two commits if no problem with that. it will be a big commit but I guess all changes are relevant. ( will make the commit message a bit meaningful)

// CgroupStats is a structure used to augment libctr.Stats object, because
// it does not have all of the information required for all of the stats we're
// interested in.
// This is a universal stats object to be used across different runtime implementations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find myself worried about the translation here. there is already high cpu usage on stats collection from golang object GC (each tiny object is created and destroyed quite quickly). I think I am okay with this for now but it's possible we may want to find a different way. For instance, having the statsserver run linux only for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you'd rather do that as a follow up (or not deal with that and have someone else do it) @Mo-Fatah let me know

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally wanted to go the way Sohan did before here to just use the , but due to the newly-added FreeBSD build workflow, apparently any inclusion of libcontainer in runtime_vm.go (through importing libcontainer or adding a type that uses a libcontainer type internally) will lead to FreeBSD build failure (at least this what I noticed from multiple CI failures).

I would prefer to solve this in a separate PR if it's Ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to solve this in a separate PR if it's Ok.

That sounds good!

- Collect container and Sandbox stats through libctr cgroup managers

- Move stats collection from statsserver to cgmgr

- Use the generic CgroupStats type for all stats functions

- Convert all stats collected into CgroupStats

- Add new stats types for unsupported

Signed-off-by: mohamed <mohamedabdelfatah2027@gmail.com>
Signed-off-by: mohamed <mohamedabdelfatah2027@gmail.com>
Signed-off-by: mohamed <mohamedabdelfatah2027@gmail.com>
@Mo-Fatah
Copy link
Contributor Author

/retest

@openshift-ci openshift-ci bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 31, 2024
@sohankunkerkar
Copy link
Member

/lgtm
/hold
@cri-o/cri-o-maintainers Please feel free unhold if everything looks good.

@openshift-ci openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Jan 31, 2024
@haircommander
Copy link
Member

/hold cancel
/approve

well done @Mo-Fatah !

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 31, 2024
Copy link
Contributor

openshift-ci bot commented Jan 31, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, Mo-Fatah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit fc7e852 into cri-o:main Jan 31, 2024
53 of 57 checks passed
@Mo-Fatah Mo-Fatah deleted the libctr-based-statsserver branch May 3, 2024 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants