Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fail to inject StressChaos in certain cgroup v1 environment because PidPath returns an unexpected error #4407

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

kaaass
Copy link

@kaaass kaaass commented Apr 28, 2024

What problem does this PR solve?

Close #4406

This PR fixes the issue that in some certain cgroup v1 environments, Chaos Mesh fails to inject StressChaos to the Pod. A typical error message would like:

Failed to apply chaos: rpc error: code = Unknown desc = load cgroup v1 manager, pid 31390, cpu path /kubepods/besteffort/pod936b79d1-f6cf-4716-82e8-d302aef341d2/fc121824c882c4280adf95010c74233e8420c4eec617433f4029b0fd3477aed2, memory path /kubepods/besteffort/pod936b79d1-f6cf-4716-82e8-d302aef341d2/fc121824c882c4280adf95010c74233e8420c4eec617433f4029b0fd3477aed2: controller is not supported

The problem is the chaos-daemon fails to handle when cgroup controller presents in /sys/fs/cgroup but is not enabled for the target pod. For my case:

root@chaos-daemon-rlnp7:/# ls /host-sys/fs/cgroup/
blkio  cpu  cpu,cpuacct  cpuacct  cpuset  devices  freezer  hugetlb  memory  net_cls  net_prio  perf_event  pids  systemd  unified
root@chaos-daemon-rlnp7:/# cat /proc/5442/cgroup 
11:freezer:/kubepods/besteffort/pod936b79d1-f6cf-4716-82e8-d302aef341d2/fc121824c882c4280adf95010c74233e8420c4eec617433f4029b0fd3477aed2
10:net_cls:/kubepods/besteffort/pod936b79d1-f6cf-4716-82e8-d302aef341d2/fc121824c882c4280adf95010c74233e8420c4eec617433f4029b0fd3477aed2
9:hugetlb:/kubepods/besteffort/pod936b79d1-f6cf-4716-82e8-d302aef341d2/fc121824c882c4280adf95010c74233e8420c4eec617433f4029b0fd3477aed2
8:perf_event:/kubepods/besteffort/pod936b79d1-f6cf-4716-82e8-d302aef341d2/fc121824c882c4280adf95010c74233e8420c4eec617433f4029b0fd3477aed2
7:pids:/kubepods/besteffort/pod936b79d1-f6cf-4716-82e8-d302aef341d2/fc121824c882c4280adf95010c74233e8420c4eec617433f4029b0fd3477aed2
6:memory:/kubepods/besteffort/pod936b79d1-f6cf-4716-82e8-d302aef341d2/fc121824c882c4280adf95010c74233e8420c4eec617433f4029b0fd3477aed2
5:blkio:/kubepods/besteffort/pod936b79d1-f6cf-4716-82e8-d302aef341d2/fc121824c882c4280adf95010c74233e8420c4eec617433f4029b0fd3477aed2
4:devices:/kubepods/besteffort/pod936b79d1-f6cf-4716-82e8-d302aef341d2/fc121824c882c4280adf95010c74233e8420c4eec617433f4029b0fd3477aed2
3:cpu,cpuacct:/kubepods/besteffort/pod936b79d1-f6cf-4716-82e8-d302aef341d2/fc121824c882c4280adf95010c74233e8420c4eec617433f4029b0fd3477aed2
2:cpuset:/kubepods/besteffort/pod936b79d1-f6cf-4716-82e8-d302aef341d2/fc121824c882c4280adf95010c74233e8420c4eec617433f4029b0fd3477aed2
1:name=systemd:/kubepods/besteffort/pod936b79d1-f6cf-4716-82e8-d302aef341d2/fc121824c882c4280adf95010c74233e8420c4eec617433f4029b0fd3477aed2
0::/kubepods/besteffort/pod936b79d1-f6cf-4716-82e8-d302aef341d2/fc121824c882c4280adf95010c74233e8420c4eec617433f4029b0fd3477aed2
root@chaos-daemon-rlnp7:/# cat /proc/5442/cgroup | grep net_prio
root@chaos-daemon-rlnp7:/# 

From code, the error controller is not supported is returned by PidPath when running cgroups.Load. Actually, cgroups.Load could handle this situation, as if the path function (which is PidPath here) returns ErrControllerNotActive error.

https://github.com/containerd/cgroups/blob/fa6f6841ed3d57355acadbc06f1d7ed4d91ac4f7/cgroup1/cgroup.go#L93

But now PidPath returns an unexpected error errors.New("controller is not supported") instead.

root, ok := paths[string(name)]
if !ok {
if root, ok = paths["name="+string(name)]; !ok {
return "", errors.New("controller is not supported")
}
}

What's changed and how it works?

This PR changes the returned error to ErrControllerNotActive. This behaves the same as the implementation in containerd/cgroup.

https://github.com/containerd/cgroups/blob/4dacf2bc1300b0d7dc1087b8e27712a597890ba3/paths.go#L80-L85

Related changes

  • This change also requires further updates to the website (e.g. docs)
  • This change also requires further updates to the UI interface

Cherry-pick to release branches (optional)

This PR should be cherry-picked to the following release branches:

  • release-2.6
  • release-2.5

Checklist

CHANGELOG

Must include at least one of them.

  • I have updated the CHANGELOG.md
  • I have labeled this PR with "no-need-update-changelog"

Tests

Must include at least one of them.

  • Unit test
  • E2E test
  • Manual test

Side effects

  • Breaking backward compatibility

DCO

If you find the DCO check fails, please run commands like below (Depends on the actual situations. For example, if the failed commit isn't the most recent) to fix it:

git commit --amend --signoff
git push --force

Signed-off-by: KAAAsS <admin@kaaass.net>
@STRRL
Copy link
Member

STRRL commented May 14, 2024

Hi @kaaass , basically LGTM, could you resolve the conflicts in CHANGELOG.md?

@kaaass
Copy link
Author

kaaass commented May 14, 2024

@STRRL Sure! I have done them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed to apply StressChaos in minikube with qemu driver: controller is not supported
3 participants