Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support cgroup v2 for linux stress experiments #2928

Merged
merged 21 commits into from
Mar 7, 2022

Conversation

igcherkaev
Copy link
Contributor

@igcherkaev igcherkaev commented Feb 23, 2022

What problem does this PR solve?

Support cgroup v2 driver when running experiments on linux (CPU & Mem). If the host/containers where chaosdaemon is running is in full cgroup v2 mode (unified), users will no longer see:

Failed to apply chaos: rpc error: code = Unknown desc = controller is not supported

When they run CPU or Memory stress experiments.

What's changed and how it works?

Introduced a cgroupv2 manager to add newly created stress processes to the respective cgroup if containers based on what cgroup driver containers are using.

P.S. I did not re-order import statements myself, it must have been done by the format/lint/tidy goals in the Makefile when I ran make check or make test. I can revert these changes, if need be.

Related changes

  • Need to update chaos-mesh/website
  • Need to update Dashboard UI
  • Need to cheery-pick to release branches
    • release-2.1
    • release-2.0

Checklist

Tests

  • Unit test
  • E2E test
  • No code
  • Manual test
    • Start kubernetes 1.21+ with cgroupfs and cgroup v2 driver.

    • Verify what cgroup driver you have on the node and in the container:

      # container
      $ kubectl -n test-dev exec -ti demo-c7c688b8-trr42 -- bash
      bash-4.2# mount| grep cgroup
      cgroup on /sys/fs/cgroup type cgroup2 (ro,nosuid,nodev,noexec,relatime,seclabel)
      
      # kubernetes node
      lvdkbw503 ~ # mount| grep cgroup
      cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel)
    • Schedule new CPU/Mem stress experiment and verify it's working by checking cgroup assigned to respective processes on the worker nodes.

    • Verify it on the node by using pids of the main process in the container and of the stress-ng/memstress process:

       lvdkbw503 ~ # ps -o cgroup 1476124
       CGROUP
       0::/kubepods/burstable/pod8731e44b-df00-41f9-9c50-
       1c6327a16b74/69604c2dd135683bf0758cf623c05c37f6fe61ab437fd2227b0f584e93eedbc0
       lvdkbw503 ~ # ps -o cgroup 2188035
       CGROUP
       0::/kubepods/burstable/pod8731e44b-df00-41f9-9c50-
       1c6327a16b74/69604c2dd135683bf0758cf623c05c37f6fe61ab437fd2227b0f584e93eedbc0

Side effects

N/A

Release note

Please add a release note.

You can safely ignore this section if you don't think this PR needs a release note.

DCO

If you find the DCO check fails, please run commands like below (Depends on the actual situations. For example, if the failed commit isn't the most recent) to fix it:

git commit --amend --signoff
git push --force

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Feb 23, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • STRRL
  • YangKeao

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot
Copy link
Member

Welcome @igcherkaev!

It looks like this is your first PR to chaos-mesh/chaos-mesh 🎉.

I'm the bot to help you request reviewers, add labels and more, See available commands.

We want to make sure your contribution gets all the attention it needs!



Thank you, and welcome to chaos-mesh/chaos-mesh. 😃

Signed-off-by: Igor Cherkaev <igor.cherkaev@copart.com>
@codecov
Copy link

codecov bot commented Feb 23, 2022

Codecov Report

Merging #2928 (1ebea4f) into master (92b720b) will increase coverage by 0.16%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2928      +/-   ##
==========================================
+ Coverage   37.96%   38.12%   +0.16%     
==========================================
  Files         106      106              
  Lines        9083     9106      +23     
==========================================
+ Hits         3448     3472      +24     
- Misses       5327     5328       +1     
+ Partials      308      306       -2     
Impacted Files Coverage Δ
pkg/chaosdaemon/stress_server_linux.go 0.00% <0.00%> (ø)
pkg/workflow/controllers/deadline_reconciler.go 69.62% <0.00%> (+5.18%) ⬆️
.../workflow/controllers/workflow_entry_reconciler.go 55.02% <0.00%> (+8.99%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 92b720b...1ebea4f. Read the comment docs.

@YangKeao
Copy link
Member

This PR cannot close #2652, because the problem in #2652 is actually not for "Unified" mode, but "Hybrid" mode. In #2652 (comment) , only the blkio is handled by the cgroupv2 controller.

But it's still fine to support unified group v2 only, for now.

@igcherkaev
Copy link
Contributor Author

This PR cannot close #2652, because the problem in #2652 is actually not for "Unified" mode, but "Hybrid" mode. In #2652 (comment) , only the blkio is handled by the cgroupv2 controller.

But it's still fine to support unified group v2 only, for now.

Oh, sorry, I did not know that. @STRRL was talking about cgroup v2 there so I assumed the unified mode. I don't have environment with cgroups in the hybrid mode, and I don't know the root cause of the issue in #2652 and why it's related to blkio, so I'll leave it out of the scope of this PR then.

@igcherkaev
Copy link
Contributor Author

Also, I can backport this fix to release-2.1 branch with a separate PR - cherry-picking won't work since function signatures have changed.

@STRRL
Copy link
Member

STRRL commented Feb 24, 2022

I tested this patch for different container runtime; it works on contained but does not work on docker.

It seems docker would isolate the "cgroup namespace" for the container but containerd (also crio) would not.

When we call cgroupsv2.PidGroupPath(<pid>), chaos-daemon (in containerd) would return:

0::/kubepods/besteffort/poddc367d15-b0dd-4401-86de-84fd7ba7e08b/4d287d5b0651c1bd13ccaecdcfe6f1a65379150a4ec4ae6d46c392b76b812941

But chaos-daemon( in docker) would return

0::/../../kubepods-besteffort-pod25df87b9_2545_47fb_94ba_3292ea427f95.slice/docker-41c1b86aa6257d0cd560db3c99c53f632c4909774280dcc5c57a544fde52a340.scope

The latter one would not work well.

Should we also mount the hosts /proc into chaos-daemon like /host-sys? @YangKeao

@igcherkaev
Copy link
Contributor Author

Should I add a condition to apply it only when runtime is containerd? Docker runtime is deprecated in k8s anyway, maybe Chaos Mesh should do the same and begin deprecating docker runtime?

@STRRL
Copy link
Member

STRRL commented Mar 1, 2022

Should I add a condition to apply it only when runtime is containerd? Docker runtime is deprecated in k8s anyway, maybe Chaos Mesh should do the same and begin deprecating docker runtime?

I think this PR would not require that. I think the support for docker should be finished in the next several PRs.

@STRRL
Copy link
Member

STRRL commented Mar 1, 2022

Hi @igcherkaev , could you resolve the conflicts? Then appending an entry in the CHANGELOG.md

@igcherkaev
Copy link
Contributor Author

@STRRL sure, will work on that.

igcherkaev and others added 8 commits March 1, 2022 13:58
Signed-off-by: Igor Cherkaev <igor.cherkaev@copart.com>
Signed-off-by: SiyuChen <ryougi201@gmail.com>
* chore: move fake_image to linux_amd64

Signed-off-by: SiyuChen <ryougi201@gmail.com>

* chore: fix upload image

Signed-off-by: SiyuChen <ryougi201@gmail.com>

* fix: remove deafult project env

Signed-off-by: SiyuChen <ryougi201@gmail.com>

* chore: add FakeImage to fake_image

Signed-off-by: SiyuChen <ryougi201@gmail.com>

* chore: update e2e image name

Signed-off-by: SiyuChen <ryougi201@gmail.com>

* chore: fix e2e test

Signed-off-by: SiyuChen <ryougi201@gmail.com>

* chore: fix e2e test

Signed-off-by: SiyuChen <ryougi201@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
* fix(ui): pod phases should be first letter capitalized

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: unify label case

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
* feat: OpenAPI to TypeScript API Client and Forms

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: license checker

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: go (verify)

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: add comments

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: reuse kubebuilder marks

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: update according to RFC

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: prevent ui:form appearing in crds

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: add missing actions

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: update lockfile

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: support PhysicalMachineChaos

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: merge commands into codegen

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: license checker

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: ci

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: supplement another ignore check

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: update package info

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: update README

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: update

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: gen HTTPChaos

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* chore: update changelog

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: remove nested ignores

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: remove remaining markers

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: typo

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

* fix: supplement comments

Signed-off-by: Yue Yang <g1enyy0ung@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
Signed-off-by: imlonghao <git@imlonghao.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
…-mesh#2921)

* helm: update archived data's ttl

Signed-off-by: cwen0 <cwenyin0@gmail.com>

* update helm doc

Signed-off-by: cwen0 <cwenyin0@gmail.com>

* fix comments

Signed-off-by: cwen0 <cwenyin0@gmail.com>

* fix comments

Signed-off-by: cwen0 <cwenyin0@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
* fix wrong comment on workflow Dashboard HTTP API

Signed-off-by: xiang <xiang13225080@163.com>

* update swagger

Signed-off-by: xiang <xiang13225080@163.com>
YangKeao and others added 3 commits March 1, 2022 14:21
* add customized logger for daemon and bpm

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* extract the grpc metadata into the context

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* make check

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* make check

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* remove log parameter in killIOChaos

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* add logs

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* remove log parameter in httpchaos server

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* add the context comments for process builder

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* make check

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* use the logger in arguments

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* add back boilerplate

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* fix make check

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
* add auto compile github action

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* modify the name of action

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* remove comment expression

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* add comments validation

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* rename the zst file name

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* auto detect buildenv and devenv

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* fix artifact url

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* add debug information

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* add cache and debug artifact url

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* fix bash script

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* modify cache step name

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* use another docker buildx driver

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* add download bash script

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* add boilerplate

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* add some comments and help message

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
Signed-off-by: Igor Cherkaev <igor.cherkaev@copart.com>
@ti-chi-bot ti-chi-bot added size/XXL and removed size/M labels Mar 1, 2022
@STRRL
Copy link
Member

STRRL commented Mar 2, 2022

Hi @igcherkaev, the verify check failed. verify relates to codes linters and code generation. You could execute make check on your local machine then commit the changes to fix it.

@igcherkaev
Copy link
Contributor Author

Hi @igcherkaev, the verify check failed. verify relates to codes linters and code generation. You could execute make check on your local machine then commit the changes to fix it.

Hmm, I thought I'd done that. Did it again and pushed the changes.

Copy link
Member

@YangKeao YangKeao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@STRRL STRRL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Thanks for your contribution!

@STRRL
Copy link
Member

STRRL commented Mar 4, 2022

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: a434135

@STRRL
Copy link
Member

STRRL commented Mar 4, 2022

/run-e2e-tests

@igcherkaev
Copy link
Contributor Author

I think DCO check fails due to rebasing my branch and including commits made by other folks.

@STRRL
Copy link
Member

STRRL commented Mar 7, 2022

I think DCO check fails due to rebasing my branch and including commits made by other folks.

I would manually set it to PASS. 🥲

@ti-chi-bot ti-chi-bot merged commit 57832f7 into chaos-mesh:master Mar 7, 2022
@igcherkaev
Copy link
Contributor Author

Thank you! Glad my efforts didn't go to waste :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet