Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blockchaos: impl blockchaos in chaosdaemon #2907

Merged
merged 32 commits into from
Jul 5, 2022

Conversation

YangKeao
Copy link
Member

@YangKeao YangKeao commented Feb 17, 2022

What problem does this PR solve?

Implement blockchaos in chaosdaemon

Test Steps

This functions is relatively hard to automatically test. I have tested manually with the following steps in a centos virtual machine:

  1. Install chaos-driver
  2. Start up a k3s kubernetes cluster.
  3. Install local static provisioner https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner
  4. Create a Pod and PVC
  5. Create the BlockChaos with following spec:
     apiVersion: chaos-mesh.org/v1alpha1
     kind: BlockChaos
     metadata:
       name: disk-example
     spec:
       selector:
         labelSelectors:
           app: disk-example
       mode: all
       volumeName: disk-example
       action: delay
       delay:
         latency: 1s
  6. kubectl exec to enter the pod, and do some operation on the disk.

Detailed Steps

These steps can be concluded in the following scripts, but please execute them one by one.

Bootstraping virtual machine

Firstly we need a virtual machine to test this feature, I used vagrant to help me setup the machine. In an empty directory:

vagrant init centos/7
vagrant up --provider virtualbox

vagrant ssh

Then update the kernel and kernel-devel for virtual machine

sudo yum update -y
sudo yum install kernel-devel -y
sudo reboot

You'll also need to install docker/go/python3 to build chaos-mesh

sudo yum install -y yum-utils python3 gcc
sudo yum-config-manager \
    --add-repo \
    https://download.docker.com/linux/centos/docker-ce.repo
sudo yum install docker-ce docker-ce-cli containerd.io docker-compose-plugin
sudo usermod -aG docker vagrant
sudo systemctl enable docker

wget https://go.dev/dl/go1.18.3.linux-amd64.tar.gz
sudo rm -rf /usr/local/go && sudo tar -C /usr/local -xzf go1.18.3.linux-amd64.tar.gz

sudo reboot

Install chaos driver

Then, inside the virtual machine, we'll need to build the chaos-driver

# clone chaos-driver
git clone https://github.com/chaos-mesh/chaos-driver.git
cd chaos-driver
make all

sudo insmod ./driver/chaos_driver.ko

Now, you have installed the chaos-driver. In the dmesg you can see:

[ 1552.172408] chaos_driver is loading 
[ 1552.172539] io scheduler ioem-mq registered
[ 1552.172541] io scheduler ioem registered

Install k3s and helm

curl -sfL https://get.k3s.io | sh -
# after some time
sudo chmod 666 /etc/rancher/k3s/k3s.yaml
k3s kubectl get pods -n kube-system

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

You can see every pod is running or completed.

Build and install chaos mesh

git clone https://github.com/YangKeao/chaos-mesh.git
cd chaos-mesh
git checkout --track origin/impl-blockchaos-chaosdaemon

make image

docker image save ghcr.io/chaos-mesh/chaos-mesh:latest > ./chaos-mesh.tar
docker image save ghcr.io/chaos-mesh/chaos-daemon:latest > ./chaos-daemon.tar

sudo /usr/local/bin/k3s ctr images import ./chaos-mesh.tar
sudo /usr/local/bin/k3s ctr images import ./chaos-daemon.tar

k3s kubectl create ns chaos-testing
KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm install chaos-mesh ./helm/chaos-mesh -n chaos-testing --set dashboard.securityMode=false --set controllerManager.replicaCount=1 --set chaosDaemon.runtime=containerd --set chaosDaemon.socketPath=/run/k3s/containerd/containerd.sock

Then you can check whether chaos mesh is running through k3s kubectl get pods -n chaos-testing

Setup Injectee Pod

I have prepared one in the examples

kubectl apply -f examples/blockchaos/hostpath-persistent-volume.yaml
kubectl apply -f examples/blockchaos/pod.yaml

Inject BlockChaos

kubectl apply -f examples/blockchaos/blockchaos-delay-example.yaml

Then you can verify the injection through read or write inside the /usr/share/nginx/html

kubectl exec -it hostpath-example /bin/bash
# Then execute the following codes in the container

apt update -y && apt install fio -y

cd /usr/share/nginx/html
dd if=

fio --filename=./some --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=16 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --size=1000000000

You can see:

Jobs: 16 (f=16): [r(16)][21.7%][r=639KiB/s][r=159 IOPS][eta 01m:34s]
Jobs: 16 (f=16): [r(16)][22.5%][r=510KiB/s][r=127 IOPS][eta 01m:33s]
Jobs: 16 (f=16): [r(16)][24.2%][r=511KiB/s][r=127 IOPS][eta 01m:31s] 
Jobs: 16 (f=16): [r(16)][25.8%][r=512KiB/s][r=128 IOPS][eta 01m:29s] 
^Cbs: 16 (f=16): [r(16)][26.7%][r=512KiB/s][r=128 IOPS][eta 01m:28s]
# following code will recover the chaos
kubectl delete -f ./examples/blockchaos/blockchaos-delay-example.yaml 

# Then execute the following codes in the container
cd /usr/share/nginx/html
fio --filename=./some --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=16 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --size=1000000000

You can see:

Jobs: 16 (f=16): [r(16)][2.5%][r=49.9MiB/s][r=12.8k IOPS][eta 01m:57s]
Jobs: 16 (f=16): [r(16)][3.3%][r=49.2MiB/s][r=12.6k IOPS][eta 01m:56s]
Jobs: 16 (f=16): [r(16)][4.2%][r=50.2MiB/s][r=12.9k IOPS][eta 01m:55s]
Jobs: 16 (f=16): [r(16)][5.0%][r=51.2MiB/s][r=13.1k IOPS][eta 01m:54s]
Jobs: 16 (f=16): [r(16)][5.8%][r=47.6MiB/s][r=12.2k IOPS][eta 01m:53s]
Jobs: 16 (f=16): [r(16)][6.7%][r=49.2MiB/s][r=12.6k IOPS][eta 01m:52s]
Jobs: 16 (f=16): [r(16)][7.5%][r=48.8MiB/s][r=12.5k IOPS][eta 01m:51s]
Jobs: 16 (f=16): [r(16)][9.2%][r=48.3MiB/s][r=12.4k IOPS][eta 01m:49s] 

Further Tests

  • Test with hostPath PV
  • Test with multiple volumes

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
@ti-chi-bot
Copy link
Member

ti-chi-bot commented Feb 17, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • Andrewmatilde
  • Hexilee

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@codecov
Copy link

codecov bot commented Feb 17, 2022

Codecov Report

Merging #2907 (c1f86b3) into master (6a379cf) will decrease coverage by 0.26%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2907      +/-   ##
==========================================
- Coverage   41.09%   40.82%   -0.27%     
==========================================
  Files         165      166       +1     
  Lines       13851    13969     +118     
==========================================
+ Hits         5692     5703      +11     
- Misses       7726     7835     +109     
+ Partials      433      431       -2     
Impacted Files Coverage Δ
api/v1alpha1/blockchaos_types.go 0.00% <ø> (ø)
controllers/utils/controller/key.go 0.00% <0.00%> (ø)
pkg/chaosdaemon/blockchaos_server.go 0.00% <0.00%> (ø)
pkg/workflow/controllers/chaos_node_reconciler.go 61.31% <0.00%> (+3.15%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fdd4bb7...c1f86b3. Read the comment docs.

@YangKeao YangKeao force-pushed the impl-blockchaos-chaosdaemon branch 2 times, most recently from f61d729 to 7db00da Compare February 17, 2022 11:10
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
@YangKeao YangKeao marked this pull request as ready for review February 23, 2022 09:49
@YangKeao
Copy link
Member Author

/run-e2e-tests

@YangKeao YangKeao requested a review from a team February 23, 2022 10:01
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
@YangKeao
Copy link
Member Author

YangKeao commented Mar 2, 2022

/build-image

…haosdaemon

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
@github-actions
Copy link

github-actions bot commented Mar 2, 2022

You can download and import the image with following commands:

./hack/download-image.sh -r chaos-mesh/chaos-mesh -i 1920188139

@STRRL STRRL mentioned this pull request Mar 2, 2022
51 tasks
}

blockchaos := obj.(*v1alpha1.BlockChaos)
if blockchaos.Status.InjectionIds == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think will not happen or this is a situation with some unknown problem.

func (s *DaemonServer) ApplyBlockChaos(ctx context.Context, req *pb.ApplyBlockChaosRequest) (*pb.ApplyBlockChaosResponse, error) {
log := s.getLoggerFromContext(ctx)

volumeName, err := normalizeVolumeName(ctx, req.VolumePath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wired , but you have to do it. What a pity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can access the mnt ns of process in /proc/$PID/root/.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Andrewmatilde I have considered this problem, the difficult part is the symbolic path resolving. If there is a symlink points to an absolute address, e.g. /home/vagrant/test -> /mnt, then the /proc/1/root/home/vagrant/test in container also points to /mnt, which doesn't exist in the container. To resolve this problem we have to implement a symlink walking by ourselves, which is somehow disappointing, as it may contain relative paths, absolute paths...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with it.

@Andrewmatilde
Copy link
Member

It seems the example have some problem.
image

@YangKeao
Copy link
Member Author

YangKeao commented Jun 30, 2022

It seems the example have some problem. image

@Andrewmatilde
It prints ioem scheduler not found, which means you haven't installed the chaos-driver? Did you see the io scheduler ioem registered in the dmesg?

@Andrewmatilde
Copy link
Member

Andrewmatilde commented Jun 30, 2022

@Andrewmatilde It prints ioem scheduler not found, which means you haven't installed the chaos-driver? Did you see the io scheduler ioem registered in the dmesg?

It fixed. THX

Copy link
Member

@Andrewmatilde Andrewmatilde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
…haosdaemon

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
if pbClient != nil {
defer pbClient.Close()
}
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not check this error as soon as it is returned at L52?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the tricky part (as all of other container-level injection). It'll need to check whether the pbClient should be closed, before returning the error.

I thought there are some problems which makes it hard to clean the pbClient inside DecodeContainerRecord, but I'm not sure. I'll open an issue to optimize this function (DecodeContainerRecord) further. But in this PR, I think we can keep the same with all other implementations.

Copy link
Member

@Hexilee Hexilee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

controllers/chaosimpl/blockchaos/impl.go Outdated Show resolved Hide resolved
Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
@YangKeao
Copy link
Member Author

YangKeao commented Jul 5, 2022

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: c1f86b3

@YangKeao
Copy link
Member Author

YangKeao commented Jul 5, 2022

After nearly five months 😆

@ti-chi-bot ti-chi-bot merged commit a49c43d into chaos-mesh:master Jul 5, 2022
STRRL pushed a commit to Garima-Negi/chaos-mesh that referenced this pull request Sep 13, 2022
* implement blockchaos on chaosdaemon

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* make check

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* implement blockchaos helper

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* auto enable ioem scheduler

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* make check

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* fix decode container record

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* fix joining the volume path

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* enable local mount and mount namespace for cdh

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* use host-sys rather than sys in chaos-daemon

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* enable CGO for cdh

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* fine tune the protobuf

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* don't override scheduler if it's already set

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* make check

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* remount sysfs with rw permission

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* modify BlockChaos API

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* regenerate something

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* remove blockchaos limit

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* clean go.sum in e2e-test

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* add CHANGELOG.md of blockchaos

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* remove example of limit

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* use float64 correlation

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* close chaos-driver client

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* make tidy

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

* return error after recovering

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
Signed-off-by: STRRL <im@strrl.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants