Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the issue that io delay injected does not take effect #121

Merged
merged 9 commits into from Jan 13, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
6 changes: 4 additions & 2 deletions Makefile
Expand Up @@ -102,9 +102,9 @@ image:
docker build -t ${DOCKER_REGISTRY}/pingcap/chaos-daemon images/chaos-daemon
docker build -t ${DOCKER_REGISTRY}/pingcap/chaos-mesh images/chaos-mesh
docker build -t ${DOCKER_REGISTRY}/pingcap/chaos-fs images/chaosfs
cp -R hack images/chaos-scripts
cp -R scripts images/chaos-scripts
docker build -t ${DOCKER_REGISTRY}/pingcap/chaos-scripts images/chaos-scripts
rm -rf images/chaos-scripts/hack
rm -rf images/chaos-scripts/scripts
docker build -t ${DOCKER_REGISTRY}/pingcap/chaos-grafana images/grafana
docker build -t ${DOCKER_REGISTRY}/pingcap/chaos-dashboard images/chaos-dashboard

Expand All @@ -113,6 +113,8 @@ docker-push:
docker push "${DOCKER_REGISTRY}/pingcap/chaos-fs:latest"
docker push "${DOCKER_REGISTRY}/pingcap/chaos-daemon:latest"
docker push "${DOCKER_REGISTRY}/pingcap/chaos-scripts:latest"
docker push "${DOCKER_REGISTRY}/pingcap/chaos-grafana:latest"
docker push "${DOCKER_REGISTRY}/pingcap/chaos-dashboard:latest"

bin/revive:
GO111MODULE="on" go build -o bin/revive github.com/mgechev/revive
Expand Down
4 changes: 4 additions & 0 deletions cmd/controller-manager/main.go
Expand Up @@ -180,13 +180,17 @@ func watchConfig(cfg *config.Config, stopCh <-chan struct{}) {
setupLog.Error(err, "watcher got error, try to restart watcher")
default:
setupLog.Error(err, "unable to watch new ConfigMaps")
os.Exit(1)
}
}

select {
case <-stopCh:
close(sigChan)
return
default:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add the default branch? this may cause there exist multiple configWatchers that notify the eventsCh at the same time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this branch just used to restart configWatcher.Watch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood the stopCh before, I thought it was used by configWatcher to notify the outer goroutine that it had been finished.
If configWatcher.Watch or eventsCh first received the msg from stopCh before the outer goroutine, the configWatcher.Watch will be restarted forever because case <- stopCh at line 187 won't receive msg again? how to solve this problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stopCh will be closed, so all receiver will receive this event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got, that's ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restarting is ok for ErrWatchChannelClosed case, is that ok for other err?

// sleep 2 seconds to prevent excessive log due to infinite restart
time.Sleep(2 * time.Second)
}
}
}()
Expand Down
117 changes: 44 additions & 73 deletions controllers/iochaos/fs/types.go
Expand Up @@ -22,6 +22,7 @@ import (
"golang.org/x/sync/errgroup"

"github.com/go-logr/logr"
"github.com/golang/protobuf/ptypes/empty"

"github.com/pingcap/chaos-mesh/api/v1alpha1"
"github.com/pingcap/chaos-mesh/controllers/twophase"
Expand Down Expand Up @@ -161,29 +162,31 @@ func (r *Reconciler) cleanFinalizersAndRecover(ctx context.Context, iochaos *v1a
func (r *Reconciler) recoverPod(ctx context.Context, pod *v1.Pod, iochaos *v1alpha1.IoChaos) error {
r.Log.Info("Recovering", "namespace", pod.Namespace, "name", pod.Name)

var ns v1.Namespace
if err := r.Get(ctx, types.NamespacedName{Name: pod.Namespace}, &ns); err != nil {
return err
}
cctx, cancel := context.WithTimeout(ctx, 2*time.Minute)
defer cancel()
err := wait.PollUntil(2*time.Second, func() (bool, error) {
if err := r.recoverInjectAction(ctx, pod, iochaos); err != nil {
if utils.IsCaredNetError(err) {
r.Log.Info("Recover I/O chaos action, network is not ok, retrying...",
"namespace", pod.Namespace, "name", pod.Name)
return false, nil
}

annotations := ns.GetAnnotations()
if annotations == nil {
annotations = make(map[string]string)
}
return false, err
}

if _, ok := annotations[v1alpha1.WebhookInitPodAnnotationKey]; ok {
return r.recoverInjectAction(ctx, pod, iochaos)
}
r.Log.Info("Recover I/O chaos action successfully")

if err := utils.UnsetIoInjection(ctx, r.Client, pod, iochaos); err != nil {
r.Log.Error(err, "failed to unset I/O injection",
return true, nil
}, cctx.Done())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When network is not ok, Poll loop and cctx timeout, wait.PollUntil will return timeout err or nil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout err


if err != nil {
r.Log.Error(err, "failed to recover I/O chaos action",
"namespace", pod.Namespace, "name", pod.Name)
return err
}

return r.Delete(ctx, pod, &client.DeleteOptions{
GracePeriodSeconds: new(int64),
})
return nil
}

func (r *Reconciler) injectAllPods(ctx context.Context, pods []v1.Pod, iochaos *v1alpha1.IoChaos) error {
Expand All @@ -210,67 +213,29 @@ func (r *Reconciler) injectAllPods(ctx context.Context, pods []v1.Pod, iochaos *
func (r *Reconciler) injectPod(ctx context.Context, pod *v1.Pod, iochaos *v1alpha1.IoChaos) error {
r.Log.Info("Inject I/O chaos action", "namespace", pod.Namespace, "name", pod.Name)

if err := utils.SetIoInjection(ctx, r.Client, pod, iochaos); err != nil {
r.Log.Error(err, "failed to set I/O injection",
"namespace", pod.Namespace, "name", pod.Name)
return err
}

var ns v1.Namespace
if err := r.Get(ctx, types.NamespacedName{Name: pod.Namespace}, &ns); err != nil {
return err
}

annotations := ns.GetAnnotations()
if annotations == nil {
annotations = make(map[string]string)
}

if _, ok := annotations[v1alpha1.WebhookInitPodAnnotationKey]; !ok {
// need to recreate pod when to inject sidecar
time.Sleep(1 * time.Second)
err := r.Delete(ctx, pod, &client.DeleteOptions{
GracePeriodSeconds: new(int64),
})
if err != nil {
return err
}
}

// TODO: optimize inject action
go func() {
cctx, cancel := context.WithTimeout(ctx, 5*time.Minute)
defer cancel()
err := wait.PollUntil(2*time.Second, func() (bool, error) {
var npod v1.Pod
err := r.Client.Get(ctx, types.NamespacedName{
Namespace: pod.Namespace,
Name: pod.Name,
}, &npod)
if err != nil {
r.Log.Error(err, "failed to get pod", "namespace", pod.Namespace, "name", pod.Name)
cctx, cancel := context.WithTimeout(ctx, 2*time.Minute)
defer cancel()
err := wait.PollUntil(2*time.Second, func() (bool, error) {
if err := r.injectAction(ctx, pod, iochaos); err != nil {
if utils.IsCaredNetError(err) {
r.Log.Info("Inject I/O chaos action, network is not ok, retrying...",
"namespace", pod.Namespace, "name", pod.Name)
return false, nil
}

if err := r.injectAction(ctx, &npod, iochaos); err != nil {
if utils.IsCaredNetError(err) {
r.Log.Info("Inject I/O chaos action, network is not ok, retrying...",
"namespace", pod.Namespace, "name", pod.Name)
return false, nil
}
return false, err
}

return false, err
}
r.Log.Info("Inject I/O chaos action successfully")

r.Log.Info("Inject I/O chaos action successfully")
return true, nil
}, cctx.Done())

return true, nil
}, cctx.Done())
if err != nil {
r.Log.Error(err, "failed to inject I/O chaos action",
"namespace", pod.Namespace, "name", pod.Name)
}
}()
if err != nil {
r.Log.Error(err, "failed to inject I/O chaos action",
"namespace", pod.Namespace, "name", pod.Name)
return err
}

return nil
}
Expand All @@ -294,7 +259,13 @@ func (r *Reconciler) injectAction(ctx context.Context, pod *v1.Pod, iochaos *v1a
return err
}

_, err = cli.SetFault(ctx, req)
if len(req.Methods) > 0 {
_, err = cli.SetFault(ctx, req)
return err
}

// inject fault to all methods if the the methods is empty.
_, err = cli.SetFaultAll(ctx, req)
return err
}

Expand All @@ -311,6 +282,6 @@ func (r *Reconciler) recoverInjectAction(ctx context.Context, pod *v1.Pod, iocha
return err
}

_, err = cli.RecoverAll(ctx, nil)
_, err = cli.RecoverAll(ctx, &empty.Empty{})
return err
}
35 changes: 26 additions & 9 deletions doc/io_chaos.md
Expand Up @@ -2,7 +2,11 @@

This document helps you to build IO chaos experiments.

IO chaos allows you to simulate file system faults such as IO delay, read/write errors, etc. It can inject delay and errno when you use the IO system calls such as `open`, `read` and `write`.
IO chaos allows you to simulate file system faults such as IO delay,
read/write errors, etc. It can inject delay and errno when you use the IO system calls such as `open`, `read` and `write`.

> Note: IO Chaos can only be used if the relevant labels and annotations are set before the application is created.
> More info refer [here](#create-a-chaos-experiment)

## Prerequisites

Expand Down Expand Up @@ -35,6 +39,10 @@ ARGS="--pd=${CLUSTER_NAME}-pd:2379 \
--config=/etc/tikv/tikv.toml
```

> Node: The default data directory of TiKV is not a subdirectory of `PersistentVolumes`.
> If your application is TiDB cluster, you need to modify it at [_start_tikv.sh.tpl](https://github.com/pingcap/tidb-operator/blob/master/charts/tidb-cluster/templates/scripts/_start_tikv.sh.tpl).
> PD has the same issue with TiKV, you need to modity the data directory of pd at [_start_pd.sh.tpl](https://github.com/pingcap/tidb-operator/blob/master/charts/tidb-cluster/templates/scripts/_start_pd.sh.tpl).

## Usage

### Configure a ConfigMap
Expand Down Expand Up @@ -98,21 +106,30 @@ Description:

### Create a chaos experiment

#### Before the application starts

In this situation, you can add an [annotation](https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/) to the application namespace:
Before the application created, you need to make admission-webhook enable by label add an [annotation](https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/) to the application namespace:

```yaml
```bash
admission-webhook.pingcap.com/init-request:chaosfs-tikv
```

Then, you can start your application and define YAML file to start your chaos experiment.
You can use the following commands to set labels and annotations of the application namespace:

#### If the application is already running
```bash
# If the application namespace does not exist. you can exec this command to create one,
# otherwise ignore this command.
kubectl create ns app-ns # "app-ns" is the application namespace

# enable admission-webhook
kubectl label ns app-ns admission-webhook=enabled

In this situation, you just need to define YAML file to start your chaos experiment.
# set annotation
kubectl annotate ns app-ns admission-webhook.pingcap.com/init-request=chaosfs-tikv

> Note that if you are in this situation, the target pods will be modified dynamically and restarted.
# create your application
...
```

Then, you can start your application and define YAML file to start your chaos experiment.

#### Start a chaos experiment

Expand Down