kube-monkey is an implementation of Netflix's Chaos Monkey for Kubernetes clusters. It randomly deletes Kubernetes (k8s) pods in the cluster encouraging and validating the development of failure-resilient services.
kube-monkey runs at a pre-configured hour (
run_hour, defaults to 8am) on weekdays, and builds a schedule of deployments that will face a random
Pod death sometime during the same day. The time-range during the day when the random pod Death might occur is configurable and defaults to 10am to 4pm.
kube-monkey can be configured with a list of namespaces
- to blacklist (any deployments within a blacklisted namespace will not be touched)
To disable the blacklist provide
[""] in the
Opting-In to Chaos
kube-monkey works on an opt-in model and will only schedule terminations for Kubernetes (k8s) apps that have explicitly agreed to have their pods terminated by kube-monkey.
Opt-in is done by setting the following labels on a k8s app:
kube-monkey/enabled: Set to
"enabled" to opt-in to kube-monkey
kube-monkey/mtbf: Mean time between failure (in days). For example, if set to
"3", the k8s app can expect to have a Pod
killed approximately every third weekday.
kube-monkey/identifier: A unique identifier for the k8s apps. This is used to identify the pods
that belong to a k8s app as Pods inherit labels from their k8s app. So, if kube-monkey detects that app
foo has enrolled to be a victim, kube-monkey will look for all pods that have the label
kube-monkey/identifier: foo to determine which pods are candidates for killing. Recommendation is to set this value to be the same as the app's name.
kube-monkey/kill-mode: Default behavior is for kube-monkey to kill only ONE pod of your app. You can override this behavior by setting the value to:
"kill-all"if you want kube-monkey to kill ALL of your pods regardless of status (not ready or not running pods included). Does not require kill-value. Use this label carefully.
fixedif you want to kill a specific number of running pods with kill-value. If you overspecify, it will kill all running pods and issue a warning.
random-max-percentto specify a maximum % with kill-value that can be killed. At the scheduled time, a uniform random specified % of the running pods will be terminated.
kube-monkey/kill-value: Specify value for kill-mode
fixed, provide an integer of pods to kill
random-max-percent, provide a number from 0-100 to specify the max % of pods kube-monkey can kill
Example of opted-in Deployment killing one pod per purge
--- apiVersion: extensions/v1beta1 kind: Deployment metadata: name: monkey-victim namespace: app-namespace spec: template: metadata: labels: kube-monkey/enabled: enabled kube-monkey/identifier: monkey-victim kube-monkey/mtbf: '2' kube-monkey/kill-mode: "fixed" kube-monkey/kill-value: 1 [... omitted ...]
For newer versions of kubernetes you may need to add the labels to the k8s app metadata as well.
--- apiVersion: extensions/v1beta1 kind: Deployment metadata: name: monkey-victim namespace: app-namespace labels: kube-monkey/enabled: enabled kube-monkey/identifier: monkey-victim kube-monkey/mtbf: '2' kube-monkey/kill-mode: "fixed" kube-monkey/kill-value: 1 spec: template: metadata: labels: kube-monkey/enabled: enabled kube-monkey/identifier: monkey-victim [... omitted ...]
Overriding the apiserver
- Since client-go does not support cluster dns explicitly with a
// TODO: switch to using cluster DNS.note in the code, you may need to override the apiserver.
- If you are running an unauthenticated system, you may need to force the http apiserver endpoint.
To override the apiserver specify in the config.toml file
How kube-monkey works
Scheduling happens once a day on Weekdays - this is when a schedule for terminations for the current day is generated. During scheduling, kube-monkey will:
- Generate a list of eligible k8s apps (k8s apps that have opted-in and are not blacklisted, if specified, and are whitelisted, if specified)
- For each eligible k8s app, flip a biased coin (bias determined by
kube-monkey/mtbf) to determine if a pod for that k8s app should be killed today
- For each victim, calculate a random time when a pod will be killed
This is the randomly generated time during the day when a victim k8s app will have a pod killed. At termination time, kube-monkey will:
- Check if the k8s app is still eligible (has not opted-out or been blacklisted or removed from the whitelist since scheduling)
- Check if the k8s app has updated kill-mode and kill-value
- Depending on kill-mode and kill-value, execute pods
Docker images for kube-monkey can be found at DockerHub
Clone the repository and build the container.
go get github.com/asobti/kube-monkey cd $GOPATH/src/github.com/asobti/kube-monkey make container
kube-monkey is configured by a toml file placed at
/etc/kube-monkey/config.toml and expects the configmap to exist before the kubemonkey deployment.
Configuration keys and descriptions can be found in
Example config.toml file
[kubemonkey] dry_run = true # Terminations are only logged run_hour = 8 # Run scheduling at 8am on weekdays start_hour = 10 # Don't schedule any pod deaths before 10am end_hour = 16 # Don't schedule any pod deaths after 4pm blacklisted_namespaces = ["kube-system"] # Critical apps live here time_zone = "America/New_York" # Set tzdata timezone example. Note the field is time_zone not timezone
- First deploy the expected
kube-monkey-config-mapconfigmap in the namespace you intend to run kube-monkey in (for example, the
kube-systemnamespace). Make sure to define the keyname as
kubectl create configmap km-config --from-file=config.toml=km-config.tomlor
kubectl apply -f km-config.yaml
- Run kube-monkey as a k8s app within the Kubernetes cluster, in a namespace that has permissions to kill Pods in other namespaces (eg.
examples/ for example Kubernetes yaml files.
A helm chart is provided that assumes you have already compiled and uploaded the container to your own container repository. Once uploaded, you need to edit
$PROJECT/helm/kubemonkey/values.yaml and update the value of
image.repository to point at the location of your container.
Helm can then be executed using
helm install $release helm/kubemonkey
kube-monkey uses glog and supports all command-line features for glog. To specify a custom v level or a custom log directory on the pod, see
args: ["-v=5", "-log_dir=/path/to/custom/log"] in the example deployment file
Standardized glog levels
grep -r V\([0-9]\) *
L1: Highest Level current status info and Errors with Terminations
L2: Successful terminations
L3: More detailed schedule status info
L4: Debugging verbose schedule and config info
L5: Auto-resolved inconsequential issues