Chaos Machine is a complete chaos engineering workflow that enables customers to run controlled chaos experiments and test hypotheses related to system behavior. Chaos Machine uses metric and alarm data from both Amazon CloudWatch and Prometheus as inputs to evaluate system behavior before, during, and after the experiment. The Chaos Machine provides a simple, consistent way to organize and execute chaos experiments, and is appropriate to use for both building and conducting ad-hoc experiments or integrating into more sophisticated automation pipelines. Chaos Machine uses the AWS Fault Injection Service (FIS) to run controlled experiments, and AWS Step Functions and AWS Lambda for orchestration and execution.
- Review the architecture diagram above to understand how the Chaos Machine works, then see the
examples
directory for working examples to reference. - Create the
chaos-machine
Lambda layer before deploying the stack. This layer also includes the JSON schema for validation, so the layer must be updated whenever the schema is updated.
make layer/package
terraform apply
provider "aws" {
region = local.region
}
locals {
region = "us-west-2"
project_env = "dev"
}
module "chaos-machine" {
source = "/home/ec2-user/repos/chaos-machine"
create_iam_roles = true
project_env = local.project_env
}
Each execution of the Chaos Machine runs an experiment and tests a hypothesis. Input for the execution is defined by a JSON schema; see the example inputs for reference. The execution input includes two sets of measurables: steadyState
and hypothesis
. If you're new to chaos engineering and aren't familiar with these terms, I recommend reviewing the "Test resiliency using chaos engineering" best practice in the reliability pillar of the AWS Well-Architected Framework, guidance in Principles of Chaos Engineering, and Chaos Engineering Stories. In most cases, the hypothesis will be that the system will continue to behave in steady state during and after the chaos experiment. If this is the case, specify"hypothesis": "steadyState"
. However, the Chaos Machine allows you to specify unique measurables for hypothesis
. This is particularly useful for evaluating disaster recovery or analyzing specific metrics that may not be part of a well-defined steady state.
When defining the metrics and expressions to be used for the steadyState
and hypothesis
, I recommend starting by using the Amazon CloudWatch Metrics console to create and test example metrics and expressions with the system to be tested, and then using the Source tab to view and copy the definitions to the execution input file. You can also use this same approach to create CloudWatch alarms by creating a metric for the alarm, then clicking on the bell icon under Actions in the Graphed metrics tab to create the alarm. One of the key features of the chaos machine is that it uses the powerful built-in capabilities of both CloudWatch and Prometheus to evaluate the metric data, rather than having to handle that in the application logic. Thus, you're able to take full advantage of both of these tools to build almost unlimited evaluation expressions. If you use Prometheus for your application monitoring, see the Prometheus section for details.
A test begins when you start an execution of the state machine. During the SteadyState step, a Lambda function will retrieve the measurables defined in steadyState
for the amount of time specified in the lookback
to verify that the system has been behaving normally. If the evaluation passes, i.e. the application is in "steady state", the FIS experiment will be started. No measurables are checked during the experiment. Once the experiment is completed, by default, the hypothesis is tested based on data retrieved for the period between the experiment start time and end time. However, if you wish to test your hypothesis during application recovery after the experiment ended, you can use recoveryDelay
and recoveryDuration
in the execution input so that metric/alarm data will be retrieved for the period starting recoveryDelay
seconds after the experiment end time and ending recoveryDuration
seconds later.
The Chaos Machine does not create the FIS experiments used during an execution of the machine. Therefore, you should create the experiment templates before beginning the steps below. See the FIS User Guide and the Chaos Engineering Workshop for details.
The examples
are intended to provide users references for how to use the module(s), as well as testing/validating changes to the source code of the module. If contributing to the project, please be sure to make any appropriate updates to the relevant examples to allow maintainers to test your changes and to keep the examples up to date for users. Thank you!
- Complete. This example will deploy the chaos machine and required IAM resources.
- Prometheus. This modifies the
complete
example to enable using Prometheus metrics. See the Prometheus section for details.
To help you get familiar with using the Chaos Machine in a realistic scenario, this project includes example execution inputs and automation (using the pytest
framework) to run an experiment for the PetAdoptions application that is part of the Chaos Engineering Workshop. You can follow the instructions in the Bring your own AWS Account section of the workshop website. Then you can create the AZ Disruption experiment from the workshop, simulate user activity, and try running tests based on the example inputs.
- Example chaos test execution inputs:
- PetSiteAZDisruption-split.json: This example is configured to use CloudWatch metrics and specifies unique definitions for the
steadyState
andhypothesis
. - PetSiteAZDisruption-same.json: This example is configured to use CloudWatch metrics and reuses the definitions in
steadyState
forhypothesis
. - PetSiteAZDisruption-alarms.json: This example uses only CloudWatch alarms and reuses the definitions in
steadyState
forhypothesis
.- To create an alarm used in this example, follow the instructions in the AZ Disruption page under Understand steady-state to get to the dashboard, click on the options for the
OK (2xx)
widget, choose View in metrics, and click on the bell icon under Actions in the Graphed metrics tab. Name the alarmPetSiteOkRate
. - You can use this command to set an alarm status for testing purposes, e.g. to quickly set back to OK after a failed test.
aws cloudwatch set-alarm-state --alarm-name "PetSiteOkRate" --state-value OK --state-reason "chaos experiment"
- To create an alarm used in this example, follow the instructions in the AZ Disruption page under Understand steady-state to get to the dashboard, click on the options for the
- PetSiteAZDisruption-prom.json: This example is configured to use Prometheus metrics.
- PetSiteAZDisruption-mixed.json: This example is configured to use a combination of CloudWatch metrics and alarms, and Prometheus metrics.
- PetSiteAZDisruption-recovery.json: This example is configured for scenarios where you want to test a hypothesis during application recovery after the FIS experiment has ended.
- PetSiteAZDisruption-split.json: This example is configured to use CloudWatch metrics and specifies unique definitions for the
The chaos-machine
can also be configured to use Prometheus metrics instead of, or in combination with, CloudWatch metrics. This example deploys Prometheus to the EKS cluster used for the PetAdoptions application.
- Prerequisites
- Helm
- kubectl
- Amazon EBS CSI driver - I recommend installing it as a managed add-on. If you use EKS Pod Identity to grant permissions, be sure to also install the Amazon EKS Pod Identity Agent add-on. If you use IAM Roles for Service Accounts, be sure to also create the IAM Identity Provider for the cluster OIDC endpoint.
- Configure kubectl to connect to the PetSite EKS cluster.
aws eks update-kubeconfig --region us-east-1 --name PetSite
- Deploy Prometheus in the cluster using instructions in the Deploy Prometheus using Helm page of the EKS User Guide.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm upgrade -i prometheus prometheus-community/prometheus \
--create-namespace \
--namespace prometheus \
--set alertmanager.persistence.storageClass="gp2" \
--set server.persistentVolume.storageClass="gp2"
- Verify the pods are in the
READY
state.
kubectl get pods -n prometheus
- Expose Prometheus as a Kubernetes service. For simplicity, this example uses a service of type
NodePort
. However, for real-world or production use cases, you should consider exposing the service using the AWS Load Balancer Controller configured to be internal-facing. For either case, thesteady-state
andevaluate-hypothesis
Lambda functions must be attached to the private subnets in the VPC where the nodes for EKS cluster are running to get access to the exposed, but private, Prometheus endpoint.
kubectl expose service prometheus-server -n prometheus --type NodePort --target-port 9090 --name prometheus-service
- Get the URL for the
prometheus-service
to specify for theprometheusUrl
property in the execution input.- List the services running in the
prometheus
namespace.
kubectl get svc -n prometheus
- Find the
prometheus-service
of typeNodePort
and note the exposed port - it is the number after the colon, e.g.80:32288/TCP
. - Get the IP address for one of the worker nodes, e.g.
10.1.241.75
.
kubectl get nodes -o wide
- Use these values to specify the
prometheusUrl
in an execution input. For example:
... "prometheusUrl": "http://10.1.186.117:31793" ...
- List the services running in the
- Attach the Lambda functions to the VPC. See the example deployment.
- Navigate to the Subnets page in the VPC console, or use the commands below, to identify the two subnets named
Services/Microservices/PrivateSubnet1
andServices/Microservices/PrivateSubnet2
, and add the values for subnet ID tolambda_subnet_ids
inmodule "chaos-machine"
in themain.tf
.- Note these subnets will also be used in the FIS experiment template as targets.
aws ec2 describe-subnets --filters "Name=tag:Name,Values=*Services/Microservices/PrivateSubnet*" --query 'Subnets[*].SubnetId' --output text
- Navigate to the EKS console and the Networking tab for the cluster named "PetSite", or use the commands below, to identify the Cluster security group, and add that value to
lambda_security_group_ids
variables in themain.tf
.
aws eks describe-cluster --name PetSite --query 'cluster.resourcesVpcConfig.clusterSecurityGroupId' --output text
lambda_subnet_ids = ["subnet-XXXXXXXXXXXXX", "subnet-YYYYYYYYYYYYYY"] lambda_security_group_ids = ["sg-ZZZZZZZZZZZZZZ"]
- Update the stack.
- Navigate to the Subnets page in the VPC console, or use the commands below, to identify the two subnets named
- Try to run an experiment using the example execution input. This example checks whether or not the rate of pet searches (form the load generator) drops below 100 per 2 min across all nodes where the service is running. The example uses two queries labeled
m2
ande2
, but only the expressione2
is technically necessary. In this case,m2
is just for additional transparency so you can see the raw data used for the evaluation, i.e.< bool 100
, which, similar to the way theExpressions
are used for the example CloudWatch metrics, returns a1
or0
in thee2
query. However, I highly recommend including the extra query for the raw data. - When you're finished, you can delete the
prometheus-service
and uninstall prometheus.
kubectl delete service prometheus-service -n prometheus
helm uninstall prometheus -n prometheus
To run an experiment and test a hypothesis with the Chaos Machine, you provide an input and start an execution of the state machine. You can do this using the AWS Step Functions console, AWS CLI, or any AWS SDK. I recommend starting with the Step Functions console for initial or ad-hoc experimentation, then using a SDK to integrate with your automated testing. There is an example using pytest
that can be referenced to create automated tests using the chaos machine in examples/tests
.
- Find and select the state machine in the AWS Step Functions console.
- Choose Start execution.
- Copy and paste the execution input into the Input field.
- Choose Start execution.
export ENVIRONMENT={environment} # corresponds to the module variable project_env, e.g. dev
export AWS_DEFAULT_REGION={region} # set to the AWS region where the chaos machine is deployed, e.g. us-east-1
make pytest experiment-template-id={experimentTemplateId} # specify the value for the experimentTemplateId in the execution input
You can review details of the test, e.g. results of the steady state and hypothesis evaluations, in the CloudWatch log groups associated with the Lambda functions. Both functions will stop executing once the results from an expression or alarm cause the test to fail, so the logs will not include results from any additional expressions or alarms that were not evaluated.
If working on feature branches, add the pre-commit configuration to your environment.
make venv
source .venv/bin/activate
pre-commit install
Name | Version |
---|---|
terraform | >= 1 |
aws | >= 4 |
Name | Version |
---|---|
archive | n/a |
aws | >= 4 |
No modules.
Name | Type |
---|---|
aws_cloudwatch_event_rule.continue_execution | resource |
aws_cloudwatch_event_target.continue_execution | resource |
aws_cloudwatch_log_group.lambda | resource |
aws_cloudwatch_log_group.sfn | resource |
aws_dynamodb_table.this | resource |
aws_iam_policy.this | resource |
aws_iam_role.this | resource |
aws_iam_role_policy_attachment.this | resource |
aws_lambda_function.this | resource |
aws_lambda_layer_version.layer | resource |
aws_lambda_permission.this | resource |
aws_sfn_state_machine.this | resource |
archive_file.this | data source |
aws_caller_identity.current | data source |
aws_partition.current | data source |
aws_region.current | data source |
Name | Description | Type | Default | Required |
---|---|---|---|---|
create_chaos_machine | Set to true to create the chaos machine. You might set this to false if your organization requires you to pre-provision IAM resources, which can be created by setting create_iam_resources = true . |
bool |
true |
no |
create_iam_roles | Set to true to create IAM resources. If false, you must provide ARNs for the Lambda and state machine roles. | bool |
true |
no |
lambda_cloudwatch_log_group_retention_in_days | Retention period for the CloudWatch log groups associated with each Lambda function. | number |
30 |
no |
lambda_continue_execution_role_arn | The ARN of the execution role for the continue-execution Lambda function. Required if create_iam_roles = false . |
string |
"" |
no |
lambda_environment_variables | Additional environment variables for all Lambda functions. Can be used to set the HTTPS_PROXY and NO_PROXY envs for Lambda functions. | map(string) |
{} |
no |
lambda_evaluate_hypothesis_role_arn | The ARN of the execution role for the evaluate-hypothesis Lambda function. Required if create_iam_roles = false . |
string |
"" |
no |
lambda_log_level | Log level for the Lambda functions. | string |
"INFO" |
no |
lambda_runtime | The runtime of the Lambda function. | string |
"python3.11" |
no |
lambda_security_group_ids | Optional list of security group IDs associated with the Lambda function. Required if attaching functions to a VPC. | list(string) |
[] |
no |
lambda_start_experiment_role_arn | The ARN of the execution role for the start-experiment Lambda function. Required if create_iam_roles = false . |
string |
"" |
no |
lambda_steady_state_role_arn | The ARN of the execution role for the steady-state Lambda function. Required if create_iam_roles = false . |
string |
"" |
no |
lambda_subnet_ids | Optional list of subnet IDs associated with the Lambda function. Required if attaching functions to a VPC. | list(string) |
[] |
no |
project_env | Name of the project environment, e.g. dev. | string |
n/a | yes |
state_machine_cloudwatch_log_group_retention_in_days | Retention period for the CloudWatch log group associated with the state machine. | number |
30 |
no |
state_machine_log_level | Log level for the state machine. | string |
"ERROR" |
no |
state_machine_role_arn | The ARN of the execution role for the state machine. Required if create_iam_roles = false . |
string |
"" |
no |
Name | Description |
---|---|
role_arns | n/a |