Skip to content

Chaos Machine is a complete chaos engineering workflow that enables customers to run controlled chaos experiments with the AWS Fault Injection Service and test hypotheses related to system behavior using metrics and alarms from CloudWatch and Prometheus.

License

Notifications You must be signed in to change notification settings

awslabs/chaos-machine

Chaos Machine

Chaos Machine is a complete chaos engineering workflow that enables customers to run controlled chaos experiments and test hypotheses related to system behavior. Chaos Machine uses metric and alarm data from both Amazon CloudWatch and Prometheus as inputs to evaluate system behavior before, during, and after the experiment. The Chaos Machine provides a simple, consistent way to organize and execute chaos experiments, and is appropriate to use for both building and conducting ad-hoc experiments or integrating into more sophisticated automation pipelines. Chaos Machine uses the AWS Fault Injection Service (FIS) to run controlled experiments, and AWS Step Functions and AWS Lambda for orchestration and execution.

Architecture

chaos-machine

Usage

  • Review the architecture diagram above to understand how the Chaos Machine works, then see the examples directory for working examples to reference.
  • Create the chaos-machine Lambda layer before deploying the stack. This layer also includes the JSON schema for validation, so the layer must be updated whenever the schema is updated.
make layer/package
terraform apply
provider "aws" {
  region = local.region
}

locals {
  region      = "us-west-2"
  project_env = "dev"
}

module "chaos-machine" {
  source = "/home/ec2-user/repos/chaos-machine"

  create_iam_roles = true
  project_env      = local.project_env

}

Chaos machine input and execution

Each execution of the Chaos Machine runs an experiment and tests a hypothesis. Input for the execution is defined by a JSON schema; see the example inputs for reference. The execution input includes two sets of measurables: steadyState and hypothesis. If you're new to chaos engineering and aren't familiar with these terms, I recommend reviewing the "Test resiliency using chaos engineering" best practice in the reliability pillar of the AWS Well-Architected Framework, guidance in Principles of Chaos Engineering, and Chaos Engineering Stories. In most cases, the hypothesis will be that the system will continue to behave in steady state during and after the chaos experiment. If this is the case, specify"hypothesis": "steadyState". However, the Chaos Machine allows you to specify unique measurables for hypothesis. This is particularly useful for evaluating disaster recovery or analyzing specific metrics that may not be part of a well-defined steady state.

When defining the metrics and expressions to be used for the steadyState and hypothesis, I recommend starting by using the Amazon CloudWatch Metrics console to create and test example metrics and expressions with the system to be tested, and then using the Source tab to view and copy the definitions to the execution input file. You can also use this same approach to create CloudWatch alarms by creating a metric for the alarm, then clicking on the bell icon under Actions in the Graphed metrics tab to create the alarm. One of the key features of the chaos machine is that it uses the powerful built-in capabilities of both CloudWatch and Prometheus to evaluate the metric data, rather than having to handle that in the application logic. Thus, you're able to take full advantage of both of these tools to build almost unlimited evaluation expressions. If you use Prometheus for your application monitoring, see the Prometheus section for details.

A test begins when you start an execution of the state machine. During the SteadyState step, a Lambda function will retrieve the measurables defined in steadyState for the amount of time specified in the lookback to verify that the system has been behaving normally. If the evaluation passes, i.e. the application is in "steady state", the FIS experiment will be started. No measurables are checked during the experiment. Once the experiment is completed, by default, the hypothesis is tested based on data retrieved for the period between the experiment start time and end time. However, if you wish to test your hypothesis during application recovery after the experiment ended, you can use recoveryDelay and recoveryDuration in the execution input so that metric/alarm data will be retrieved for the period starting recoveryDelay seconds after the experiment end time and ending recoveryDuration seconds later.

chaos-machine-timeline

FIS experiment template

The Chaos Machine does not create the FIS experiments used during an execution of the machine. Therefore, you should create the experiment templates before beginning the steps below. See the FIS User Guide and the Chaos Engineering Workshop for details.

Examples

The examples are intended to provide users references for how to use the module(s), as well as testing/validating changes to the source code of the module. If contributing to the project, please be sure to make any appropriate updates to the relevant examples to allow maintainers to test your changes and to keep the examples up to date for users. Thank you!

  • Complete. This example will deploy the chaos machine and required IAM resources.
  • Prometheus. This modifies the complete example to enable using Prometheus metrics. See the Prometheus section for details.

Execution inputs

To help you get familiar with using the Chaos Machine in a realistic scenario, this project includes example execution inputs and automation (using the pytest framework) to run an experiment for the PetAdoptions application that is part of the Chaos Engineering Workshop. You can follow the instructions in the Bring your own AWS Account section of the workshop website. Then you can create the AZ Disruption experiment from the workshop, simulate user activity, and try running tests based on the example inputs.

  • Example chaos test execution inputs:
    • PetSiteAZDisruption-split.json: This example is configured to use CloudWatch metrics and specifies unique definitions for the steadyState and hypothesis.
    • PetSiteAZDisruption-same.json: This example is configured to use CloudWatch metrics and reuses the definitions in steadyState for hypothesis.
    • PetSiteAZDisruption-alarms.json: This example uses only CloudWatch alarms and reuses the definitions in steadyState for hypothesis.
      • To create an alarm used in this example, follow the instructions in the AZ Disruption page under Understand steady-state to get to the dashboard, click on the options for the OK (2xx) widget, choose View in metrics, and click on the bell icon under Actions in the Graphed metrics tab. Name the alarm PetSiteOkRate.
      • You can use this command to set an alarm status for testing purposes, e.g. to quickly set back to OK after a failed test.
      aws cloudwatch set-alarm-state --alarm-name "PetSiteOkRate" --state-value OK --state-reason "chaos experiment"
    • PetSiteAZDisruption-prom.json: This example is configured to use Prometheus metrics.
    • PetSiteAZDisruption-mixed.json: This example is configured to use a combination of CloudWatch metrics and alarms, and Prometheus metrics.
    • PetSiteAZDisruption-recovery.json: This example is configured for scenarios where you want to test a hypothesis during application recovery after the FIS experiment has ended.

Prometheus

The chaos-machine can also be configured to use Prometheus metrics instead of, or in combination with, CloudWatch metrics. This example deploys Prometheus to the EKS cluster used for the PetAdoptions application.

  • Prerequisites
    • Helm
    • kubectl
    • Amazon EBS CSI driver - I recommend installing it as a managed add-on. If you use EKS Pod Identity to grant permissions, be sure to also install the Amazon EKS Pod Identity Agent add-on. If you use IAM Roles for Service Accounts, be sure to also create the IAM Identity Provider for the cluster OIDC endpoint.
  • Configure kubectl to connect to the PetSite EKS cluster.
aws eks update-kubeconfig --region us-east-1 --name PetSite
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm upgrade -i prometheus prometheus-community/prometheus \
    --create-namespace \
    --namespace prometheus \
    --set alertmanager.persistence.storageClass="gp2" \
    --set server.persistentVolume.storageClass="gp2"
  • Verify the pods are in the READY state.
kubectl get pods -n prometheus
  • Expose Prometheus as a Kubernetes service. For simplicity, this example uses a service of type NodePort. However, for real-world or production use cases, you should consider exposing the service using the AWS Load Balancer Controller configured to be internal-facing. For either case, the steady-state and evaluate-hypothesis Lambda functions must be attached to the private subnets in the VPC where the nodes for EKS cluster are running to get access to the exposed, but private, Prometheus endpoint.
kubectl expose service prometheus-server -n prometheus --type NodePort --target-port 9090 --name prometheus-service
  • Get the URL for the prometheus-service to specify for the prometheusUrl property in the execution input.
    • List the services running in the prometheus namespace.
    kubectl get svc -n prometheus
    • Find the prometheus-service of type NodePort and note the exposed port - it is the number after the colon, e.g. 80:32288/TCP.
    • Get the IP address for one of the worker nodes, e.g. 10.1.241.75.
    kubectl get nodes -o wide
    • Use these values to specify the prometheusUrl in an execution input. For example:
    ...
    "prometheusUrl": "http://10.1.186.117:31793"
    ...
  • Attach the Lambda functions to the VPC. See the example deployment.
    • Navigate to the Subnets page in the VPC console, or use the commands below, to identify the two subnets named Services/Microservices/PrivateSubnet1 and Services/Microservices/PrivateSubnet2, and add the values for subnet ID to lambda_subnet_ids in module "chaos-machine" in the main.tf.
      • Note these subnets will also be used in the FIS experiment template as targets.
    aws ec2 describe-subnets --filters "Name=tag:Name,Values=*Services/Microservices/PrivateSubnet*" --query 'Subnets[*].SubnetId' --output text
    • Navigate to the EKS console and the Networking tab for the cluster named "PetSite", or use the commands below, to identify the Cluster security group, and add that value to lambda_security_group_ids variables in the main.tf.
    aws eks describe-cluster --name PetSite --query 'cluster.resourcesVpcConfig.clusterSecurityGroupId' --output text
    lambda_subnet_ids         = ["subnet-XXXXXXXXXXXXX", "subnet-YYYYYYYYYYYYYY"]
    lambda_security_group_ids = ["sg-ZZZZZZZZZZZZZZ"]
    • Update the stack.
  • Try to run an experiment using the example execution input. This example checks whether or not the rate of pet searches (form the load generator) drops below 100 per 2 min across all nodes where the service is running. The example uses two queries labeled m2 and e2, but only the expression e2 is technically necessary. In this case, m2 is just for additional transparency so you can see the raw data used for the evaluation, i.e. < bool 100, which, similar to the way the Expressions are used for the example CloudWatch metrics, returns a 1 or 0 in the e2 query. However, I highly recommend including the extra query for the raw data.
  • When you're finished, you can delete the prometheus-service and uninstall prometheus.
kubectl delete service prometheus-service -n prometheus
helm uninstall prometheus -n prometheus

Running a test

To run an experiment and test a hypothesis with the Chaos Machine, you provide an input and start an execution of the state machine. You can do this using the AWS Step Functions console, AWS CLI, or any AWS SDK. I recommend starting with the Step Functions console for initial or ad-hoc experimentation, then using a SDK to integrate with your automated testing. There is an example using pytest that can be referenced to create automated tests using the chaos machine in examples/tests.

Step Functions console:

  • Find and select the state machine in the AWS Step Functions console.
  • Choose Start execution.
  • Copy and paste the execution input into the Input field.
  • Choose Start execution.

Pytest

export ENVIRONMENT={environment} # corresponds to the module variable project_env, e.g. dev
export AWS_DEFAULT_REGION={region} # set to the AWS region where the chaos machine is deployed, e.g. us-east-1
make pytest experiment-template-id={experimentTemplateId} # specify the value for the experimentTemplateId in the execution input

You can review details of the test, e.g. results of the steady state and hypothesis evaluations, in the CloudWatch log groups associated with the Lambda functions. Both functions will stop executing once the results from an expression or alarm cause the test to fail, so the logs will not include results from any additional expressions or alarms that were not evaluated.

Precommit

If working on feature branches, add the pre-commit configuration to your environment.

make venv
source .venv/bin/activate
pre-commit install

Requirements

Name Version
terraform >= 1
aws >= 4

Providers

Name Version
archive n/a
aws >= 4

Modules

No modules.

Resources

Name Type
aws_cloudwatch_event_rule.continue_execution resource
aws_cloudwatch_event_target.continue_execution resource
aws_cloudwatch_log_group.lambda resource
aws_cloudwatch_log_group.sfn resource
aws_dynamodb_table.this resource
aws_iam_policy.this resource
aws_iam_role.this resource
aws_iam_role_policy_attachment.this resource
aws_lambda_function.this resource
aws_lambda_layer_version.layer resource
aws_lambda_permission.this resource
aws_sfn_state_machine.this resource
archive_file.this data source
aws_caller_identity.current data source
aws_partition.current data source
aws_region.current data source

Inputs

Name Description Type Default Required
create_chaos_machine Set to true to create the chaos machine. You might set this to false if your organization requires you to pre-provision IAM resources, which can be created by setting create_iam_resources = true. bool true no
create_iam_roles Set to true to create IAM resources. If false, you must provide ARNs for the Lambda and state machine roles. bool true no
lambda_cloudwatch_log_group_retention_in_days Retention period for the CloudWatch log groups associated with each Lambda function. number 30 no
lambda_continue_execution_role_arn The ARN of the execution role for the continue-execution Lambda function. Required if create_iam_roles = false. string "" no
lambda_environment_variables Additional environment variables for all Lambda functions. Can be used to set the HTTPS_PROXY and NO_PROXY envs for Lambda functions. map(string) {} no
lambda_evaluate_hypothesis_role_arn The ARN of the execution role for the evaluate-hypothesis Lambda function. Required if create_iam_roles = false. string "" no
lambda_log_level Log level for the Lambda functions. string "INFO" no
lambda_runtime The runtime of the Lambda function. string "python3.11" no
lambda_security_group_ids Optional list of security group IDs associated with the Lambda function. Required if attaching functions to a VPC. list(string) [] no
lambda_start_experiment_role_arn The ARN of the execution role for the start-experiment Lambda function. Required if create_iam_roles = false. string "" no
lambda_steady_state_role_arn The ARN of the execution role for the steady-state Lambda function. Required if create_iam_roles = false. string "" no
lambda_subnet_ids Optional list of subnet IDs associated with the Lambda function. Required if attaching functions to a VPC. list(string) [] no
project_env Name of the project environment, e.g. dev. string n/a yes
state_machine_cloudwatch_log_group_retention_in_days Retention period for the CloudWatch log group associated with the state machine. number 30 no
state_machine_log_level Log level for the state machine. string "ERROR" no
state_machine_role_arn The ARN of the execution role for the state machine. Required if create_iam_roles = false. string "" no

Outputs

Name Description
role_arns n/a

About

Chaos Machine is a complete chaos engineering workflow that enables customers to run controlled chaos experiments with the AWS Fault Injection Service and test hypotheses related to system behavior using metrics and alarms from CloudWatch and Prometheus.

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published