Chaos Engineering Whitepaper v0.1

What is Chaos Engineering

Chaos Engineering is the practice of injecting controlled failure into a system, application or sub-routine in an attempt to determine whether or not it will behave as expected under degraded conditions.

History

Though Chaos Engineering has been practiced for some time in large corporations, it has only recently become popular, largely due to the work of Netflix and the emergence of Chaos Monkey. Alongside Chaos Monkey, the Principles of Chaos Engineering rose as an early description of the various characteristics of the practice. The practice has evolved quite a bit since the writing of the Principles of Chaos Engineering, but the document laid the foundation for the core tenets of Chaos Engineering.

Principles

Experimention

The practice of Chaos Engineering should be performed as a form of experimentation. Each experiment should follow the scientific method and should should contain a control group or steady state, a hypothesis of behavior under duress, a failure mode which imposes duress, metric based measurements and outcomes.

Blast Radius

Practitioners should start with the smallest "blast radius" possible before increasing the effect of the failure mode. Usually this means testing on a small portion of staging traffic before moving to testing in production. Ultimately, the goal of Chaos Engineering is to ensure Production systems behave well under duress and so testing should eventually advance to live environments.

Automation

In a world where rapidly changing environments are the norm, it is important to work toward automating experiments so as to prevent the dreaded Drift into Failure. By automating experiments the practioner effectively immunizes their system, application or sub-routine to a particular failure mode.

Why practice Chaos Engineering

Harness and Improve System Reliability

Chaos Engineering as a practice lends itself to exploring facets of a system's reliability such as its resilience, scalability, security, safety or privacy.

All these dimensions of a system have a direct impact on end-users' perception. All systems experience unplanned events that cause degradation of performance. Experiencing degraded conditions may leave a lasting negative opinion of the service a system renders. In the worst case scenario, poor conditions may lead to legal issues for the service provider.

In effect, Chaos Engineering is a unique practice to enable an organization to harness and improve system reliability proactively and in a controlled manner, rather than respond to unplanned events under pressure.

Benefits for Cloud Native Systems

Chaos Engineering lends itself well to Cloud Native Systems which, by nature, provide the platform for strong system reliability.

Properties of Cloud Native Systems that benefit Chaos Engineering:

dynamic: resources and services are designed to come and go, and cannot be trusted to remain for any specific amount of time
isolation of concerns: outcomes of a well-focused chaos experiment should be easier to make sense of in a cloud environment
automation: being API-driven makes them a great candidate for experimental automation needs
value observability: cloud native systems expose mechanisms for an operator to observe the live system's behavior, which is an inherent expectation of a well-crafted chaos experiment

Through these properties, Chaos Engineering experiments can be designed, implemented and automated to provide continuous auditing of the impact of degraded conditions in the system. This feedback loop provides great insights about platform and application behavior under stress, allowing both to adapt and improve accordingly.

With that said, while Cloud Native Systems abstract large chunks of complexity away from users, this complexity does not disappear altogether. It is merely made simpler to deal with. Operators, and developers alike, must understand the stack they rely on to offer relevant responses in face of adversarial conditions.

In a nutshell, Chaos Engineering reminds the actors that the underlying system, for all its benefits, cannot be trusted nor become a black box.

Actors must remain active, and even proactive, in the lifecycle of their system.

Software and Operational Practices In Production

Chaos Engineering is a new practice in the toolbox of product teams. While most operational best practices focus upstream, Chaos Engineering looks downstream to production once the system is live.

Typically, testing happens either in development or a production-lookalike environment but it is seldom performed in production after the system is in the hands of users. In other words, testing is most often performed in safe conditions whereas Chaos Engineering factors in a certain level of risk.

Chaos Engineering does not take results of an experiment in binary - passed|not passed - fashion. Instead, results are meant to be analysed and correlated with the system's state and events at the time the experiment took place.

With that said, the practice of Chaos Engineering should benefit from well-defined policies such as automation or reporting.

Use Cases

So, what are the use cases for Chaos Engineering? As stated in the overview, there is no single golden rule that can be applied across the board.

However, here are a few areas where it makes sense to invest in Chaos Engineering:

Dependency on third-party providers that are out of actors' control: what is the effect of using provider X, when that provider goes down?
Network dependent services: Can we cope with a link failure when the network (internal or otherwise) cannot be trusted?
Service release impact may not be tested for peripheral aspects of the system: How does a new poorly performing release of one of the internal services impact our system?
Testing the engagement process and ensuring employees understand how to respond to pages and where playbook resources are: Do the actors know how to react in the case of failures, especially cascading failure modes?
Surfacing unknown/transitive dependencies within a system: How well do the actors understand the dependencies within the system, especially as complexity increases?
Testing for service resilience Are the services in the distributed system resilient and able to gracefully handle (and recover from) unexpected failures?
Service Inter-Dependency How well does the system handle a degraded service that other services depend on?
Multi-cloud migration Has the appropriate stress testing occurred on a distributed system that is moving to the cloud or spreading across many cloud?

Practicing Chaos Engineering

Getting Started With Chaos Engineering

Warning with the word Chaos

"Chaos" can be a scary word because of the ideas associated with it. It is important to notice that the word chaos is generic term for "complete disorder". Chaos engineering is the discipline to run experiments to expose that chaos, to make it visible. In our case, the chaos is in the system already. By exposing the inherent chaos of a system, that system is better understood and improvements can be made in order to make more it resilience. In other words, to make the chaos less affecting the availability of the system.

If you see someone with a syringe and that person says to you: "I'm going to inject you with something it's going to be great". Would you trust that person ? Probably not. If you see a trained practitioner or a doctor with a syringe, that syringe contains a vaccine and the doctor explains to you what is it going to do to your body with the potential benefits of it. Would you trust that person ? Probably more than in the previous case.

So it is important to mention the benefits of chaos engineering when talking about it to someone is not familiar with it. Talk about the results of the experiments as well as the experiments themselves.

System requirements to endure Chaos Engineering

Chaos Engineering is a fairly disruptive practice as it takes the position that, by forcing the system away from its natural position, we can learn subtle aspects of the system's behaviors that other testing practices wouldn't unearth.

To achieve useful learnings however, the system needs to be in a fairly reliable and predictable state already. If the system is too fragile or its reactions to failures are completely unknown, either the learnings would be trivial or would provide a high volume of false positives. The cost of running Chaos Engineering experiments under those conditions would be higher than traditional tests and would not provide more benefit than those tests. To maximize the Return On Investment for Chaos Engineering, it's important to begin with an understanding of your current steady state.

As we suggested, Chaos Engineering supports exploring the reliability of a system. While being authoritative is a challenge this whitepaper will not take, it is good to check the basics of your system:

Security: Look at the OWASP) project which provides good security approaches for various use cases.
Logging: Your system needs to leave traces about what it is doing so you can investigate it. Central logging is often a requirement to make this process much smoother and faster.
Monitoring: Your system constantly sends pulses about its current state. Those signals tell you something about how it fares when you capture the right metrics.
Automation: Whenever you deal with enough complexity, automation can save your day. There are many levels of automation but Continuous Integration is a great starting point. Tackle this before moving on to Continuous Delivery, Deployment and perhaps even GitOps.
Testing: You want to move fast but with a high degree of confidence. Testing is a must-do practice to achieve this confidence and develop meaning hypotheses.
Alerting: The persons to be notified when a system behaves abnormally. An experiment can consist of testing those alerts are triggered.

Altogether, those have become common practices in building great software infrastructure and applications. Chaos Engineering teams will thrive if engineering have already figured out those for themselves. With that said, even basic Chaos Engineering experiments can help you get useful information from your system and how to improve it.

Chaos Engineering in production

In test environments, it is common to create conditions that best suit the testing scenarios you are running. Chaos Engineering experiments that are run in these environments may produce tailored outcomes rather than real-world reactions to the tests. Running Chaos Engineering experiments in production, therefore, is more useful because you will get a more realistic view into how your system will really react to failure.

While running in production may be the best approach, most organizations will want to start by running Chaos Engineering experiments in test environments until they are comfortable with the practice. This is completely acceptable but the real value of the experiments comes from running them in production so that should always be the end goal.

Communicate with the Organization

This is where we need to continue the discussion and figure out how far we want/can go with the patterns.

Should we talk game days for instance? Observability? ((comment from Lorinda - I think we should talk about game days, observability, logging/auditing and notifications. Those are all different examples of communication that happen at different stages. But in our experience, teams want the reassurance that communication will occur at key moments in the process and on a continuous basis))

The following phases may or may not be useful. I think it would be valuable if we could describe what it means to deal with chaos in those various cases, but is it the right place?

Chaos Engineering Perturbations

Degrade Network Conditions

Network is one of the greatest complexity, as well as fragility, facet of any systems. One often hears that network cannot be trusted.

It is therefore a place of choice for any Chaos Engineer to look for weaknesses. The purpose is to gauge how one service copes with poor communication with a service it depends on.

Common experiments can therefore:

add latency to a network call (either on the receive, send or both)
add jittering to create noise in the message
lose any number of packets along the channel
prevent name discovery
randomly close connections
inject random data into streams
send all data to nowhere

Basically, any number of unhappy scenarios you can think of during a network exchange is a candidate.

Vary Computing Resources

Resources allocated to a service should not be trusted to always be available. Whether it's fighting for computing resources or attached devices going away, it is critical that you harness the impact of any of those situations on your system.

Reduce the available amount of CPU or memory to a service
Detach a device

Stress to the Limits

You may have designed your system to scale and handle a certain level of stress. You may even have proven it in specific conditions through performance testing. However, those types of testing don't usually take the chance of looking at the system dealing with such stress while also enduring degraded conditions (such as poor network or lost device) when often, failures pile up because of those high loads in the first place.

Simulate Data Loss and its Recovery

Data loss is one of the most critical failure any service can face. Its response often goes beyond engineering boundaries. Understanding the impacts of data loss is therefore highly valuable to a Chaos Engineer.

Simulating data loss often depends on the model, architecture and storage of the system so it experimenting for it will take different shapes.

Remove storage
Change permissions so a service cannot read or write
Drop messages from a queue to prevent eventual consistency
Duplicate messages and deal with integrity
Perform a backup

Change ACLs Permissions

Never underestimate the impacts of an error in a configuration somewhere that changes the ACL of a service towards another service.

Provoke a Security Breach

While security is its own topic, Chaos Engineers must not set aside but, on the contrary, bring security experiments into the fore.

Things to look for:

Simulate an expired certificate
Broken authentication
Various common injections (SQL...)
Loading from unsafe sources of data

What a Chaos Engineer may be interested in looking for is how the system copes with the dire situation, starting with whether the system detected the attack in the first place.

Assume application fails to restart

How to handle an application which does not restart? If it's a new version, is it incompatible with the rest of the system? Can you roll it back?

If it's an existing application, do you have a space issue on the disk? A change of permissions on the filesystem?

Chaos Engineering Automation

Continuous Chaos Engineering

Chaos Engineering Reporting

Report Findings

Landscape

Kubernetes-native chaos engineering
Blockade - Docker-based utility for testing network failures and partitions in distributed applications.
Chaos Monkey - Version 2 of Chaos Monkey by Netflix
Chaos Toolkit - A chaos engineering toolkit to help you build confidence in your software system.
chaos-lambda - Randomly terminate ASG instances during business hours.
ChaoSlingr - Introducing Security Chaos Engineering. ChaoSlingr focuses primarily on the experimentation on AWS Infrastructure to proactively instrument system security failure through experimentation.
Drax - DC/OS Resilience Automated Xenodiagnosis tool. It helps to test DC/OS deployments by applying a Chaos Monkey-inspired, proactive and invasive testing approach.
Gremlin- Chaos-as-a-Service - Gremlin is a platform that offers everything you need to do Chaos Engineering. Supports all cloud infrastructure providers, Kubernetes, Docker and host-level chaos engineering. Offers an API and control plane.
Litmus - An open source framework for chaos engine based qualification of Kubernetes environments
MockLab - API mocking (Service Virtualization) as a service which enables modeling real world faults and delays.
Monkey: The Infection Monkey is an open source security tool for testing a data center's resiliency to perimeter breaches and internal server infection. The Monkey uses various methods to self propagate across a data center and reports success to a centralized Monkey Island server.
Muxy - A chaos testing tool for simulating a real-world distributed system failures.
Namazu - Programmable fuzzy scheduler for testing distributed systems.
Pod-Reaper - A rules based pod killing container. Pod-Reaper was designed to kill pods that meet specific conditions that can be used for Chaos testing in Kubernetes.
Pumba - Chaos testing and network emulation for Docker containers (and clusters).
The Simian Army - A suite of tools for keeping your cloud operating in top form.
Toxiproxy - A TCP proxy to simulate network and system conditions for chaos and resiliency testing.
Wiremock - API mocking (Service Virtualization) which enables modeling real world faults and delays

Appendix A: Additional Material

Chaos Engineering - Companies, people, tools & practice (Graph)
Principles of Chaos Engineering
Chaos Engineering: Building Confidence in System Behavior through Experiments
Chaos Engineering: Why Breaking Things Should Be Practiced (presentation)
Lineage Driven Fault Injection (pdf) - UC Berkeley
Automating Failure Testing Research at Internet Scale (pdf)
Game Day: Achieving Resilience through Chaos Engineering
Chaos Engineering: Why the World Needs More Resilient Systems

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WHITEPAPER.md

WHITEPAPER.md

Chaos Engineering Whitepaper v0.1

What is Chaos Engineering

History

Principles

Experimention

Blast Radius

Automation

Why practice Chaos Engineering

Harness and Improve System Reliability

Benefits for Cloud Native Systems

Software and Operational Practices In Production

Use Cases

Practicing Chaos Engineering

Getting Started With Chaos Engineering

Warning with the word Chaos

System requirements to endure Chaos Engineering

Chaos Engineering in production

Communicate with the Organization

Chaos Engineering Perturbations

Degrade Network Conditions

Vary Computing Resources

Stress to the Limits

Simulate Data Loss and its Recovery

Change ACLs Permissions

Provoke a Security Breach

Assume application fails to restart

Chaos Engineering Automation

Continuous Chaos Engineering

Chaos Engineering Reporting

Report Findings

Landscape

Appendix A: Additional Material

Files

WHITEPAPER.md

Latest commit

History

WHITEPAPER.md

File metadata and controls

Chaos Engineering Whitepaper v0.1

What is Chaos Engineering

History

Principles

Experimention

Blast Radius

Automation

Why practice Chaos Engineering

Harness and Improve System Reliability

Benefits for Cloud Native Systems

Software and Operational Practices In Production

Use Cases

Practicing Chaos Engineering

Getting Started With Chaos Engineering

Warning with the word Chaos

System requirements to endure Chaos Engineering

Chaos Engineering in production

Communicate with the Organization

Chaos Engineering Perturbations

Degrade Network Conditions

Vary Computing Resources

Stress to the Limits

Simulate Data Loss and its Recovery

Change ACLs Permissions

Provoke a Security Breach

Assume application fails to restart

Chaos Engineering Automation

Continuous Chaos Engineering

Chaos Engineering Reporting

Report Findings

Landscape

Appendix A: Additional Material