# Chaos Engineering Workflow - Manual Step-by-Step Execution

This notebook demonstrates the chaos engineering workflow by executing each agent manually, one at a time. This allows you to see exactly what happens at each step and understand the workflow pattern in detail.

## Workflow Overview

The chaos engineering workflow consists of 6 sequential steps:

1. **Hypothesis Generation** - Analyze the AWS workload and generate chaos engineering hypotheses
2. **Hypothesis Prioritization** - Prioritize hypotheses based on impact and feasibility
3. **Experiment Design** - Create AWS FIS experiment templates for prioritized hypotheses
4. **FIS Setup** - Set up experiments in AWS FIS with real resources
5. **Experiment Execution** - Execute selected experiments and monitor results
6. **Results Analysis** - Analyze results and generate insights for system improvements

## Target Application

AWS Retail Store Sample App: https://github.com/aws-containers/retail-store-sample-app.git

## Setup and Imports

In [None]:
import sys
import os
from pathlib import Path
from datetime import datetime

# Setup Python path
current_path = Path.cwd()
parent_path = current_path.parent
sys.path.insert(0, str(parent_path))
sys.path.insert(0, str(current_path))

try:
    del os.environ['AWS_CA_BUNDLE']  # Remove AWS_CA_BUNDLE if set to avoid SSL verification issues
except KeyError:
    pass

print(f"📁 Running from src directory: {current_path}")
print(f"📁 Added parent directory to path: {parent_path}")

# Import agents
from HypothesisGeneratorAgent.agent import agent as hypothesis_agent
from HypothesisPrioritizationAgent.agent import agent as prioritization_agent
from ExperimentDesignAgent.agent import agent as design_agent
from ExperimentsAgent.agent import agent as experiments_agent
from LearningAndIterationAgent.agent import agent as learning_agent

print("✅ All agents imported successfully!")

## AWS Environment Configuration

In [None]:
# Set AWS environment variables if needed
# Replace these values with your actual AWS account details

# Optional: Set AWS credentials if not using IAM roles or AWS CLI profiles
# os.environ['AWS_ACCESS_KEY_ID'] = 'your-access-key'
os.environ['AWS_DEFAULT_REGION'] = 'us-west-2'
# os.environ['AWS_SECRET_ACCESS_KEY'] = 'your-secret-key'
# os.environ['AWS_SESSION_TOKEN'] = 'your-session-token'  # Only for temporary credentials

print("🔧 AWS Environment Configuration:")
print(f"   Region: {os.environ.get('AWS_DEFAULT_REGION', 'us-west-2 (default)')}")
print(f"   Access Key: {'Set' if os.environ.get('AWS_ACCESS_KEY_ID') else 'Using default credentials'}")
print("✅ AWS environment configured!")

## Knowledge Base Configuration

In [None]:
# Configure Knowledge Base ID for the retrieve tool
import boto3

def get_knowledge_base_id():
    """Retrieve Knowledge Base ID from CDK stack outputs"""
    try:
        cf_client = boto3.client('cloudformation')
        stack_name = 'ChaosAgentDatabaseStack'
        response = cf_client.describe_stacks(StackName=stack_name)
        
        for stack in response['Stacks']:
            for output in stack.get('Outputs', []):
                if output['OutputKey'] == 'KnowledgeBaseId':
                    return output['OutputValue']
        return None
    except Exception as e:
        print(f"⚠️  Error retrieving Knowledge Base ID: {e}")
        return None

kb_id = get_knowledge_base_id()
if kb_id:
    os.environ['KNOWLEDGE_BASE_ID'] = kb_id
    print(f"✅ Knowledge Base ID configured: {kb_id}")
else:
    print("⚠️  Knowledge Base ID not found - some tools may not work")
    print("💡 You can manually set it: os.environ['KNOWLEDGE_BASE_ID'] = 'your-kb-id'")

## Optional EKS Setup
This is required to give our FIS execution role access to the containers running on EKS.
Please see fis docs for full details.

In [None]:
# Step: Configure EKS Service Account and RBAC for FIS experiments

import boto3
import yaml
import subprocess
import json
import os
from IPython.display import Markdown, display

# Display section header
display(Markdown("## 🔐 Configure FIS Access to EKS Cluster"))
display(Markdown("Setting up Service Account and RBAC permissions for FIS to access all namespaces"))

# Get the FIS execution role ARN
cf_client = boto3.client('cloudformation')
response = cf_client.describe_stacks(StackName='ChaosAgentDatabaseStack')
fis_role_arn = None

for output in response['Stacks'][0]['Outputs']:
    if output['ExportName'] == 'ChaosAgentFISExecutionRoleArn':
        fis_role_arn = output['OutputValue']
        break

print(f"FIS Execution Role ARN: {fis_role_arn}")

# Update kubeconfig to point to the retail-store cluster
print("Updating kubeconfig to point to the retail-store cluster...")
region = os.environ.get('AWS_DEFAULT_REGION', 'us-west-2')
subprocess.run(
    ["aws", "eks", "update-kubeconfig", "--name", "retail-store", "--region", region], 
    check=True
)

# Get all namespaces in the cluster
print("Getting all namespaces...")
namespaces_output = subprocess.check_output(["kubectl", "get", "namespaces", "-o", "jsonpath='{.items[*].metadata.name}'"]).decode().strip("'").split()
print(f"Found namespaces: {namespaces_output}")

# Create service account and RBAC configuration for each namespace
for namespace in namespaces_output:
    print(f"\nConfiguring namespace: {namespace}")
    
    # Create RBAC configuration for the namespace
    rbac_config = f"""
kind: ServiceAccount
apiVersion: v1
metadata:
  namespace: {namespace}
  name: fis-service-account

---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: {namespace}
  name: fis-experiments-role
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: [ "get", "create", "patch", "delete"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["create", "list", "get", "delete", "deletecollection"]
- apiGroups: [""]
  resources: ["pods/ephemeralcontainers"]
  verbs: ["update"]
- apiGroups: [""]
  resources: ["pods/exec"]
  verbs: ["create"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: fis-experiments-role-binding
  namespace: {namespace}
subjects:
- kind: ServiceAccount
  name: fis-service-account
  namespace: {namespace}
- apiGroup: rbac.authorization.k8s.io
  kind: User
  name: fis-experiment
roleRef:
  kind: Role
  name: fis-experiments-role
  apiGroup: rbac.authorization.k8s.io
"""
    
    # Save RBAC config to a file
    rbac_file = f'fis-rbac-{namespace}.yaml'
    with open(rbac_file, 'w') as f:
        f.write(rbac_config)
    
    # Apply the RBAC configuration
    print(f"Applying RBAC configuration for namespace {namespace}...")
    subprocess.run(["kubectl", "apply", "-f", rbac_file], check=True)
    os.remove(rbac_file)  # Clean up the file after applying

# Create IAM identity mapping for the FIS role
print("\nCreating IAM identity mapping for the FIS role...")
try:
    # Option 1: Try using access entries (newer method)
    print("Attempting to create access entry...")
    subprocess.run([
        "aws", "eks", "create-access-entry",
        "--cluster-name", "retail-store",
        "--principal-arn", fis_role_arn,
        "--username", "fis-experiment",
        "--region", region
    ], check=True)
    print("✅ Successfully created EKS access entry")
except subprocess.CalledProcessError:
    print("error creating acces entry, exiting")
   

# Display summary
display(Markdown("""
### FIS Access Configuration Complete

The following components have been configured:

1. **Service Account**: `fis-service-account` created in each namespace
2. **RBAC Roles**: Permissions to manage pods and execute commands
3. **IAM Integration**: FIS role mapped to Kubernetes user `fis-experiment`

This enables the following FIS actions:
- `aws:eks:pod-delete`
- `aws:eks:pod-network-latency`
- `aws:eks:pod-network-packet-loss`
- `aws:eks:pod-network-bandwidth`
- `aws:eks:pod-io-stress`
- `aws:eks:pod-cpu-stress`
- `aws:eks:pod-memory-stress`

### Using in FIS Experiments

When creating FIS experiments, use the following parameters:
```json
{
  "kubernetesServiceAccount": "fis-service-account",
  "duration": "PT5M"
}
The service account must match the one created in the target namespace.
"""))


# Step 1: Hypothesis Generation

The first step analyzes the AWS Retail Store Sample App repository to generate chaos engineering hypotheses. The HypothesisGeneratorAgent examines the application architecture, identifies potential failure points, and creates testable hypotheses.

In [None]:
print("🔬 STEP 1: HYPOTHESIS GENERATION")
print("=" * 50)
print("📝 What this step does:")
print("   • Analyzes the AWS Retail Store Sample App repository")
print("   • Identifies potential failure points in the architecture")
print("   • Generates testable chaos engineering hypotheses")
print("   • Stores hypotheses in the database for further processing")
print()

# Record start time
step1_start = datetime.now()

# Execute the Hypothesis Generator Agent
print("🚀 Executing HypothesisGeneratorAgent...")
hypothesis_result = hypothesis_agent(
    "Analyze the AWS workload repository (https://github.com/aws-containers/retail-store-sample-app.git)."
)

# Record end time and calculate duration
step1_end = datetime.now()
step1_duration = (step1_end - step1_start).total_seconds()

print(f"\n✅ Step 1 completed in {step1_duration:.2f} seconds")
print("\n📄 Hypothesis Generation Results:")
print("-" * 40)
print(hypothesis_result)
print("-" * 40)

# Step 2: Hypothesis Prioritization

The second step prioritizes the generated hypotheses based on business impact, technical feasibility, risk level, and learning value. This ensures we focus on the most valuable experiments first.

In [None]:
print("\n🎯 STEP 2: HYPOTHESIS PRIORITIZATION")
print("=" * 50)
print("📝 What this step does:")
print("   • Retrieves all hypotheses from the database")
print("   • Evaluates each hypothesis based on:")
print("     - Business impact (customer experience, revenue)")
print("     - Technical feasibility (ease of testing, resources)")
print("     - Risk level (blast radius, recovery time)")
print("     - Learning value (insights gained)")
print("   • Assigns priority rankings (1 = highest priority)")
print("   • Updates database with priority information")
print()

# Record start time
step2_start = datetime.now()

# Execute the Hypothesis Prioritization Agent
print("🚀 Executing HypothesisPrioritizationAgent...")
prioritization_result = prioritization_agent("""
Prioritize all hypotheses in the database based on:

1. Business impact (customer experience, revenue impact)
2. Technical feasibility (ease of testing, resource requirements)
3. Risk level (blast radius, recovery time)
4. Learning value (insights gained from the experiment)

Update each hypothesis with a priority ranking from 1 to N (1 = highest priority).
Focus on experiments that provide maximum learning with acceptable risk.
""")

# Record end time and calculate duration
step2_end = datetime.now()
step2_duration = (step2_end - step2_start).total_seconds()

print(f"\n✅ Step 2 completed in {step2_duration:.2f} seconds")
print("\n📄 Hypothesis Prioritization Results:")
print("-" * 40)
print(prioritization_result.message)
print("-" * 40)

# Step 3: Experiment Design

The third step creates detailed AWS FIS experiment templates for the prioritized hypotheses. This includes both the FIS configuration and necessary IAM role setup.

In [None]:
print("\n🔧 STEP 3: EXPERIMENT DESIGN")
print("=" * 50)
print("📝 What this step does:")
print("   • Retrieves prioritized hypotheses from the database")
print("   • Creates production-ready AWS FIS experiment templates")
print("   • Designs IAM roles and policies for experiment execution")
print("   • Includes safety measures and rollback procedures")
print("   • Saves experiment designs to the database")
print()

# Record start time
step3_start = datetime.now()

# Execute the Experiment Design Agent
print("🚀 Executing ExperimentDesignAgent...")
design_result = design_agent("""
Retrieve all hypotheses from the database (ordered by priority) and create experiment designs for each.
Make sure to look up the latest documentation for each experiment type.
Do not stop till all hypotheses are processed.
""")

# Record end time and calculate duration
step3_end = datetime.now()
step3_duration = (step3_end - step3_start).total_seconds()

print(f"\n✅ Step 3 completed in {step3_duration:.2f} seconds")
print("\n📄 Experiment Design Results:")
print("-" * 40)
print(design_result.message)
print("-" * 40)

# Step 4: FIS Setup

The fourth step sets up the designed experiments in AWS FIS with real AWS resources. This creates executable experiments ready for testing.

In [None]:
print("\n⚙️ STEP 4: AWS FIS SETUP")
print("=" * 50)
print("📝 What this step does:")
print("   • Retrieves draft experiments from the database")
print("   • Discovers real AWS resources in your account")
print("   • Creates executable FIS experiments with actual resource targets")
print("   • Sets up IAM roles and permissions")
print("   • Updates experiment status to 'created' when ready")
print("   • Validates experiment configurations")
print()

# Record start time
step4_start = datetime.now()

# Execute the Experiments Agent for FIS setup
print("🚀 Executing ExperimentsAgent for FIS setup...")
fis_setup_result = experiments_agent("""
Set up AWS FIS experiments for the workload:

1. Get all draft experiments from the database using get_experiments
2. For each experiment, discover AWS resources and create FIS experiments
3. Update experiment status to 'created' when successfully set up

Focus on creating real, executable FIS experiments in AWS.
                                     
Process all experiements without stopping until all are processed.
""")

# Record end time and calculate duration
step4_end = datetime.now()
step4_duration = (step4_end - step4_start).total_seconds()

print(f"\n✅ Step 4 completed in {step4_duration:.2f} seconds")
print("\n📄 FIS Setup Results:")
print("-" * 40)
print(fis_setup_result.message)
print("-" * 40)



# Step 5: Experiment Execution

The fifth step executes the top priority experiments and monitors their progress. This is where the actual chaos engineering tests run against your AWS infrastructure.

In [None]:
print("\n⚡ STEP 5: EXPERIMENT EXECUTION")
print("=" * 50)
print("📝 What this step does:")
print("   • Selects the top 3 highest priority experiments with status 'created'")
print("   • Executes experiments sequentially (one at a time for safety)")
print("   • Monitors experiment progress with detailed status updates")
print("   • Waits for completion (completed, failed, or stopped)")
print("   • Captures execution results, duration, and failure details")
print("   • Updates database with final status and results")
print("   • Implements safety measures and monitoring")
print()
print("⚠️  WARNING: This step will execute real chaos experiments on your AWS infrastructure!")
print("   Make sure you understand the impact before proceeding.")
print()

# Record start time
step5_start = datetime.now()

# Execute the Experiments Agent for execution
print("🚀 Executing ExperimentsAgent for experiment execution...")
execution_result = experiments_agent("""
Execute chaos engineering experiments for the workload:

EXECUTION PLAN:
1. Get the top 5 highest priority experiments from the database that have status 'created'
2. For each experiment:
   a. Display experiment details (name, hypothesis, expected impact)
   b. Execute the experiment using AWS FIS start_experiment
   c. Monitor experiment progress with detailed status updates
   d. Wait for completion (completed, failed, or stopped)
   e. Capture execution results, duration, and any failure details
   f. Update database with final status and results
3. Provide a summary of all executed experiments

EXECUTION REQUIREMENTS:
- Execute experiments sequentially (one at a time)
- Wait for each experiment to complete before starting the next
- Capture detailed execution logs and timing information
- Update database status throughout the process
- Handle any execution failures gracefully
- Provide clear status updates for each step

SAFETY MEASURES:
- Verify experiment targets before execution
- Monitor for any unexpected behavior
- Capture stop reasons if experiments are terminated
- Log all AWS FIS API responses

Execute experiments safely and provide detailed progress updates.
""")

# Record end time and calculate duration
step5_end = datetime.now()
step5_duration = (step5_end - step5_start).total_seconds()

print(f"\n✅ Step 5 completed in {step5_duration:.2f} seconds")
print("\n📄 Experiment Execution Results:")
print("-" * 40)
print(execution_result.message)
print("-" * 40)

# Step 6: Results Analysis

The final step analyzes the experiment results and generates actionable insights for improving system resilience.

In [None]:
print("\n📊 STEP 6: RESULTS ANALYSIS")
print("=" * 50)
print("📝 What this step does:")
print("   • Retrieves all executed experiments from the database")
print("   • Analyzes experiment outcomes and failure patterns")
print("   • Extracts key learnings and insights")
print("   • Provides success/failure rates and patterns")
print("   • Generates recommendations for system improvements")
print("   • Suggests follow-up experiments")
print("   • Creates actionable insights for development teams")
print()

# Record start time
step6_start = datetime.now()

# Execute the Learning and Iteration Agent
print("🚀 Executing LearningAndIterationAgent...")
analysis_result = learning_agent("""
Analyze and summarize the results of executed chaos engineering experiments:

ANALYSIS TASKS:
1. Get all experiments from the database. 
2. For each executed experiment:
   a. Display experiment name and hypothesis
   b. Show execution status and duration
   c. Analyze any failure patterns or unexpected behaviors
   d. Extract key learnings and insights
3. Provide overall summary of chaos engineering results
4. Recommend next steps based on findings

REPORTING FORMAT:
- Clear experiment-by-experiment breakdown
- Success/failure rates and patterns
- Key insights and learnings discovered
- Recommendations for system improvements
- Suggestions for follow-up experiments

Focus on actionable insights that can improve system resilience.
""")

# Record end time and calculate duration
step6_end = datetime.now()
step6_duration = (step6_end - step6_start).total_seconds()

print(f"\n✅ Step 6 completed in {step6_duration:.2f} seconds")
print("\n📄 Results Analysis:")
print("-" * 40)
print(analysis_result.message)
print("-" * 40)

# Step 7: Automated Evaluation

Evaluation is one of the most important steps when building a GenAI system. A complete evaluation system for this project would include:

* Creating a benchmark data set.
* Using the benchmark to evaluate each agent, including:
    * The agent output
    * How well the agent selects and uses tools
* Using a mixture of human and automated evaluation of new outputs

In order to get started, we built a simple automated evaluation system for the hypothesis generation agent.

In [None]:
from HypothesisEvaluatorAgent.agent import hypothesis_evaluator_agent

# Evaluate specific hypotheses
result = hypothesis_evaluator_agent("""
{
  "limit": 5,
  "message": "Evaluate all proposed hypotheses"
}
""")

In [None]:
from typing import Dict, Any
def display_chart(chart_result: Dict[str, Any]) -> None:
    """
    Display a chart in the notebook from the chart result dictionary.
    
    Args:
        chart_result: Dictionary containing chart data with base64-encoded image
    """
    if not chart_result["success"]:
        print(f"Error: {chart_result['message']}")
        return
    
    # Get the base64-encoded image
    img_data = chart_result["chart_data"]["image_base64"]
    
    # Display the image
    display(Image(data=base64.b64decode(img_data)))
    
    # Print chart metadata
    print(f"Chart Type: {chart_result['chart_type']}")
    print(f"Hypotheses Displayed: {chart_result['hypothesis_count']}")

In [None]:
from HypothesisEvaluatorAgent.evaluation_charts import display_hypothesis_evaluation_chart, display_evaluation_statistics, generate_statistics_chart

chart = display_hypothesis_evaluation_chart(chart_type="bar")

In [None]:
from IPython.display import Image, display
import base64
display_chart(chart)

In [None]:
stats_result = display_evaluation_statistics(limit=50, 
                                             output_path="charts/hypothesis_statistics.png")

In [None]:
img_data = stats_result["chart_data"]["image_base64"]
display(Image(data=base64.b64decode(img_data)))

# Workflow Summary

Let's summarize the complete chaos engineering workflow execution:

In [None]:
print("\n🎉 CHAOS ENGINEERING WORKFLOW COMPLETED!")
print("=" * 60)

# Calculate total workflow duration
total_duration = (step6_end - step1_start).total_seconds()

print(f"\n⏱️  Total Workflow Duration: {total_duration:.2f} seconds ({total_duration/60:.1f} minutes)")
print("\n📊 Step-by-Step Timing:")
print(f"   1. Hypothesis Generation:    {step1_duration:.2f}s")
print(f"   2. Hypothesis Prioritization: {step2_duration:.2f}s")
print(f"   3. Experiment Design:        {step3_duration:.2f}s")
print(f"   4. FIS Setup:               {step4_duration:.2f}s")
print(f"   5. Experiment Execution:     {step5_duration:.2f}s")
print(f"   6. Results Analysis:         {step6_duration:.2f}s")

print("\n✅ What Was Accomplished:")
print("   • Analyzed AWS Retail Store Sample App architecture")
print("   • Generated and prioritized chaos engineering hypotheses")
print("   • Created production-ready AWS FIS experiment templates")
print("   • Set up experiments with real AWS resources")
print("   • Executed selected experiments safely")
print("   • Analyzed results and generated actionable insights")

print("\n🎯 Key Benefits of This Workflow Pattern:")
print("   • Sequential execution with dependency management")
print("   • Clear separation of concerns between agents")
print("   • Comprehensive error handling and monitoring")
print("   • Actionable insights for system improvements")
print("   • Repeatable and scalable process")

print("\n🚀 Next Steps:")
print("   1. Review the generated insights and recommendations")
print("   2. Implement suggested system improvements")
print("   3. Schedule regular chaos engineering sessions")
print("   4. Expand experiments to cover more failure scenarios")
print("   5. Integrate findings into development processes")

print("\n💡 Pro Tips:")
print("   • Start with low-risk experiments in non-production environments")
print("   • Gradually increase experiment complexity and scope")
print("   • Document all learnings and share with your team")
print("   • Use insights to improve system design and architecture")
print("   • Make chaos engineering part of your regular development cycle")

print("\n🎯 YOUR SYSTEM IS NOW MORE RESILIENT!")