# SageMaker Random Cut Forest Exercise

This notebook demonstrates Amazon SageMaker's **Random Cut Forest (RCF)** algorithm for anomaly detection.

## What You'll Learn
1. How to prepare data for anomaly detection
2. How to configure and understand RCF hyperparameters
3. How to train an RCF model
4. How to interpret anomaly scores and evaluate detection performance

## What is Random Cut Forest?

Random Cut Forest is an **unsupervised** algorithm for detecting anomalous data points. It assigns an anomaly score to each data point - higher scores indicate more anomalous observations.

**Key Concept:**
- RCF builds a forest of random trees by recursively partitioning data
- Anomalies are points that require fewer cuts to isolate
- Works with arbitrary-dimensional input
- Based on the research paper "Robust Random Cut Forest Based Anomaly Detection On Streams" (Guha et al.)

## Use Cases

| Domain | Application |
|--------|-------------|
| Time Series | Spike detection, unusual patterns |
| Cybersecurity | Intrusion detection, fraud |
| IoT | Sensor malfunction, equipment failure |
| Finance | Transaction anomalies, market events |
| Operations | Server issues, traffic anomalies |

---

## ⚠️ Training Cost Information

<div style="background-color: #000000ff; border: 1px solid #28a745; border-radius: 5px; padding: 15px; margin: 10px 0;">

### RCF Uses CPU Instances (Cost-Effective!)

Random Cut Forest is a **CPU-only** algorithm and does NOT require GPU instances.

| Instance Type | vCPU | Memory | On-Demand Price* |
|---------------|------|--------|------------------|
| ml.m5.large | 2 | 8 GB | ~$0.13/hour |
| ml.m5.xlarge | 4 | 16 GB | ~$0.27/hour |
| ml.c5.xlarge | 4 | 8 GB | ~$0.24/hour |
| ml.c5.2xlarge | 8 | 16 GB | ~$0.48/hour |

*Prices are approximate for us-west-2. Check [AWS SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/) for current rates.

### Cost Estimation
- **Training**: Typically 3-5 minutes for small-medium datasets (~$0.01-0.03)
- **Inference endpoint**: ~$0.13/hour for ml.m5.large (delete when not in use!)
- RCF is one of the most cost-effective SageMaker algorithms

</div>

## Step 1: Setup and Imports

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.image_uris import retrieve
from sagemaker.estimator import Estimator
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer
import pandas as pd
import numpy as np
import json
import os
from datetime import datetime, timedelta
from dotenv import load_dotenv
import matplotlib.pyplot as plt

# Load environment variables from .env file
load_dotenv()

# Configure AWS session from environment variables
aws_profile = os.getenv('AWS_PROFILE')
aws_region = os.getenv('AWS_REGION', 'us-west-2')
sagemaker_role = os.getenv('SAGEMAKER_ROLE_ARN')

if aws_profile:
    boto3.setup_default_session(profile_name=aws_profile, region_name=aws_region)
else:
    boto3.setup_default_session(region_name=aws_region)

# SageMaker session and role
sagemaker_session = sagemaker.Session()

if sagemaker_role:
    role = sagemaker_role
else:
    role = get_execution_role()

region = sagemaker_session.boto_region_name

print(f"AWS Profile: {aws_profile or 'default'}")
print(f"SageMaker Role: {role}")
print(f"Region: {region}")
print(f"SageMaker SDK Version: {sagemaker.__version__}")

In [None]:
# Configuration
BUCKET_NAME = sagemaker_session.default_bucket()
PREFIX = "random-cut-forest"

# Dataset parameters
NUM_SAMPLES = 5000
ANOMALY_RATE = 0.02  # 2% anomalies
RANDOM_STATE = 42

print(f"S3 Bucket: {BUCKET_NAME}")
print(f"S3 Prefix: {PREFIX}")

## Step 2: Generate Synthetic Data with Anomalies

In [None]:
def generate_time_series_with_anomalies(num_samples=5000, anomaly_rate=0.02, seed=42):
    """
    Generate synthetic time series data with injected anomalies.
    
    Simulates server metrics: CPU usage, memory, request count.
    """
    np.random.seed(seed)
    
    # Time index
    timestamps = pd.date_range(
        start='2024-01-01', 
        periods=num_samples, 
        freq='5min'
    )
    
    # Normal patterns
    # CPU: base load + daily pattern + noise
    hours = np.array([t.hour for t in timestamps])
    daily_pattern = 20 * np.sin(2 * np.pi * hours / 24 - np.pi/2) + 50
    cpu_usage = daily_pattern + np.random.normal(0, 5, num_samples)
    cpu_usage = np.clip(cpu_usage, 0, 100)
    
    # Memory: slow growth + noise
    memory_base = 40 + 0.002 * np.arange(num_samples)
    memory_usage = memory_base + np.random.normal(0, 3, num_samples)
    memory_usage = np.clip(memory_usage, 0, 100)
    
    # Request count: follows CPU pattern roughly
    request_count = (daily_pattern - 30) * 100 + np.random.normal(0, 200, num_samples)
    request_count = np.clip(request_count, 0, None)
    
    # Create dataframe
    df = pd.DataFrame({
        'timestamp': timestamps,
        'cpu_usage': cpu_usage,
        'memory_usage': memory_usage,
        'request_count': request_count
    })
    
    # Inject anomalies
    num_anomalies = int(num_samples * anomaly_rate)
    anomaly_indices = np.random.choice(num_samples, num_anomalies, replace=False)
    
    # Mark anomalies (for evaluation only - RCF doesn't use labels)
    df['is_anomaly'] = False
    df.loc[anomaly_indices, 'is_anomaly'] = True
    
    # Types of anomalies:
    for idx in anomaly_indices:
        anomaly_type = np.random.choice(['spike', 'drop', 'outlier'])
        
        if anomaly_type == 'spike':
            # Sudden spike in CPU or memory
            df.loc[idx, 'cpu_usage'] = min(100, df.loc[idx, 'cpu_usage'] + np.random.uniform(30, 50))
            df.loc[idx, 'memory_usage'] = min(100, df.loc[idx, 'memory_usage'] + np.random.uniform(20, 40))
        elif anomaly_type == 'drop':
            # Sudden drop in requests
            df.loc[idx, 'request_count'] = max(0, df.loc[idx, 'request_count'] * np.random.uniform(0.1, 0.3))
        else:
            # Multi-dimensional outlier
            df.loc[idx, 'cpu_usage'] = np.random.uniform(85, 100)
            df.loc[idx, 'memory_usage'] = np.random.uniform(80, 100)
            df.loc[idx, 'request_count'] = np.random.uniform(0, 500)
    
    return df

# Generate data
df = generate_time_series_with_anomalies(NUM_SAMPLES, ANOMALY_RATE, RANDOM_STATE)

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nAnomaly count: {df['is_anomaly'].sum()} ({100*df['is_anomaly'].mean():.1f}%)")
print(f"\nSample data:")
print(df.head(10))

In [None]:
# Visualize the data with anomalies highlighted
fig, axes = plt.subplots(3, 1, figsize=(14, 10), sharex=True)

# Get anomaly indices for plotting
anomaly_mask = df['is_anomaly']

# CPU Usage
axes[0].plot(df.index, df['cpu_usage'], 'b-', alpha=0.7, label='CPU Usage')
axes[0].scatter(df.index[anomaly_mask], df.loc[anomaly_mask, 'cpu_usage'], 
                c='red', s=50, label='Anomaly', zorder=5)
axes[0].set_ylabel('CPU Usage (%)')
axes[0].legend()
axes[0].set_title('Server Metrics with Injected Anomalies')

# Memory Usage
axes[1].plot(df.index, df['memory_usage'], 'g-', alpha=0.7, label='Memory Usage')
axes[1].scatter(df.index[anomaly_mask], df.loc[anomaly_mask, 'memory_usage'], 
                c='red', s=50, label='Anomaly', zorder=5)
axes[1].set_ylabel('Memory Usage (%)')
axes[1].legend()

# Request Count
axes[2].plot(df.index, df['request_count'], 'm-', alpha=0.7, label='Request Count')
axes[2].scatter(df.index[anomaly_mask], df.loc[anomaly_mask, 'request_count'], 
                c='red', s=50, label='Anomaly', zorder=5)
axes[2].set_ylabel('Requests')
axes[2].set_xlabel('Sample Index')
axes[2].legend()

plt.tight_layout()
plt.show()

## Step 3: Prepare Data for RCF

RCF accepts:
- **CSV format**: text/csv
- **RecordIO-protobuf**: application/x-recordio-protobuf

For CSV, no header row, just numeric features.

In [None]:
# Prepare features (exclude timestamp and is_anomaly label)
feature_columns = ['cpu_usage', 'memory_usage', 'request_count']
train_data = df[feature_columns].values

print(f"Training data shape: {train_data.shape}")
print(f"Features: {feature_columns}")

In [None]:
# Save to CSV (no header)
os.makedirs('data/rcf', exist_ok=True)

np.savetxt('data/rcf/train.csv', train_data, delimiter=',')

print(f"Saved: data/rcf/train.csv ({os.path.getsize('data/rcf/train.csv') / 1024:.1f} KB)")

# Preview
print("\nFile preview (first 5 lines):")
with open('data/rcf/train.csv', 'r') as f:
    for i, line in enumerate(f):
        if i >= 5:
            break
        print(f"  {line.strip()}")

In [None]:
# Upload to S3
s3_client = boto3.client('s3')

train_s3_key = f"{PREFIX}/train/train.csv"
s3_client.upload_file('data/rcf/train.csv', BUCKET_NAME, train_s3_key)

train_uri = f"s3://{BUCKET_NAME}/{PREFIX}/train"
print(f"Data uploaded to: {train_uri}")

## Step 4: Train RCF Model

### Understanding RCF Hyperparameters

| Parameter | Description | Default | Recommendation |
|-----------|-------------|---------|----------------|
| `num_trees` | Number of trees in forest | 50 | 50-100 for stability |
| `num_samples_per_tree` | Samples used to build each tree | 256 | ~1/anomaly_rate |
| `feature_dim` | Input feature dimension | Required | Must match data |
| `eval_metrics` | Metrics to compute | None | `accuracy`, `f1` |

### Key Hyperparameter Details

**num_trees**
- More trees = more stable anomaly scores
- Diminishing returns beyond 100 trees
- Recommended: 50-100 for most use cases
- Higher values increase training time linearly

**num_samples_per_tree**
- Critical parameter for anomaly detection sensitivity
- **Rule of thumb**: `1/num_samples_per_tree ≈ expected_anomaly_rate`
- Example: 2% anomaly rate → `num_samples_per_tree=50` (1/50 = 2%)
- Default of 256 expects ~0.4% anomaly rate
- Lower values = more sensitive to small anomalies
- Higher values = more robust, less false positives

**feature_dim**
- Must equal the number of input features
- RCF handles high-dimensional data well (tested up to 50+ dimensions)

### Evaluation Metrics Explained

| Metric | Description | Usage |
|--------|-------------|-------|
| `accuracy` | Classification accuracy at threshold | Requires labeled test data |
| `f1` | F1 score at threshold | Balance precision/recall |
| `precision` | True positives / Predicted positives | Minimize false alarms |
| `recall` | True positives / Actual positives | Catch all anomalies |

In [None]:
# Get RCF container image
rcf_image = retrieve(
    framework='randomcutforest',
    region=region,
    version='1'
)

print(f"RCF Image URI: {rcf_image}")

In [None]:
# Create RCF estimator
rcf_estimator = Estimator(
    image_uri=rcf_image,
    role=role,
    instance_count=1,
    instance_type='ml.m5.large',  # CPU instance
    output_path=f's3://{BUCKET_NAME}/{PREFIX}/output',
    sagemaker_session=sagemaker_session,
    base_job_name='random-cut-forest'
)

In [None]:
# Set hyperparameters
hyperparameters = {
    "num_trees": 100,
    "num_samples_per_tree": 256,  # ~0.4% expected anomaly rate (1/256)
    "feature_dim": len(feature_columns),
}

rcf_estimator.set_hyperparameters(**hyperparameters)

print("RCF hyperparameters:")
for k, v in hyperparameters.items():
    print(f"  {k}: {v}")

In [None]:
# Start training
print("Starting RCF training job...")
print("This will take approximately 3-5 minutes.\n")

rcf_estimator.fit(
    {'train': train_uri},
    wait=True,
    logs=True
)

In [None]:
# Get training job info
job_name = rcf_estimator.latest_training_job.name
print(f"Training job completed: {job_name}")
print(f"Model artifacts: {rcf_estimator.model_data}")

## Step 5: Deploy and Score Data

In [None]:
# Deploy the model
print("Deploying RCF model...")
print("This will take approximately 5-7 minutes.\n")

rcf_predictor = rcf_estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',
    endpoint_name=f'rcf-{datetime.now().strftime("%Y%m%d%H%M")}'
)

print(f"\nEndpoint deployed: {rcf_predictor.endpoint_name}")

In [None]:
# Configure predictor
rcf_predictor.serializer = CSVSerializer()
rcf_predictor.deserializer = JSONDeserializer()

def get_anomaly_scores(data, predictor, batch_size=500):
    """
    Get anomaly scores for all data points.
    """
    all_scores = []
    
    for i in range(0, len(data), batch_size):
        batch = data[i:i+batch_size]
        response = predictor.predict(batch)
        
        for result in response['scores']:
            all_scores.append(result['score'])
    
    return np.array(all_scores)

In [None]:
# Get anomaly scores for all data
print("Computing anomaly scores...")
scores = get_anomaly_scores(train_data, rcf_predictor)

# Add scores to dataframe
df['anomaly_score'] = scores

print(f"\nScore statistics:")
print(f"  Min:    {scores.min():.4f}")
print(f"  Max:    {scores.max():.4f}")
print(f"  Mean:   {scores.mean():.4f}")
print(f"  Median: {np.median(scores):.4f}")
print(f"  Std:    {scores.std():.4f}")

In [None]:
# Visualize anomaly scores
fig, axes = plt.subplots(2, 1, figsize=(14, 8))

# Score distribution
axes[0].hist(scores[~df['is_anomaly']], bins=50, alpha=0.7, label='Normal', color='blue')
axes[0].hist(scores[df['is_anomaly']], bins=50, alpha=0.7, label='True Anomaly', color='red')
axes[0].set_xlabel('Anomaly Score')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Anomaly Scores')
axes[0].legend()

# Scores over time
axes[1].plot(df.index, df['anomaly_score'], 'b-', alpha=0.7, label='Anomaly Score')
axes[1].scatter(df.index[df['is_anomaly']], df.loc[df['is_anomaly'], 'anomaly_score'], 
                c='red', s=50, label='True Anomaly', zorder=5)

# Add threshold line
threshold = np.percentile(scores, 98)  # Top 2% as anomalies
axes[1].axhline(y=threshold, color='orange', linestyle='--', label=f'Threshold (98th percentile: {threshold:.2f})')

axes[1].set_xlabel('Sample Index')
axes[1].set_ylabel('Anomaly Score')
axes[1].set_title('Anomaly Scores Over Time')
axes[1].legend()

plt.tight_layout()
plt.show()

## Step 6: Evaluate Detection Performance

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

def evaluate_threshold(scores, true_labels, threshold):
    """
    Evaluate anomaly detection at a given threshold.
    """
    predictions = (scores >= threshold).astype(int)
    
    precision = precision_score(true_labels, predictions)
    recall = recall_score(true_labels, predictions)
    f1 = f1_score(true_labels, predictions)
    
    return {
        'threshold': threshold,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'predictions': predictions
    }

# Try different thresholds (percentile-based)
true_labels = df['is_anomaly'].astype(int).values

print("Threshold Analysis:")
print("=" * 60)
print(f"{'Percentile':<12} {'Threshold':<12} {'Precision':<12} {'Recall':<12} {'F1':<12}")
print("-" * 60)

for pct in [95, 96, 97, 98, 99]:
    threshold = np.percentile(scores, pct)
    result = evaluate_threshold(scores, true_labels, threshold)
    print(f"{pct}%          {threshold:<12.4f} {result['precision']:<12.4f} {result['recall']:<12.4f} {result['f1']:<12.4f}")

In [None]:
# Calculate AUC-ROC
auc = roc_auc_score(true_labels, scores)
print(f"\nAUC-ROC: {auc:.4f}")

# Use 98th percentile threshold for final evaluation
best_threshold = np.percentile(scores, 98)
final_result = evaluate_threshold(scores, true_labels, best_threshold)

print(f"\nFinal Results (threshold = {best_threshold:.4f}):")
print(f"  Precision: {final_result['precision']:.4f}")
print(f"  Recall:    {final_result['recall']:.4f}")
print(f"  F1 Score:  {final_result['f1']:.4f}")

# Confusion matrix
cm = confusion_matrix(true_labels, final_result['predictions'])
print(f"\nConfusion Matrix:")
print(f"  TN: {cm[0,0]:5d}  FP: {cm[0,1]:5d}")
print(f"  FN: {cm[1,0]:5d}  TP: {cm[1,1]:5d}")

## Step 7: Inspect Top Anomalies

In [None]:
# Show top anomalies
df_sorted = df.sort_values('anomaly_score', ascending=False)

print("Top 15 Detected Anomalies:")
print("=" * 80)
print(df_sorted[['timestamp', 'cpu_usage', 'memory_usage', 'request_count', 'anomaly_score', 'is_anomaly']].head(15).to_string())

## Step 8: Clean Up Resources

In [None]:
# Delete the endpoint
print(f"Deleting endpoint: {rcf_predictor.endpoint_name}")
rcf_predictor.delete_endpoint()
print("Endpoint deleted successfully!")

---

## Summary

In this exercise, you learned:

1. **Data Format**: CSV (no header) or RecordIO-protobuf

2. **Key Hyperparameters**:
   - `num_trees`: More trees = more stable scores (recommend 50-100)
   - `num_samples_per_tree`: Inversely related to expected anomaly rate
   - `feature_dim`: Number of input features

3. **Output**: Anomaly scores (higher = more anomalous)

4. **Threshold Selection**:
   - Use percentile-based thresholds (e.g., 98th percentile)
   - Tune based on precision/recall tradeoff
   - Consider business cost of false positives vs missed anomalies

5. **Evaluation Metrics**:
   - AUC-ROC for overall performance
   - Precision, Recall, F1 at specific threshold
   - Confusion matrix for detailed analysis

### Instance Recommendations

| Task | Instance Types | Notes |
|------|----------------|-------|
| Training | ml.m5.large, ml.c5.xlarge | CPU only, very cost-effective |
| Inference | ml.c5.large (recommended) | Low latency for streaming |

### CloudWatch Training Metrics

| Metric | Description |
|--------|-------------|
| `train:loss` | Training loss (should decrease) |
| `validation:loss` | Validation loss |

### Best Practices

- Start with `num_trees=100` for balance between speed and accuracy
- Set `num_samples_per_tree` based on expected anomaly ratio (1/rate)
- Monitor score distribution over time for drift
- Consider streaming inference for real-time detection
- Normalize features if they have very different scales

### Anomaly Score Interpretation

| Score Range | Typical Meaning |
|-------------|-----------------|
| < mean | Normal behavior |
| mean to 95th percentile | Slightly unusual |
| 95th to 99th percentile | Suspicious, investigate |
| > 99th percentile | Highly anomalous, alert |

### Next Steps

- Apply to real-world time series data
- Implement streaming anomaly detection with Kinesis
- Combine with CloudWatch Alarms for alerting
- Use SageMaker Model Monitor for production monitoring
- Consider ensemble with other anomaly detection methods