# Beginner's Guide to Anomaly Detection with Pynomaly

Welcome to the world of anomaly detection! This tutorial will teach you the fundamentals of detecting outliers and anomalies in data using Pynomaly.

## 📚 What You'll Learn

1. **What is anomaly detection?**
2. **Types of anomalies**
3. **Basic detection algorithms**
4. **Hands-on implementation**
5. **Evaluation and interpretation**
6. **Real-world applications**

## 🎯 Prerequisites

- Basic Python knowledge
- Familiarity with pandas and numpy
- High school level statistics

Let's get started!

## 1. Understanding Anomaly Detection

### What is an Anomaly?

An **anomaly** (also called an outlier) is a data point that significantly differs from the majority of the data. Think of it as:

- A fraudulent credit card transaction among normal purchases
- A faulty sensor reading in a smart home system
- An unusual network access pattern indicating a security breach
- A defective product in a manufacturing line

### Why is Anomaly Detection Important?

- **Security**: Detect cyber attacks and fraud
- **Quality Control**: Find defective products
- **Health Monitoring**: Identify system failures early
- **Business Intelligence**: Discover unusual customer behavior

In [None]:
# Let's start by installing and importing the necessary libraries
# If you haven't installed pynomaly yet, uncomment the next line:
# !pip install pynomaly

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs
import warnings
warnings.filterwarnings('ignore')

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")

print("✅ Libraries imported successfully!")
print("🚀 Ready to explore anomaly detection!")

## 2. Types of Anomalies

There are three main types of anomalies:

### 1. Point Anomalies
Individual data points that are unusual

### 2. Contextual Anomalies
Data points that are normal in one context but anomalous in another

### 3. Collective Anomalies
A collection of data points that together form an anomalous pattern

Let's create some example data to visualize these concepts:

In [None]:
# Create sample data to demonstrate different types of anomalies
np.random.seed(42)

# Generate normal data points
normal_data = np.random.normal(0, 1, (200, 2))

# Add point anomalies (outliers)
point_anomalies = np.array([[-4, -4], [4, 4], [-3, 4], [4, -3]])

# Combine data
all_data = np.vstack([normal_data, point_anomalies])
labels = np.hstack([np.zeros(200), np.ones(4)])  # 0 = normal, 1 = anomaly

# Create a DataFrame for easier handling
df = pd.DataFrame(all_data, columns=['Feature_1', 'Feature_2'])
df['Is_Anomaly'] = labels

print(f"Dataset created with {len(df)} data points")
print(f"Normal points: {sum(labels == 0)}")
print(f"Anomalous points: {sum(labels == 1)}")
print(f"Anomaly rate: {sum(labels == 1) / len(labels):.1%}")

In [None]:
# Let's visualize our data
plt.figure(figsize=(10, 6))

# Plot normal points
normal_points = df[df['Is_Anomaly'] == 0]
anomaly_points = df[df['Is_Anomaly'] == 1]

plt.scatter(normal_points['Feature_1'], normal_points['Feature_2'], 
           c='blue', alpha=0.6, label='Normal Points', s=30)
plt.scatter(anomaly_points['Feature_1'], anomaly_points['Feature_2'], 
           c='red', alpha=0.8, label='Anomalies', s=100, marker='x')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Point Anomalies Example')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("👆 Notice how the red X marks are clearly separated from the blue cluster!")
print("These are classic examples of point anomalies.")

## 3. Your First Anomaly Detection Algorithm

Let's start with the simplest method: **Statistical Outlier Detection** using the Z-score.

### Z-Score Method

The Z-score tells us how many standard deviations a point is from the mean:

```
Z = (x - μ) / σ
```

Where:
- x = data point
- μ = mean
- σ = standard deviation

Typically, points with |Z| > 3 are considered outliers.

In [None]:
# Let's implement a simple Z-score anomaly detector
def detect_anomalies_zscore(data, threshold=3):
    """
    Detect anomalies using Z-score method.
    
    Parameters:
    - data: array-like, the data to analyze
    - threshold: float, Z-score threshold (default: 3)
    
    Returns:
    - anomalies: boolean array, True for anomalies
    - z_scores: float array, Z-scores for each point
    """
    mean = np.mean(data)
    std = np.std(data)
    
    # Calculate Z-scores
    z_scores = np.abs((data - mean) / std)
    
    # Identify anomalies
    anomalies = z_scores > threshold
    
    return anomalies, z_scores

# Let's test it on a simple 1D dataset
# Create sample data with some outliers
np.random.seed(42)
sample_data = np.concatenate([
    np.random.normal(0, 1, 100),  # Normal data
    [5, -5, 6]  # Clear outliers
])

# Detect anomalies
anomalies, z_scores = detect_anomalies_zscore(sample_data)

print(f"Total data points: {len(sample_data)}")
print(f"Anomalies detected: {sum(anomalies)}")
print(f"Anomaly indices: {np.where(anomalies)[0]}")
print(f"Anomaly values: {sample_data[anomalies]}")
print(f"Z-scores of anomalies: {z_scores[anomalies]}")

In [None]:
# Visualize the Z-score results
plt.figure(figsize=(12, 5))

# Plot 1: Data points
plt.subplot(1, 2, 1)
normal_indices = ~anomalies
plt.scatter(range(len(sample_data)), sample_data, c='blue', alpha=0.6, label='Normal')
plt.scatter(np.where(anomalies)[0], sample_data[anomalies], 
           c='red', s=100, marker='x', label='Anomalies')
plt.xlabel('Data Point Index')
plt.ylabel('Value')
plt.title('Detected Anomalies')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 2: Z-scores
plt.subplot(1, 2, 2)
plt.bar(range(len(z_scores)), z_scores, 
        color=['red' if a else 'blue' for a in anomalies], alpha=0.7)
plt.axhline(y=3, color='red', linestyle='--', label='Threshold (Z=3)')
plt.xlabel('Data Point Index')
plt.ylabel('Z-Score')
plt.title('Z-Scores for Each Point')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("🎯 The red bars show Z-scores above the threshold of 3!")

## 4. Introduction to Pynomaly

Now let's use Pynomaly's built-in algorithms. Pynomaly provides many advanced algorithms that work better than simple statistical methods, especially for complex, multi-dimensional data.

### Isolation Forest

Isolation Forest is one of the most popular anomaly detection algorithms. It works by:

1. **Randomly selecting** a feature and a split value
2. **Isolating** points by splitting the data
3. **Measuring** how many splits it takes to isolate each point
4. **Anomalies** require fewer splits (easier to isolate)

Think of it like this: if you're in a crowded room, it's hard to isolate you. But if you're standing alone in a corner, it's easy!

In [None]:
# Import Pynomaly's Isolation Forest
from pynomaly.detectors import IsolationForest

# Let's go back to our 2D dataset
X = df[['Feature_1', 'Feature_2']].values
y_true = df['Is_Anomaly'].values

# Create and train the Isolation Forest detector
# contamination = expected proportion of anomalies in the data
detector = IsolationForest(
    contamination=0.02,  # We expect about 2% of data to be anomalies
    random_state=42,     # For reproducible results
    n_estimators=100     # Number of trees in the forest
)

# Fit the detector to our data
detector.fit(X)

# Get predictions (-1 for anomaly, 1 for normal)
predictions = detector.predict(X)

# Convert to binary (0 for normal, 1 for anomaly)
predicted_anomalies = (predictions == -1).astype(int)

# Get anomaly scores (lower scores indicate more anomalous)
anomaly_scores = detector.decision_function(X)

print(f"Total data points: {len(X)}")
print(f"Predicted anomalies: {sum(predicted_anomalies)}")
print(f"Actual anomalies: {sum(y_true)}")
print(f"Detection rate: {sum(predicted_anomalies & y_true) / sum(y_true):.1%}")

In [None]:
# Visualize the Isolation Forest results
plt.figure(figsize=(15, 5))

# Plot 1: True anomalies vs predictions
plt.subplot(1, 3, 1)
colors = ['blue' if p == 0 else 'red' for p in predicted_anomalies]
plt.scatter(X[:, 0], X[:, 1], c=colors, alpha=0.6)
plt.title('Isolation Forest Predictions')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# Plot 2: Anomaly scores
plt.subplot(1, 3, 2)
scatter = plt.scatter(X[:, 0], X[:, 1], c=anomaly_scores, 
                     cmap='viridis', alpha=0.7)
plt.colorbar(scatter, label='Anomaly Score')
plt.title('Anomaly Scores\n(Lower = More Anomalous)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# Plot 3: True vs Predicted
plt.subplot(1, 3, 3)
# Normal points correctly classified
correct_normal = (y_true == 0) & (predicted_anomalies == 0)
# Anomalies correctly detected
correct_anomaly = (y_true == 1) & (predicted_anomalies == 1)
# False positives
false_positive = (y_true == 0) & (predicted_anomalies == 1)
# False negatives
false_negative = (y_true == 1) & (predicted_anomalies == 0)

plt.scatter(X[correct_normal, 0], X[correct_normal, 1], 
           c='green', alpha=0.6, label='Correct Normal', s=30)
plt.scatter(X[correct_anomaly, 0], X[correct_anomaly, 1], 
           c='red', marker='x', s=100, label='Correct Anomaly')
plt.scatter(X[false_positive, 0], X[false_positive, 1], 
           c='orange', marker='^', s=60, label='False Positive')
plt.scatter(X[false_negative, 0], X[false_negative, 1], 
           c='purple', marker='v', s=60, label='False Negative')

plt.title('Classification Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()

plt.tight_layout()
plt.show()

print("🎯 Green dots: Normal points correctly identified")
print("❌ Red X's: Anomalies correctly detected")
print("🔺 Orange triangles: False positives (normal labeled as anomaly)")
print("🔻 Purple triangles: False negatives (anomaly missed)")

## 5. Evaluating Anomaly Detection Performance

How do we know if our detector is working well? We use several metrics:

### Key Metrics:

- **Precision**: Of all points flagged as anomalies, how many are actually anomalies?
- **Recall**: Of all actual anomalies, how many did we detect?
- **F1-Score**: Harmonic mean of precision and recall
- **AUC-ROC**: Area under the ROC curve (measures overall performance)

### Business Perspective:

- **High Precision**: Few false alarms, but might miss some anomalies
- **High Recall**: Catch most anomalies, but might have false alarms
- **Balance**: Usually we want a good balance between both

In [None]:
# Calculate evaluation metrics
from sklearn.metrics import (
    precision_score, recall_score, f1_score, 
    roc_auc_score, confusion_matrix, classification_report
)

# Calculate metrics
precision = precision_score(y_true, predicted_anomalies)
recall = recall_score(y_true, predicted_anomalies)
f1 = f1_score(y_true, predicted_anomalies)
auc = roc_auc_score(y_true, -anomaly_scores)  # Note: negative scores because lower = more anomalous

print("🔍 Performance Metrics:")
print(f"Precision: {precision:.3f} ({precision:.1%})")
print(f"Recall:    {recall:.3f} ({recall:.1%})")
print(f"F1-Score:  {f1:.3f} ({f1:.1%})")
print(f"AUC-ROC:   {auc:.3f} ({auc:.1%})")

print("\n📊 What this means:")
print(f"• Out of {sum(predicted_anomalies)} flagged anomalies, {sum(predicted_anomalies & y_true)} were actually anomalous")
print(f"• Out of {sum(y_true)} actual anomalies, {sum(predicted_anomalies & y_true)} were detected")
print(f"• We missed {sum(y_true) - sum(predicted_anomalies & y_true)} anomalies")
print(f"• We had {sum(predicted_anomalies & (1-y_true))} false alarms")

In [None]:
# Create a confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay

plt.figure(figsize=(8, 6))

# Create confusion matrix
cm = confusion_matrix(y_true, predicted_anomalies)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, 
                             display_labels=['Normal', 'Anomaly'])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix\nIsolation Forest Results')

# Add explanations
plt.text(0.5, -0.1, 
         f"True Negatives: {cm[0,0]} | False Positives: {cm[0,1]}\n" +
         f"False Negatives: {cm[1,0]} | True Positives: {cm[1,1]}", 
         transform=plt.gca().transAxes, ha='center', fontsize=10)

plt.show()

print("📖 Reading the Confusion Matrix:")
print("• Top-left (True Negatives): Normal points correctly identified as normal")
print("• Top-right (False Positives): Normal points incorrectly flagged as anomalies")
print("• Bottom-left (False Negatives): Anomalies missed by the detector")
print("• Bottom-right (True Positives): Anomalies correctly detected")

## 6. Comparing Different Algorithms

Let's compare multiple anomaly detection algorithms to see which works best for our data:

In [None]:
# Import multiple detectors from Pynomaly
from pynomaly.detectors import (
    IsolationForest, 
    LocalOutlierFactor, 
    OneClassSVM,
    EllipticEnvelope
)

# Define our detectors
detectors = {
    'Isolation Forest': IsolationForest(contamination=0.02, random_state=42),
    'Local Outlier Factor': LocalOutlierFactor(contamination=0.02),
    'One-Class SVM': OneClassSVM(nu=0.02),  # nu is similar to contamination
    'Elliptic Envelope': EllipticEnvelope(contamination=0.02, random_state=42)
}

# Store results
results = {}

print("🔄 Training and evaluating detectors...\n")

for name, detector in detectors.items():
    print(f"Training {name}...")
    
    # Fit and predict
    if name == 'Local Outlier Factor':
        # LOF doesn't have separate fit/predict methods
        predictions = detector.fit_predict(X)
    else:
        detector.fit(X)
        predictions = detector.predict(X)
    
    # Convert predictions to binary
    predicted_anomalies = (predictions == -1).astype(int)
    
    # Calculate metrics
    precision = precision_score(y_true, predicted_anomalies)
    recall = recall_score(y_true, predicted_anomalies)
    f1 = f1_score(y_true, predicted_anomalies)
    
    # Store results
    results[name] = {
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'predictions': predicted_anomalies
    }
    
    print(f"  Precision: {precision:.3f}, Recall: {recall:.3f}, F1: {f1:.3f}")

print("\n✅ All detectors trained and evaluated!")

In [None]:
# Create a comparison chart
import pandas as pd

# Convert results to DataFrame for easy visualization
results_df = pd.DataFrame(results).T
results_df = results_df.drop('predictions', axis=1)  # Remove predictions column for plotting

# Create comparison plots
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Plot each metric
metrics = ['precision', 'recall', 'f1_score']
titles = ['Precision', 'Recall', 'F1-Score']
colors = ['skyblue', 'lightcoral', 'lightgreen']

for i, (metric, title, color) in enumerate(zip(metrics, titles, colors)):
    ax = axes[i]
    bars = ax.bar(results_df.index, results_df[metric], color=color, alpha=0.7)
    ax.set_title(f'{title} Comparison')
    ax.set_ylabel(title)
    ax.set_ylim(0, 1)
    
    # Rotate x-axis labels for better readability
    ax.tick_params(axis='x', rotation=45)
    
    # Add value labels on bars
    for bar, value in zip(bars, results_df[metric]):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02, 
               f'{value:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

# Find the best performer
best_f1 = results_df['f1_score'].idxmax()
print(f"🏆 Best overall performer (F1-Score): {best_f1}")
print(f"   F1-Score: {results_df.loc[best_f1, 'f1_score']:.3f}")

## 7. Real-World Example: Credit Card Fraud Detection

Let's apply what we've learned to a realistic scenario: detecting fraudulent credit card transactions.

In [None]:
# Create a realistic credit card transaction dataset
np.random.seed(42)

# Generate features for normal transactions
n_normal = 5000
n_fraud = 50  # 1% fraud rate (realistic)

# Normal transactions
normal_amounts = np.random.lognormal(mean=3, sigma=1, size=n_normal)  # $20-$200 typical
normal_times = np.random.uniform(6, 22, size=n_normal)  # Business hours
normal_locations = np.random.choice(['domestic'], size=n_normal)  # Domestic transactions
normal_frequency = np.random.poisson(3, size=n_normal)  # 3 transactions per day average

# Fraudulent transactions (different patterns)
fraud_amounts = np.concatenate([
    np.random.uniform(1000, 5000, size=n_fraud//2),  # Large amounts
    np.random.uniform(1, 5, size=n_fraud//2)        # Very small amounts
])
fraud_times = np.random.uniform(0, 4, size=n_fraud)  # Night hours
fraud_locations = np.random.choice(['foreign'], size=n_fraud)  # Foreign transactions
fraud_frequency = np.random.poisson(10, size=n_fraud)  # High frequency

# Combine data
amounts = np.concatenate([normal_amounts, fraud_amounts])
times = np.concatenate([normal_times, fraud_times])
is_foreign = np.concatenate(
    [np.zeros(n_normal), np.ones(n_fraud)]  # 0=domestic, 1=foreign
)
frequency = np.concatenate([normal_frequency, fraud_frequency])
labels = np.concatenate([np.zeros(n_normal), np.ones(n_fraud)])  # 0=normal, 1=fraud

# Create DataFrame
fraud_df = pd.DataFrame({
    'amount': amounts,
    'hour_of_day': times,
    'is_foreign': is_foreign,
    'daily_frequency': frequency,
    'is_fraud': labels
})

# Add derived features
fraud_df['amount_log'] = np.log1p(fraud_df['amount'])
fraud_df['is_night'] = (fraud_df['hour_of_day'] < 6).astype(int)
fraud_df['high_frequency'] = (fraud_df['daily_frequency'] > 5).astype(int)

print(f"📊 Credit Card Dataset Created:")
print(f"Total transactions: {len(fraud_df):,}")
print(f"Normal transactions: {sum(fraud_df['is_fraud'] == 0):,}")
print(f"Fraudulent transactions: {sum(fraud_df['is_fraud'] == 1):,}")
print(f"Fraud rate: {fraud_df['is_fraud'].mean():.1%}")

# Show sample data
print("\n📋 Sample transactions:")
print(fraud_df.head(10))

In [None]:
# Visualize the fraud patterns
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Amount distribution
axes[0,0].hist(fraud_df[fraud_df['is_fraud']==0]['amount'], bins=50, alpha=0.7, label='Normal', density=True)
axes[0,0].hist(fraud_df[fraud_df['is_fraud']==1]['amount'], bins=20, alpha=0.7, label='Fraud', density=True)
axes[0,0].set_xlabel('Transaction Amount ($)')
axes[0,0].set_ylabel('Density')
axes[0,0].set_title('Transaction Amount Distribution')
axes[0,0].legend()
axes[0,0].set_xlim(0, 1000)  # Focus on lower amounts for visibility

# Time of day distribution
axes[0,1].hist(fraud_df[fraud_df['is_fraud']==0]['hour_of_day'], bins=24, alpha=0.7, label='Normal', density=True)
axes[0,1].hist(fraud_df[fraud_df['is_fraud']==1]['hour_of_day'], bins=24, alpha=0.7, label='Fraud', density=True)
axes[0,1].set_xlabel('Hour of Day')
axes[0,1].set_ylabel('Density')
axes[0,1].set_title('Transaction Time Distribution')
axes[0,1].legend()

# Foreign transaction comparison
foreign_counts = fraud_df.groupby(['is_foreign', 'is_fraud']).size().unstack()
foreign_counts.plot(kind='bar', ax=axes[1,0], color=['blue', 'red'], alpha=0.7)
axes[1,0].set_xlabel('Location (0=Domestic, 1=Foreign)')
axes[1,0].set_ylabel('Count')
axes[1,0].set_title('Domestic vs Foreign Transactions')
axes[1,0].legend(['Normal', 'Fraud'])
axes[1,0].tick_params(axis='x', rotation=0)

# Frequency distribution
axes[1,1].hist(fraud_df[fraud_df['is_fraud']==0]['daily_frequency'], bins=20, alpha=0.7, label='Normal', density=True)
axes[1,1].hist(fraud_df[fraud_df['is_fraud']==1]['daily_frequency'], bins=10, alpha=0.7, label='Fraud', density=True)
axes[1,1].set_xlabel('Daily Transaction Frequency')
axes[1,1].set_ylabel('Density')
axes[1,1].set_title('Transaction Frequency Distribution')
axes[1,1].legend()

plt.tight_layout()
plt.show()

print("🔍 Notice the different patterns:")
print("• Fraud transactions often have extreme amounts (very high or very low)")
print("• Fraud occurs more often at night")
print("• Fraud transactions are often foreign")
print("• Fraud accounts often have high transaction frequency")

In [None]:
# Train fraud detection model
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Prepare features (exclude the target variable)
feature_columns = ['amount_log', 'hour_of_day', 'is_foreign', 'daily_frequency', 'is_night', 'high_frequency']
X_fraud = fraud_df[feature_columns].values
y_fraud = fraud_df['is_fraud'].values

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_fraud, y_fraud, test_size=0.3, random_state=42, stratify=y_fraud
)

# Scale features (important for some algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Isolation Forest for fraud detection
fraud_detector = IsolationForest(
    contamination=0.01,  # Expect 1% fraud
    random_state=42,
    n_estimators=200
)

# Fit on training data (only normal transactions for unsupervised learning)
normal_transactions = X_train_scaled[y_train == 0]
fraud_detector.fit(normal_transactions)

# Predict on test set
test_predictions = fraud_detector.predict(X_test_scaled)
test_scores = fraud_detector.decision_function(X_test_scaled)

# Convert predictions to binary
test_predictions_binary = (test_predictions == -1).astype(int)

print("🚀 Fraud Detection Model Trained!")
print(f"Training set size: {len(X_train):,} transactions")
print(f"Test set size: {len(X_test):,} transactions")
print(f"Normal transactions in training: {sum(y_train == 0):,}")
print(f"Fraud transactions in test: {sum(y_test == 1):,}")

In [None]:
# Evaluate fraud detection performance
from sklearn.metrics import precision_recall_curve, roc_curve

# Calculate metrics
precision = precision_score(y_test, test_predictions_binary)
recall = recall_score(y_test, test_predictions_binary)
f1 = f1_score(y_test, test_predictions_binary)
auc = roc_auc_score(y_test, -test_scores)

print("🎯 Fraud Detection Performance:")
print(f"Precision: {precision:.3f} ({precision:.1%})")
print(f"Recall:    {recall:.3f} ({recall:.1%})")
print(f"F1-Score:  {f1:.3f}")
print(f"AUC-ROC:   {auc:.3f}")

# Business interpretation
tp = sum((y_test == 1) & (test_predictions_binary == 1))
fp = sum((y_test == 0) & (test_predictions_binary == 1))
fn = sum((y_test == 1) & (test_predictions_binary == 0))

print(f"\n💼 Business Impact:")
print(f"• Detected {tp} out of {sum(y_test)} fraud cases ({tp/sum(y_test):.1%})")
print(f"• {fp} false alarms out of {sum(test_predictions_binary)} flagged transactions")
print(f"• Missed {fn} fraud cases")

# Estimate monetary impact (example)
avg_fraud_amount = fraud_df[fraud_df['is_fraud']==1]['amount'].mean()
fraud_prevented = tp * avg_fraud_amount
fraud_missed = fn * avg_fraud_amount
review_cost = fp * 50  # Assume $50 cost per manual review

print(f"\n💰 Estimated Impact:")
print(f"• Fraud prevented: ${fraud_prevented:,.0f}")
print(f"• Fraud missed: ${fraud_missed:,.0f}")
print(f"• Review costs: ${review_cost:,.0f}")
print(f"• Net benefit: ${fraud_prevented - fraud_missed - review_cost:,.0f}")

In [None]:
# Visualize fraud detection results
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, -test_scores)
axes[0,0].plot(fpr, tpr, linewidth=2, label=f'ROC Curve (AUC = {auc:.3f})')
axes[0,0].plot([0, 1], [0, 1], 'k--', label='Random Classifier')
axes[0,0].set_xlabel('False Positive Rate')
axes[0,0].set_ylabel('True Positive Rate')
axes[0,0].set_title('ROC Curve')
axes[0,0].legend()
axes[0,0].grid(True, alpha=0.3)

# Precision-Recall Curve
precision_curve, recall_curve, _ = precision_recall_curve(y_test, -test_scores)
axes[0,1].plot(recall_curve, precision_curve, linewidth=2, label=f'PR Curve')
axes[0,1].axhline(y=sum(y_test)/len(y_test), color='k', linestyle='--', label='Random Classifier')
axes[0,1].set_xlabel('Recall')
axes[0,1].set_ylabel('Precision')
axes[0,1].set_title('Precision-Recall Curve')
axes[0,1].legend()
axes[0,1].grid(True, alpha=0.3)

# Score distribution
normal_scores = test_scores[y_test == 0]
fraud_scores = test_scores[y_test == 1]
axes[1,0].hist(normal_scores, bins=50, alpha=0.7, label='Normal', density=True)
axes[1,0].hist(fraud_scores, bins=20, alpha=0.7, label='Fraud', density=True)
axes[1,0].set_xlabel('Anomaly Score')
axes[1,0].set_ylabel('Density')
axes[1,0].set_title('Score Distribution')
axes[1,0].legend()
axes[1,0].axvline(x=0, color='red', linestyle='--', label='Decision Threshold')

# Confusion Matrix
cm = confusion_matrix(y_test, test_predictions_binary)
im = axes[1,1].imshow(cm, interpolation='nearest', cmap='Blues')
axes[1,1].set_title('Confusion Matrix')
tick_marks = np.arange(2)
axes[1,1].set_xticks(tick_marks)
axes[1,1].set_yticks(tick_marks)
axes[1,1].set_xticklabels(['Normal', 'Fraud'])
axes[1,1].set_yticklabels(['Normal', 'Fraud'])
axes[1,1].set_ylabel('True Label')
axes[1,1].set_xlabel('Predicted Label')

# Add text annotations to confusion matrix
for i in range(2):
    for j in range(2):
        axes[1,1].text(j, i, f'{cm[i, j]}', 
                      ha="center", va="center", fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("📊 Chart Explanations:")
print("• ROC Curve: Higher curve = better performance")
print("• PR Curve: Higher curve = better performance")
print("• Score Distribution: Good separation between normal and fraud scores")
print("• Confusion Matrix: Diagonal elements should be high")

## 8. Key Takeaways and Next Steps

Congratulations! You've learned the fundamentals of anomaly detection. Here's what we covered:

### 🎓 What You Learned:

1. **Anomaly Types**: Point, contextual, and collective anomalies
2. **Algorithms**: Statistical methods, Isolation Forest, LOF, and more
3. **Evaluation**: Precision, recall, F1-score, and AUC-ROC
4. **Real Application**: Credit card fraud detection
5. **Business Impact**: Converting technical metrics to business value

### 🚀 Next Steps:

1. **Practice** with your own datasets
2. **Explore** advanced algorithms (deep learning, ensemble methods)
3. **Learn** about time series anomaly detection
4. **Study** real-time detection systems
5. **Understand** domain-specific applications

### 🔧 Practical Tips:

- **Start simple** with statistical methods, then move to complex algorithms
- **Understand your data** before choosing an algorithm
- **Consider the business context** when setting thresholds
- **Always validate** on held-out test data
- **Monitor performance** in production

### 📚 Further Learning Resources:

- [Intermediate Tutorial: Time Series Anomaly Detection](./intermediate-time-series.ipynb)
- [Advanced Tutorial: Deep Learning for Anomalies](./advanced-deep-learning.ipynb)
- [Industry Examples](../practical-examples/)
- [API Documentation](../../api/README.md)

Happy detecting! 🕵️‍♂️

In [None]:
# Final exercise: Try it yourself!
print("🎉 Congratulations on completing the tutorial!")
print("\n🏆 Your Turn:")
print("1. Try modifying the contamination parameter in the fraud detection example")
print("2. Add new features to improve detection performance")
print("3. Test different algorithms and compare their results")
print("4. Create visualizations for your own dataset")
print("\n💡 Remember: The best anomaly detector is the one that works for YOUR specific problem!")

# Show final summary
print("\n📊 Summary of Key Algorithms:")
algorithms_summary = pd.DataFrame({
    'Algorithm': ['Z-Score', 'Isolation Forest', 'Local Outlier Factor', 'One-Class SVM'],
    'Best For': [
        'Simple, univariate data',
        'High-dimensional, mixed data types',
        'Local density-based anomalies',
        'Complex decision boundaries'
    ],
    'Speed': ['Fast', 'Fast', 'Medium', 'Slow'],
    'Interpretability': ['High', 'Medium', 'Medium', 'Low']
})
print(algorithms_summary.to_string(index=False))