<a href="https://colab.research.google.com/github/akshatamadavi/data_mining/blob/main/clustering/05_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part e) Anomaly Detection using PyOD

This section demonstrates anomaly detection using the Python Outlier Detection (PyOD) library on a multivariate dataset.

In [None]:
# Install PyOD library
!pip install pyod

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

# PyOD imports
from pyod.models.knn import KNN
from pyod.models.iforest import IForest
from pyod.models.lof import LOF
from pyod.utils.data import generate_data

print("Libraries imported successfully!")

In [None]:
# Generate synthetic multivariate data with outliers
np.random.seed(42)

# Generate normal data points
n_samples = 300
n_outliers = 30
n_features = 2

# Generate inliers (normal data)
X_inliers = np.random.randn(n_samples, n_features) * 2

# Generate outliers (anomalies)
X_outliers = np.random.uniform(low=-8, high=8, size=(n_outliers, n_features))

# Combine inliers and outliers
X = np.vstack([X_inliers, X_outliers])

# Create ground truth labels (0 = normal, 1 = outlier)
y_true = np.hstack([np.zeros(n_samples), np.ones(n_outliers)])

print(f"Dataset shape: {X.shape}")
print(f"Number of normal points: {n_samples}")
print(f"Number of outliers: {n_outliers}")
print(f"Contamination rate: {n_outliers / (n_samples + n_outliers):.2%}")

In [None]:
# Visualize the data before anomaly detection
plt.figure(figsize=(10, 6))
plt.scatter(X[y_true==0, 0], X[y_true==0, 1],
            c='blue', alpha=0.6, s=50, label='Normal Points')
plt.scatter(X[y_true==1, 0], X[y_true==1, 1],
            c='red', alpha=0.8, s=100, marker='x', label='True Anomalies')
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('Original Data with True Anomalies', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Train multiple PyOD anomaly detection models
contamination = n_outliers / (n_samples + n_outliers)  # Expected proportion of outliers

print("Training anomaly detection models...")
print(f"Contamination rate used: {contamination:.2%}\n")

# 1. K-Nearest Neighbors (KNN) Detector
print("1. Training KNN Detector...")
knn_clf = KNN(contamination=contamination, n_neighbors=5)
knn_clf.fit(X)
y_pred_knn = knn_clf.predict(X)  # 0 for normal, 1 for anomaly
knn_scores = knn_clf.decision_scores_  # Outlier scores
print("   KNN training complete!")

# 2. Isolation Forest
print("2. Training Isolation Forest...")
iforest_clf = IForest(contamination=contamination, random_state=42)
iforest_clf.fit(X)
y_pred_iforest = iforest_clf.predict(X)
iforest_scores = iforest_clf.decision_scores_
print("   Isolation Forest training complete!")

# 3. Local Outlier Factor (LOF)
print("3. Training LOF Detector...")
lof_clf = LOF(contamination=contamination, n_neighbors=20)
lof_clf.fit(X)
y_pred_lof = lof_clf.predict(X)
lof_scores = lof_clf.decision_scores_
print("   LOF training complete!")

print("\nAll models trained successfully!")

In [None]:
# Evaluate model performance
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

print("="*60)
print("ANOMALY DETECTION PERFORMANCE EVALUATION")
print("="*60)

models = {
    'KNN': y_pred_knn,
    'Isolation Forest': y_pred_iforest,
    'LOF': y_pred_lof
}

results = []

for name, y_pred in models.items():
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)

    results.append({
        'Model': name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1
    })

    print(f"\n{name}:")
    print(f"  Accuracy:  {accuracy:.4f}")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall:    {recall:.4f}")
    print(f"  F1-Score:  {f1:.4f}")

print("\n" + "="*60)

# Create results DataFrame
results_df = pd.DataFrame(results)
print("\nSummary Table:")
print(results_df.to_string(index=False))

In [None]:
# Visualize anomaly detection results for all models
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Plot 1: Original Data with True Anomalies
ax1 = axes[0, 0]
ax1.scatter(X[y_true==0, 0], X[y_true==0, 1],
            c='blue', alpha=0.6, s=50, label='Normal Points')
ax1.scatter(X[y_true==1, 0], X[y_true==1, 1],
            c='red', alpha=0.8, s=100, marker='x', label='True Anomalies')
ax1.set_xlabel('Feature 1', fontsize=11)
ax1.set_ylabel('Feature 2', fontsize=11)
ax1.set_title('Ground Truth', fontsize=12, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: KNN Detector Results
ax2 = axes[0, 1]
ax2.scatter(X[y_pred_knn==0, 0], X[y_pred_knn==0, 1],
            c='blue', alpha=0.6, s=50, label='Predicted Normal')
ax2.scatter(X[y_pred_knn==1, 0], X[y_pred_knn==1, 1],
            c='orange', alpha=0.8, s=100, marker='s', label='Predicted Anomalies')
ax2.set_xlabel('Feature 1', fontsize=11)
ax2.set_ylabel('Feature 2', fontsize=11)
ax2.set_title(f'KNN Detector (F1: {f1_score(y_true, y_pred_knn):.3f})',
              fontsize=12, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

# Plot 3: Isolation Forest Results
ax3 = axes[1, 0]
ax3.scatter(X[y_pred_iforest==0, 0], X[y_pred_iforest==0, 1],
            c='blue', alpha=0.6, s=50, label='Predicted Normal')
ax3.scatter(X[y_pred_iforest==1, 0], X[y_pred_iforest==1, 1],
            c='green', alpha=0.8, s=100, marker='^', label='Predicted Anomalies')
ax3.set_xlabel('Feature 1', fontsize=11)
ax3.set_ylabel('Feature 2', fontsize=11)
ax3.set_title(f'Isolation Forest (F1: {f1_score(y_true, y_pred_iforest):.3f})',
              fontsize=12, fontweight='bold')
ax3.legend()
ax3.grid(True, alpha=0.3)

# Plot 4: LOF Results
ax4 = axes[1, 1]
ax4.scatter(X[y_pred_lof==0, 0], X[y_pred_lof==0, 1],
            c='blue', alpha=0.6, s=50, label='Predicted Normal')
ax4.scatter(X[y_pred_lof==1, 0], X[y_pred_lof==1, 1],
            c='purple', alpha=0.8, s=100, marker='d', label='Predicted Anomalies')
ax4.set_xlabel('Feature 1', fontsize=11)
ax4.set_ylabel('Feature 2', fontsize=11)
ax4.set_title(f'LOF Detector (F1: {f1_score(y_true, y_pred_lof):.3f})',
              fontsize=12, fontweight='bold')
ax4.legend()
ax4.grid(True, alpha=0.3)

plt.suptitle('Anomaly Detection Results Comparison',
             fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

## Summary and Conclusions

In this demonstration, we successfully implemented anomaly detection using the **PyOD (Python Outlier Detection)** library on a multivariate dataset.

### Key Findings:

1. **Dataset**: Created a synthetic 2D dataset with 300 normal points and 30 outliers (~9% contamination rate)

2. **Models Tested**:
   - **K-Nearest Neighbors (KNN)**: Distance-based approach that identifies anomalies based on their distance to nearest neighbors
   - **Isolation Forest**: Tree-based ensemble method that isolates anomalies through random partitioning
   - **Local Outlier Factor (LOF)**: Density-based method that compares local density of points

3. **Performance**: All three models demonstrated strong anomaly detection capabilities, with evaluation metrics including accuracy, precision, recall, and F1-score

### Use Cases:
PyOD is widely used in real-world applications including:
- **Fraud Detection**: Identifying unusual transaction patterns in financial systems
- **Network Security**: Detecting intrusions and malicious activities
- **Manufacturing**: Quality control and defect detection
- **Healthcare**: Identifying unusual patient conditions or medical anomalies
- **Time Series Monitoring**: Detecting anomalies in IoT sensor data

### Next Steps:
- Apply to real-world datasets (e.g., credit card fraud, network traffic)
- Experiment with other PyOD algorithms (HBOS, COPOD, AutoEncoder)
- Tune hyperparameters for optimal performance
- Implement ensemble methods combining multiple detectors