# Tier 6: Advanced Anomaly Detection & Outlier Analysis

---

**Author:** Brandon Deloatch
**Affiliation:** Quipu Research Labs, LLC
**Date:** 2025-10-02
**Version:** v1.3
**License:** MIT
**Notebook ID:** f8bf2d16-b5d6-4a48-97f4-1d58238ec1c7

---

## Citation
Brandon Deloatch, "Tier 6: Advanced Anomaly Detection & Outlier Analysis," Quipu Research Labs, LLC, v1.3, 2025-10-02.

Please cite this notebook if used or adapted in publications, presentations, or derivative work.

---

## Contributors / Acknowledgments
- **Primary Author:** Brandon Deloatch (Quipu Research Labs, LLC)
- **Institutional Support:** Quipu Research Labs, LLC - Advanced Analytics Division
- **Technical Framework:** Built on scikit-learn, pandas, numpy, and plotly ecosystems
- **Methodological Foundation:** Statistical learning principles and modern data science best practices

---

## Version History
| Version | Date | Notes |
|---------|------|-------|
| v1.3 | 2025-10-02 | Enhanced professional formatting, comprehensive documentation, interactive visualizations |
| v1.2 | 2024-09-15 | Updated analysis methods, improved data generation algorithms |
| v1.0 | 2024-06-10 | Initial release with core analytical framework |

---

## Environment Dependencies
- **Python:** 3.8+
- **Core Libraries:** pandas 2.0+, numpy 1.24+, scikit-learn 1.3+
- **Visualization:** plotly 5.0+, matplotlib 3.7+
- **Statistical:** scipy 1.10+, statsmodels 0.14+
- **Development:** jupyter-lab 4.0+, ipywidgets 8.0+

> **Reproducibility Note:** Use requirements.txt or environment.yml for exact dependency matching.

---

## Data Provenance
| Dataset | Source | License | Notes |
|---------|--------|---------|-------|
| Synthetic Data | Generated in-notebook | MIT | Custom algorithms for realistic simulation |
| Statistical Distributions | NumPy/SciPy | BSD-3-Clause | Standard library implementations |
| ML Algorithms | Scikit-learn | BSD-3-Clause | Industry-standard implementations |
| Visualization Schemas | Plotly | MIT | Interactive dashboard frameworks |

---

## Execution Provenance Logs
- **Created:** 2025-10-02
- **Notebook ID:** f8bf2d16-b5d6-4a48-97f4-1d58238ec1c7
- **Execution Environment:** Jupyter Lab / VS Code
- **Computational Requirements:** Standard laptop/workstation (2GB+ RAM recommended)

> **Auto-tracking:** Execution metadata can be programmatically captured for reproducibility.

---

## Disclaimer & Responsible Use
This notebook is provided "as-is" for educational, research, and professional development purposes. Users assume full responsibility for any results, applications, or decisions derived from this analysis.

**Professional Standards:**
- Validate all results against domain expertise and additional data sources
- Respect licensing and attribution requirements for all dependencies
- Follow ethical guidelines for data analysis and algorithmic decision-making
- Credit all methodological sources and derivative frameworks appropriately

**Academic & Commercial Use:**
- Permitted under MIT license with proper attribution
- Suitable for educational curriculum and professional training
- Appropriate for commercial adaptation with citation requirements
- Recommended for reproducible research and transparent analytics

---



In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.covariance import EllipticEnvelope
from sklearn.cluster import DBSCAN
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

print(" Tier 6: Advanced Anomaly Detection & Outlier Analysis")
print("=" * 55)
print(" CROSS-REFERENCES:")
print("• Prerequisites: Tier1_Descriptive.ipynb, Tier2_LogisticRegression.ipynb, Tier5_Classification.ipynb")
print("• Builds On: Tier4_SVM.ipynb (one-class SVM), Tier4_KMeans.ipynb (clustering-based)")
print("• Complements: Tier6_RealTimeAnalytics.ipynb, Tier3_TimeSeries.ipynb")
print("• Advanced: Tier6_DeepLearning.ipynb, Tier6_EnsembleMethods.ipynb")
print("=" * 55)
print("Anomaly Detection Techniques:")
print("• Statistical methods (Z-score, IQR, Grubbs' test)")
print("• Machine learning approaches (Isolation Forest, One-Class SVM)")
print("• Clustering-based detection (DBSCAN, LOF)")
print("• Ensemble methods and real-time detection")
print("• Performance evaluation and business impact analysis")

 Tier 6: Advanced Anomaly Detection & Outlier Analysis
 CROSS-REFERENCES:
• Prerequisites: Tier1_Descriptive.ipynb, Tier2_LogisticRegression.ipynb, Tier5_Classification.ipynb
• Builds On: Tier4_SVM.ipynb (one-class SVM), Tier4_KMeans.ipynb (clustering-based)
• Complements: Tier6_RealTimeAnalytics.ipynb, Tier3_TimeSeries.ipynb
• Advanced: Tier6_DeepLearning.ipynb, Tier6_EnsembleMethods.ipynb
Anomaly Detection Techniques:
• Statistical methods (Z-score, IQR, Grubbs' test)
• Machine learning approaches (Isolation Forest, One-Class SVM)
• Clustering-based detection (DBSCAN, LOF)
• Ensemble methods and real-time detection
• Performance evaluation and business impact analysis


In [14]:
# Generate comprehensive synthetic datasets with known anomalies
np.random.seed(42)

def generate_anomaly_datasets():
    """Generate multiple datasets with different types of anomalies."""

    # Dataset 1: Financial Transactions (fraud detection)
    n_normal = 8000
    n_fraud = 200

    # Normal transactions
    normal_amounts = np.random.lognormal(mean=3, sigma=1, size=n_normal)
    normal_amounts = np.clip(normal_amounts, 1, 1000) # Realistic transaction amounts

    normal_features = {
        'transaction_amount': normal_amounts,
        'account_age_days': np.random.normal(365, 200, n_normal),
        'daily_transaction_count': np.random.poisson(3, n_normal),
        'time_since_last_transaction': np.random.exponential(2, n_normal),
        'merchant_risk_score': np.random.beta(2, 5, n_normal),
        'is_weekend': np.random.choice([0, 1], n_normal, p=[0.7, 0.3]),
        'customer_id': np.random.randint(1, 2000, n_normal)
    }

    # Fraudulent transactions (anomalies)
    # Fix: Generate arrays separately then concatenate
    high_value_fraud = np.random.uniform(2000, 10000, n_fraud//3)
    micro_fraud = np.random.uniform(0.1, 5, n_fraud//3)
    remaining_fraud = n_fraud - 2*(n_fraud//3)
    large_random_fraud = np.random.lognormal(5, 0.5, remaining_fraud)
    fraud_amounts = np.concatenate([high_value_fraud, micro_fraud, large_random_fraud])

    # Fix similar issue with account_age_days
    new_accounts = np.random.uniform(0, 30, n_fraud//2)
    remaining_fraud_accounts = n_fraud - (n_fraud//2)
    old_dormant_accounts = np.random.uniform(1000, 2000, remaining_fraud_accounts)
    fraud_account_ages = np.concatenate([new_accounts, old_dormant_accounts])

    # Fix similar issue with daily_transaction_count
    high_frequency = np.random.poisson(15, n_fraud//2)
    remaining_fraud_txns = n_fraud - (n_fraud//2)
    low_frequency = np.random.poisson(0.5, remaining_fraud_txns)
    fraud_transaction_counts = np.concatenate([high_frequency, low_frequency])

    fraud_features = {
        'transaction_amount': fraud_amounts,
        'account_age_days': fraud_account_ages,
        'daily_transaction_count': fraud_transaction_counts,
        'time_since_last_transaction': np.random.exponential(0.1, n_fraud), # Rapid succession
        'merchant_risk_score': np.random.beta(5, 2, n_fraud), # Higher risk merchants
        'is_weekend': np.random.choice([0, 1], n_fraud, p=[0.4, 0.6]), # More weekend activity
        'customer_id': np.random.randint(1, 2000, n_fraud)
    }

    # Combine normal and fraudulent transactions
    financial_df = pd.DataFrame({
        key: np.concatenate([normal_features[key], fraud_features[key]])
        for key in normal_features.keys()
    })
    financial_df['is_fraud'] = np.concatenate([np.zeros(n_normal), np.ones(n_fraud)])
    financial_df = financial_df.sample(frac=1).reset_index(drop=True) # Shuffle

    # Dataset 2: Manufacturing Quality Control
    n_normal_parts = 5000
    n_defective = 150

    # Normal parts (multivariate normal distribution)
    normal_temp = np.random.normal(250, 10, n_normal_parts)
    normal_pressure = np.random.normal(15, 2, n_normal_parts)
    normal_vibration = np.random.normal(0.5, 0.1, n_normal_parts)
    normal_thickness = np.random.normal(5.0, 0.1, n_normal_parts)

    # Defective parts (various anomaly patterns)
    defective_temp = np.concatenate([
        np.random.normal(300, 15, n_defective//3), # Overheating
        np.random.normal(200, 10, n_defective//3), # Underheating
        np.random.normal(250, 50, n_defective - 2*(n_defective//3)) # Unstable temperature
    ])

    defective_pressure = np.concatenate([
        np.random.normal(25, 5, n_defective//2), # High pressure
        np.random.normal(5, 2, n_defective - (n_defective//2)) # Low pressure
    ])

    defective_vibration = np.random.uniform(1.0, 3.0, n_defective) # High vibration
    defective_thickness = np.concatenate([
        np.random.normal(5.5, 0.2, n_defective//2), # Too thick
        np.random.normal(4.5, 0.2, n_defective - (n_defective//2)) # Too thin
    ])

    manufacturing_df = pd.DataFrame({
        'temperature': np.concatenate([normal_temp, defective_temp]),
        'pressure': np.concatenate([normal_pressure, defective_pressure]),
        'vibration': np.concatenate([normal_vibration, defective_vibration]),
        'thickness': np.concatenate([normal_thickness, defective_thickness]),
        'is_defective': np.concatenate([np.zeros(n_normal_parts), np.ones(n_defective)])
    })
    manufacturing_df = manufacturing_df.sample(frac=1).reset_index(drop=True)

    return financial_df, manufacturing_df

# Generate datasets
financial_df, manufacturing_df = generate_anomaly_datasets()

print(" Anomaly Detection Datasets Created:")
print(f"\n1. Financial Transactions Dataset:")
print(f" • Total transactions: {len(financial_df):,}")
print(f" • Fraudulent transactions: {financial_df['is_fraud'].sum():.0f} ({financial_df['is_fraud'].mean()*100:.1f}%)")
print(f" • Average transaction: ${financial_df['transaction_amount'].mean():.2f}")
print(f" • Fraud rate: 1 in {len(financial_df) / financial_df['is_fraud'].sum():.0f} transactions")

print(f"\n2. Manufacturing Quality Control:")
print(f" • Total parts inspected: {len(manufacturing_df):,}")
print(f" • Defective parts: {manufacturing_df['is_defective'].sum():.0f} ({manufacturing_df['is_defective'].mean()*100:.1f}%)")
print(f" • Average temperature: {manufacturing_df['temperature'].mean():.1f}°C")
print(f" • Quality rate: {(1-manufacturing_df['is_defective'].mean())*100:.1f}%")

print(financial_df.head())

# Display sample dataprint(f"\nSample Financial Data:")

 Anomaly Detection Datasets Created:

1. Financial Transactions Dataset:
 • Total transactions: 8,200
 • Fraudulent transactions: 200 (2.4%)
 • Average transaction: $85.68
 • Fraud rate: 1 in 41 transactions

2. Manufacturing Quality Control:
 • Total parts inspected: 5,150
 • Defective parts: 150 (2.9%)
 • Average temperature: 250.0°C
 • Quality rate: 97.1%
   transaction_amount  account_age_days  daily_transaction_count  \
0           11.141399        147.097318                        4   
1           17.430864        698.559587                        1   
2           31.187409        490.375413                        5   
3           38.291027        318.098493                        2   
4           13.333950        332.909991                        1   

   time_since_last_transaction  merchant_risk_score  is_weekend  customer_id  \
0                     2.159092             0.126949           0          752   
1                     3.427894             0.165478           0       

In [15]:
# 1. STATISTICAL ANOMALY DETECTION METHODS
print(" 1. STATISTICAL ANOMALY DETECTION METHODS")
print("=" * 43)

def statistical_anomaly_detection(data, column, method='zscore', threshold=3):
    """Apply various statistical methods for anomaly detection."""

    if method == 'zscore':
        z_scores = np.abs(stats.zscore(data[column]))
        anomalies = z_scores > threshold
        scores = z_scores

    elif method == 'iqr':
        Q1 = data[column].quantile(0.25)
        Q3 = data[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        anomalies = (data[column] < lower_bound) | (data[column] > upper_bound)
        scores = np.abs(data[column] - data[column].median()) / IQR

    elif method == 'grubbs':
        # Simplified Grubbs test (single outlier)
        mean_val = data[column].mean()
        std_val = data[column].std()
        z_scores = np.abs((data[column] - mean_val) / std_val)

        n = len(data)
        t_critical = stats.t.ppf(1 - 0.05/(2*n), n-2) # Two-tailed test
        grubbs_critical = ((n-1)/np.sqrt(n)) * np.sqrt(t_critical**2/(n-2+t_critical**2))

        anomalies = z_scores > grubbs_critical
        scores = z_scores

    elif method == 'modified_zscore':
        # Using median absolute deviation (more robust)
        median = data[column].median()
        mad = np.median(np.abs(data[column] - median))
        modified_z_scores = 0.6745 * (data[column] - median) / mad
        anomalies = np.abs(modified_z_scores) > threshold
        scores = np.abs(modified_z_scores)

    return anomalies, scores

# Apply statistical methods to financial transaction amounts
print("Financial Transaction Amount Analysis:")

stat_methods = ['zscore', 'iqr', 'grubbs', 'modified_zscore']
stat_results = {}

for method in stat_methods:
    threshold = 3 if method in ['zscore', 'modified_zscore'] else None
    anomalies, scores = statistical_anomaly_detection(
        financial_df, 'transaction_amount', method, threshold
    )

    stat_results[method] = {
        'anomalies': anomalies,
        'scores': scores,
        'count': anomalies.sum(),
        'percentage': anomalies.mean() * 100
    }

    # Calculate precision, recall, F1 for fraud detection
    if 'is_fraud' in financial_df.columns:
        true_positives = ((anomalies) & (financial_df['is_fraud'] == 1)).sum()
        false_positives = ((anomalies) & (financial_df['is_fraud'] == 0)).sum()
        false_negatives = ((~anomalies) & (financial_df['is_fraud'] == 1)).sum()

        precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
        recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
        f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

        stat_results[method].update({
            'precision': precision,
            'recall': recall,
            'f1': f1
        })

    print(f"\n{method.upper()} Method:")
    print(f" • Anomalies detected: {anomalies.sum()} ({anomalies.mean()*100:.1f}%)")
    if 'precision' in stat_results[method]:
        print(f" • Precision: {precision:.3f}")
        print(f" • Recall: {recall:.3f}")
        print(f" • F1-Score: {f1:.3f}")

# Find best statistical method
best_stat_method = max(stat_methods, key=lambda x: stat_results[x].get('f1', 0))
print(f"\n Best Statistical Method: {best_stat_method.upper()} (F1: {stat_results[best_stat_method]['f1']:.3f})")

 1. STATISTICAL ANOMALY DETECTION METHODS
Financial Transaction Amount Analysis:

ZSCORE Method:
 • Anomalies detected: 66 (0.8%)
 • Precision: 1.000
 • Recall: 0.330
 • F1-Score: 0.496

IQR Method:
 • Anomalies detected: 715 (8.7%)
 • Precision: 0.168
 • Recall: 0.600
 • F1-Score: 0.262

GRUBBS Method:
 • Anomalies detected: 62 (0.8%)
 • Precision: 1.000
 • Recall: 0.310
 • F1-Score: 0.473

MODIFIED_ZSCORE Method:
 • Anomalies detected: 889 (10.8%)
 • Precision: 0.143
 • Recall: 0.635
 • F1-Score: 0.233

 Best Statistical Method: ZSCORE (F1: 0.496)


In [16]:
# 2. MACHINE LEARNING ANOMALY DETECTION
print(" 2. MACHINE LEARNING ANOMALY DETECTION")
print("=" * 41)

# Prepare data for ML models
scaler = StandardScaler()

# Financial data preparation
financial_features = ['transaction_amount', 'account_age_days', 'daily_transaction_count',
 'time_since_last_transaction', 'merchant_risk_score', 'is_weekend']
X_financial = financial_df[financial_features].copy()
X_financial_scaled = scaler.fit_transform(X_financial)

# Manufacturing data preparation
manufacturing_features = ['temperature', 'pressure', 'vibration', 'thickness']
X_manufacturing = manufacturing_df[manufacturing_features].copy()
X_manufacturing_scaled = StandardScaler().fit_transform(X_manufacturing)

# Initialize ML anomaly detection models
ml_models = {
 'Isolation Forest': IsolationForest(contamination=0.1, random_state=42),
 'One-Class SVM': OneClassSVM(gamma='scale', nu=0.1),
 'Local Outlier Factor': LocalOutlierFactor(contamination=0.1),
 'Elliptic Envelope': EllipticEnvelope(contamination=0.1),
 'DBSCAN': DBSCAN(eps=0.5, min_samples=5)
}

def evaluate_ml_anomaly_detection(X, y_true, models, dataset_name):
    """Evaluate multiple ML anomaly detection models."""

    print(f"\n{dataset_name} Dataset Evaluation:")
    print("=" * (len(dataset_name) + 20))

    results = {}

    for model_name, model in models.items():
        try:
            if model_name == 'Local Outlier Factor':
                # LOF returns 1 for inliers, -1 for outliers
                predictions = model.fit_predict(X)
                anomaly_mask = predictions == -1

            elif model_name == 'DBSCAN':
                # DBSCAN labels: -1 for noise (anomalies), 0+ for clusters
                predictions = model.fit_predict(X)
                anomaly_mask = predictions == -1

            else:
                # Standard sklearn anomaly detection interface
                model.fit(X)
                predictions = model.predict(X)
                anomaly_mask = predictions == -1

            # Calculate performance metrics
            if y_true is not None:
                true_positives = ((anomaly_mask) & (y_true == 1)).sum()
                false_positives = ((anomaly_mask) & (y_true == 0)).sum()
                false_negatives = ((~anomaly_mask) & (y_true == 1)).sum()
                true_negatives = ((~anomaly_mask) & (y_true == 0)).sum()

                precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
                recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
                f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

                # Calculate AUC if possible
                try:
                    if hasattr(model, 'decision_function'):
                        scores = model.decision_function(X)
                        auc = roc_auc_score(y_true, -scores) # Negative because anomalies have lower scores
                    else:
                        auc = None
                except:
                    auc = None
            else:
                precision = recall = f1 = auc = None

            results[model_name] = {
                'anomaly_mask': anomaly_mask,
                'anomaly_count': anomaly_mask.sum(),
                'anomaly_percentage': anomaly_mask.mean() * 100,
                'precision': precision,
                'recall': recall,
                'f1': f1,
                'auc': auc,
                'model': model
            }

            print(f"\n{model_name}:")
            print(f" • Anomalies detected: {anomaly_mask.sum()} ({anomaly_mask.mean()*100:.1f}%)")
            if precision is not None:
                print(f" • Precision: {precision:.3f}")
                print(f" • Recall: {recall:.3f}")
                print(f" • F1-Score: {f1:.3f}")
            if auc is not None:
                print(f" • AUC: {auc:.3f}")

        except Exception as e:
            print(f"\n{model_name}: Error - {str(e)}")
            results[model_name] = None

    return results

# Evaluate models on both datasets
financial_results = evaluate_ml_anomaly_detection(
 X_financial_scaled, financial_df['is_fraud'], ml_models, "Financial Fraud Detection"
)

manufacturing_results = evaluate_ml_anomaly_detection(
 X_manufacturing_scaled, manufacturing_df['is_defective'], ml_models, "Manufacturing Quality Control"
)

# Find best models for each dataset
def find_best_model(results):
    """Find best performing model based on F1 score."""
    valid_results = {k: v for k, v in results.items() if v is not None and v['f1'] is not None}
    if valid_results:
        return max(valid_results.keys(), key=lambda x: valid_results[x]['f1'])
    return None

best_financial_model = find_best_model(financial_results)
best_manufacturing_model = find_best_model(manufacturing_results)

print(f"\n\n BEST PERFORMING MODELS:")
print("=" * 26)
if best_financial_model:
 print(f"• Financial: {best_financial_model} (F1: {financial_results[best_financial_model]['f1']:.3f})")
if best_manufacturing_model:
 print(f"• Manufacturing: {best_manufacturing_model} (F1: {manufacturing_results[best_manufacturing_model]['f1']:.3f})")

 2. MACHINE LEARNING ANOMALY DETECTION

Financial Fraud Detection Dataset Evaluation:

Isolation Forest:
 • Anomalies detected: 820 (10.0%)
 • Precision: 0.244
 • Recall: 1.000
 • F1-Score: 0.392
 • AUC: 0.998

Isolation Forest:
 • Anomalies detected: 820 (10.0%)
 • Precision: 0.244
 • Recall: 1.000
 • F1-Score: 0.392
 • AUC: 0.998

One-Class SVM:
 • Anomalies detected: 819 (10.0%)
 • Precision: 0.244
 • Recall: 1.000
 • F1-Score: 0.393
 • AUC: 0.996

Local Outlier Factor:
 • Anomalies detected: 820 (10.0%)
 • Precision: 0.061
 • Recall: 0.250
 • F1-Score: 0.098

One-Class SVM:
 • Anomalies detected: 819 (10.0%)
 • Precision: 0.244
 • Recall: 1.000
 • F1-Score: 0.393
 • AUC: 0.996

Local Outlier Factor:
 • Anomalies detected: 820 (10.0%)
 • Precision: 0.061
 • Recall: 0.250
 • F1-Score: 0.098

Elliptic Envelope:
 • Anomalies detected: 820 (10.0%)
 • Precision: 0.232
 • Recall: 0.950
 • F1-Score: 0.373
 • AUC: 0.977

DBSCAN:
 • Anomalies detected: 1337 (16.3%)
 • Precision: 0.129
 • Rec

In [17]:
# 3. ENSEMBLE ANOMALY DETECTION
print(" 3. ENSEMBLE ANOMALY DETECTION")
print("=" * 31)

class EnsembleAnomalyDetector:
    """Ensemble anomaly detector combining multiple methods."""

    def __init__(self, models=None, voting='majority', weights=None):
        self.models = models or [
            IsolationForest(contamination=0.1, random_state=42),
            OneClassSVM(gamma='scale', nu=0.1),
            EllipticEnvelope(contamination=0.1)
        ]
        self.voting = voting
        self.weights = weights or [1.0] * len(self.models)
        self.fitted_models = []

    def fit(self, X):
        """Fit all models in the ensemble."""
        self.fitted_models = []
        for model in self.models:
            fitted_model = model.fit(X)
            self.fitted_models.append(fitted_model)
        return self

    def predict(self, X):
        """Predict anomalies using ensemble voting."""
        predictions = []

        for model in self.fitted_models:
            pred = model.predict(X)
            # Convert to binary (1 for normal, 0 for anomaly)
            pred_binary = (pred == 1).astype(int)
            predictions.append(pred_binary)

        predictions = np.array(predictions)

        if self.voting == 'majority':
            # Majority vote (anomaly if majority says anomaly)
            ensemble_pred = (np.mean(predictions, axis=0) < 0.5).astype(int)
        elif self.voting == 'weighted':
            # Weighted vote
            weighted_pred = np.average(predictions, axis=0, weights=self.weights)
            ensemble_pred = (weighted_pred < 0.5).astype(int)

        return ensemble_pred

    def get_anomaly_scores(self, X):
        """Get ensemble anomaly scores."""
        scores = []

        for model in self.fitted_models:
            if hasattr(model, 'decision_function'):
                score = -model.decision_function(X) # Negative for anomaly scores
            elif hasattr(model, 'score_samples'):
                score = -model.score_samples(X)
            else:
                # Fallback to prediction confidence
                pred = model.predict(X)
                score = (pred == -1).astype(float)
            scores.append(score)

        # Average scores
        ensemble_scores = np.mean(scores, axis=0)
        return ensemble_scores

# Test ensemble detector on financial data
print("Financial Data Ensemble Detection:")

ensemble_detector = EnsembleAnomalyDetector(voting='majority')
ensemble_detector.fit(X_financial_scaled)
ensemble_predictions = ensemble_detector.predict(X_financial_scaled)
ensemble_scores = ensemble_detector.get_anomaly_scores(X_financial_scaled)

# Evaluate ensemble performance
ensemble_anomalies = ensemble_predictions == 1
true_positives = ((ensemble_anomalies) & (financial_df['is_fraud'] == 1)).sum()
false_positives = ((ensemble_anomalies) & (financial_df['is_fraud'] == 0)).sum()
false_negatives = ((~ensemble_anomalies) & (financial_df['is_fraud'] == 1)).sum()

ensemble_precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
ensemble_recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
ensemble_f1 = 2 * (ensemble_precision * ensemble_recall) / (ensemble_precision + ensemble_recall) if (ensemble_precision + ensemble_recall) > 0 else 0

print(f"• Ensemble Anomalies detected: {ensemble_anomalies.sum()} ({ensemble_anomalies.mean()*100:.1f}%)")
print(f"• Ensemble Precision: {ensemble_precision:.3f}")
print(f"• Ensemble Recall: {ensemble_recall:.3f}")
print(f"• Ensemble F1-Score: {ensemble_f1:.3f}")

 3. ENSEMBLE ANOMALY DETECTION
Financial Data Ensemble Detection:
• Ensemble Anomalies detected: 758 (9.2%)
• Ensemble Precision: 0.264
• Ensemble Recall: 1.000
• Ensemble F1-Score: 0.418
• Ensemble Anomalies detected: 758 (9.2%)
• Ensemble Precision: 0.264
• Ensemble Recall: 1.000
• Ensemble F1-Score: 0.418


In [21]:
# 4. COMPREHENSIVE ANOMALY DETECTION DASHBOARD
print(" 4. COMPREHENSIVE ANOMALY DETECTION DASHBOARD")
print("=" * 49)

# Create comprehensive anomaly detection dashboard
fig = make_subplots(
 rows=2, cols=3,
 subplot_titles=[
 'Financial Transaction Anomalies',
 'Manufacturing Quality Control',
 'Statistical vs ML Methods',
 'Anomaly Score Distribution',
 'Model Performance Comparison',
 'ROC Curves'
 ],
 specs=[[{"secondary_y": False}, {"secondary_y": False}, {"secondary_y": False}],
 [{"secondary_y": False}, {"secondary_y": False}, {"secondary_y": False}]]
)

# 1. Financial Transaction Anomalies (Scatter plot)
fraud_mask = financial_df['is_fraud'] == 1
normal_mask = financial_df['is_fraud'] == 0

fig.add_trace(
 go.Scatter(
 x=financial_df[normal_mask]['transaction_amount'],
 y=financial_df[normal_mask]['account_age_days'],
 mode='markers',
 name='Normal Transactions',
 marker=dict(color='blue', size=4, opacity=0.6)
 ),
 row=1, col=1
)

fig.add_trace(
 go.Scatter(
 x=financial_df[fraud_mask]['transaction_amount'],
 y=financial_df[fraud_mask]['account_age_days'],
 mode='markers',
 name='Fraudulent Transactions',
 marker=dict(color='red', size=6, symbol='x')
 ),
 row=1, col=1
)

# 2. Manufacturing Quality Control
defective_mask = manufacturing_df['is_defective'] == 1
normal_parts_mask = manufacturing_df['is_defective'] == 0

fig.add_trace(
 go.Scatter(
 x=manufacturing_df[normal_parts_mask]['temperature'],
 y=manufacturing_df[normal_parts_mask]['pressure'],
 mode='markers',
 name='Normal Parts',
 marker=dict(color='green', size=4, opacity=0.6)
 ),
 row=1, col=2
)

fig.add_trace(
 go.Scatter(
 x=manufacturing_df[defective_mask]['temperature'],
 y=manufacturing_df[defective_mask]['pressure'],
 mode='markers',
 name='Defective Parts',
 marker=dict(color='red', size=6, symbol='x')
 ),
 row=1, col=2
)

# 3. Statistical vs ML Methods Comparison
methods = []
f1_scores = []

# Add statistical methods
for method, results in stat_results.items():
    if 'f1' in results:
        methods.append(f"Stat: {method}")
        f1_scores.append(results['f1'])

# Add ML methods
for method, results in financial_results.items():
    if results and results['f1'] is not None:
        methods.append(f"ML: {method}")
        f1_scores.append(results['f1'])

fig.add_trace(
 go.Bar(
 x=methods,
 y=f1_scores,
 name='F1 Scores',
 marker_color='lightcoral'
 ),
 row=1, col=3
)

# 4. Anomaly Score Distribution
if best_financial_model and financial_results[best_financial_model]:
    # Get scores from the best financial model
    best_model = financial_results[best_financial_model]['model']
    try:
        if hasattr(best_model, 'decision_function'):
            all_scores = -best_model.decision_function(X_financial_scaled)
        elif hasattr(best_model, 'score_samples'):
            all_scores = -best_model.score_samples(X_financial_scaled)
        else:
            all_scores = ensemble_scores

        normal_scores = all_scores[financial_df['is_fraud'] == 0]
        anomaly_scores = all_scores[financial_df['is_fraud'] == 1]

        fig.add_trace(
            go.Histogram(
                x=normal_scores,
                name='Normal Scores',
                opacity=0.7,
                marker_color='blue',
                nbinsx=30
            ),
            row=2, col=1
        )

        fig.add_trace(
            go.Histogram(
                x=anomaly_scores,
                name='Anomaly Scores',
                opacity=0.7,
                marker_color='red',
                nbinsx=30
            ),
            row=2, col=1
        )
    except Exception as e:
        # Fallback to ensemble scores if model fails
        all_scores = ensemble_scores
        normal_scores = all_scores[financial_df['is_fraud'] == 0]
        anomaly_scores = all_scores[financial_df['is_fraud'] == 1]

        fig.add_trace(
            go.Histogram(
                x=normal_scores,
                name='Normal Scores',
                opacity=0.7,
                marker_color='blue',
                nbinsx=30
            ),
            row=2, col=1
        )

        fig.add_trace(
            go.Histogram(
                x=anomaly_scores,
                name='Anomaly Scores',
                opacity=0.7,
                marker_color='red',
                nbinsx=30
            ),
            row=2, col=1
        )

# 5. Model Performance Comparison
model_names = []
model_f1_scores = []

for method, results in financial_results.items():
    if results and results['f1'] is not None:
        model_names.append(method)
        model_f1_scores.append(results['f1'])

fig.add_trace(
    go.Bar(
        x=model_names,
        y=model_f1_scores,
        name='Model F1 Scores',
        marker_color='lightgreen'
    ),
    row=2, col=2
)

# 6. ROC Curve
if best_financial_model and financial_results[best_financial_model]:
    y_true = financial_df['is_fraud']
    try:
        fpr, tpr, _ = roc_curve(y_true, all_scores)

        fig.add_trace(
            go.Scatter(
                x=fpr,
                y=tpr,
                mode='lines',
                name=f'ROC ({best_financial_model})',
                line=dict(color='purple', width=2)
            ),
            row=2, col=3
        )

        # Add diagonal reference line
        fig.add_trace(
            go.Scatter(
                x=[0, 1],
                y=[0, 1],
                mode='lines',
                name='Random Classifier',
                line=dict(color='gray', width=1, dash='dash')
            ),
            row=2, col=3
        )
    except Exception as e:
        print(f"ROC curve error: {e}")
        pass

# Update layout
fig.update_layout(
 height=800,
 title="Comprehensive Anomaly Detection Dashboard",
 showlegend=True
)

# Update axis labels
fig.update_xaxes(title_text="Transaction Amount", row=1, col=1)
fig.update_xaxes(title_text="Temperature", row=1, col=2)
fig.update_xaxes(title_text="Method", row=1, col=3)
fig.update_xaxes(title_text="Anomaly Score", row=2, col=1)
fig.update_xaxes(title_text="Model", row=2, col=2)
fig.update_xaxes(title_text="False Positive Rate", row=2, col=3)

fig.update_yaxes(title_text="Account Age", row=1, col=1)
fig.update_yaxes(title_text="Pressure", row=1, col=2)
fig.update_yaxes(title_text="F1 Score", row=1, col=3)
fig.update_yaxes(title_text="Frequency", row=2, col=1)
fig.update_yaxes(title_text="F1 Score", row=2, col=2)
fig.update_yaxes(title_text="True Positive Rate", row=2, col=3)

fig.show()

 4. COMPREHENSIVE ANOMALY DETECTION DASHBOARD


In [None]:
# 5. BUSINESS INSIGHTS AND ROI ANALYSIS
print(" 5. BUSINESS INSIGHTS AND ROI ANALYSIS")
print("=" * 40)

# Financial Fraud Detection ROI
print("Financial Fraud Detection System ROI:")

# Business parameters
monthly_transactions = 1_000_000  # Transactions processed per month
fraud_rate = 0.024  # 2.4% observed fraud rate
avg_transaction_value = 85.68
avg_fraud_value = 2_500  # Higher value for fraudulent transactions

# Current manual review costs
manual_review_rate = 0.05  # 5% of transactions manually reviewed
manual_cost_per_review = 2.50  # Cost per manual review
monthly_manual_cost = monthly_transactions * manual_review_rate * manual_cost_per_review

# Expected fraud losses without detection
monthly_fraud_transactions = monthly_transactions * fraud_rate
total_fraud_value = monthly_fraud_transactions * avg_fraud_value
fraud_loss_rate = 0.80  # 80% of undetected fraud results in losses
monthly_fraud_losses = total_fraud_value * fraud_loss_rate

# With anomaly detection system
system_detection_rate = 0.393  # F1-score from best model (One-Class SVM)
system_monthly_cost = 25_000  # System operational cost

# Detected fraud (prevented losses)
detected_fraud = monthly_fraud_transactions * system_detection_rate
prevented_losses = detected_fraud * avg_fraud_value

# False positives require manual review
false_positive_rate = 0.10  # Estimated false positive rate
false_positives = monthly_transactions * false_positive_rate
manual_review_cost = false_positives * manual_cost_per_review

total_system_cost = system_monthly_cost + manual_review_cost
net_monthly_benefit = prevented_losses - total_system_cost
annual_benefit = net_monthly_benefit * 12
roi = net_monthly_benefit / system_monthly_cost

print(f"• Monthly transactions: {monthly_transactions:,}")
print(f"• Fraud rate: {fraud_rate:.1%}")
print(f"• Expected monthly fraud value: ${total_fraud_value:,.0f}")
print(f"• System detection rate: {system_detection_rate:.1%}")
print(f"• Prevented fraud losses: ${prevented_losses:,.0f}/month")
print(f"• System operational cost: ${system_monthly_cost:,.0f}/month")
print(f"• Manual review cost: ${manual_review_cost:,.0f}/month")
print(f"• Net monthly benefit: ${net_monthly_benefit:,.0f}")
print(f"• Annual benefit: ${annual_benefit:,.0f}")
print(f"• ROI: {roi*100:.0f}%")

# Manufacturing Quality Control ROI
print(f"\nManufacturing Quality Control ROI:")

daily_parts = 10_000  # Parts produced per day
defect_rate = 0.029  # 2.9% observed defect rate
cost_per_defective_part = 50  # Cost if defective part reaches customer
production_cost_per_part = 5  # Cost to produce one part

# Current quality control
current_detection_rate = 0.85  # 85% of defects caught by current methods
monthly_parts = daily_parts * 30
monthly_defects = monthly_parts * defect_rate

# Losses from undetected defects
current_detected_defects = monthly_defects * current_detection_rate
current_missed_defects = monthly_defects - current_detected_defects
current_defect_cost = current_missed_defects * cost_per_defective_part

# With advanced anomaly detection
advanced_detection_rate = 0.693  # F1-score from best model (DBSCAN)
system_cost_per_part = 0.25  # Additional cost per part for advanced detection
advanced_system_monthly_cost = monthly_parts * system_cost_per_part

# Improved defect detection
advanced_detected_defects = monthly_defects * advanced_detection_rate
advanced_missed_defects = monthly_defects - advanced_detected_defects
advanced_defect_cost = advanced_missed_defects * cost_per_defective_part

# Calculate savings
defect_cost_savings = current_defect_cost - advanced_defect_cost
net_monthly_savings = defect_cost_savings - advanced_system_monthly_cost
manufacturing_roi = net_monthly_savings / advanced_system_monthly_cost

print(f"• Daily parts produced: {daily_parts:,}")
print(f"• Monthly defects expected: {monthly_defects:.0f}")
print(f"• Current detection rate: {current_detection_rate:.1%}")
print(f"• Advanced detection rate: {advanced_detection_rate:.1%}")
print(f"• Current defect cost: ${current_defect_cost:,.0f}/month")
print(f"• Advanced defect cost: ${advanced_defect_cost:,.0f}/month")
print(f"• Defect cost savings: ${defect_cost_savings:,.0f}/month")
print(f"• System cost: ${advanced_system_monthly_cost:,.0f}/month")
print(f"• Net monthly savings: ${net_monthly_savings:,.0f}")
print(f"• Manufacturing ROI: {manufacturing_roi*100:.0f}%")

# Ensemble method benefits
print(f"\nEnsemble Method Analysis:")
ensemble_f1 = 0.418  # Ensemble F1-score
single_best_f1 = 0.393  # Best single model F1-score
ensemble_improvement = (ensemble_f1 - single_best_f1) / single_best_f1

print(f"• Best single model F1: {single_best_f1:.3f}")
print(f"• Ensemble model F1: {ensemble_f1:.3f}")
print(f"• Performance improvement: {ensemble_improvement:.1%}")
print(f"• Additional fraud detection: ${prevented_losses * ensemble_improvement:,.0f}/month")

# Combined systems summary
total_annual_investment = (system_monthly_cost * 12) + (advanced_system_monthly_cost * 12)
total_annual_benefits = annual_benefit + (net_monthly_savings * 12)
combined_roi = total_annual_benefits / total_annual_investment

print(f"\nCombined Anomaly Detection Systems:")
print(f"• Total annual investment: ${total_annual_investment:,.0f}")
print(f"• Total annual benefits: ${total_annual_benefits:,.0f}")
print(f"• Combined ROI: {combined_roi*100:.0f}%")
print(f"• Payback period: {total_annual_investment/total_annual_benefits*12:.1f} months")

 5. BUSINESS INSIGHTS AND ROI ANALYSIS
Financial Fraud Detection System ROI:
• Monthly transactions: 1,000,000
• Fraud rate: 2.4%
• Expected monthly fraud value: $60,000,000
• System detection rate: 39.3%
• Prevented fraud losses: $23,580,000/month
• System operational cost: $25,000/month
• Manual review cost: $250,000/month
• Net monthly benefit: $23,305,000
• Annual benefit: $279,660,000
• ROI: 93220%

Manufacturing Quality Control ROI:
• Daily parts produced: 10,000
• Monthly defects expected: 8700
• Current detection rate: 85.0%
• Advanced detection rate: 69.3%
• Current defect cost: $65,250/month
• Advanced defect cost: $133,545/month
• Defect cost savings: $-68,295/month
• System cost: $75,000/month
• Net monthly savings: $-143,295
• Manufacturing ROI: -191%

Ensemble Method Analysis:
• Best single model F1: 0.393
• Ensemble model F1: 0.418
• Performance improvement: 6.4%
• Additional fraud detection: $1,500,000/month

Combined Anomaly Detection Systems:
• Total annual investment

In [None]:
# 6. IMPLEMENTATION GUIDELINES AND RECOMMENDATIONS
print(" 6. IMPLEMENTATION GUIDELINES AND RECOMMENDATIONS")
print("=" * 53)

print("Anomaly Detection Method Selection Guide:")
print("=" * 42)

# Method selection recommendations
recommendations = {
    "Statistical Methods": {
        "Best For": ["Small datasets", "Interpretable results", "Simple baseline"],
        "Z-Score": "Normal distributions, few outliers",
        "IQR": "Skewed distributions, robust to outliers", 
        "Grubbs Test": "Single outlier detection, normal data",
        "Modified Z-Score": "Non-normal distributions, median-based"
    },
    "Machine Learning Methods": {
        "Best For": ["Large datasets", "Complex patterns", "High accuracy"],
        "Isolation Forest": "High-dimensional data, fast training",
        "One-Class SVM": "Non-linear patterns, small datasets",
        "Local Outlier Factor": "Local density-based anomalies",
        "Elliptic Envelope": "Gaussian distributed data",
        "DBSCAN": "Cluster-based anomalies, varying densities"
    },
    "Ensemble Methods": {
        "Best For": ["Critical applications", "Robust detection", "Reduced false positives"],
        "Voting": "Combine multiple weak detectors",
        "Weighted": "Emphasize best-performing models",
        "Stacking": "Meta-learning for optimal combination"
    }
}

for category, details in recommendations.items():
    print(f"\n{category}:")
    best_for = details.pop("Best For")
    print(f"  Best For: {', '.join(best_for)}")
    for method, description in details.items():
        print(f"  • {method}: {description}")

print(f"\nPerformance Summary:")
print(f"• Best Statistical Method: Z-Score (F1: 0.496)")
print(f"• Best ML Method - Financial: One-Class SVM (F1: 0.393)")
print(f"• Best ML Method - Manufacturing: DBSCAN (F1: 0.693)")
print(f"• Ensemble Improvement: 6.4% better than single models")

print(f"\nImplementation Checklist:")
implementation_steps = [
    "Data Quality: Ensure clean, preprocessed data",
    "Feature Engineering: Create relevant domain-specific features",
    "Baseline: Start with simple statistical methods",
    "Model Selection: Test multiple algorithms on validation set",
    "Hyperparameter Tuning: Optimize contamination rates and model parameters",
    "Ensemble: Combine complementary methods for robustness",
    "Evaluation: Use domain-relevant metrics (precision vs recall trade-offs)",
    "Monitoring: Implement continuous model performance tracking",
    "Feedback Loop: Collect expert feedback to improve detection",
    "Scalability: Design for real-time or batch processing needs"
]

for i, step in enumerate(implementation_steps, 1):
    print(f"{i:2d}. {step}")

print(f"\nReal-Time Considerations:")
realtime_factors = [
    "Latency Requirements: Choose fast algorithms (Isolation Forest > SVM)",
    "Memory Usage: Consider model size for edge deployment",
    "Incremental Learning: Update models with new data streams",
    "Alert Systems: Define escalation procedures for detected anomalies",
    "False Positive Management: Balance sensitivity vs operational burden"
]

for factor in realtime_factors:
    print(f"• {factor}")

print(f"\nCross-Reference Learning Path:")
learning_path = [
    "Prerequisites: Tier1_Descriptive.ipynb (statistical foundations)",
    "Building On: Tier2_LogisticRegression.ipynb (classification metrics)",
    "Advanced: Tier5_Classification.ipynb (ensemble methods)",
    "Specialized: Tier6_IsolationForest.ipynb, Tier6_OneClassSVM.ipynb",
    "Integration: Tier6_RealTimeAnalytics.ipynb (deployment patterns)",
    "Domain-Specific: Tier3_TimeSeries.ipynb (temporal anomalies)"
]

for path in learning_path:
    print(f"• {path}")

print(f"\nNext Steps:")
print(f"• Experiment with domain-specific feature engineering")
print(f"• Implement online learning for concept drift adaptation") 
print(f"• Develop custom ensemble methods for your use case")
print(f"• Create automated hyperparameter optimization pipelines")
print(f"• Build interpretability tools for anomaly explanations")

 6. IMPLEMENTATION GUIDELINES AND RECOMMENDATIONS
Anomaly Detection Method Selection Guide:

Statistical Methods:
  Best For: Small datasets, Interpretable results, Simple baseline
  • Z-Score: Normal distributions, few outliers
  • IQR: Skewed distributions, robust to outliers
  • Grubbs Test: Single outlier detection, normal data
  • Modified Z-Score: Non-normal distributions, median-based

Machine Learning Methods:
  Best For: Large datasets, Complex patterns, High accuracy
  • Isolation Forest: High-dimensional data, fast training
  • One-Class SVM: Non-linear patterns, small datasets
  • Local Outlier Factor: Local density-based anomalies
  • Elliptic Envelope: Gaussian distributed data
  • DBSCAN: Cluster-based anomalies, varying densities

Ensemble Methods:
  Best For: Critical applications, Robust detection, Reduced false positives
  • Voting: Combine multiple weak detectors
  • Weighted: Emphasize best-performing models
  • Stacking: Meta-learning for optimal combination

Perf

In [29]:
# Additional Simple Visualizations
print(" ADDITIONAL ANOMALY DETECTION VISUALIZATIONS")
print("=" * 46)

# Print summary table
print(f"\nPerformance Summary Table:")
print("=" * 60)
print(f"{'Method':<25} {'Type':<15} {'F1-Score':<10} {'Precision':<10} {'Recall':<10}")
print("=" * 60)

# Statistical methods
for method, results in stat_results.items():
    if 'f1' in results:
        print(f"{method.title():<25} {'Statistical':<15} {results['f1']:<10.3f} {results['precision']:<10.3f} {results['recall']:<10.3f}")

# ML methods
for method, results in financial_results.items():
    if results and results['f1'] is not None:
        print(f"{method:<25} {'ML':<15} {results['f1']:<10.3f} {results['precision']:<10.3f} {results['recall']:<10.3f}")

# Ensemble (using the correct variable names from ensemble cell)
# Note: Ensemble results from previous cell: F1=0.418, Precision=0.264, Recall=1.000
print(f"{'Ensemble':<25} {'Ensemble':<15} {'0.418':<10} {'0.264':<10} {'1.000':<10}")

print("=" * 60)

# Find best method
best_stat_f1 = max([results['f1'] for results in stat_results.values() if 'f1' in results])
best_ml_f1 = max([results['f1'] for results in financial_results.values() if results and results['f1'] is not None])
ensemble_score = 0.418  # Fixed ensemble F1 score from earlier analysis
all_f1_scores = [best_stat_f1, best_ml_f1, ensemble_score]
best_overall_f1 = max(all_f1_scores)

if best_overall_f1 == ensemble_score:
    best_method = "Ensemble Method"
elif best_overall_f1 == best_stat_f1:
    best_method = f"Statistical: {best_stat_method.title()}"
else:
    best_method = f"ML: {best_financial_model}"

print(f"Best Overall Method: {best_method} (F1: {best_overall_f1:.3f})")

# Create simple comparison chart
methods_simple = ['Best Statistical', 'Best ML', 'Ensemble']
scores_simple = [best_stat_f1, best_ml_f1, ensemble_score]

fig_simple = go.Figure(data=[
    go.Bar(x=methods_simple, 
           y=scores_simple,
           marker_color=['lightblue', 'lightcoral', 'gold'],
           text=[f'{score:.3f}' for score in scores_simple],
           textposition='auto')
])

fig_simple.update_layout(
    title="Top Anomaly Detection Methods Comparison",
    xaxis_title="Method Category",
    yaxis_title="F1-Score",
    height=400
)

fig_simple.show()

 ADDITIONAL ANOMALY DETECTION VISUALIZATIONS

Performance Summary Table:
Method                    Type            F1-Score   Precision  Recall    
Zscore                    Statistical     0.496      1.000      0.330     
Iqr                       Statistical     0.262      0.168      0.600     
Grubbs                    Statistical     0.473      1.000      0.310     
Modified_Zscore           Statistical     0.233      0.143      0.635     
Isolation Forest          ML              0.392      0.244      1.000     
One-Class SVM             ML              0.393      0.244      1.000     
Local Outlier Factor      ML              0.098      0.061      0.250     
Elliptic Envelope         ML              0.373      0.232      0.950     
DBSCAN                    ML              0.224      0.129      0.860     
Ensemble                  Ensemble        0.418      0.264      1.000     
Best Overall Method: Statistical: Zscore (F1: 0.496)
