# Tier 6: Isolation Forest

---

**Author:** Brandon Deloatch
**Affiliation:** Quipu Research Labs, LLC
**Date:** 2025-10-02
**Version:** v1.3
**License:** MIT
**Notebook ID:** 2b64eced-5f5c-439f-bd23-22ac395445be

---

## Citation
Brandon Deloatch, "Tier 6: Isolation Forest," Quipu Research Labs, LLC, v1.3, 2025-10-02.

Please cite this notebook if used or adapted in publications, presentations, or derivative work.

---

## Contributors / Acknowledgments
- **Primary Author:** Brandon Deloatch (Quipu Research Labs, LLC)
- **Institutional Support:** Quipu Research Labs, LLC - Advanced Analytics Division
- **Technical Framework:** Built on scikit-learn, pandas, numpy, and plotly ecosystems
- **Methodological Foundation:** Statistical learning principles and modern data science best practices

---

## Version History
| Version | Date | Notes |
|---------|------|-------|
| v1.3 | 2025-10-02 | Enhanced professional formatting, comprehensive documentation, interactive visualizations |
| v1.2 | 2024-09-15 | Updated analysis methods, improved data generation algorithms |
| v1.0 | 2024-06-10 | Initial release with core analytical framework |

---

## Environment Dependencies
- **Python:** 3.8+
- **Core Libraries:** pandas 2.0+, numpy 1.24+, scikit-learn 1.3+
- **Visualization:** plotly 5.0+, matplotlib 3.7+
- **Statistical:** scipy 1.10+, statsmodels 0.14+
- **Development:** jupyter-lab 4.0+, ipywidgets 8.0+

> **Reproducibility Note:** Use requirements.txt or environment.yml for exact dependency matching.

---

## Data Provenance
| Dataset | Source | License | Notes |
|---------|--------|---------|-------|
| Synthetic Data | Generated in-notebook | MIT | Custom algorithms for realistic simulation |
| Statistical Distributions | NumPy/SciPy | BSD-3-Clause | Standard library implementations |
| ML Algorithms | Scikit-learn | BSD-3-Clause | Industry-standard implementations |
| Visualization Schemas | Plotly | MIT | Interactive dashboard frameworks |

---

## Execution Provenance Logs
- **Created:** 2025-10-02
- **Notebook ID:** 2b64eced-5f5c-439f-bd23-22ac395445be
- **Execution Environment:** Jupyter Lab / VS Code
- **Computational Requirements:** Standard laptop/workstation (2GB+ RAM recommended)

> **Auto-tracking:** Execution metadata can be programmatically captured for reproducibility.

---

## Disclaimer & Responsible Use
This notebook is provided "as-is" for educational, research, and professional development purposes. Users assume full responsibility for any results, applications, or decisions derived from this analysis.

**Professional Standards:**
- Validate all results against domain expertise and additional data sources
- Respect licensing and attribution requirements for all dependencies
- Follow ethical guidelines for data analysis and algorithmic decision-making
- Credit all methodological sources and derivative frameworks appropriately

**Academic & Commercial Use:**
- Permitted under MIT license with proper attribution
- Suitable for educational curriculum and professional training
- Appropriate for commercial adaptation with citation requirements
- Recommended for reproducible research and transparent analytics

---



In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

print(" Tier 6: Isolation Forest - Libraries Loaded!")
print("=" * 50)
print("Isolation Forest Techniques:")
print("• Tree-based anomaly detection")
print("• Path length anomaly scoring")
print("• Contamination parameter optimization")
print("• Real-time anomaly monitoring")
print("• Multi-dimensional outlier detection")

 Tier 6: Isolation Forest - Libraries Loaded!
Isolation Forest Techniques:
• Tree-based anomaly detection
• Path length anomaly scoring
• Contamination parameter optimization
• Real-time anomaly monitoring
• Multi-dimensional outlier detection


In [2]:
# Generate Isolation Forest datasets
np.random.seed(42)

# 1. Financial transaction data with fraud
n_transactions = 10000
normal_transactions = pd.DataFrame({
 'amount': np.random.lognormal(3, 1, int(n_transactions * 0.95)),
 'frequency': np.random.poisson(5, int(n_transactions * 0.95)),
 'time_of_day': np.random.normal(12, 4, int(n_transactions * 0.95)) % 24,
 'merchant_risk': np.random.beta(2, 5, int(n_transactions * 0.95))
})

# Add fraudulent transactions (5% contamination)
n_fraud = int(n_transactions * 0.05)
fraud_transactions = pd.DataFrame({
 'amount': np.random.lognormal(5, 1.5, n_fraud), # Larger amounts
 'frequency': np.random.poisson(15, n_fraud), # Higher frequency
 'time_of_day': np.random.choice([2, 3, 23], n_fraud), # Unusual times
 'merchant_risk': np.random.beta(5, 2, n_fraud) # Higher risk
})

# Combine datasets
financial_data = pd.concat([normal_transactions, fraud_transactions], ignore_index=True)
financial_labels = np.concatenate([np.zeros(len(normal_transactions)),
 np.ones(len(fraud_transactions))])

# 2. Network traffic data with intrusions
n_connections = 5000
network_data = pd.DataFrame({
 'packet_size': np.random.exponential(1000, n_connections),
 'duration': np.random.gamma(2, 30, n_connections),
 'port_number': np.random.choice([80, 443, 22, 21], n_connections, p=[0.4, 0.3, 0.2, 0.1]),
 'bytes_transferred': np.random.lognormal(8, 2, n_connections)
})

# Add intrusion patterns (3% contamination)
n_intrusions = int(n_connections * 0.03)
intrusion_indices = np.random.choice(n_connections, n_intrusions, replace=False)
network_labels = np.zeros(n_connections)
network_labels[intrusion_indices] = 1

# Modify intrusion data
network_data.loc[intrusion_indices, 'packet_size'] *= 10 # Larger packets
network_data.loc[intrusion_indices, 'duration'] *= 5 # Longer duration
network_data.loc[intrusion_indices, 'bytes_transferred'] *= 20 # More data

print(" Isolation Forest Datasets:")
print(f"Financial transactions: {len(financial_data)} samples, {financial_labels.sum():.0f} fraudulent")
print(f"Network connections: {len(network_data)} samples, {network_labels.sum():.0f} intrusions")
print(f"Financial fraud rate: {financial_labels.mean()*100:.1f}%")
print(f"Network intrusion rate: {network_labels.mean()*100:.1f}%")

 Isolation Forest Datasets:
Financial transactions: 10000 samples, 500 fraudulent
Network connections: 5000 samples, 150 intrusions
Financial fraud rate: 5.0%
Network intrusion rate: 3.0%


In [3]:
# 1. ISOLATION FOREST ANOMALY DETECTION
print(" 1. ISOLATION FOREST ANALYSIS")
print("=" * 31)

# Apply Isolation Forest to financial data
scaler_financial = StandardScaler()
financial_scaled = scaler_financial.fit_transform(financial_data)

# Test different contamination parameters
contamination_rates = [0.01, 0.03, 0.05, 0.1, 0.15]
isolation_results = {}

for contamination in contamination_rates:
 iso_forest = IsolationForest(
 contamination=contamination,
 random_state=42,
 n_estimators=100
 )

 # Fit and predict
 anomaly_predictions = iso_forest.fit_predict(financial_scaled)
 anomaly_scores = iso_forest.decision_function(financial_scaled)

 # Convert predictions (-1 for anomaly, 1 for normal) to binary
 binary_predictions = (anomaly_predictions == -1).astype(int)

 # Calculate metrics
 precision = np.sum((binary_predictions == 1) & (financial_labels == 1)) / np.sum(binary_predictions == 1) if np.sum(binary_predictions == 1) > 0 else 0
 recall = np.sum((binary_predictions == 1) & (financial_labels == 1)) / np.sum(financial_labels == 1)
 f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

 isolation_results[contamination] = {
 'predictions': binary_predictions,
 'scores': anomaly_scores,
 'precision': precision,
 'recall': recall,
 'f1_score': f1_score
 }

 print(f"Contamination {contamination:.2f}: Precision={precision:.3f}, Recall={recall:.3f}, F1={f1_score:.3f}")

# Select best contamination rate
best_contamination = max(isolation_results.keys(), key=lambda x: isolation_results[x]['f1_score'])
best_model_results = isolation_results[best_contamination]

print(f"\nBest contamination rate: {best_contamination:.2f}")
print(f"Best F1-score: {best_model_results['f1_score']:.3f}")

# Analyze anomaly scores distribution
normal_scores = best_model_results['scores'][financial_labels == 0]
fraud_scores = best_model_results['scores'][financial_labels == 1]

print(f"Normal transaction scores: mean={normal_scores.mean():.3f}, std={normal_scores.std():.3f}")
print(f"Fraud transaction scores: mean={fraud_scores.mean():.3f}, std={fraud_scores.std():.3f}")

# Feature importance analysis (approximated by feature variance in anomalies)
anomaly_indices = best_model_results['predictions'] == 1
if np.sum(anomaly_indices) > 0:
    anomaly_data = financial_data[anomaly_indices]
    normal_data = financial_data[~anomaly_indices]

    feature_importance = {}
    for col in financial_data.columns:
        anomaly_var = anomaly_data[col].var()
        normal_var = normal_data[col].var()
        importance = anomaly_var / (normal_var + 1e-10) # Avoid division by zero
        feature_importance[col] = importance

    print(f"\nFeature importance for anomaly detection:")
    for feature, importance in sorted(feature_importance.items(), key=lambda x: x[1], reverse=True):
        print(f"• {feature}: {importance:.2f}")

 1. ISOLATION FOREST ANALYSIS
Contamination 0.01: Precision=1.000, Recall=0.200, F1=0.333
Contamination 0.03: Precision=0.993, Recall=0.596, F1=0.745
Contamination 0.01: Precision=1.000, Recall=0.200, F1=0.333
Contamination 0.03: Precision=0.993, Recall=0.596, F1=0.745
Contamination 0.05: Precision=0.932, Recall=0.932, F1=0.932
Contamination 0.05: Precision=0.932, Recall=0.932, F1=0.932
Contamination 0.10: Precision=0.500, Recall=1.000, F1=0.667
Contamination 0.15: Precision=0.333, Recall=1.000, F1=0.500

Best contamination rate: 0.05
Best F1-score: 0.932
Normal transaction scores: mean=0.169, std=0.044
Fraud transaction scores: mean=-0.063, std=0.044

Feature importance for anomaly detection:
• amount: 678.82
• time_of_day: 5.55
• frequency: 3.93
• merchant_risk: 1.50
Contamination 0.10: Precision=0.500, Recall=1.000, F1=0.667
Contamination 0.15: Precision=0.333, Recall=1.000, F1=0.500

Best contamination rate: 0.05
Best F1-score: 0.932
Normal transaction scores: mean=0.169, std=0.044

In [4]:
# 2. INTERACTIVE ISOLATION FOREST VISUALIZATIONS
print(" 2. INTERACTIVE ISOLATION FOREST VISUALIZATIONS")
print("=" * 49)

# Create comprehensive dashboard
fig = make_subplots(
 rows=2, cols=2,
 subplot_titles=[
 'Contamination Rate Performance',
 'Anomaly Score Distribution',
 'Transaction Amount vs Frequency (Colored by Anomaly)',
 'Feature Comparison: Normal vs Anomalous'
 ]
)

# Performance comparison across contamination rates
contamination_list = list(isolation_results.keys())
f1_scores = [isolation_results[c]['f1_score'] for c in contamination_list]
precision_scores = [isolation_results[c]['precision'] for c in contamination_list]
recall_scores = [isolation_results[c]['recall'] for c in contamination_list]

fig.add_trace(
 go.Scatter(x=contamination_list, y=f1_scores, mode='lines+markers',
 name='F1-Score', line=dict(color='blue', width=3)),
 row=1, col=1
)
fig.add_trace(
 go.Scatter(x=contamination_list, y=precision_scores, mode='lines+markers',
 name='Precision', line=dict(color='green', width=2)),
 row=1, col=1
)
fig.add_trace(
 go.Scatter(x=contamination_list, y=recall_scores, mode='lines+markers',
 name='Recall', line=dict(color='red', width=2)),
 row=1, col=1
)

# Anomaly score distributions
fig.add_trace(
 go.Histogram(x=normal_scores, name='Normal Transactions',
 opacity=0.7, nbinsx=30, marker_color='blue'),
 row=1, col=2
)
fig.add_trace(
 go.Histogram(x=fraud_scores, name='Fraudulent Transactions',
 opacity=0.7, nbinsx=30, marker_color='red'),
 row=1, col=2
)

# Transaction scatter plot colored by anomaly detection
financial_data['predicted_anomaly'] = best_model_results['predictions']
financial_data['anomaly_score'] = best_model_results['scores']

# Normal transactions
normal_trans = financial_data[financial_data['predicted_anomaly'] == 0]
fig.add_trace(
 go.Scatter(x=normal_trans['amount'], y=normal_trans['frequency'],
 mode='markers', name='Normal (Predicted)',
 marker=dict(color='lightblue', size=4, opacity=0.6)),
 row=2, col=1
)

# Anomalous transactions
anomaly_trans = financial_data[financial_data['predicted_anomaly'] == 1]
fig.add_trace(
 go.Scatter(x=anomaly_trans['amount'], y=anomaly_trans['frequency'],
 mode='markers', name='Anomalous (Predicted)',
 marker=dict(color='red', size=6, opacity=0.8)),
 row=2, col=1
)

# Feature comparison box plots
features = list(financial_data.columns[:-2]) # Exclude prediction columns
for i, feature in enumerate(features):
 normal_values = financial_data[financial_data['predicted_anomaly'] == 0][feature]
 anomaly_values = financial_data[financial_data['predicted_anomaly'] == 1][feature]

 fig.add_trace(
 go.Box(y=normal_values, name=f'Normal {feature}',
 marker_color='lightblue', showlegend=False),
 row=2, col=2
 )
 fig.add_trace(
 go.Box(y=anomaly_values, name=f'Anomaly {feature}',
 marker_color='red', showlegend=False),
 row=2, col=2
 )

fig.update_layout(height=800, title="Isolation Forest Anomaly Detection Dashboard")
fig.update_xaxes(title_text="Contamination Rate", row=1, col=1)
fig.update_xaxes(title_text="Anomaly Score", row=1, col=2)
fig.update_xaxes(title_text="Transaction Amount ($)", row=2, col=1)
fig.update_xaxes(title_text="Features", row=2, col=2)
fig.update_yaxes(title_text="Score", row=1, col=1)
fig.update_yaxes(title_text="Frequency", row=1, col=2)
fig.update_yaxes(title_text="Transaction Frequency", row=2, col=1)
fig.update_yaxes(title_text="Feature Values", row=2, col=2)
fig.show()

# Real-time monitoring simulation
print(f"\n REAL-TIME ANOMALY MONITORING SIMULATION:")

# Simulate new incoming transactions
new_transactions = pd.DataFrame({
 'amount': [50, 10000, 75, 50000, 200], # Mix of normal and suspicious amounts
 'frequency': [3, 25, 5, 40, 8], # Mix of frequencies
 'time_of_day': [14, 2, 10, 23, 16], # Mix of times
 'merchant_risk': [0.2, 0.8, 0.3, 0.9, 0.1] # Mix of risk levels
})

# Apply the trained model
new_scaled = scaler_financial.transform(new_transactions)
final_model = IsolationForest(contamination=best_contamination, random_state=42, n_estimators=100)
final_model.fit(financial_scaled)

new_predictions = final_model.predict(new_scaled)
new_scores = final_model.decision_function(new_scaled)

print("New transaction analysis:")
for i, (pred, score) in enumerate(zip(new_predictions, new_scores)):
 status = " ANOMALY" if pred == -1 else " NORMAL"
 amount = new_transactions.iloc[i]['amount']
 risk_level = "HIGH" if score < -0.1 else "MEDIUM" if score < 0 else "LOW"
 print(f"Transaction {i+1}: ${amount:,.0f} -> {status} (Score: {score:.3f}, Risk: {risk_level})")

# Business impact calculation
fraud_detection_rate = best_model_results['recall']
false_positive_rate = 1 - best_model_results['precision']
avg_fraud_amount = financial_data[financial_labels == 1]['amount'].mean()
total_transactions_per_year = 1_000_000
fraud_rate = 0.05

prevented_fraud_amount = (total_transactions_per_year * fraud_rate *
 fraud_detection_rate * avg_fraud_amount)
false_positive_cost = (total_transactions_per_year * (1 - fraud_rate) *
 false_positive_rate * 50) # $50 cost per false positive

net_savings = prevented_fraud_amount - false_positive_cost
implementation_cost = 200_000

print(f"\n ISOLATION FOREST BUSINESS ROI:")
print(f"• Annual transactions: {total_transactions_per_year:,}")
print(f"• Fraud detection rate: {fraud_detection_rate:.1%}")
print(f"• Average fraud amount: ${avg_fraud_amount:,.0f}")
print(f"• Prevented fraud losses: ${prevented_fraud_amount:,.0f}")
print(f"• False positive costs: ${false_positive_cost:,.0f}")
print(f"• Net annual savings: ${net_savings:,.0f}")
print(f"• Implementation cost: ${implementation_cost:,.0f}")
print(f"• ROI: {(net_savings - implementation_cost)/implementation_cost*100:.0f}%")
print(f"• Payback period: {implementation_cost/net_savings*12:.1f} months")

 2. INTERACTIVE ISOLATION FOREST VISUALIZATIONS



 REAL-TIME ANOMALY MONITORING SIMULATION:
New transaction analysis:
Transaction 1: $50 ->  NORMAL (Score: 0.193, Risk: LOW)
Transaction 2: $10,000 ->  ANOMALY (Score: -0.177, Risk: HIGH)
Transaction 3: $75 ->  NORMAL (Score: 0.188, Risk: LOW)
Transaction 4: $50,000 ->  ANOMALY (Score: -0.194, Risk: HIGH)
Transaction 5: $200 ->  NORMAL (Score: 0.042, Risk: LOW)

 ISOLATION FOREST BUSINESS ROI:
• Annual transactions: 1,000,000
• Fraud detection rate: 93.2%
• Average fraud amount: $399
• Prevented fraud losses: $18,594,978
• False positive costs: $3,230,000
• Net annual savings: $15,364,978
• Implementation cost: $200,000
• ROI: 7582%
• Payback period: 0.2 months
