# Tier 5: Random Forest Classification

---

**Author:** Brandon Deloatch
**Affiliation:** Quipu Research Labs, LLC
**Date:** 2025-10-02
**Version:** v1.3
**License:** MIT
**Notebook ID:** e4bd9f90-d13a-4de7-be04-925cc2fee1ed

---

## Citation
Brandon Deloatch, "Tier 5: Random Forest Classification," Quipu Research Labs, LLC, v1.3, 2025-10-02.

Please cite this notebook if used or adapted in publications, presentations, or derivative work.

---

## Contributors / Acknowledgments
- **Primary Author:** Brandon Deloatch (Quipu Research Labs, LLC)
- **Institutional Support:** Quipu Research Labs, LLC - Advanced Analytics Division
- **Technical Framework:** Built on scikit-learn, pandas, numpy, and plotly ecosystems
- **Methodological Foundation:** Statistical learning principles and modern data science best practices

---

## Version History
| Version | Date | Notes |
|---------|------|-------|
| v1.3 | 2025-10-02 | Enhanced professional formatting, comprehensive documentation, interactive visualizations |
| v1.2 | 2024-09-15 | Updated analysis methods, improved data generation algorithms |
| v1.0 | 2024-06-10 | Initial release with core analytical framework |

---

## Environment Dependencies
- **Python:** 3.8+
- **Core Libraries:** pandas 2.0+, numpy 1.24+, scikit-learn 1.3+
- **Visualization:** plotly 5.0+, matplotlib 3.7+
- **Statistical:** scipy 1.10+, statsmodels 0.14+
- **Development:** jupyter-lab 4.0+, ipywidgets 8.0+

> **Reproducibility Note:** Use requirements.txt or environment.yml for exact dependency matching.

---

## Data Provenance
| Dataset | Source | License | Notes |
|---------|--------|---------|-------|
| Synthetic Data | Generated in-notebook | MIT | Custom algorithms for realistic simulation |
| Statistical Distributions | NumPy/SciPy | BSD-3-Clause | Standard library implementations |
| ML Algorithms | Scikit-learn | BSD-3-Clause | Industry-standard implementations |
| Visualization Schemas | Plotly | MIT | Interactive dashboard frameworks |

---

## Execution Provenance Logs
- **Created:** 2025-10-02
- **Notebook ID:** e4bd9f90-d13a-4de7-be04-925cc2fee1ed
- **Execution Environment:** Jupyter Lab / VS Code
- **Computational Requirements:** Standard laptop/workstation (2GB+ RAM recommended)

> **Auto-tracking:** Execution metadata can be programmatically captured for reproducibility.

---

## Disclaimer & Responsible Use
This notebook is provided "as-is" for educational, research, and professional development purposes. Users assume full responsibility for any results, applications, or decisions derived from this analysis.

**Professional Standards:**
- Validate all results against domain expertise and additional data sources
- Respect licensing and attribution requirements for all dependencies
- Follow ethical guidelines for data analysis and algorithmic decision-making
- Credit all methodological sources and derivative frameworks appropriately

**Academic & Commercial Use:**
- Permitted under MIT license with proper attribution
- Suitable for educational curriculum and professional training
- Appropriate for commercial adaptation with citation requirements
- Recommended for reproducible research and transparent analytics

---



In [7]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, validation_curve
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.inspection import permutation_importance
import warnings
warnings.filterwarnings('ignore')

print(" Tier 5: Random Forest Classification - Libraries Loaded!")
print("="*58)
print("Random Forest Classification Techniques:")
print("• Bootstrap aggregating (bagging) ensemble method")
print("• Random feature selection at each split")
print("• Out-of-bag (OOB) error estimation")
print("• Feature importance and selection")
print("• Extremely Randomized Trees (Extra Trees)")
print("• Ensemble diversity and bias-variance tradeoff")

 Tier 5: Random Forest Classification - Libraries Loaded!
Random Forest Classification Techniques:
• Bootstrap aggregating (bagging) ensemble method
• Random feature selection at each split
• Out-of-bag (OOB) error estimation
• Feature importance and selection
• Extremely Randomized Trees (Extra Trees)
• Ensemble diversity and bias-variance tradeoff


In [8]:
# Generate comprehensive Random Forest datasets
np.random.seed(42)

# 1. Customer segmentation dataset
def generate_customer_dataset(n_samples=4000):
    """Generate realistic customer segmentation dataset."""

    # Base customer features
    X, y = make_classification(
        n_samples=n_samples,
        n_features=15,
        n_informative=10,
        n_redundant=3,
        n_clusters_per_class=2,
        n_classes=4,
        class_sep=0.8,
        random_state=42
    )

    # Create realistic business features
    data = []
    segments = ['Budget', 'Standard', 'Premium', 'Enterprise']

    for i in range(n_samples):
        segment_idx = y[i]
        segment = segments[segment_idx]

        # Generate segment-specific features
        if segment == 'Budget':
            annual_revenue = np.random.lognormal(7, 0.5) # ~$1K
            transaction_frequency = np.random.poisson(2)
            support_tickets = np.random.poisson(3)
            satisfaction_score = np.random.beta(5, 3) * 10
        elif segment == 'Standard':
            annual_revenue = np.random.lognormal(8, 0.4) # ~$3K
            transaction_frequency = np.random.poisson(5)
            support_tickets = np.random.poisson(2)
            satisfaction_score = np.random.beta(6, 2) * 10
        elif segment == 'Premium':
            annual_revenue = np.random.lognormal(9, 0.3) # ~$8K
            transaction_frequency = np.random.poisson(8)
            support_tickets = np.random.poisson(1)
            satisfaction_score = np.random.beta(8, 2) * 10
        else: # Enterprise
            annual_revenue = np.random.lognormal(10, 0.4) # ~$22K
            transaction_frequency = np.random.poisson(12)
            support_tickets = np.random.poisson(1)
            satisfaction_score = np.random.beta(9, 1) * 10

        # Additional features
        data.append({
            'customer_id': f'CUST_{i:06d}',
            'annual_revenue': annual_revenue,
            'transaction_frequency': transaction_frequency,
            'support_tickets': support_tickets,
            'satisfaction_score': satisfaction_score,
            'customer_age_months': np.random.exponential(24),
            'product_usage_score': np.random.beta(3, 2) * 100,
            'referral_count': np.random.poisson(1),
            'contract_length': np.random.choice([6, 12, 24, 36], p=[0.1, 0.4, 0.3, 0.2]),
            'payment_method_risk': np.random.beta(2, 5),
            'geographic_tier': np.random.choice([1, 2, 3], p=[0.3, 0.5, 0.2]),
            'segment': segment,
            'segment_code': segment_idx
        })

        # Add technical features from make_classification
        for j in range(X.shape[1]):
            data[i][f'tech_feature_{j+1:02d}'] = X[i, j]

    return pd.DataFrame(data)

# 2. Gene expression dataset (high-dimensional)
def generate_gene_dataset(n_samples=1000, n_genes=200):
    """Generate synthetic gene expression dataset."""

    X, y = make_classification(
        n_samples=n_samples,
        n_features=n_genes,
        n_informative=int(n_genes * 0.1), # 10% informative genes
        n_redundant=int(n_genes * 0.05), # 5% redundant genes
        n_clusters_per_class=1,
        n_classes=3,
        class_sep=1.2,
        random_state=42
    )

    # Create gene names
    gene_names = [f'GENE_{i+1:04d}' for i in range(n_genes)]

    # Create DataFrame
    gene_df = pd.DataFrame(X, columns=gene_names)
    gene_df['condition'] = [['Healthy', 'Disease_A', 'Disease_B'][y[i]] for i in range(len(y))]
    gene_df['condition_code'] = y

    return gene_df, gene_names

# Generate datasets
customer_df = generate_customer_dataset()
gene_df, gene_names = generate_gene_dataset()

print(" Random Forest Datasets Created:")
print(f"Customer segmentation: {customer_df.shape}")
print(f"Segment distribution: {customer_df['segment'].value_counts().to_dict()}")
print(f"\nGene expression: {gene_df.shape}")
print(f"Condition distribution: {gene_df['condition'].value_counts().to_dict()}")
print(f"Average annual revenue by segment:")
for segment in customer_df['segment'].unique():
    avg_revenue = customer_df[customer_df['segment'] == segment]['annual_revenue'].mean()
    print(f" {segment}: ${avg_revenue:,.0f}")

 Random Forest Datasets Created:
Customer segmentation: (4000, 28)
Segment distribution: {'Standard': 1007, 'Budget': 1001, 'Premium': 998, 'Enterprise': 994}

Gene expression: (1000, 202)
Condition distribution: {'Disease_B': 335, 'Healthy': 333, 'Disease_A': 332}
Average annual revenue by segment:
 Enterprise: $23,689
 Standard: $3,187
 Budget: $1,224
 Premium: $8,454


In [9]:
# 1. RANDOM FOREST CLASSIFICATION AND HYPERPARAMETER OPTIMIZATION
print(" 1. RANDOM FOREST CLASSIFICATION AND HYPERPARAMETER OPTIMIZATION")
print("="*68)

# Prepare customer data
customer_features = [col for col in customer_df.columns if col not in ['customer_id', 'segment', 'segment_code']]
X_customer = customer_df[customer_features].values
y_customer = customer_df['segment_code'].values

# Split the data
X_cust_train, X_cust_test, y_cust_train, y_cust_test = train_test_split(
 X_customer, y_customer, test_size=0.2, random_state=42, stratify=y_customer
)

# Train baseline Random Forest
rf_baseline = RandomForestClassifier(n_estimators=100, random_state=42)
rf_baseline.fit(X_cust_train, y_cust_train)
y_pred_baseline = rf_baseline.predict(X_cust_test)
baseline_accuracy = accuracy_score(y_cust_test, y_pred_baseline)

print(f"Baseline Random Forest Performance:")
print(f"Accuracy: {baseline_accuracy:.3f}")
print(f"OOB Score: {rf_baseline.oob_score_:.3f}" if hasattr(rf_baseline, 'oob_score_') else "OOB not calculated")

# Hyperparameter optimization
print(f"\nHyperparameter Optimization:")
param_grid = {
 'n_estimators': [50, 100, 200],
 'max_depth': [5, 10, 15, None],
 'min_samples_split': [2, 5, 10],
 'min_samples_leaf': [1, 2, 4],
 'max_features': ['sqrt', 'log2', None]
}

# Reduced grid for faster execution
reduced_param_grid = {
 'n_estimators': [100, 200],
 'max_depth': [10, None],
 'min_samples_split': [2, 5],
 'max_features': ['sqrt', None]
}

grid_search = GridSearchCV(
 RandomForestClassifier(random_state=42, oob_score=True),
 reduced_param_grid,
 cv=3,
 scoring='accuracy',
 n_jobs=-1
)

grid_search.fit(X_cust_train, y_cust_train)
best_rf = grid_search.best_estimator_

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")

# Evaluate optimized model
y_pred_optimized = best_rf.predict(X_cust_test)
optimized_accuracy = accuracy_score(y_cust_test, y_pred_optimized)

print(f"\nOptimized Random Forest Performance:")
print(f"Accuracy: {optimized_accuracy:.3f}")
print(f"Improvement: {optimized_accuracy - baseline_accuracy:+.3f}")
print(f"OOB Score: {best_rf.oob_score_:.3f}")

# Feature importance analysis
feature_importance = pd.DataFrame({
 'feature': customer_features,
 'importance': best_rf.feature_importances_
}).sort_values('importance', ascending=False)

print(f"\nTop 10 Most Important Features:")
for i, (idx, row) in enumerate(feature_importance.head(10).iterrows()):
    print(f"{i+1:2d}. {row['feature']}: {row['importance']:.4f}")

# Compare with single decision tree
single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_cust_train, y_cust_train)
tree_accuracy = accuracy_score(y_cust_test, single_tree.predict(X_cust_test))

print(f"\nEnsemble vs Single Tree Comparison:")
print(f"Random Forest: {optimized_accuracy:.3f}")
print(f"Single Tree: {tree_accuracy:.3f}")
print(f"Ensemble benefit: {optimized_accuracy - tree_accuracy:+.3f}")

 1. RANDOM FOREST CLASSIFICATION AND HYPERPARAMETER OPTIMIZATION
Baseline Random Forest Performance:
Accuracy: 0.932
OOB not calculated

Hyperparameter Optimization:
Baseline Random Forest Performance:
Accuracy: 0.932
OOB not calculated

Hyperparameter Optimization:
Best parameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_split': 5, 'n_estimators': 200}
Best CV score: 0.942

Optimized Random Forest Performance:
Accuracy: 0.931
Improvement: -0.001
OOB Score: 0.942

Top 10 Most Important Features:
 1. annual_revenue: 0.3955
 2. transaction_frequency: 0.1607
 3. satisfaction_score: 0.0709
 4. tech_feature_02: 0.0360
 5. tech_feature_03: 0.0331
 6. support_tickets: 0.0323
 7. tech_feature_11: 0.0318
 8. tech_feature_04: 0.0255
 9. tech_feature_07: 0.0220
10. tech_feature_09: 0.0213

Ensemble vs Single Tree Comparison:
Random Forest: 0.931
Single Tree: 0.875
Ensemble benefit: +0.056
Best parameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_split': 5, 'n_esti

In [10]:
# 2. HIGH-DIMENSIONAL DATA ANALYSIS (GENE EXPRESSION)
print(" 2. HIGH-DIMENSIONAL DATA ANALYSIS (GENE EXPRESSION)")
print("="*56)

# Prepare gene expression data
X_gene = gene_df[gene_names].values
y_gene = gene_df['condition_code'].values

# Split the data
X_gene_train, X_gene_test, y_gene_train, y_gene_test = train_test_split(
 X_gene, y_gene, test_size=0.2, random_state=42, stratify=y_gene
)

# Train Random Forest for gene data
rf_gene = RandomForestClassifier(
 n_estimators=200,
 max_features='sqrt', # Important for high-dimensional data
 random_state=42,
 oob_score=True
)

rf_gene.fit(X_gene_train, y_gene_train)
y_gene_pred = rf_gene.predict(X_gene_test)
gene_accuracy = accuracy_score(y_gene_test, y_gene_pred)

print(f"Gene Expression Classification Performance:")
print(f"Accuracy: {gene_accuracy:.3f}")
print(f"OOB Score: {rf_gene.oob_score_:.3f}")

# Feature selection using Random Forest
gene_importance = pd.DataFrame({
 'gene': gene_names,
 'importance': rf_gene.feature_importances_
}).sort_values('importance', ascending=False)

# Select top genes
top_genes = gene_importance.head(20)
print(f"\nTop 10 Most Important Genes:")
for i, (idx, row) in enumerate(top_genes.head(10).iterrows()):
    print(f"{i+1:2d}. {row['gene']}: {row['importance']:.4f}")

# Train model with reduced features
top_gene_names = top_genes['gene'].tolist()
X_gene_reduced = gene_df[top_gene_names].values
X_gene_red_train, X_gene_red_test, _, _ = train_test_split(
 X_gene_reduced, y_gene, test_size=0.2, random_state=42, stratify=y_gene
)

rf_gene_reduced = RandomForestClassifier(
 n_estimators=200, random_state=42, oob_score=True
)
rf_gene_reduced.fit(X_gene_red_train, y_gene_train)
y_gene_red_pred = rf_gene_reduced.predict(X_gene_red_test)
gene_reduced_accuracy = accuracy_score(y_gene_test, y_gene_red_pred)

print(f"\nFeature Selection Results:")
print(f"All genes ({len(gene_names)}): {gene_accuracy:.3f}")
print(f"Top genes ({len(top_gene_names)}): {gene_reduced_accuracy:.3f}")
print(f"Performance change: {gene_reduced_accuracy - gene_accuracy:+.3f}")
print(f"Dimensionality reduction: {(1 - len(top_gene_names)/len(gene_names))*100:.1f}%")

# Permutation importance for verification
print(f"\nPermutation Importance Analysis:")
perm_importance = permutation_importance(
 rf_gene_reduced, X_gene_red_test, y_gene_test, n_repeats=5, random_state=42
)

perm_importance_df = pd.DataFrame({
 'gene': top_gene_names,
 'perm_importance': perm_importance.importances_mean,
 'perm_std': perm_importance.importances_std
}).sort_values('perm_importance', ascending=False)

print(f"Top 5 genes by permutation importance:")
for i, (idx, row) in enumerate(perm_importance_df.head(5).iterrows()):
    print(f"{i+1}. {row['gene']}: {row['perm_importance']:.4f} ± {row['perm_std']:.4f}")

 2. HIGH-DIMENSIONAL DATA ANALYSIS (GENE EXPRESSION)
Gene Expression Classification Performance:
Accuracy: 0.960
OOB Score: 0.929

Top 10 Most Important Genes:
 1. GENE_0088: 0.0584
 2. GENE_0199: 0.0317
 3. GENE_0158: 0.0314
 4. GENE_0070: 0.0311
 5. GENE_0139: 0.0303
 6. GENE_0130: 0.0277
 7. GENE_0030: 0.0273
 8. GENE_0117: 0.0269
 9. GENE_0177: 0.0260
10. GENE_0045: 0.0240
Gene Expression Classification Performance:
Accuracy: 0.960
OOB Score: 0.929

Top 10 Most Important Genes:
 1. GENE_0088: 0.0584
 2. GENE_0199: 0.0317
 3. GENE_0158: 0.0314
 4. GENE_0070: 0.0311
 5. GENE_0139: 0.0303
 6. GENE_0130: 0.0277
 7. GENE_0030: 0.0273
 8. GENE_0117: 0.0269
 9. GENE_0177: 0.0260
10. GENE_0045: 0.0240

Feature Selection Results:
All genes (200): 0.960
Top genes (20): 0.955
Performance change: -0.005
Dimensionality reduction: 90.0%

Permutation Importance Analysis:

Feature Selection Results:
All genes (200): 0.960
Top genes (20): 0.955
Performance change: -0.005
Dimensionality reduction: 9

In [11]:
# 3. ENSEMBLE BEHAVIOR ANALYSIS
print(" 3. ENSEMBLE BEHAVIOR ANALYSIS")
print("="*33)

# Analyze how ensemble size affects performance
n_estimators_range = [1, 5, 10, 25, 50, 100, 200, 300]
train_scores = []
test_scores = []
oob_scores = []

print("Analyzing ensemble size effect...")
for n_est in n_estimators_range:
    rf_temp = RandomForestClassifier(
        n_estimators=n_est,
        random_state=42,
        oob_score=True
    )

    rf_temp.fit(X_cust_train, y_cust_train)

    train_score = rf_temp.score(X_cust_train, y_cust_train)
    test_score = rf_temp.score(X_cust_test, y_cust_test)
    oob_score = rf_temp.oob_score_

    train_scores.append(train_score)
    test_scores.append(test_score)
    oob_scores.append(oob_score)

    print(f"n_estimators={n_est:3d}: Train={train_score:.3f}, Test={test_score:.3f}, OOB={oob_score:.3f}")

# Analyze max_features effect
max_features_options = ['sqrt', 'log2', 0.3, 0.5, 0.7, None]
max_features_scores = []

print(f"\nAnalyzing max_features effect:")
for max_feat in max_features_options:
    rf_temp = RandomForestClassifier(
        n_estimators=100,
        max_features=max_feat,
        random_state=42,
        oob_score=True
    )

    cv_scores = cross_val_score(rf_temp, X_cust_train, y_cust_train, cv=3)
    max_features_scores.append(cv_scores.mean())

    print(f"max_features={str(max_feat):6s}: CV Score={cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

# Best max_features
best_max_feat_idx = np.argmax(max_features_scores)
best_max_feat = max_features_options[best_max_feat_idx]
print(f"\nBest max_features: {best_max_feat} (Score: {max_features_scores[best_max_feat_idx]:.3f})")

# Compare with Extra Trees
print(f"\nComparing Random Forest vs Extra Trees:")
extra_trees = ExtraTreesClassifier(
    n_estimators=100,
    random_state=42,
    bootstrap=True,  # Enable bootstrap for OOB score
    oob_score=True
)
extra_trees.fit(X_cust_train, y_cust_train)
et_accuracy = extra_trees.score(X_cust_test, y_cust_test)
rf_accuracy = best_rf.score(X_cust_test, y_cust_test)

print(f"Random Forest: {rf_accuracy:.3f}")
print(f"Extra Trees: {et_accuracy:.3f}")
print(f"Difference: {et_accuracy - rf_accuracy:+.3f}")

# Bootstrap sampling analysis
print(f"\nBootstrap Sampling Analysis:")
n_samples_bootstrap = X_cust_train.shape[0]
unique_samples_ratios = []

for i in range(100): # 100 bootstrap samples
    bootstrap_indices = np.random.choice(n_samples_bootstrap, n_samples_bootstrap, replace=True)
    unique_samples = len(np.unique(bootstrap_indices))
    unique_ratio = unique_samples / n_samples_bootstrap
    unique_samples_ratios.append(unique_ratio)

avg_unique_ratio = np.mean(unique_samples_ratios)
print(f"Average unique samples in bootstrap: {avg_unique_ratio:.3f} ({avg_unique_ratio*100:.1f}%)")
print(f"Average out-of-bag samples: {1-avg_unique_ratio:.3f} ({(1-avg_unique_ratio)*100:.1f}%)")

 3. ENSEMBLE BEHAVIOR ANALYSIS
Analyzing ensemble size effect...
n_estimators=  1: Train=0.911, Test=0.769, OOB=0.431
n_estimators=  5: Train=0.990, Test=0.885, OOB=0.747
n_estimators= 10: Train=0.996, Test=0.915, OOB=0.857
n_estimators= 25: Train=1.000, Test=0.929, OOB=0.917
n_estimators= 25: Train=1.000, Test=0.929, OOB=0.917
n_estimators= 50: Train=1.000, Test=0.935, OOB=0.933
n_estimators= 50: Train=1.000, Test=0.935, OOB=0.933
n_estimators=100: Train=1.000, Test=0.932, OOB=0.940
n_estimators=100: Train=1.000, Test=0.932, OOB=0.940
n_estimators=200: Train=1.000, Test=0.927, OOB=0.943
n_estimators=200: Train=1.000, Test=0.927, OOB=0.943
n_estimators=300: Train=1.000, Test=0.930, OOB=0.941

Analyzing max_features effect:
n_estimators=300: Train=1.000, Test=0.930, OOB=0.941

Analyzing max_features effect:
max_features=sqrt  : CV Score=0.941 ± 0.003
max_features=sqrt  : CV Score=0.941 ± 0.003
max_features=log2  : CV Score=0.941 ± 0.013
max_features=log2  : CV Score=0.941 ± 0.013
max_fe

In [12]:
# 4. COMPREHENSIVE RANDOM FOREST VISUALIZATION DASHBOARD
print(" 4. COMPREHENSIVE RANDOM FOREST VISUALIZATION DASHBOARD")
print("="*60)

# Create comprehensive dashboard
fig = make_subplots(
 rows=3, cols=2,
 subplot_titles=[
 'Ensemble Size vs Performance',
 'Feature Importance: Customer Segmentation',
 'max_features Parameter Analysis',
 'Confusion Matrix: Customer Segments',
 'Gene Expression: Top Important Genes',
 'OOB Error vs Training Error'
 ],
 specs=[[{"secondary_y": False}, {"secondary_y": False}],
 [{"secondary_y": False}, {"type": "heatmap"}],
 [{"secondary_y": False}, {"secondary_y": False}]]
)

# 1. Ensemble size analysis
fig.add_trace(
 go.Scatter(
 x=n_estimators_range,
 y=train_scores,
 mode='lines+markers',
 name='Training Score',
 line=dict(color='blue')
 ),
 row=1, col=1
)

fig.add_trace(
 go.Scatter(
 x=n_estimators_range,
 y=test_scores,
 mode='lines+markers',
 name='Test Score',
 line=dict(color='red')
 ),
 row=1, col=1
)

fig.add_trace(
 go.Scatter(
 x=n_estimators_range,
 y=oob_scores,
 mode='lines+markers',
 name='OOB Score',
 line=dict(color='green')
 ),
 row=1, col=1
)

# 2. Feature importance
top_features = feature_importance.head(15)
fig.add_trace(
 go.Bar(
 x=top_features['importance'],
 y=top_features['feature'],
 orientation='h',
 marker_color='forestgreen',
 name='Feature Importance'
 ),
 row=1, col=2
)

# 3. max_features analysis
fig.add_trace(
 go.Bar(
 x=[str(x) for x in max_features_options],
 y=max_features_scores,
 marker_color='lightblue',
 name='CV Score'
 ),
 row=2, col=1
)

# 4. Confusion matrix
cm_customer = confusion_matrix(y_cust_test, y_pred_optimized)
segments = ['Budget', 'Standard', 'Premium', 'Enterprise']

fig.add_trace(
 go.Heatmap(
 z=cm_customer,
 x=segments,
 y=segments,
 colorscale='Blues',
 text=cm_customer,
 texttemplate='%{text}',
 hovertemplate='Predicted: %{x}<br>Actual: %{y}<br>Count: %{z}<extra></extra>'
 ),
 row=2, col=2
)

# 5. Gene importance
top_genes_plot = gene_importance.head(15)
fig.add_trace(
 go.Bar(
 x=top_genes_plot['importance'],
 y=top_genes_plot['gene'],
 orientation='h',
 marker_color='orange',
 name='Gene Importance'
 ),
 row=3, col=1
)

# 6. Training vs OOB error comparison
error_train = [1 - score for score in train_scores]
error_oob = [1 - score for score in oob_scores]

fig.add_trace(
 go.Scatter(
 x=n_estimators_range,
 y=error_train,
 mode='lines+markers',
 name='Training Error',
 line=dict(color='blue', dash='solid')
 ),
 row=3, col=2
)

fig.add_trace(
 go.Scatter(
 x=n_estimators_range,
 y=error_oob,
 mode='lines+markers',
 name='OOB Error',
 line=dict(color='red', dash='dash')
 ),
 row=3, col=2
)

# Update layout
fig.update_layout(
 height=1200,
 title="Random Forest Classification - Comprehensive Analysis Dashboard",
 showlegend=True
)

# Update axis labels
fig.update_xaxes(title_text="Number of Estimators", row=1, col=1)
fig.update_xaxes(title_text="Feature Importance", row=1, col=2)
fig.update_xaxes(title_text="max_features Parameter", row=2, col=1)
fig.update_xaxes(title_text="Predicted Segment", row=2, col=2)
fig.update_xaxes(title_text="Gene Importance", row=3, col=1)
fig.update_xaxes(title_text="Number of Estimators", row=3, col=2)

fig.update_yaxes(title_text="Accuracy Score", row=1, col=1)
fig.update_yaxes(title_text="Features", row=1, col=2)
fig.update_yaxes(title_text="CV Score", row=2, col=1)
fig.update_yaxes(title_text="Actual Segment", row=2, col=2)
fig.update_yaxes(title_text="Gene Names", row=3, col=1)
fig.update_yaxes(title_text="Error Rate", row=3, col=2)

fig.show()

 4. COMPREHENSIVE RANDOM FOREST VISUALIZATION DASHBOARD


In [13]:
# 5. BUSINESS INSIGHTS AND ROI ANALYSIS
print(" 5. BUSINESS INSIGHTS AND ROI ANALYSIS")
print("="*40)

# Customer segmentation business impact
print("Customer Segmentation System ROI:")
total_customers = 100000 # Total customer base
segmentation_accuracy = optimized_accuracy

# Revenue impact by segment
segment_revenues = {
 'Budget': 1000,
 'Standard': 3000,
 'Premium': 8000,
 'Enterprise': 22000
}

# Calculate improved targeting efficiency
baseline_conversion = 0.05 # 5% conversion without segmentation
segment_conversion_improvement = {
 'Budget': 0.02, # 2% improvement
 'Standard': 0.03, # 3% improvement
 'Premium': 0.05, # 5% improvement
 'Enterprise': 0.08 # 8% improvement
}

# Calculate segment distribution
segment_distribution = customer_df['segment'].value_counts(normalize=True).to_dict()

total_revenue_improvement = 0
for segment, proportion in segment_distribution.items():
    customers_in_segment = total_customers * proportion
    correctly_identified = customers_in_segment * segmentation_accuracy

    base_revenue = correctly_identified * baseline_conversion * segment_revenues[segment]
    improved_conversion = baseline_conversion + segment_conversion_improvement[segment]
    improved_revenue = correctly_identified * improved_conversion * segment_revenues[segment]

    segment_improvement = improved_revenue - base_revenue
    total_revenue_improvement += segment_improvement

    print(f"• {segment}: {customers_in_segment:,.0f} customers, "
          f"${segment_improvement:,.0f} additional revenue")

# System costs
implementation_cost = 250000 # Initial development
annual_operational_cost = 80000 # Maintenance and updates
net_annual_benefit = total_revenue_improvement - annual_operational_cost
roi = (net_annual_benefit - implementation_cost) / implementation_cost

print(f"\nCustomer Segmentation ROI Summary:")
print(f"• Total revenue improvement: ${total_revenue_improvement:,.0f}/year")
print(f"• Implementation cost: ${implementation_cost:,.0f}")
print(f"• Annual operational cost: ${annual_operational_cost:,.0f}")
print(f"• Net annual benefit: ${net_annual_benefit:,.0f}")
print(f"• ROI: {roi*100:.0f}%")
print(f"• Payback period: {implementation_cost/net_annual_benefit*12:.1f} months")

# Gene expression analysis business impact
print(f"\nBiomedical Research Cost Savings:")
total_genes_analyzed = len(gene_names)
genes_selected = len(top_gene_names)
cost_per_gene_analysis = 500 # Cost to analyze each gene

baseline_analysis_cost = total_genes_analyzed * cost_per_gene_analysis
reduced_analysis_cost = genes_selected * cost_per_gene_analysis
cost_savings = baseline_analysis_cost - reduced_analysis_cost

print(f"• Genes reduced from {total_genes_analyzed} to {genes_selected}")
print(f"• Cost per gene analysis: ${cost_per_gene_analysis}")
print(f"• Analysis cost savings: ${cost_savings:,.0f} per study")
print(f"• Accuracy maintained: {gene_reduced_accuracy:.3f} vs {gene_accuracy:.3f}")
print(f"• Dimensionality reduction: {(1-genes_selected/total_genes_analyzed)*100:.1f}%")

# Random Forest advantages summary
print(f"\nRandom Forest Key Advantages:")
print(f"• Handles high-dimensional data effectively")
print(f"• Provides feature importance for interpretability")
print(f"• Built-in cross-validation through OOB error")
print(f"• Resistant to overfitting with large ensembles")
print(f"• Handles missing values and categorical features")
print(f"• Parallelizable for fast training on large datasets")

print(f"\nImplementation Guidelines:")
print(f"• n_estimators: Start with 100, increase until OOB error stabilizes")
print(f"• max_features: Use 'sqrt' for classification, 'None' for small datasets")
print(f"• max_depth: Start with None, add constraints if overfitting occurs")
print(f"• min_samples_split: Increase (5-10) for noisy data")
print(f"• Feature selection: Use importance scores for dimensionality reduction")

print(f"\nCross-Reference Learning Path:")
print(f"• Foundation: Tier2_DecisionTree.ipynb (tree fundamentals)")
print(f"• Building On: Tier2_RandomForest.ipynb (basic implementation)")
print(f"• Comparison: Tier5_GradientBoosting.ipynb (boosting vs bagging)")
print(f"• Advanced: Advanced_EnsembleClassification.ipynb, Advanced_FeatureSelection.ipynb")

 5. BUSINESS INSIGHTS AND ROI ANALYSIS
Customer Segmentation System ROI:
• Standard: 25,175 customers, $2,109,980 additional revenue
• Budget: 25,025 customers, $466,091 additional revenue
• Premium: 24,950 customers, $9,293,875 additional revenue
• Enterprise: 24,850 customers, $40,729,150 additional revenue

Customer Segmentation ROI Summary:
• Total revenue improvement: $52,599,095/year
• Implementation cost: $250,000
• Annual operational cost: $80,000
• Net annual benefit: $52,519,095
• ROI: 20908%
• Payback period: 0.1 months

Biomedical Research Cost Savings:
• Genes reduced from 200 to 20
• Cost per gene analysis: $500
• Analysis cost savings: $90,000 per study
• Accuracy maintained: 0.955 vs 0.960
• Dimensionality reduction: 90.0%

Random Forest Key Advantages:
• Handles high-dimensional data effectively
• Provides feature importance for interpretability
• Built-in cross-validation through OOB error
• Resistant to overfitting with large ensembles
• Handles missing values and ca