# Tier 2: Random Forest

---

**Author:** Brandon Deloatch
**Affiliation:** Quipu Research Labs, LLC
**Date:** 2025-10-02
**Version:** v1.3
**License:** MIT
**Notebook ID:** af9c8a07-a69a-44d9-bb82-a24d738f13c3

---

## Citation
Brandon Deloatch, "Tier 2: Random Forest," Quipu Research Labs, LLC, v1.3, 2025-10-02.

Please cite this notebook if used or adapted in publications, presentations, or derivative work.

---

## Contributors / Acknowledgments
- **Primary Author:** Brandon Deloatch (Quipu Research Labs, LLC)
- **Institutional Support:** Quipu Research Labs, LLC - Advanced Analytics Division
- **Technical Framework:** Built on scikit-learn, pandas, numpy, and plotly ecosystems
- **Methodological Foundation:** Statistical learning principles and modern data science best practices

---

## Version History
| Version | Date | Notes |
|---------|------|-------|
| v1.3 | 2025-10-02 | Enhanced professional formatting, comprehensive documentation, interactive visualizations |
| v1.2 | 2024-09-15 | Updated analysis methods, improved data generation algorithms |
| v1.0 | 2024-06-10 | Initial release with core analytical framework |

---

## Environment Dependencies
- **Python:** 3.8+
- **Core Libraries:** pandas 2.0+, numpy 1.24+, scikit-learn 1.3+
- **Visualization:** plotly 5.0+, matplotlib 3.7+
- **Statistical:** scipy 1.10+, statsmodels 0.14+
- **Development:** jupyter-lab 4.0+, ipywidgets 8.0+

> **Reproducibility Note:** Use requirements.txt or environment.yml for exact dependency matching.

---

## Data Provenance
| Dataset | Source | License | Notes |
|---------|--------|---------|-------|
| Synthetic Data | Generated in-notebook | MIT | Custom algorithms for realistic simulation |
| Statistical Distributions | NumPy/SciPy | BSD-3-Clause | Standard library implementations |
| ML Algorithms | Scikit-learn | BSD-3-Clause | Industry-standard implementations |
| Visualization Schemas | Plotly | MIT | Interactive dashboard frameworks |

---

## Execution Provenance Logs
- **Created:** 2025-10-02
- **Notebook ID:** af9c8a07-a69a-44d9-bb82-a24d738f13c3
- **Execution Environment:** Jupyter Lab / VS Code
- **Computational Requirements:** Standard laptop/workstation (2GB+ RAM recommended)

> **Auto-tracking:** Execution metadata can be programmatically captured for reproducibility.

---

## Disclaimer & Responsible Use
This notebook is provided "as-is" for educational, research, and professional development purposes. Users assume full responsibility for any results, applications, or decisions derived from this analysis.

**Professional Standards:**
- Validate all results against domain expertise and additional data sources
- Respect licensing and attribution requirements for all dependencies
- Follow ethical guidelines for data analysis and algorithmic decision-making
- Credit all methodological sources and derivative frameworks appropriately

**Academic & Commercial Use:**
- Permitted under MIT license with proper attribution
- Suitable for educational curriculum and professional training
- Appropriate for commercial adaptation with citation requirements
- Recommended for reproducible research and transparent analytics

---



In [None]:
# Import Essential Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

# Scikit-learn imports
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score, roc_curve, auc
from sklearn.metrics import precision_recall_curve, mean_absolute_error
from sklearn.tree import DecisionTreeClassifier

# Feature selection
from sklearn.feature_selection import SelectFromModel
from sklearn.inspection import permutation_importance

import warnings
warnings.filterwarnings('ignore')

print(" Tier 2: Random Forest - Libraries Loaded Successfully!")
print("=" * 65)
print("Available Random Forest Techniques:")
print("• Random Forest Classification - Ensemble voting for robust classification")
print("• Random Forest Regression - Ensemble averaging for continuous prediction")
print("• Feature Importance Analysis - Variable ranking and selection")
print("• Out-of-Bag (OOB) Validation - Built-in model validation")
print("• Hyperparameter Optimization - n_estimators, max_depth, min_samples tuning")
print("• Ensemble Interpretation - Understanding collective decision making")

In [None]:
# Generate Comprehensive Datasets for Random Forest Analysis
np.random.seed(42)

def generate_random_forest_datasets():
 """Generate datasets optimized for Random Forest analysis"""

 # 1. CLASSIFICATION DATASET - Employee Performance Prediction
 n_employees = 1200

 # Employee demographics
 age = np.random.normal(35, 10, n_employees)
 age = np.clip(age, 22, 65)

 years_experience = np.random.exponential(scale=5, size=n_employees) + 1
 years_experience = np.clip(years_experience, 1, 30)

 education_level = np.random.choice([1, 2, 3, 4], size=n_employees, p=[0.2, 0.3, 0.3, 0.2])
 # 1=High School, 2=Bachelor's, 3=Master's, 4=PhD

 # Performance-related features
 training_hours = np.random.gamma(shape=2, scale=20, size=n_employees)
 training_hours = np.clip(training_hours, 5, 120)

 projects_completed = np.random.poisson(lam=8, size=n_employees) + 1

 team_size = np.random.choice([3, 5, 8, 12, 15], size=n_employees, p=[0.2, 0.3, 0.25, 0.15, 0.1])

 work_from_home_days = np.random.poisson(lam=2, size=n_employees)
 work_from_home_days = np.clip(work_from_home_days, 0, 5)

 # Behavioral features
 meeting_attendance = np.random.beta(a=8, b=2, size=n_employees) # High attendance generally

 peer_collaboration_score = np.random.normal(7, 2, n_employees)
 peer_collaboration_score = np.clip(peer_collaboration_score, 1, 10)

 innovation_score = np.random.gamma(shape=3, scale=2, size=n_employees)
 innovation_score = np.clip(innovation_score, 1, 10)

 # Department (affects performance patterns)
 departments = ['Engineering', 'Sales', 'Marketing', 'HR', 'Finance']
 department = np.random.choice(departments, size=n_employees, p=[0.3, 0.25, 0.2, 0.15, 0.1])

 # Create realistic performance ratings with complex interactions
 performance_score = (
 0.1 * (age - 35) / 10 + # Slight age effect
 0.2 * (years_experience - 5) / 10 + # Experience matters
 0.15 * (education_level - 2) + # Education impact
 0.2 * (training_hours - 40) / 40 + # Training effect
 0.1 * (projects_completed - 8) / 5 + # Project completion
 -0.05 * (team_size - 8) / 5 + # Smaller teams might be better
 0.1 * (meeting_attendance - 0.8) / 0.2 + # Attendance matters
 0.15 * (peer_collaboration_score - 7) / 3 + # Collaboration
 0.1 * (innovation_score - 6) / 4 + # Innovation
 np.random.normal(0, 0.3, n_employees) # Random variation
 )

 # Adjust for department effects
 dept_effects = {'Engineering': 0.1, 'Sales': 0.05, 'Marketing': 0.0, 'HR': -0.05, 'Finance': 0.02}
 for i, dept in enumerate(department):
 performance_score[i] += dept_effects[dept]

 # Convert to performance categories
 # Use percentiles to create balanced classes
 performance_percentiles = np.percentile(performance_score, [33, 67])
 performance_rating = np.zeros(n_employees, dtype=int)
 performance_rating[performance_score <= performance_percentiles[0]] = 0 # Needs Improvement
 performance_rating[(performance_score > performance_percentiles[0]) &
 (performance_score <= performance_percentiles[1])] = 1 # Meets Expectations
 performance_rating[performance_score > performance_percentiles[1]] = 2 # Exceeds Expectations

 # Encode department as numerical
 dept_encoder = LabelEncoder()
 department_encoded = dept_encoder.fit_transform(department)

 classification_df = pd.DataFrame({
 'age': age,
 'years_experience': years_experience,
 'education_level': education_level,
 'training_hours': training_hours,
 'projects_completed': projects_completed,
 'team_size': team_size,
 'work_from_home_days': work_from_home_days,
 'meeting_attendance': meeting_attendance,
 'peer_collaboration_score': peer_collaboration_score,
 'innovation_score': innovation_score,
 'department': department_encoded,
 'performance_rating': performance_rating
 })

 # 2. REGRESSION DATASET - Real Estate Price Prediction
 n_houses = 1000

 # Property characteristics
 house_size = np.random.gamma(shape=3, scale=600, size=n_houses) + 800
 house_size = np.clip(house_size, 800, 4000)

 bedrooms = np.random.poisson(lam=3, size=n_houses) + 1
 bedrooms = np.clip(bedrooms, 1, 6)

 bathrooms = np.random.poisson(lam=2, size=n_houses) + 1
 bathrooms = np.clip(bathrooms, 1, 5)

 garage_spaces = np.random.choice([0, 1, 2, 3], size=n_houses, p=[0.1, 0.3, 0.5, 0.1])

 house_age = np.random.exponential(scale=15, size=n_houses) + 1
 house_age = np.clip(house_age, 1, 100)

 lot_size = np.random.gamma(shape=2, scale=3000, size=n_houses) + 2000
 lot_size = np.clip(lot_size, 2000, 20000)

 # Location features
 distance_to_downtown = np.random.exponential(scale=8, size=n_houses) + 1
 distance_to_downtown = np.clip(distance_to_downtown, 1, 30)

 school_rating = np.random.beta(a=5, b=2, size=n_houses) * 10 + 1
 school_rating = np.clip(school_rating, 1, 10)

 crime_rate = np.random.exponential(scale=3, size=n_houses) + 0.5
 crime_rate = np.clip(crime_rate, 0.5, 15)

 # Neighborhood amenities
 parks_nearby = np.random.poisson(lam=2, size=n_houses)
 parks_nearby = np.clip(parks_nearby, 0, 8)

 shopping_centers_nearby = np.random.poisson(lam=1.5, size=n_houses)
 shopping_centers_nearby = np.clip(shopping_centers_nearby, 0, 5)

 # Property condition and features
 renovation_score = np.random.beta(a=3, b=2, size=n_houses) * 10
 renovation_score = np.clip(renovation_score, 1, 10)

 has_pool = np.random.binomial(1, 0.25, n_houses)
 has_fireplace = np.random.binomial(1, 0.4, n_houses)
 has_basement = np.random.binomial(1, 0.6, n_houses)

 # Generate house prices with complex non-linear relationships
 base_price = (
 house_size * 120 + # Base price per sq ft
 bedrooms * 15000 + # Bedroom premium
 bathrooms * 12000 + # Bathroom premium
 garage_spaces * 8000 + # Garage value
 -house_age * 1000 + # Depreciation
 lot_size * 10 + # Lot size value
 -distance_to_downtown * 2000 + # Location premium
 school_rating * 8000 + # School quality
 -crime_rate * 3000 + # Safety factor
 parks_nearby * 2000 + # Recreation access
 shopping_centers_nearby * 3000 + # Convenience
 renovation_score * 5000 + # Condition
 has_pool * 15000 + # Pool premium
 has_fireplace * 8000 + # Fireplace value
 has_basement * 12000 # Basement value
 )

 # Add non-linear interactions
 # Premium for large houses with many bedrooms
 luxury_bonus = np.where((house_size > 2500) & (bedrooms >= 4), 50000, 0)

 # Penalty for old houses far from downtown
 location_age_penalty = np.where((house_age > 30) & (distance_to_downtown > 15), -30000, 0)

 # Bonus for high-rated schools with low crime
 safe_school_bonus = np.where((school_rating > 8) & (crime_rate < 2), 25000, 0)

 house_price = (base_price + luxury_bonus + location_age_penalty + safe_school_bonus +
 np.random.normal(0, 20000, n_houses))
 house_price = np.maximum(house_price, 100000) # Minimum price floor

 regression_df = pd.DataFrame({
 'house_size': house_size,
 'bedrooms': bedrooms,
 'bathrooms': bathrooms,
 'garage_spaces': garage_spaces,
 'house_age': house_age,
 'lot_size': lot_size,
 'distance_to_downtown': distance_to_downtown,
 'school_rating': school_rating,
 'crime_rate': crime_rate,
 'parks_nearby': parks_nearby,
 'shopping_centers_nearby': shopping_centers_nearby,
 'renovation_score': renovation_score,
 'has_pool': has_pool,
 'has_fireplace': has_fireplace,
 'has_basement': has_basement,
 'price': house_price
 })

 # 3. HIGH-DIMENSIONAL DATASET - Gene Expression Classification
 n_samples = 400
 n_genes = 100 # Many features to showcase Random Forest's robustness

 # Generate gene expression data
 np.random.seed(42)

 # Create correlated gene groups (pathways)
 pathway_sizes = [10, 8, 12, 15, 20] # Different pathway sizes
 pathway_effects = [2.0, 1.5, 1.8, 1.2, 2.5] # Effect sizes

 gene_expression = np.random.normal(0, 1, (n_samples, n_genes))

 # Create two disease classes
 disease_status = np.random.binomial(1, 0.5, n_samples)

 # Add pathway effects for disease samples
 pathway_start = 0
 for pathway_size, effect in zip(pathway_sizes, pathway_effects):
 pathway_end = pathway_start + pathway_size

 # Disease samples have different expression in this pathway
 disease_mask = disease_status == 1
 gene_expression[disease_mask, pathway_start:pathway_end] += np.random.normal(
 effect, 0.5, (disease_mask.sum(), pathway_size)
 )

 pathway_start = pathway_end

 # Add noise genes (remaining genes are just noise)
 # These should have no predictive power

 # Create gene names
 gene_names = [f'Gene_{i+1:03d}' for i in range(n_genes)]

 # Create DataFrame
 gene_df = pd.DataFrame(gene_expression, columns=gene_names)
 gene_df['disease_status'] = disease_status

 return classification_df, regression_df, gene_df

# Generate datasets
print(" Generating Random Forest optimized datasets...")
classification_df, regression_df, gene_df = generate_random_forest_datasets()

print(f"Classification Dataset (Employee Performance): {classification_df.shape}")
print(f"Regression Dataset (House Prices): {regression_df.shape}")
print(f"High-dimensional Dataset (Gene Expression): {gene_df.shape}")

print("\nClassification Dataset (Employee Performance):")
print(classification_df.head())
performance_labels = ['Needs Improvement', 'Meets Expectations', 'Exceeds Expectations']
perf_counts = classification_df['performance_rating'].value_counts().sort_index()
for i, count in enumerate(perf_counts):
 print(f"• {performance_labels[i]}: {count} ({count/len(classification_df):.1%})")

print("\nRegression Dataset (House Prices):")
print(regression_df.head())
print(f"Price Range: ${regression_df['price'].min():,.0f} - ${regression_df['price'].max():,.0f}")
print(f"Median Price: ${regression_df['price'].median():,.0f}")

print("\nHigh-dimensional Dataset (Gene Expression):")
print(gene_df.head())
print(f"Disease Distribution: {gene_df['disease_status'].value_counts().to_dict()}")

In [None]:
# 1. RANDOM FOREST CLASSIFICATION ANALYSIS
print(" 1. RANDOM FOREST CLASSIFICATION ANALYSIS")
print("=" * 43)

# Prepare classification data
class_features = ['age', 'years_experience', 'education_level', 'training_hours',
 'projects_completed', 'team_size', 'work_from_home_days',
 'meeting_attendance', 'peer_collaboration_score', 'innovation_score', 'department']
X_class = classification_df[class_features]
y_class = classification_df['performance_rating']

# Split data
X_class_train, X_class_test, y_class_train, y_class_test = train_test_split(
 X_class, y_class, test_size=0.2, random_state=42, stratify=y_class
)

print(f"Training set: {X_class_train.shape}")
print(f"Test set: {X_class_test.shape}")
print(f"Class distribution: {y_class_train.value_counts().sort_index().to_dict()}")

# Train basic Random Forest
rf_basic = RandomForestClassifier(n_estimators=100, random_state=42)
rf_basic.fit(X_class_train, y_class_train)

# Predictions
y_class_pred = rf_basic.predict(X_class_test)
y_class_proba = rf_basic.predict_proba(X_class_test)

# Performance metrics
class_accuracy = accuracy_score(y_class_test, y_class_pred)
oob_score = rf_basic.oob_score_ if hasattr(rf_basic, 'oob_score_') else None

print(f"\n Basic Random Forest Performance:")
print(f"• Test Accuracy: {class_accuracy:.4f}")

# Enable OOB scoring
rf_oob = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)
rf_oob.fit(X_class_train, y_class_train)
print(f"• OOB Score: {rf_oob.oob_score_:.4f}")

# Cross-validation
cv_scores = cross_val_score(rf_basic, X_class_train, y_class_train, cv=5)
print(f"• Cross-validation: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

print(f"\nClassification Report:")
print(classification_report(y_class_test, y_class_pred, target_names=performance_labels))

# Feature importance analysis
feature_importance = rf_basic.feature_importances_
importance_df = pd.DataFrame({
 'Feature': class_features,
 'Importance': feature_importance
}).sort_values('Importance', ascending=False)

print(f"\n Feature Importance Analysis:")
for _, row in importance_df.iterrows():
 print(f"• {row['Feature']}: {row['Importance']:.4f}")

# Visualize feature importance
fig_importance = go.Figure()

fig_importance.add_trace(
 go.Bar(
 x=importance_df['Importance'],
 y=importance_df['Feature'],
 orientation='h',
 marker_color='lightblue',
 hovertemplate="Feature: %{y}<br>Importance: %{x:.4f}<extra></extra>"
 )
)

fig_importance.update_layout(
 title="Random Forest Feature Importance (Employee Performance)",
 xaxis_title="Feature Importance",
 yaxis_title="Features",
 height=600
)
fig_importance.show()

# Compare with single decision tree
dt_single = DecisionTreeClassifier(random_state=42)
dt_single.fit(X_class_train, y_class_train)
dt_accuracy = dt_single.score(X_class_test, y_class_test)

print(f"\n Single Decision Tree vs Random Forest:")
print(f"• Single Decision Tree Accuracy: {dt_accuracy:.4f}")
print(f"• Random Forest Accuracy: {class_accuracy:.4f}")
print(f"• Improvement: {(class_accuracy - dt_accuracy)*100:.1f} percentage points")

# Hyperparameter optimization
print(f"\n Hyperparameter Optimization:")

# Grid search for optimal parameters
param_grid = {
 'n_estimators': [50, 100, 200],
 'max_depth': [None, 10, 20],
 'min_samples_split': [2, 5, 10],
 'min_samples_leaf': [1, 2, 4]
}

# Use a smaller grid for demonstration
grid_search = GridSearchCV(
 RandomForestClassifier(random_state=42),
 param_grid,
 cv=3, # Reduced for speed
 scoring='accuracy',
 n_jobs=-1
)

grid_search.fit(X_class_train, y_class_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

# Test best model
best_rf = grid_search.best_estimator_
best_accuracy = best_rf.score(X_class_test, y_class_test)
print(f"Best model test accuracy: {best_accuracy:.4f}")

# Confusion Matrix
cm_class = confusion_matrix(y_class_test, y_class_pred)

fig_cm_class = ff.create_annotated_heatmap(
 z=cm_class,
 x=performance_labels,
 y=performance_labels,
 annotation_text=cm_class,
 colorscale='Blues',
 showscale=True
)

fig_cm_class.update_layout(
 title="Random Forest Confusion Matrix (Employee Performance)",
 xaxis_title="Predicted Performance",
 yaxis_title="Actual Performance",
 height=500
)
fig_cm_class.show()

# Number of estimators effect
print(f"\n Effect of Number of Estimators:")

n_estimators_range = [10, 25, 50, 100, 200, 300]
train_scores = []
test_scores = []
oob_scores = []

for n_est in n_estimators_range:
 rf_temp = RandomForestClassifier(n_estimators=n_est, oob_score=True, random_state=42)
 rf_temp.fit(X_class_train, y_class_train)

 train_scores.append(rf_temp.score(X_class_train, y_class_train))
 test_scores.append(rf_temp.score(X_class_test, y_class_test))
 oob_scores.append(rf_temp.oob_score_)

# Visualize effect of n_estimators
fig_n_est = go.Figure()

fig_n_est.add_trace(
 go.Scatter(
 x=n_estimators_range,
 y=train_scores,
 mode='lines+markers',
 name='Training Accuracy',
 line=dict(color='blue')
 )
)

fig_n_est.add_trace(
 go.Scatter(
 x=n_estimators_range,
 y=test_scores,
 mode='lines+markers',
 name='Test Accuracy',
 line=dict(color='red')
 )
)

fig_n_est.add_trace(
 go.Scatter(
 x=n_estimators_range,
 y=oob_scores,
 mode='lines+markers',
 name='OOB Score',
 line=dict(color='green')
 )
)

fig_n_est.update_layout(
 title="Random Forest: Effect of Number of Estimators",
 xaxis_title="Number of Estimators",
 yaxis_title="Accuracy",
 height=500
)
fig_n_est.show()

print(f"Optimal number of estimators appears to be around 100-200")

In [None]:
# 2. RANDOM FOREST REGRESSION ANALYSIS
print(" 2. RANDOM FOREST REGRESSION ANALYSIS")
print("=" * 37)

# Prepare regression data
reg_features = ['house_size', 'bedrooms', 'bathrooms', 'garage_spaces', 'house_age',
 'lot_size', 'distance_to_downtown', 'school_rating', 'crime_rate',
 'parks_nearby', 'shopping_centers_nearby', 'renovation_score',
 'has_pool', 'has_fireplace', 'has_basement']
X_reg = regression_df[reg_features]
y_reg = regression_df['price']

# Split data
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(
 X_reg, y_reg, test_size=0.2, random_state=42
)

print(f"Training set: {X_reg_train.shape}")
print(f"Test set: {X_reg_test.shape}")

# Train Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, oob_score=True, random_state=42)
rf_reg.fit(X_reg_train, y_reg_train)

# Predictions
y_reg_pred = rf_reg.predict(X_reg_test)

# Performance metrics
reg_mse = mean_squared_error(y_reg_test, y_reg_pred)
reg_r2 = r2_score(y_reg_test, y_reg_pred)
reg_mae = mean_absolute_error(y_reg_test, y_reg_pred)

print(f"\n Random Forest Regression Performance:")
print(f"• Test R²: {reg_r2:.4f}")
print(f"• Test RMSE: ${np.sqrt(reg_mse):,.0f}")
print(f"• Test MAE: ${reg_mae:,.0f}")
print(f"• OOB Score: {rf_reg.oob_score_:.4f}")

# Cross-validation
cv_scores_reg = cross_val_score(rf_reg, X_reg_train, y_reg_train, cv=5, scoring='r2')
print(f"• Cross-validation R²: {cv_scores_reg.mean():.4f} ± {cv_scores_reg.std():.4f}")

# Feature importance for regression
reg_importance = rf_reg.feature_importances_
reg_importance_df = pd.DataFrame({
 'Feature': reg_features,
 'Importance': reg_importance
}).sort_values('Importance', ascending=False)

print(f"\n Feature Importance Analysis (House Price Prediction):")
for _, row in reg_importance_df.iterrows():
 print(f"• {row['Feature']}: {row['Importance']:.4f}")

# Visualize regression feature importance
fig_reg_importance = go.Figure()

fig_reg_importance.add_trace(
 go.Bar(
 x=reg_importance_df['Importance'],
 y=reg_importance_df['Feature'],
 orientation='h',
 marker_color='lightgreen',
 hovertemplate="Feature: %{y}<br>Importance: %{x:.4f}<extra></extra>"
 )
)

fig_reg_importance.update_layout(
 title="Random Forest Feature Importance (House Price Prediction)",
 xaxis_title="Feature Importance",
 yaxis_title="Features",
 height=600
)
fig_reg_importance.show()

# Actual vs Predicted plot
fig_pred = go.Figure()

fig_pred.add_trace(
 go.Scatter(
 x=y_reg_test,
 y=y_reg_pred,
 mode='markers',
 marker=dict(color='blue', opacity=0.6),
 name='Predictions',
 hovertemplate="Actual: $%{x:,.0f}<br>Predicted: $%{y:,.0f}<extra></extra>"
 )
)

# Perfect prediction line
min_price = min(y_reg_test.min(), y_reg_pred.min())
max_price = max(y_reg_test.max(), y_reg_pred.max())

fig_pred.add_trace(
 go.Scatter(
 x=[min_price, max_price],
 y=[min_price, max_price],
 mode='lines',
 line=dict(color='red', dash='dash'),
 name='Perfect Prediction',
 hovertemplate="Perfect Line<extra></extra>"
 )
)

fig_pred.update_layout(
 title=f"Random Forest: Actual vs Predicted House Prices (R² = {reg_r2:.3f})",
 xaxis_title="Actual Price ($)",
 yaxis_title="Predicted Price ($)",
 height=500
)
fig_pred.show()

# Residuals analysis
residuals = y_reg_test - y_reg_pred

fig_residuals = go.Figure()

fig_residuals.add_trace(
 go.Scatter(
 x=y_reg_pred,
 y=residuals,
 mode='markers',
 marker=dict(color='green', opacity=0.6),
 hovertemplate="Predicted: $%{x:,.0f}<br>Residual: $%{y:,.0f}<extra></extra>"
 )
)

fig_residuals.add_hline(y=0, line_dash="dash", line_color="red")

fig_residuals.update_layout(
 title="Random Forest Residuals Analysis",
 xaxis_title="Predicted Price ($)",
 yaxis_title="Residuals ($)",
 height=500
)
fig_residuals.show()

# Permutation importance for more robust feature ranking
print(f"\n Permutation Importance Analysis:")

perm_importance = permutation_importance(
 rf_reg, X_reg_test, y_reg_test, n_repeats=10, random_state=42
)

perm_importance_df = pd.DataFrame({
 'Feature': reg_features,
 'Perm_Importance_Mean': perm_importance.importances_mean,
 'Perm_Importance_Std': perm_importance.importances_std
}).sort_values('Perm_Importance_Mean', ascending=False)

print("Permutation Importance (more robust than built-in importance):")
for _, row in perm_importance_df.head(5).iterrows():
 print(f"• {row['Feature']}: {row['Perm_Importance_Mean']:.4f} ± {row['Perm_Importance_Std']:.4f}")

# Compare built-in vs permutation importance
fig_importance_comparison = go.Figure()

# Merge dataframes for comparison
comparison_df = reg_importance_df.merge(
 perm_importance_df, on='Feature', suffixes=('_builtin', '_permutation')
)

fig_importance_comparison.add_trace(
 go.Scatter(
 x=comparison_df['Importance'],
 y=comparison_df['Perm_Importance_Mean'],
 mode='markers+text',
 text=comparison_df['Feature'],
 textposition='top center',
 marker=dict(size=10, color='blue'),
 name='Feature Importance Comparison',
 hovertemplate="Built-in: %{x:.4f}<br>Permutation: %{y:.4f}<br>Feature: %{text}<extra></extra>"
 )
)

# Add diagonal line for perfect correlation
max_importance = max(comparison_df['Importance'].max(), comparison_df['Perm_Importance_Mean'].max())
fig_importance_comparison.add_trace(
 go.Scatter(
 x=[0, max_importance],
 y=[0, max_importance],
 mode='lines',
 line=dict(color='red', dash='dash'),
 name='Perfect Correlation',
 showlegend=False
 )
)

fig_importance_comparison.update_layout(
 title="Built-in vs Permutation Feature Importance",
 xaxis_title="Built-in Importance",
 yaxis_title="Permutation Importance",
 height=600
)
fig_importance_comparison.show()

In [None]:
# 3. HIGH-DIMENSIONAL DATA ANALYSIS
print(" 3. HIGH-DIMENSIONAL DATA ANALYSIS")
print("=" * 33)

# Prepare gene expression data
gene_features = [col for col in gene_df.columns if col != 'disease_status']
X_gene = gene_df[gene_features]
y_gene = gene_df['disease_status']

print(f"Gene expression dataset: {X_gene.shape}")
print(f"Number of features: {len(gene_features)}")
print(f"Sample size: {len(y_gene)}")

# Split data
X_gene_train, X_gene_test, y_gene_train, y_gene_test = train_test_split(
 X_gene, y_gene, test_size=0.2, random_state=42, stratify=y_gene
)

print(f"Training set: {X_gene_train.shape}")
print(f"Test set: {X_gene_test.shape}")

# Train Random Forest on high-dimensional data
rf_gene = RandomForestClassifier(
 n_estimators=200, # More trees for stability with many features
 max_features='sqrt', # Use sqrt(p) features per tree
 oob_score=True,
 random_state=42
)

rf_gene.fit(X_gene_train, y_gene_train)

# Predictions
y_gene_pred = rf_gene.predict(X_gene_test)
y_gene_proba = rf_gene.predict_proba(X_gene_test)

# Performance metrics
gene_accuracy = accuracy_score(y_gene_test, y_gene_pred)
gene_auc = roc_auc_score(y_gene_test, y_gene_proba[:, 1])

print(f"\n High-dimensional Random Forest Performance:")
print(f"• Test Accuracy: {gene_accuracy:.4f}")
print(f"• ROC AUC: {gene_auc:.4f}")
print(f"• OOB Score: {rf_gene.oob_score_:.4f}")

# Feature importance analysis
gene_importance = rf_gene.feature_importances_
gene_importance_df = pd.DataFrame({
 'Gene': gene_features,
 'Importance': gene_importance
}).sort_values('Importance', ascending=False)

# Identify top important genes
print(f"\n Top 15 Most Important Genes:")
for _, row in gene_importance_df.head(15).iterrows():
 gene_num = int(row['Gene'].split('_')[1])
 # Determine if this gene is from a true pathway (first 65 genes)
 is_signal = "Signal" if gene_num <= 65 else "Noise"
 print(f"• {row['Gene']}: {row['Importance']:.4f} ({is_signal})")

# Visualize gene importance
fig_gene_importance = go.Figure()

# Color genes by signal vs noise
colors = ['green' if int(gene.split('_')[1]) <= 65 else 'red'
 for gene in gene_importance_df.head(20)['Gene']]

fig_gene_importance.add_trace(
 go.Bar(
 x=gene_importance_df.head(20)['Gene'],
 y=gene_importance_df.head(20)['Importance'],
 marker_color=colors,
 hovertemplate="Gene: %{x}<br>Importance: %{y:.4f}<extra></extra>"
 )
)

fig_gene_importance.update_layout(
 title="Top 20 Gene Importance (Green=Signal, Red=Noise)",
 xaxis_title="Genes",
 yaxis_title="Feature Importance",
 xaxis_tickangle=-45,
 height=600
)
fig_gene_importance.show()

# Feature selection using Random Forest
print(f"\n Feature Selection Analysis:")

# Use Random Forest for feature selection
selector = SelectFromModel(rf_gene, threshold='mean')
selector.fit(X_gene_train, y_gene_train)

selected_features = selector.get_support()
selected_gene_names = np.array(gene_features)[selected_features]
n_selected = len(selected_gene_names)

print(f"• Selected {n_selected} out of {len(gene_features)} genes")
print(f"• Selection threshold: {selector.threshold_:.4f}")

# Analyze quality of feature selection
signal_genes_selected = sum(1 for gene in selected_gene_names if int(gene.split('_')[1]) <= 65)
noise_genes_selected = n_selected - signal_genes_selected
total_signal_genes = 65
total_noise_genes = len(gene_features) - 65

precision = signal_genes_selected / n_selected if n_selected > 0 else 0
recall = signal_genes_selected / total_signal_genes

print(f"• Signal genes selected: {signal_genes_selected}/{total_signal_genes}")
print(f"• Noise genes selected: {noise_genes_selected}/{total_noise_genes}")
print(f"• Selection precision: {precision:.3f}")
print(f"• Selection recall: {recall:.3f}")

# Train model on selected features only
X_gene_train_selected = selector.transform(X_gene_train)
X_gene_test_selected = selector.transform(X_gene_test)

rf_gene_selected = RandomForestClassifier(n_estimators=200, random_state=42)
rf_gene_selected.fit(X_gene_train_selected, y_gene_train)

selected_accuracy = rf_gene_selected.score(X_gene_test_selected, y_gene_test)

print(f"\n Performance Comparison:")
print(f"• All genes ({len(gene_features)}): {gene_accuracy:.4f} accuracy")
print(f"• Selected genes ({n_selected}): {selected_accuracy:.4f} accuracy")
print(f"• Feature reduction: {(1 - n_selected/len(gene_features))*100:.1f}%")

# ROC Curve comparison
fpr_all, tpr_all, _ = roc_curve(y_gene_test, y_gene_proba[:, 1])
y_gene_proba_selected = rf_gene_selected.predict_proba(X_gene_test_selected)
fpr_selected, tpr_selected, _ = roc_curve(y_gene_test, y_gene_proba_selected[:, 1])

fig_roc_comparison = go.Figure()

fig_roc_comparison.add_trace(
 go.Scatter(
 x=fpr_all,
 y=tpr_all,
 mode='lines',
 name=f'All Genes (AUC = {gene_auc:.3f})',
 line=dict(color='blue', width=2)
 )
)

auc_selected = roc_auc_score(y_gene_test, y_gene_proba_selected[:, 1])
fig_roc_comparison.add_trace(
 go.Scatter(
 x=fpr_selected,
 y=tpr_selected,
 mode='lines',
 name=f'Selected Genes (AUC = {auc_selected:.3f})',
 line=dict(color='green', width=2)
 )
)

fig_roc_comparison.add_trace(
 go.Scatter(
 x=[0, 1],
 y=[0, 1],
 mode='lines',
 name='Random Classifier',
 line=dict(color='red', dash='dash')
 )
)

fig_roc_comparison.update_layout(
 title="ROC Curves: All Genes vs Selected Genes",
 xaxis_title="False Positive Rate",
 yaxis_title="True Positive Rate",
 height=500
)
fig_roc_comparison.show()

# Analyze max_features parameter effect
print(f"\n Effect of max_features Parameter:")

max_features_options = ['sqrt', 'log2', None, 0.1, 0.3, 0.5]
max_features_scores = []

for max_feat in max_features_options:
 rf_temp = RandomForestClassifier(
 n_estimators=100,
 max_features=max_feat,
 random_state=42
 )
 cv_score = cross_val_score(rf_temp, X_gene_train, y_gene_train, cv=3).mean()
 max_features_scores.append(cv_score)

 feat_name = str(max_feat) if max_feat is not None else 'all'
 print(f"• max_features={feat_name}: {cv_score:.4f}")

# Find optimal max_features
best_idx = np.argmax(max_features_scores)
best_max_features = max_features_options[best_idx]
print(f"\n Optimal max_features: {best_max_features}")

In [None]:
# 4. ENSEMBLE BEHAVIOR ANALYSIS
print(" 4. ENSEMBLE BEHAVIOR ANALYSIS")
print("=" * 31)

# Analyze individual tree predictions vs ensemble
print("Understanding how Random Forest combines individual tree predictions:")

# Use the employee performance dataset for this analysis
rf_ensemble = RandomForestClassifier(n_estimators=10, random_state=42) # Small ensemble for visualization
rf_ensemble.fit(X_class_train, y_class_train)

# Get predictions from individual trees
individual_predictions = []
for tree in rf_ensemble.estimators_:
 tree_pred = tree.predict(X_class_test)
 individual_predictions.append(tree_pred)

individual_predictions = np.array(individual_predictions)

# Ensemble prediction
ensemble_pred = rf_ensemble.predict(X_class_test)

# Analyze agreement between trees
print(f"\n Tree Agreement Analysis:")

# Calculate agreement for each sample
agreements = []
for i in range(len(X_class_test)):
 sample_predictions = individual_predictions[:, i]
 # Count how many trees agree with the final ensemble prediction
 agreement = np.sum(sample_predictions == ensemble_pred[i]) / len(rf_ensemble.estimators_)
 agreements.append(agreement)

agreements = np.array(agreements)

print(f"• Average tree agreement: {agreements.mean():.3f}")
print(f"• Minimum agreement: {agreements.min():.3f}")
print(f"• Maximum agreement: {agreements.max():.3f}")

# Analyze disagreement cases
low_agreement_mask = agreements < 0.6
high_agreement_mask = agreements > 0.9

print(f"• Samples with low agreement (<60%): {low_agreement_mask.sum()}")
print(f"• Samples with high agreement (>90%): {high_agreement_mask.sum()}")

# Visualize agreement distribution
fig_agreement = go.Figure()

fig_agreement.add_trace(
 go.Histogram(
 x=agreements,
 nbinsx=20,
 name='Tree Agreement',
 marker_color='lightblue',
 opacity=0.7
 )
)

fig_agreement.update_layout(
 title="Distribution of Tree Agreement in Random Forest",
 xaxis_title="Fraction of Trees Agreeing with Ensemble",
 yaxis_title="Number of Samples",
 height=500
)
fig_agreement.show()

# Prediction confidence analysis
print(f"\n Prediction Confidence Analysis:")

# Get class probabilities
class_probabilities = rf_ensemble.predict_proba(X_class_test)
max_probabilities = class_probabilities.max(axis=1)

# Correlate confidence with agreement
correlation = np.corrcoef(agreements, max_probabilities)[0, 1]
print(f"• Correlation between tree agreement and prediction confidence: {correlation:.3f}")

# Visualize confidence vs agreement
fig_conf_agreement = go.Figure()

# Color points by correctness
correct_predictions = (ensemble_pred == y_class_test)
colors = ['green' if correct else 'red' for correct in correct_predictions]

fig_conf_agreement.add_trace(
 go.Scatter(
 x=agreements,
 y=max_probabilities,
 mode='markers',
 marker=dict(color=colors, opacity=0.6),
 hovertemplate="Agreement: %{x:.3f}<br>Confidence: %{y:.3f}<extra></extra>"
 )
)

fig_conf_agreement.update_layout(
 title="Prediction Confidence vs Tree Agreement (Green=Correct, Red=Incorrect)",
 xaxis_title="Tree Agreement",
 yaxis_title="Prediction Confidence",
 height=500
)
fig_conf_agreement.show()

# Out-of-Bag (OOB) analysis
print(f"\n Out-of-Bag (OOB) Analysis:")

# Train with different OOB sample sizes
n_estimators_range = [10, 25, 50, 100, 200, 500]
oob_scores = []
test_scores = []

for n_est in n_estimators_range:
 rf_oob_temp = RandomForestClassifier(
 n_estimators=n_est,
 oob_score=True,
 random_state=42
 )
 rf_oob_temp.fit(X_class_train, y_class_train)

 oob_scores.append(rf_oob_temp.oob_score_)
 test_scores.append(rf_oob_temp.score(X_class_test, y_class_test))

print("OOB Score vs Test Score by Number of Estimators:")
for i, n_est in enumerate(n_estimators_range):
 print(f"• n_estimators={n_est}: OOB={oob_scores[i]:.4f}, Test={test_scores[i]:.4f}")

# Visualize OOB vs Test performance
fig_oob_test = go.Figure()

fig_oob_test.add_trace(
 go.Scatter(
 x=n_estimators_range,
 y=oob_scores,
 mode='lines+markers',
 name='OOB Score',
 line=dict(color='blue')
 )
)

fig_oob_test.add_trace(
 go.Scatter(
 x=n_estimators_range,
 y=test_scores,
 mode='lines+markers',
 name='Test Score',
 line=dict(color='red')
 )
)

fig_oob_test.update_layout(
 title="OOB Score vs Test Score",
 xaxis_title="Number of Estimators",
 yaxis_title="Accuracy",
 height=500
)
fig_oob_test.show()

# Bootstrap sampling analysis
print(f"\n Bootstrap Sampling Analysis:")

# Analyze what fraction of training data each tree sees
n_train_samples = len(X_class_train)
bootstrap_fractions = []

# Simulate bootstrap sampling
for _ in range(100): # 100 simulations
 bootstrap_sample = np.random.choice(n_train_samples, size=n_train_samples, replace=True)
 unique_samples = len(np.unique(bootstrap_sample))
 fraction_seen = unique_samples / n_train_samples
 bootstrap_fractions.append(fraction_seen)

avg_fraction = np.mean(bootstrap_fractions)
theoretical_fraction = 1 - (1 - 1/n_train_samples)**n_train_samples
print(f"• Average fraction of training data seen per tree: {avg_fraction:.3f}")
print(f"• Theoretical expectation (1 - 1/e): {theoretical_fraction:.3f}")
print(f"• This means ~{(1-avg_fraction)*100:.1f}% of data is out-of-bag for each tree")

# Feature subsampling analysis
print(f"\n Feature Subsampling Analysis:")

n_features = len(class_features)
print(f"• Total features: {n_features}")

# Different max_features options
max_features_options = {
 'sqrt': int(np.sqrt(n_features)),
 'log2': int(np.log2(n_features)),
 'None': n_features,
 '0.3': int(0.3 * n_features)
}

for option, n_feat in max_features_options.items():
 fraction = n_feat / n_features
 print(f"• max_features='{option}': {n_feat} features ({fraction:.1%} of total)")

print(f"\nFeature subsampling increases diversity and reduces overfitting!")

In [None]:
# 5. BUSINESS INSIGHTS AND STRATEGIC RECOMMENDATIONS
print(" 5. BUSINESS INSIGHTS AND STRATEGIC RECOMMENDATIONS")
print("=" * 54)

# Comprehensive business analysis
print(" Random Forest Business Applications Analysis:")

print(f"\n1. MODEL PERFORMANCE SUMMARY:")
print(f" • Employee Performance Classification: {class_accuracy:.1%} accuracy")
print(f" • House Price Prediction: R² = {reg_r2:.3f} (${np.sqrt(reg_mse):,.0f} RMSE)")
print(f" • Gene Expression Analysis: {gene_accuracy:.1%} accuracy, {gene_auc:.3f} AUC")

# Employee performance insights
print(f"\n2. EMPLOYEE PERFORMANCE INSIGHTS:")

top_performance_factors = importance_df.head(5)
print(f" Top 5 Performance Drivers:")
for i, (_, row) in enumerate(top_performance_factors.iterrows(), 1):
 feature = row['Feature']
 importance = row['Importance']
 print(f" {i}. {feature}: {importance:.3f} importance")

# Actionable recommendations based on feature importance
print(f"\n HR Strategy Recommendations:")
top_factor = top_performance_factors.iloc[0]['Feature']
if 'training' in top_factor.lower():
 print(f" • Invest heavily in employee training programs")
 print(f" • Target 40+ training hours annually for high performers")
elif 'collaboration' in top_factor.lower():
 print(f" • Implement team collaboration tools and practices")
 print(f" • Measure and reward collaborative behaviors")
elif 'innovation' in top_factor.lower():
 print(f" • Create innovation time and recognition programs")
 print(f" • Encourage creative problem-solving initiatives")

# Calculate ROI of performance improvements
total_employees = 10000
current_high_performers = total_employees * 0.33 # Current top 33%
target_improvement = 0.15 # 15% improvement in classification

# Assuming high performers generate 30% more value
avg_employee_value = 100000 # Annual value
high_performer_premium = 0.30
additional_value_per_improvement = avg_employee_value * high_performer_premium

potential_new_high_performers = total_employees * target_improvement
additional_annual_value = potential_new_high_performers * additional_value_per_improvement

print(f"\n ROI Analysis:")
print(f" • Current high performers: {current_high_performers:,.0f}")
print(f" • With 15% classification improvement: +{potential_new_high_performers:,.0f} identified")
print(f" • Additional annual value: ${additional_annual_value:,.0f}")
print(f" • Implementation cost estimate: ${total_employees * 50:,.0f} (training/tools)")
print(f" • Net ROI: {(additional_annual_value - total_employees * 50) / (total_employees * 50) * 100:.0f}%")

# Real estate insights
print(f"\n3. REAL ESTATE PRICING INSIGHTS:")

top_price_factors = reg_importance_df.head(5)
print(f" Top 5 Price Drivers:")
for i, (_, row) in enumerate(top_price_factors.iterrows(), 1):
 feature = row['Feature']
 importance = row['Importance']
 print(f" {i}. {feature}: {importance:.3f} importance")

# Property investment recommendations
print(f"\n Investment Strategy Recommendations:")
most_important_factor = top_price_factors.iloc[0]['Feature']
if 'size' in most_important_factor.lower():
 print(f" • Focus on property size as primary value driver")
 print(f" • Target properties >2500 sq ft for premium market")
elif 'school' in most_important_factor.lower():
 print(f" • Prioritize properties in high-rated school districts")
 print(f" • School rating >8 provides significant premium")
elif 'location' in most_important_factor.lower():
 print(f" • Location proximity is crucial for valuation")
 print(f" • Properties within 10 miles of downtown preferred")

# Market analysis automation value
properties_valued_monthly = 1000
manual_appraisal_cost = 500
automated_appraisal_cost = 50
accuracy_threshold = 0.90 # Require 90% accuracy for automation

if reg_r2 >= accuracy_threshold:
 monthly_savings = properties_valued_monthly * (manual_appraisal_cost - automated_appraisal_cost)
 print(f"\n Appraisal Automation ROI:")
 print(f" • Monthly property valuations: {properties_valued_monthly:,}")
 print(f" • Cost per manual appraisal: ${manual_appraisal_cost}")
 print(f" • Cost per automated appraisal: ${automated_appraisal_cost}")
 print(f" • Monthly savings: ${monthly_savings:,}")
 print(f" • Annual savings: ${monthly_savings * 12:,}")
else:
 print(f"\n Model needs improvement (R² = {reg_r2:.3f}) before automation deployment")

# Biomedical research insights
print(f"\n4. BIOMEDICAL RESEARCH INSIGHTS:")

print(f" Gene Discovery Results:")
signal_genes_in_top_20 = sum(1 for gene in gene_importance_df.head(20)['Gene']
 if int(gene.split('_')[1]) <= 65)
print(f" • Signal genes in top 20: {signal_genes_in_top_20}/20")
print(f" • Feature selection precision: {precision:.1%}")
print(f" • Feature selection recall: {recall:.1%}")

# Research cost savings
total_genes_to_validate = 100
cost_per_gene_validation = 5000
selected_genes_to_validate = n_selected

validation_cost_all = total_genes_to_validate * cost_per_gene_validation
validation_cost_selected = selected_genes_to_validate * cost_per_gene_validation
cost_savings = validation_cost_all - validation_cost_selected

print(f"\n Research Cost Analysis:")
print(f" • Cost to validate all genes: ${validation_cost_all:,}")
print(f" • Cost to validate selected genes: ${validation_cost_selected:,}")
print(f" • Research cost savings: ${cost_savings:,}")
print(f" • Efficiency gain: {(1 - selected_genes_to_validate/total_genes_to_validate)*100:.0f}% reduction")

# Implementation strategy
print(f"\n5. IMPLEMENTATION STRATEGY:")

print(f"\n Phase 1 - Employee Performance (Months 1-3):")
print(f" • Deploy performance prediction model for 25% of workforce")
print(f" • Focus on high-confidence predictions (agreement >80%)")
print(f" • A/B test interventions based on model recommendations")
print(f" • Expected outcome: 10-15% improvement in performance identification")

print(f"\n Phase 2 - Real Estate Automation (Months 2-4):")
print(f" • Implement automated property valuation for pre-screening")
print(f" • Human review for properties with low model confidence")
print(f" • Integration with existing appraisal workflows")
print(f" • Expected outcome: 50% reduction in appraisal time")

print(f"\n Phase 3 - Research Optimization (Months 3-6):")
print(f" • Deploy feature selection pipeline for new studies")
print(f" • Validate top gene candidates with wet lab experiments")
print(f" • Iterative model improvement with validation results")
print(f" • Expected outcome: 60% reduction in experimental costs")

print(f"\n6. RANDOM FOREST ADVANTAGES:")
print(f" • Handles mixed data types (numerical, categorical)")
print(f" • Robust to outliers and missing values")
print(f" • Built-in feature importance ranking")
print(f" • No need for feature scaling or preprocessing")
print(f" • Excellent performance on high-dimensional data")
print(f" • Built-in validation through OOB scoring")
print(f" • Interpretable through feature importance")

print(f"\n7. LIMITATIONS AND MITIGATION:")
print(f" • Can overfit with very noisy data")
print(f" → Use OOB scoring and cross-validation for monitoring")
print(f" • Memory intensive with large datasets")
print(f" → Consider online/incremental learning approaches")
print(f" • Less interpretable than single decision trees")
print(f" → Supplement with SHAP values for instance-level explanations")
print(f" • May struggle with linear relationships")
print(f" → Ensemble with linear models when appropriate")

print(f"\n8. MONITORING AND MAINTENANCE:")
print(f" • Monitor OOB score for model drift detection")
print(f" • Track feature importance stability over time")
print(f" • Retrain quarterly or when performance degrades >5%")
print(f" • A/B test model updates against current production")
print(f" • Regular feature engineering and selection reviews")

print(f"\n" + "="*80)
print(f" RANDOM FOREST LEARNING SUMMARY:")
print(f" Mastered ensemble learning and bagging principles")
print(f" Applied Random Forest to classification and regression")
print(f" Analyzed feature importance and selection techniques")
print(f" Understood OOB validation and ensemble behavior")
print(f" Optimized hyperparameters for different problem types")
print(f" Generated comprehensive business strategies and ROI analysis")
print(f"="*80)