# Random Forests - Bagging, OOB Intuition, and Feature Importance

<hr>

<center>
<div>
<img src="https://raw.githubusercontent.com/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/main/notebooks/figures/mgmt_474_ai_logo_02-modified.png" width="200"/>
</div>
</center>

# <center><a class="tocSkip"></center>
# <center>MGMT47400 Predictive Analytics</center>
# <center>Professor: Davi Moreira </center>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/blob/main/notebooks/12_random_forests_importance.ipynb)

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Explain bagging and why forests reduce variance
2. Train a random forest and tune the most impactful knobs
3. Use permutation importance responsibly
4. Compare forest vs tree vs linear/logistic baselines
5. Produce project-ready model comparison tables

---

In [None]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.inspection import permutation_importance
from sklearn.metrics import roc_auc_score
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.precision', 4)
RANDOM_SEED = 474
np.random.seed(RANDOM_SEED)
print("✓ Setup complete!")

**Reading the output:**

The setup cell loads `RandomForestClassifier` from scikit-learn's ensemble module alongside the single-tree classifier and logistic regression we used in the previous notebook. `permutation_importance` from `sklearn.inspection` is the model-agnostic importance method we will use later.

The confirmation message `Setup complete!` with **RANDOM_SEED = 474** ensures reproducibility. All forest randomness (bootstrap sampling, feature subsets at each split) flows from this single seed.

**Key takeaway:** Importing `n_jobs=-1` later in the forest constructor will parallelize tree training across all CPU cores, which is important because forests train 100-300 independent trees.

---

## 1. From Single Tree to Forest: The Bagging Idea

### The Variance Problem with Single Trees

**Problem:** Decision trees are unstable
- Small change in data → completely different tree
- High variance in predictions
- Overfitting on individual quirks

### The Solution: Bootstrap Aggregating (Bagging)

**Algorithm:**
1. Create B bootstrap samples (random sampling with replacement)
2. Train one tree on each bootstrap sample
3. Aggregate predictions (average for regression, vote for classification)

**Why it works:**
- Averaging reduces variance
- Each tree sees slightly different data
- Errors cancel out through averaging

### Random Forest = Bagging + Random Feature Selection

**Extra randomness:** At each split, only consider random subset of features
- Decorrelates trees
- Prevents one strong feature from dominating
- Further variance reduction

In [None]:
# Load data
data = load_breast_cancer(as_frame=True)
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=RANDOM_SEED, stratify=y)

print(f"Train: {len(X_train)} | Test: {len(X_test)}")
print(f"Features: {X.shape[1]}")

**Reading the output:**

The breast cancer dataset is split into **398 training** and **171 test** samples with stratification, preserving the roughly 63/37 benign-to-malignant class balance in each split. The dataset has **30 features**, all continuous measurements of cell nuclei (mean, standard error, and worst-case values for 10 properties like radius, texture, and symmetry).

**Why this matters:** Having 30 features is important for Random Forests because the algorithm randomly selects a subset of features at each split. With 30 features and the default `max_features='sqrt'`, each split considers roughly 5-6 candidate features, creating diversity among the trees.

---

## 2. Single Tree vs Random Forest

The core claim of Random Forests is that averaging many decorrelated trees produces a model with **lower variance** and **higher overall accuracy** than any single tree. The experiment below makes this concrete: we train one decision tree (depth-5) and one forest of 100 depth-5 trees on the same breast cancer dataset, using the same 5-fold stratified CV.

Watch for two things in the output: (1) the forest's mean ROC-AUC should be higher, and (2) its standard deviation across folds should be smaller. A tighter spread means the forest's predictions are more stable under different data partitions, which is exactly what variance reduction buys you.

In [None]:
# Compare single tree vs forest
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)

# Single tree (tuned depth)
tree = DecisionTreeClassifier(max_depth=5, random_state=RANDOM_SEED)
tree_scores = cross_val_score(tree, X_train, y_train, cv=cv, scoring='roc_auc')

# Random forest
forest = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=RANDOM_SEED)
forest_scores = cross_val_score(forest, X_train, y_train, cv=cv, scoring='roc_auc')

print("=== SINGLE TREE VS RANDOM FOREST ===")
print(f"\nSingle Tree (depth=5):")
print(f"  CV ROC-AUC: {tree_scores.mean():.4f} ± {tree_scores.std():.4f}")
print(f"  Fold scores: {tree_scores.round(4)}")

print(f"\nRandom Forest (100 trees, depth=5):")
print(f"  CV ROC-AUC: {forest_scores.mean():.4f} ± {forest_scores.std():.4f}")
print(f"  Fold scores: {forest_scores.round(4)}")

print(f"\n=== IMPROVEMENT ===")
print(f"Mean improvement: {(forest_scores.mean() - tree_scores.mean()):.4f}")
print(f"Variance reduction: {(tree_scores.std() - forest_scores.std()):.4f}")

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
positions = [1, 2]
bp = ax.boxplot([tree_scores, forest_scores], positions=positions, widths=0.6, patch_artist=True)
for patch, color in zip(bp['boxes'], ['lightblue', 'lightgreen']):
    patch.set_facecolor(color)
ax.set_xticklabels(['Single Tree', 'Random Forest'])
ax.set_ylabel('ROC-AUC Score')
ax.set_title('Variance Reduction: Single Tree vs Random Forest')
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\n💡 Forest has higher mean AND lower variance")
print("💡 More stable predictions across different data samples")

**Reading the output:**

The printout shows fold-level ROC-AUC scores for both models. The single tree's scores vary more across folds (higher standard deviation), while the forest's scores cluster tightly around a higher mean. Typical numbers: the single tree scores around **0.95-0.97** with std ~0.02, while the forest scores **0.98-0.99** with std ~0.01.

The box plot reinforces this visually. The forest's box is both **higher** (better median) and **narrower** (less variance) than the single tree's box. The "Improvement" line quantifies the mean lift, and the "Variance reduction" line shows how much tighter the forest's spread is.

**Why this matters:** This is the entire motivation for ensemble methods. A single tree is a high-variance estimator; averaging 100 trees with different bootstrap samples and random feature subsets dramatically stabilizes predictions without sacrificing accuracy.

---

## 3. Tuning Random Forests

### Most Important Hyperparameters

**n_estimators** (number of trees)
- More trees = better performance (usually)
- Diminishing returns after ~100-500
- More trees = longer training time

**max_features** (features per split)
- sqrt(n_features) for classification (default)
- n_features/3 for regression
- Lower values = more decorrelation

**max_depth** (tree depth)
- Controls individual tree complexity
- None = grow until pure (common for forests)

**min_samples_split** (minimum samples to split)
- Higher values = simpler trees
- Prevents overfitting

In [None]:
# Effect of number of trees
n_estimators_range = [10, 25, 50, 100, 200, 300]
results = []

for n_est in n_estimators_range:
    rf = RandomForestClassifier(n_estimators=n_est, random_state=RANDOM_SEED, n_jobs=-1)
    scores = cross_val_score(rf, X_train, y_train, cv=cv, scoring='roc_auc')
    results.append({
        'n_estimators': n_est,
        'cv_mean': scores.mean(),
        'cv_std': scores.std()
    })

results_df = pd.DataFrame(results)
print("=== N_ESTIMATORS SWEEP ===")
print(results_df.to_string(index=False))

# Plot
plt.figure(figsize=(10, 6))
plt.errorbar(results_df['n_estimators'], results_df['cv_mean'], 
             yerr=results_df['cv_std'], marker='o', capsize=5, linewidth=2)
plt.xlabel('Number of Trees')
plt.ylabel('CV ROC-AUC')
plt.title('Random Forest: Effect of Number of Trees')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n💡 Performance plateaus after ~100-200 trees")
print("💡 Use more trees for final model, fewer for experimentation")

**Reading the output:**

The table shows how CV ROC-AUC changes as we increase the number of trees from 10 to 300. Performance improves rapidly from 10 to 50 trees, then plateaus around 100-200. Going from 200 to 300 trees typically adds less than **0.001** to the mean ROC-AUC.

The error-bar plot makes the diminishing returns clear: the curve flattens well before 300 trees. Meanwhile, training time scales linearly with `n_estimators`, so there is a practical cost to adding more trees beyond the plateau.

Standard deviation also decreases slightly with more trees, because averaging over a larger ensemble further stabilizes the estimate. However, the reduction is marginal after ~100 trees.

**Key takeaway:** Use 100-200 trees for experimentation and tuning. For a final production model you can bump to 500+ for a tiny extra boost, but the gains will be minimal. The real performance levers are `max_features` and `max_depth`.

---

## 📝 PAUSE-AND-DO Exercise 1 (10 minutes)

**Task:** Tune `n_estimators` and `max_features` minimally and report effects.

---

In [None]:
# YOUR CODE: Tune max_features
import math

max_features_options = ['sqrt', 'log2', 0.3, 0.5, None]
tuning_results = []

for max_feat in max_features_options:
    rf = RandomForestClassifier(
        n_estimators=100,
        max_features=max_feat,
        random_state=RANDOM_SEED,
        n_jobs=-1
    )
    scores = cross_val_score(rf, X_train, y_train, cv=cv, scoring='roc_auc')
    tuning_results.append({
        'max_features': str(max_feat),
        'cv_mean': scores.mean(),
        'cv_std': scores.std()
    })

tuning_df = pd.DataFrame(tuning_results)
print("=== MAX_FEATURES TUNING ===")
print(tuning_df.to_string(index=False))

best_idx = tuning_df['cv_mean'].idxmax()
print(f"\n✓ Best max_features: {tuning_df.loc[best_idx, 'max_features']} ({tuning_df.loc[best_idx, 'cv_mean']:.4f})")

**Reading the output:**

The table compares five `max_features` settings: `sqrt` (~5 features), `log2` (~5 features), `0.3` (9 features), `0.5` (15 features), and `None` (all 30 features). On the breast cancer dataset, `sqrt` and `log2` often perform similarly and are competitive with higher values.

Using all features (`None`) makes each tree more similar to every other tree, which reduces the variance-reduction benefit of the forest. Lower values force more diversity among trees. However, setting `max_features` too low can hurt if no single small subset of features carries enough signal.

The best setting is highlighted at the bottom. In practice, `sqrt` is the recommended default for classification and is rarely far from optimal.

**Key takeaway:** `max_features` controls the bias-variance tradeoff at the individual tree level. Lower values mean more tree diversity (lower correlation between trees), which improves ensemble performance up to a point.

---

### YOUR ANALYSIS:

**Effect of max_features:**  
[What did you observe? How does it affect performance?]

**Recommendation:**  
[Which value would you use in production? Why?]

---

## 4. Out-of-Bag (OOB) Score

### Free Cross-Validation

**Insight:** Each tree is trained on ~63% of data (bootstrap)
- Remaining ~37% is "out-of-bag" for that tree
- Can use OOB samples to estimate test performance
- No need for separate validation set!

**OOB Score ≈ Cross-Validation Score**
- Faster than CV
- Uses all data for training
- Good for initial model selection

In [None]:
# Compare OOB vs CV
rf_oob = RandomForestClassifier(
    n_estimators=100,
    oob_score=True,
    random_state=RANDOM_SEED,
    n_jobs=-1
)

rf_oob.fit(X_train, y_train)
oob_score = rf_oob.oob_score_

# Compare to CV
cv_scores = cross_val_score(rf_oob, X_train, y_train, cv=cv, scoring='accuracy')

print("=== OOB VS CROSS-VALIDATION ===")
print(f"OOB Score (free): {oob_score:.4f}")
print(f"CV Score (5-fold): {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
print(f"Difference: {abs(oob_score - cv_scores.mean()):.4f}")

print("\n💡 OOB score is a good proxy for test performance")
print("💡 Use OOB for quick iterations, CV for final evaluation")

**Reading the output:**

Two numbers are printed side by side: the **OOB score** (computed internally by the forest from out-of-bag samples) and the **5-fold CV score**. Both are accuracy estimates. Typically they agree to within **0.005-0.01**, confirming that OOB is a reliable free proxy for cross-validation.

The OOB approach works because each tree is trained on only ~63 % of the data (a bootstrap sample), so the remaining ~37 % serves as a built-in validation set for that tree. Aggregating these per-tree OOB predictions across all 100 trees yields an unbiased performance estimate.

**Why this matters:** OOB scores are computed during training with zero extra cost, while 5-fold CV requires fitting 5 separate forests. For quick hyperparameter screening, OOB can save significant compute time. Reserve full CV for the final comparison.

---

## 5. Feature Importance

### Two Types of Importance

**1. Gini/Entropy Importance (built-in)**
- Based on how much each feature reduces impurity
- Fast to compute
- ⚠️ Biased toward high-cardinality features
- ⚠️ Can be misleading with correlated features

**2. Permutation Importance (recommended)**
- Shuffle feature, measure performance drop
- More reliable
- Slower to compute
- Works for any model

In [None]:
# Train final forest
rf_final = RandomForestClassifier(n_estimators=200, random_state=RANDOM_SEED, n_jobs=-1)
rf_final.fit(X_train, y_train)

# Built-in importance
builtin_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_final.feature_importances_
}).sort_values('importance', ascending=False)

print("=== BUILT-IN FEATURE IMPORTANCE (Top 10) ===")
print(builtin_importance.head(10).to_string(index=False))

# Visualize
plt.figure(figsize=(10, 8))
top_features = builtin_importance.head(15)
plt.barh(range(len(top_features)), top_features['importance'])
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Gini Importance')
plt.title('Top 15 Features by Built-in Importance')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

**Reading the output:**

The table and horizontal bar chart show the top 15 features ranked by built-in Gini importance. Features from the "worst" category (worst radius, worst concave points, worst perimeter) typically dominate because they capture extreme cell-nucleus measurements that strongly distinguish malignant from benign tumors.

The importance values sum to 1.0 across all 30 features. A steep dropoff after the top 3-5 features suggests that the model's decisions are driven by a small subset of measurements, while the remaining features contribute little to the splits.

However, remember the caveat: Gini importance is biased toward continuous features with many unique split points, and it double-counts correlated features. Features like `worst radius` and `worst perimeter` are highly correlated (r > 0.99), so their combined Gini importance overstates the unique information each provides.

**Key takeaway:** Treat Gini importance as a quick screening tool, not a definitive ranking. The next section shows permutation importance, which is more reliable.

---

## 6. Permutation Importance (Recommended)

Built-in Gini importance has a known flaw: it is biased toward high-cardinality and continuous features, and it can be misleading when features are correlated. **Permutation importance** avoids these pitfalls by measuring how much model performance drops when a single feature's values are randomly shuffled. If shuffling a feature barely affects ROC-AUC, that feature is not important for the model's decisions.

We compute permutation importance on the **test set** (not training set) so the estimates reflect genuine predictive value rather than memorized patterns. Each feature is shuffled 10 times to produce a mean and standard deviation, giving us confidence intervals on the importance estimates.

In [None]:
# Permutation importance
perm_importance = permutation_importance(
    rf_final, X_test, y_test,
    n_repeats=10,
    random_state=RANDOM_SEED,
    scoring='roc_auc'
)

perm_df = pd.DataFrame({
    'feature': X.columns,
    'importance_mean': perm_importance.importances_mean,
    'importance_std': perm_importance.importances_std
}).sort_values('importance_mean', ascending=False)

print("=== PERMUTATION IMPORTANCE (Top 10) ===")
print(perm_df.head(10).to_string(index=False))

# Compare built-in vs permutation
comparison = builtin_importance.merge(
    perm_df,
    on='feature',
    suffixes=('_builtin', '_perm')
).head(10)

print("\n=== TOP 10: BUILT-IN VS PERMUTATION ===")
print(comparison[['feature', 'importance', 'importance_mean']].to_string(index=False))

**Reading the output:**

The permutation importance table ranks features by how much test-set ROC-AUC drops when each feature is shuffled. The `importance_std` column shows the variability across 10 shuffle repeats, acting as a confidence interval. A feature with high mean importance but also high std should be interpreted cautiously.

The comparison table at the bottom shows built-in Gini importance alongside permutation importance for the top 10 features. You will likely notice differences in ranking: some features that rank high by Gini (because they are used in many splits) may rank lower by permutation (because other correlated features compensate when one is shuffled). For example, `worst perimeter` and `worst radius` often swap positions between the two methods.

Features with permutation importance near zero or negative are essentially irrelevant to the model's predictions on new data, even if they appear in many tree splits.

**Why this matters:** Permutation importance is the recommended method for production because it is model-agnostic, computed on held-out data, and less susceptible to correlation artifacts than Gini importance.

---

## 📝 PAUSE-AND-DO Exercise 2 (10 minutes)

**Task:** Compute permutation importance and write 3 interpretation bullets.

Already done above! Now analyze:

---

### YOUR INTERPRETATION:

**Bullet 1: Top Features**  
[Which features are most important? Why might this be?]

**Bullet 2: Differences**  
[How do built-in vs permutation importance differ?]

**Bullet 3: Business Insight**  
[What does this tell you about the prediction task?]

---

## 7. Comprehensive Model Comparison

Before concluding, we compare every model type encountered so far: Logistic Regression (linear baseline), a single tuned Decision Tree, and Random Forests with 100 and 200 trees. All models are evaluated under the same `StratifiedKFold` CV object to ensure fair comparison.

The resulting table and bar chart make it easy to see how much ensemble methods improve over single models. Pay attention not only to the mean ROC-AUC but also to the standard deviation (stability) and the gap between CV and test scores (generalization). A model that has both the highest mean and the lowest spread is the clear champion.

In [None]:
# Compare all models
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

models = {
    'Logistic Regression': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(random_state=RANDOM_SEED, max_iter=1000))
    ]),
    'Decision Tree (tuned)': DecisionTreeClassifier(max_depth=5, random_state=RANDOM_SEED),
    'Random Forest (100)': RandomForestClassifier(n_estimators=100, random_state=RANDOM_SEED, n_jobs=-1),
    'Random Forest (200)': RandomForestClassifier(n_estimators=200, random_state=RANDOM_SEED, n_jobs=-1)
}

comparison_results = []
for name, model in models.items():
    cv_results = cross_val_score(model, X_train, y_train, cv=cv, scoring='roc_auc')
    
    # Fit and test
    model.fit(X_train, y_train)
    if hasattr(model, 'predict_proba'):
        test_score = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
    else:
        test_score = roc_auc_score(y_test, model.predict(X_test))
    
    comparison_results.append({
        'Model': name,
        'CV_Mean': cv_results.mean(),
        'CV_Std': cv_results.std(),
        'Test_Score': test_score
    })

final_comparison = pd.DataFrame(comparison_results).sort_values('CV_Mean', ascending=False)
print("=== FINAL MODEL COMPARISON ===")
print(final_comparison.to_string(index=False))

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
x = np.arange(len(final_comparison))
ax.bar(x - 0.2, final_comparison['CV_Mean'], 0.4, label='CV Mean', alpha=0.8)
ax.bar(x + 0.2, final_comparison['Test_Score'], 0.4, label='Test', alpha=0.8)
ax.set_xlabel('Model')
ax.set_ylabel('ROC-AUC')
ax.set_title('Model Comparison: CV vs Test Performance')
ax.set_xticks(x)
ax.set_xticklabels(final_comparison['Model'], rotation=45, ha='right')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

best_model = final_comparison.iloc[0]['Model']
print(f"\n✓ Champion model: {best_model}")

**Reading the output:**

The final comparison table ranks four models by CV ROC-AUC: Logistic Regression, Decision Tree (depth-5), Random Forest (100 trees), and Random Forest (200 trees). The bar chart displays CV mean vs. test score side by side.

Typical ordering on the breast cancer dataset: the two Random Forest variants lead with ROC-AUC around **0.99**, followed closely by Logistic Regression at **0.98-0.99**, with the single Decision Tree trailing at **0.96-0.97**. The 200-tree forest is usually within **0.001-0.003** of the 100-tree forest, illustrating the diminishing returns of adding more trees.

Notice the standard deviations: forests and logistic regression tend to have similarly low std (~0.01), while the single tree has noticeably higher std (~0.02-0.03). The CV-vs-test gap should be small for all models; a large discrepancy for any model would signal overfitting or an unusual test split.

The champion model name is printed at the bottom. On this dataset the margin between the forest and logistic regression is often small, illustrating that a more complex model does not always provide a large lift over a well-tuned linear baseline.

**Key takeaway:** Random Forests reliably improve over single trees and compete with linear models on well-structured datasets. The real payoff of forests comes on datasets with non-linear relationships and feature interactions where logistic regression struggles.

---

## 8. Wrap-Up: Key Takeaways

### What We Learned Today:

1. **Bagging Reduces Variance**: Averaging many trees stabilizes predictions
2. **Random Forests**: Bagging + random feature selection = powerful ensemble
3. **Hyperparameter Tuning**: n_estimators and max_features most important
4. **OOB Score**: Free cross-validation estimate
5. **Permutation Importance**: More reliable than built-in Gini importance

### Critical Rules:

> **"More trees is almost always better (diminishing returns after 100-500)"**

> **"Use permutation importance for production, not built-in importance"**

> **"Random forests rarely overfit badly (but can underfit)"**

### Next Steps:

- Next notebook: Gradient Boosting (sequential ensembles)
- Boosting will give even better performance
- But requires more careful tuning

---

## Bibliography

- Breiman, L. (2001). "Random Forests." *Machine Learning*, 45(1), 5-32.
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An Introduction to Statistical Learning with Python* - Tree-Based Methods (bagging/forests)
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* - Random forests and bagging
- scikit-learn User Guide: [RandomForest estimators](https://scikit-learn.org/stable/modules/ensemble.html#forest)
- scikit-learn User Guide: [Permutation importance](https://scikit-learn.org/stable/modules/permutation_importance.html)

---



<center>

Thank you!

</center>