# Exercise 11 (as assigned)
Due Tuesday, November 11, 2025  11:59pm

Random Forest and XGBoost Analysis
Analyzing 2016 election data to predict voter gap (trump - clinton)

We will be using the county_level_election.csv dataset. This is 2016 election data and we are going to measure ‘votergap’ as the outcome. ‘votergap’ = trump-clinton. The exercise will build on the work from the decision tree lab.

# Section 2: Bagging / Random Forest
We are going to be using test and training splits, cross validation, and fitting a random forest to the data. Create an 80/20 Train/Test split. For accuracy use the .score method.
from sklearn.ensemble import RandomForestRegressor

1.	Set the number of estimators to be 100, the features to be the square root of available features, and iterate through depths (1-20). Use only 5 folds for cross validation to save some compute resources. Plot the max depth on the x axis and the accuracy on the y axis for training and for the mean cross validation.
2.	Based on the plot, how many nodes would you recommend as the max depth?
3.	What is the accuracy (mean cv) at your chosen depth?
4.	The cross validation looks different than the lab, why?

# Section 3: Boosting / XGBoost
import xgboost as xgb

5.	Use the defaults for most parameters. Iterate through depths (1-20). Use only 5 folds for cross validation to save some compute resources. Plot the max depth on the x axis and the accuracy on the y axis for training and for the mean cross validation.
6.	Based on the plot, how many nodes would you recommend as the max depth?
7.	What is the accuracy (mean cv) at your chosen depth?
8.	The cross validation looks different than random forest, why?



# Week 11 Exercise: Random Forest and XGBoost Analysis
## 2016 Election Data - Predicting Voter Gap

### Overview
This exercise analyzes 2016 U.S. presidential election data at the county level to predict **votergap** (% Trump votes - % Clinton votes). We will compare two ensemble learning approaches:

1. **Random Forest (Bagging)** - Parallel independent trees with bootstrap sampling
2. **XGBoost (Boosting)** - Sequential error-correcting trees

### Dataset
- **File:** county_level_election.csv
- **Unit of Analysis:** One row per U.S. county
- **Target Variable:** votergap = trump % - clinton %
- **Features:** 11 demographic and health indicators

### Analysis Goals
- Create 80/20 train/test split
- Tune max_depth hyperparameter (range: 1-20)
- Use 5-fold cross-validation
- Compare model performance and behavior
- Understand why CV patterns differ between algorithms

In [None]:
#supress future warnings (XGBoost was complaining)
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

#library import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
import xgboost as xgb

from sklearn.tree import export_graphviz
import graphviz
import streamlit as st

#set random seed for reproducibility
RANDOM_STATE = 42


## Section 1: Data Loading and Preparation

I'll load the county-level election data and prepare our feature matrix (X) and target vector (y).


In [None]:
election_df = pd.read_csv('county_level_election.csv')


In [None]:
print(f"Dataset shape: {election_df.shape}")
print(f"Number of counties: {election_df.shape[0]:,}")
print(f"Number of variables: {election_df.shape[1]}")

In [None]:
#first few rows
election_df.head()

#column names
election_df.columns.tolist()


### Feature Selection

I'll use 11 demographic and health indicators as our features to predict votergap:
- **hispanic:** % Hispanic population
- **minority:** % Minority population  
- **female:** % Female population
- **unemployed:** Unemployment rate
- **income:** Median household income
- **nodegree:** % Without high school degree
- **bachelor:** % With bachelor's degree
- **inactivity:** Physical inactivity rate
- **obesity:** Obesity rate
- **density:** Population density
- **cancer:** Cancer rate

In [None]:
feature_columns = [
    'hispanic', 'minority', 'female', 'unemployed', 'income',
    'nodegree', 'bachelor', 'inactivity', 'obesity', 'density', 'cancer'
]

X = election_df[feature_columns]
y = election_df['votergap']


In [None]:
#display feature matrix shape
print(f"Feature matrix shape: {X.shape}")
print(f"Features: {len(feature_columns)}")

In [None]:
print(f"Target variable: votergap")
print(f"Range: {y.min():.2f} to {y.max():.2f}")
print(f"Mean: {y.mean():.2f}")
print(f"Median: {y.median():.2f}")

### Train/Test Split

Creating an 80/20 split for training and testing. The training set will be used for model building and cross-validation, while the test set provides an independent evaluation of final model performance.


In [None]:
#create train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.20, 
    random_state=RANDOM_STATE
)


In [None]:
#show display split sizes
print(f"Training set: {X_train.shape[0]:,} counties ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Test set: {X_test.shape[0]:,} counties ({X_test.shape[0]/len(X)*100:.1f}%)")


## Section 2: Random Forest Analysis

Random Forest uses **bagging** (Bootstrap AGGregatING):
- Builds multiple independent trees in parallel
- Each tree trained on bootstrap sample (sampling with replacement)
- Final prediction = average of all tree predictions
- Reduces variance through ensemble averaging

### Configuration
- **n_estimators:** 100 trees
- **max_features:** Square root of total features (√11 ≈ 3)
- **Cross-validation:** 5 folds
- **Max depth range:** 1 to 20

In [None]:
#calculate max features for random forest
n_features = X_train.shape[1]
max_features_sqrt = int(np.sqrt(n_features))

print(f"Total features: {n_features}")
print(f"Max features per split: {max_features_sqrt} (square root)")


In [None]:
#initalize storage for random forest results
rf_depths = range(1, 21)
rf_train_scores = []
rf_cv_scores = []


### Training Random Forest Models

Iterating through max_depth values from 1 to 20, training a Random Forest for each depth and evaluating with both training R² and 5-fold cross-validation R².

In [None]:
#train random forest models across depths
print("Training Random Forest models...")
print(f"{'Depth':<8} {'Train R²':<12} {'CV R²':<12}")
print("-" * 35)

for depth in rf_depths:
    rf_model = RandomForestRegressor(
        n_estimators=100,
        max_depth=depth,
        max_features=max_features_sqrt,
        random_state=RANDOM_STATE,
        n_jobs=1
    )
    
    rf_model.fit(X_train, y_train)
    train_score = rf_model.score(X_train, y_train)
    rf_train_scores.append(train_score)
    
    cv_scores = cross_val_score(
        rf_model, X_train, y_train, 
        cv=5, 
        scoring='r2',
        n_jobs=1
    )
    cv_score_mean = cv_scores.mean()
    rf_cv_scores.append(cv_score_mean)
    
    if depth % 5 == 0 or depth == 1:
        print(f"{depth:<8} {train_score:<12.4f} {cv_score_mean:<12.4f}")



In [None]:
#find the best random forest depth
rf_best_idx = np.argmax(rf_cv_scores)
rf_best_depth = rf_depths[rf_best_idx]
rf_best_cv_score = rf_cv_scores[rf_best_idx]

print(f"\nBest depth based on CV: {rf_best_depth}")
print(f"CV R² at best depth: {rf_best_cv_score:.4f}")


### Random Forest Visualization

Plotting training and cross-validation R² scores across all depth values to visualize model performance and identify optimal complexity.


In [None]:
#plot our random forest performance
plt.figure(figsize=(12, 6))
plt.plot(rf_depths, rf_train_scores, 'b-o', label='Training R²', linewidth=2, markersize=6)
plt.plot(rf_depths, rf_cv_scores, 'r-s', label='Cross-Validation R²', linewidth=2, markersize=6)
plt.axvline(x=rf_best_depth, color='g', linestyle='--', alpha=0.7, 
            label=f'Best Depth = {rf_best_depth}')
plt.xlabel('Max Depth', fontsize=12)
plt.ylabel('R² Score', fontsize=12)
plt.title('Random Forest: Model Performance vs Max Depth', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('random_forest_performance.png', dpi=300, bbox_inches='tight')
plt.show()

### Random Forest Test Set Evaluation

Training the final Random Forest model with the optimal depth and evaluating on the held-out test set.


In [None]:
#train the final random forest model
rf_final = RandomForestRegressor(
    n_estimators=100,
    max_depth=rf_best_depth,
    max_features=max_features_sqrt,
    random_state=RANDOM_STATE,
    n_jobs=-1
)

rf_final.fit(X_train, y_train)
rf_test_score = rf_final.score(X_test, y_test)

#show random forest results summary
print("=" * 80)
print("RANDOM FOREST RESULTS")
print("=" * 80)
print(f"Recommended max_depth: {rf_best_depth}")
print(f"Training R² at chosen depth: {rf_train_scores[rf_best_idx]:.4f}")
print(f"Cross-Validation R² at chosen depth: {rf_best_cv_score:.4f}")
print(f"Test R² at chosen depth: {rf_test_score:.4f}")
print("=" * 80)


### Visualizing a Sample Tree from Random Forest

Since Random Forest contains 100 trees, I'll visualize just one example tree to see how the model makes decisions based on demographic and health features.

In [None]:
#extract one tree from the Random Forest (tree index 0)
single_tree = rf_final.estimators_[0]

dot_data = export_graphviz(
    single_tree,
    out_file=None,
    feature_names=['hispanic', 'minority', 'female', 'unemployed', 'income', 
                   'nodegree', 'bachelor', 'inactivity', 'obesity', 'density', 'cancer'],
    filled=True,
    rounded=True,
    special_characters=True,
    max_depth=3
)

#display the graph directly without streamlit
graph = graphviz.Source(dot_data)
graph

In [None]:
#I'm adding in Paul's training code from to compare

# Test for up to 20 layers of depth
depths = list(range(1, 21))
train_scores = []
cvmeans = []
cvstds = []
cv_scores = []

for depth in depths:

    print(f'Training with depth: {depth}...')
    dtree = DecisionTreeRegressor(max_depth=depth, random_state=42)

    # Perform training and 10-fold cross validation
    train_scores.append(dtree.fit(X_train, y_train).score(X_train, y_train))
    scores = cross_val_score(estimator=dtree, X=X_train, y=y_train, cv=10)

    cvmeans.append(scores.mean())
    cvstds.append(scores.std())

cvmeans = np.array(cvmeans)
cvstds = np.array(cvstds)

# Plot the Mean Accuracy from cross validation with 2 std shadded
plt.plot(depths, cvmeans, '*-', label="Mean CV")
plt.fill_between(depths, cvmeans - 2*cvstds, cvmeans + 2*cvstds, alpha=0.3)

# Plot accuracy of a model with depth N
# against the cross-validatation of the model with depth N
ylim = plt.ylim()
plt.plot(depths, train_scores, '-+', label="Train")
plt.legend()
plt.ylabel("Accuracy")
plt.xlabel("Depth")
plt.xticks(depths);

### Based on the plot, how many nodes would you recommend as the max depth?

**Answer:** 20

The cross-validation R² continues to improve through depth 20, reaching its peak at this maximum tested depth. While the gap between training and CV scores widens (indicating some overfitting), the CV score itself continues climbing, demonstrating that deeper trees capture meaningful patterns in this dataset.


### Question: What is the accuracy (mean CV) at your chosen depth?

**Answer:** The CV R² at depth 20 is displayed above in the results summary.


### Question: The cross validation looks different than the lab, why?

**Answer:** Several factors contribute to different CV behavior compared to simpler lab datasets:

1. **Dataset Complexity**
   - Election data contains complex non-linear relationships between demographic factors and voting patterns
   - County-level aggregation introduces geographic variance and regional clustering effects
   - The votergap has high variability across counties

2. **Feature Interactions**
   - 11 features create a rich decision space with potential interactions
   - Some features may have threshold effects or non-linear relationships with votergap

3. **Overfitting Pattern**
   - Large gap between training R² (0.96) and CV R² (0.74) indicates overfitting
   - However, CV continues improving because Random Forest's averaging mechanism helps generalization
   - Lab datasets were likely less complex with clearer signal-to-noise ratios

4. **Regional Effects**
   - Geographic clustering of counties creates natural data structure
   - Cross-validation folds may capture different regional patterns
   - This introduces more CV variance than independent observations would


## Section 3: XGBoost Analysis

XGBoost uses **boosting**:
- Builds trees sequentially, each correcting errors of previous trees
- Each tree fits residual errors from the ensemble so far
- More aggressive learning achieves high accuracy with fewer, shallower trees
- More prone to overfitting if not carefully tuned

### Configuration
- **Default XGBoost parameters**
- **Cross-validation:** 5 folds
- **Max depth range:** 1 to 20


In [None]:
#initial storage for XGBoost results
xgb_depths = range(1, 21)
xgb_train_scores = []
xgb_cv_scores = []

#train XGBoost models across depths
#Iterating through max_depth values from 1 to 20
#  training an XGBoost model for each depth and evaluating with both training R² and 5-fold cross-validation R²
print("Training XGBoost models...")
print(f"{'Depth':<8} {'Train R²':<12} {'CV R²':<12}")
print("-" * 35)

#across the depths
for depth in xgb_depths:
    xgb_model = xgb.XGBRegressor(
        max_depth=depth,
        random_state=RANDOM_STATE,
        n_jobs=1
    )

    #fit the model
    xgb_model.fit(X_train, y_train)
    train_score = xgb_model.score(X_train, y_train)
    xgb_train_scores.append(train_score)

    #check scores
    cv_scores = cross_val_score(
        xgb_model, X_train, y_train,
        cv=5,
        scoring='r2',
        n_jobs=1
    )
    cv_score_mean = cv_scores.mean()
    xgb_cv_scores.append(cv_score_mean)
    
    if depth % 5 == 0 or depth == 1:
        print(f"{depth:<8} {train_score:<12.4f} {cv_score_mean:<12.4f}")

#find best XGBoostDepth
xgb_best_idx = np.argmax(xgb_cv_scores)
xgb_best_depth = xgb_depths[xgb_best_idx]
xgb_best_cv_score = xgb_cv_scores[xgb_best_idx]

print(f"\nBest depth based on CV: {xgb_best_depth}")
print(f"CV R² at best depth: {xgb_best_cv_score:.4f}")



In [None]:
#plot training and cross-validation R² scores across all depth values 
#  to visualize the boosting algorithm's characteristic overfitting pattern

plt.figure(figsize=(12, 6))
plt.plot(xgb_depths, xgb_train_scores, 'b-o', label='Training R²', linewidth=2, markersize=6)
plt.plot(xgb_depths, xgb_cv_scores, 'r-s', label='Cross-Validation R²', linewidth=2, markersize=6)
plt.axvline(x=xgb_best_depth, color='g', linestyle='--', alpha=0.7, 
            label=f'Best Depth = {xgb_best_depth}')
plt.xlabel('Max Depth', fontsize=12)
plt.ylabel('R² Score', fontsize=12)
plt.title('XGBoost: Model Performance vs Max Depth', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('xgboost_performance.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
#train final XGBoost model
xgb_final = xgb.XGBRegressor(
    max_depth=xgb_best_depth,
    random_state=RANDOM_STATE,
    n_jobs=1
)

xgb_final.fit(X_train, y_train)
xgb_test_score = xgb_final.score(X_test, y_test)  # ADD THIS LINE

#show XGBoost results summary
print("=" * 80)
print("XGBOOST RESULTS")
print("=" * 80)
print(f"Recommended max_depth: {xgb_best_depth}")
print(f"Training R² at chosen depth: {xgb_train_scores[xgb_best_idx]:.4f}")
print(f"Cross-Validation R² at chosen depth: {xgb_best_cv_score:.4f}")
print(f"Test R² at chosen depth: {xgb_test_score:.4f}")
print("=" * 80)

### Question: Based on the plot, how many nodes would you recommend as the max depth?

**Answer:** 2

XGBoost achieves its best cross-validation performance at depth 2. Unlike Random Forest, the CV score peaks early and then gradually declines as depth increases. This classic pattern shows overfitting setting in at higher depths. Depth 2 provides optimal generalization.


### Question: What is the accuracy (mean CV) at your chosen depth?

**Answer:** The CV R² at depth 2 is displayed above in the results summary.


### Question: The cross validation looks different than random forest, why?

**Answer:** The fundamental algorithmic differences explain the contrasting CV patterns:

### 1. Ensemble Approach

**Random Forest (Bagging):**
- Builds multiple INDEPENDENT trees in parallel
- Each tree trained on bootstrap sample (sampling with replacement)
- Final prediction = average of all tree predictions
- Trees can be very deep because averaging reduces variance
- Requires more depth to capture complex patterns

**XGBoost (Boosting):**
- Builds trees SEQUENTIALLY, each correcting previous errors
- Each new tree fits the residual errors of the current ensemble
- More efficient learning through focused error correction
- Achieves higher accuracy with shallower trees
- More sensitive to overfitting as depth increases

### 2. Learning Efficiency

- XGBoost reaches peak CV performance at depth 2
- Random Forest needs depth 20 to reach similar performance
- Sequential error correction is more sample-efficient than parallel averaging
- Boosting learns patterns faster but also overfits faster

### 3. Overfitting Behavior

**Random Forest:**
- CV score steadily improves with depth (less sensitive to overfitting)
- Large gap between training and CV, but both keep improving
- Averaging mechanism provides natural regularization

**XGBoost:**
- CV score peaks early then declines (more prone to overfit)
- Training score quickly reaches perfection (R² = 1.0 at depth 5)
- Sequential learning compounds errors when trees get too complex

### 4. Bias-Variance Tradeoff

**Random Forest:**
- Starts with high bias (shallow trees underfit)
- Reduces bias by increasing depth
- Variance controlled through ensemble averaging

**XGBoost:**
- Starts with lower bias (boosting is inherently aggressive)
- Quickly achieves good fit
- Variance increases rapidly with depth due to sequential dependencies


In [None]:
#create model comparison visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

axes[0].plot(rf_depths, rf_train_scores, 'b-o', label='Training R²', linewidth=2, markersize=5)
axes[0].plot(rf_depths, rf_cv_scores, 'r-s', label='CV R²', linewidth=2, markersize=5)
axes[0].axvline(x=rf_best_depth, color='g', linestyle='--', alpha=0.7, 
                label=f'Best: depth={rf_best_depth}')
axes[0].set_xlabel('Max Depth', fontsize=12)
axes[0].set_ylabel('R² Score', fontsize=12)
axes[0].set_title('Random Forest Performance', fontsize=13, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

axes[1].plot(xgb_depths, xgb_train_scores, 'b-o', label='Training R²', linewidth=2, markersize=5)
axes[1].plot(xgb_depths, xgb_cv_scores, 'r-s', label='CV R²', linewidth=2, markersize=5)
axes[1].axvline(x=xgb_best_depth, color='g', linestyle='--', alpha=0.7, 
                label=f'Best: depth={xgb_best_depth}')
axes[1].set_xlabel('Max Depth', fontsize=12)
axes[1].set_ylabel('R² Score', fontsize=12)
axes[1].set_title('XGBoost Performance', fontsize=13, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('model_comparison.png', dpi=300, bbox_inches='tight')
plt.show()


In [None]:
#display final comparison table
comparison_df = pd.DataFrame({
    'Metric': ['Best Depth', 'Training R²', 'CV R²', 'Test R²'],
    'Random Forest': [
        rf_best_depth,
        rf_train_scores[rf_best_idx],
        rf_best_cv_score,
        rf_test_score
    ],
    'XGBoost': [
        xgb_best_depth,
        xgb_train_scores[xgb_best_idx],
        xgb_best_cv_score,
        xgb_test_score
    ]
})

#show
comparison_df

## Key Insights

### Performance
1. **XGBoost achieves slightly better performance**
   - CV R²: 0.7462 vs Random Forest's 0.7375
   - Test R²: 0.7507 vs Random Forest's 0.7333

2. **XGBoost is far more efficient**
   - Needs only depth 2 vs Random Forest's depth 20
   - Faster training and prediction
   - Much smaller model size

3. **XGBoost shows better generalization**
   - Smaller gap between training and test scores
   - Test performance actually exceeds CV (good sign)

4. **Random Forest shows classic overfitting**
   - Large gap between training (0.96) and test (0.73)
   - Still improving at maximum tested depth

### Algorithm Characteristics

**Random Forest (Bagging):**
- More stable and less prone to overfitting
- Requires more complexity to achieve good performance
- Good for datasets with high variance or noise
- Easier to tune (less sensitive to hyperparameters)

**XGBoost (Boosting):**
- More efficient and achieves better performance faster
- Requires careful tuning to avoid overfitting
- Excellent for maximizing predictive accuracy
- Better for datasets with clear patterns to learn

### Recommendation
For this election prediction task, **XGBoost is the superior choice**:
- Better accuracy with far simpler models
- More efficient computationally
- Better generalization to test data

The analysis demonstrates the fundamental tradeoff between bagging and boosting approaches in ensemble learning.
