# **AI TECH INSTITUTE** ¬∑ *Intermediate AI & Data Science*
### Week 9 Session 2: Tree-Based Models for Regression
**Instructor:** Amir Charkhi | **Goal:** Master Decision Trees and Ensemble Methods

### Learning Objectives
- Understand decision tree fundamentals and splitting criteria
- Learn ensemble methods: Random Forest and Gradient Boosting
- Compare tree-based models with linear models
- Master feature importance interpretation
- Apply advanced hyperparameter tuning
- Understand bias-variance tradeoff in practice

---

## 1. Import Libraries

**What you need to do:**  
Import all necessary libraries for tree-based modeling.

**Required imports:**
- NumPy and Pandas for data handling
- Matplotlib and Seaborn for visualization
- Scikit-learn for tree models and evaluation
- XGBoost and LightGBM for advanced gradient boosting

**üí° Hint:** We'll need `DecisionTreeRegressor`, `RandomForestRegressor`, `GradientBoostingRegressor`, and optionally `XGBRegressor` and `LGBMRegressor`.

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn imports
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Advanced gradient boosting libraries
try:
    from xgboost import XGBRegressor
    print("‚úÖ XGBoost available")
except ImportError:
    print("‚ö†Ô∏è XGBoost not installed. Install with: pip install xgboost")

try:
    from lightgbm import LGBMRegressor
    print("‚úÖ LightGBM available")
except ImportError:
    print("‚ö†Ô∏è LightGBM not installed. Install with: pip install lightgbm")

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("\n‚úÖ All core libraries imported successfully!")

---
## 2. Load and Prepare Dataset

**What you need to do:**  
Load the same Online Retail dataset we used for linear models.

**Theory:**  
We'll use the identical dataset to **fairly compare** tree-based models with linear models.

**Our Goal:** Predict **TotalSales** and compare with linear model performance.

**üí° Hint:** We'll reuse the same feature engineering pipeline for consistency.

In [None]:
# Load the Online Retail dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx'

print("üì• Loading Online Retail dataset from UCI...")
print("This may take a minute...\n")

df_raw = pd.read_excel(url)

print(f"‚úÖ Dataset loaded successfully!")
print(f"üìä Shape: {df_raw.shape[0]:,} rows √ó {df_raw.shape[1]} columns")

---
## 3. Data Cleaning & Feature Engineering

**What you need to do:**  
Apply the same preprocessing steps as the linear models notebook.

**Note:** Tree-based models have different characteristics:
- ‚úÖ Don't require feature scaling (scale-invariant)
- ‚úÖ Handle non-linear relationships naturally
- ‚úÖ Automatically capture feature interactions
- ‚úÖ Robust to outliers

In [None]:
# Data Cleaning
print("üßπ Cleaning data...\n")

df = df_raw.copy()

# Remove missing CustomerIDs and cancellations
df = df.dropna(subset=['CustomerID'])
df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]

# Create target variable
df['TotalSales'] = df['Quantity'] * df['UnitPrice']

# Extract time-based features
df['Year'] = df['InvoiceDate'].dt.year
df['Month'] = df['InvoiceDate'].dt.month
df['DayOfWeek'] = df['InvoiceDate'].dt.dayofweek
df['Hour'] = df['InvoiceDate'].dt.hour

print(f"‚úÖ Cleaned dataset: {df.shape[0]:,} transactions")

In [None]:
# Feature Engineering at Invoice Level
print("üî® Engineering features...\n")

invoice_features = df.groupby('InvoiceNo').agg({
    'TotalSales': 'sum',
    'Quantity': 'sum',
    'UnitPrice': 'mean',
    'StockCode': 'nunique',
    'CustomerID': 'first',
    'Country': 'first',
    'Year': 'first',
    'Month': 'first',
    'DayOfWeek': 'first',
    'Hour': 'first'
}).reset_index()

invoice_features.rename(columns={
    'Quantity': 'TotalItems',
    'UnitPrice': 'AvgItemPrice',
    'StockCode': 'NumUniqueProducts'
}, inplace=True)

invoice_features['AvgPricePerItem'] = invoice_features['TotalSales'] / invoice_features['TotalItems']

# Country encoding
top_countries = invoice_features['Country'].value_counts().head(5).index.tolist()
invoice_features['Country_Group'] = invoice_features['Country'].apply(
    lambda x: x if x in top_countries else 'Other'
)
country_dummies = pd.get_dummies(invoice_features['Country_Group'], prefix='Country', drop_first=True)
invoice_features = pd.concat([invoice_features, country_dummies], axis=1)

invoice_features['IsWeekend'] = (invoice_features['DayOfWeek'] >= 5).astype(int)

print(f"‚úÖ Created {invoice_features.shape[0]:,} invoice-level samples")
print(f"üìä Total features: {invoice_features.shape[1]} columns")

---
## 4. Train-Validation-Test Split

**‚ö†Ô∏è CRITICAL: Same split strategy as linear models for fair comparison**

**What you need to do:**  
Split data: 60% train, 20% validation, 20% test

In [None]:
# Select features - EXCLUDE original string columns
feature_cols = [
    'TotalItems', 'AvgItemPrice', 'NumUniqueProducts', 
    'AvgPricePerItem', 'Year', 'Month', 'DayOfWeek', 
    'Hour', 'IsWeekend'
]

# Add country dummy variables
country_cols = [col for col in invoice_features.columns if col.startswith('Country_')]
feature_cols.extend(country_cols)

# Create feature matrix - this will automatically exclude Country and Country_Group
X = invoice_features[feature_cols].copy()
y = invoice_features['TotalSales'].copy()

print(f"üéØ Features: {len(feature_cols)} columns")
print(f"üìä X shape: {X.shape}, y shape: {y.shape}")

# Verify all columns are numeric
print(f"\n‚úÖ All features are numeric: {X.select_dtypes(include=[np.number]).shape[1] == X.shape[1]}")

In [None]:
# Split data
print("‚úÇÔ∏è Splitting data...\n")

X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)

print(f"üìä Training:   {X_train.shape[0]:>6,} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"üìä Validation: {X_val.shape[0]:>6,} samples ({X_val.shape[0]/len(X)*100:.1f}%)")
print(f"üìä Test:       {X_test.shape[0]:>6,} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"\nüîí Test set locked until final evaluation!")

print(f"\nüí° Note: Tree-based models DON'T require feature scaling!")

In [None]:
# üîß FIX: Remove string column and convert booleans to integers
print("üîß Fixing data types...\n")

# Drop the Country_Group column if it exists
if 'Country_Group' in X_train.columns:
    X_train = X_train.drop(columns=['Country_Group'])
    X_val = X_val.drop(columns=['Country_Group'])
    X_test = X_test.drop(columns=['Country_Group'])
    print("‚úÖ Dropped Country_Group column")

# Convert boolean columns to int
bool_cols = X_train.select_dtypes(include=['bool']).columns.tolist()
if bool_cols:
    X_train[bool_cols] = X_train[bool_cols].astype(int)
    X_val[bool_cols] = X_val[bool_cols].astype(int)
    X_test[bool_cols] = X_test[bool_cols].astype(int)
    print(f"‚úÖ Converted {len(bool_cols)} boolean columns to int")

# Verify fix
print(f"\n‚úÖ Final verification:")
print(f"   X_train shape: {X_train.shape}")
print(f"   All numeric: {X_train.select_dtypes(include=[np.number]).shape[1] == X_train.shape[1]}")
print(f"   No objects: {len(X_train.select_dtypes(include=['object']).columns) == 0}")
print(f"   No booleans: {len(X_train.select_dtypes(include=['bool']).columns) == 0}")

print(f"\nüìä X_train dtypes after fix:")
print(X_train.dtypes)

---
## 5. Quick EDA Summary

**What you need to do:**  
Brief exploration of training data (detailed EDA was done in linear models notebook).

In [None]:
# Quick summary
print("üìä Training Set Summary:")
print("="*80)
print(X_train.describe())
print("\nüéØ Target Variable:")
print(y_train.describe())

---
## 6. Model 1: Decision Tree Regressor

**üìö Theory:**  
Decision Trees make predictions by learning decision rules from features. They recursively partition the feature space into regions.

**How Decision Trees Work:**
1. Start at root node with all training data
2. Find best feature and split point that minimizes error
3. Create child nodes and repeat recursively
4. Stop when reaching stopping criteria (max_depth, min_samples, etc.)
5. Predict: average target value in leaf node

**Splitting Criteria for Regression:**
- **Mean Squared Error (MSE):** Most common, minimizes squared differences
- **Mean Absolute Error (MAE):** More robust to outliers

**Mathematical Form:**
$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})^2$$

Where $\bar{y}$ is the mean target value in that node.

**Key Hyperparameters:**
- **max_depth:** Maximum depth of tree (prevents overfitting)
- **min_samples_split:** Minimum samples required to split node
- **min_samples_leaf:** Minimum samples required in leaf node
- **max_features:** Number of features to consider for best split
- **min_impurity_decrease:** Minimum decrease in impurity required to split

**Pros:**
- Easy to understand and visualize
- Handles non-linear relationships naturally
- No feature scaling needed
- Captures feature interactions automatically
- Works with numerical and categorical features
- Fast predictions

**Cons:**
- **Prone to overfitting** (high variance)
- Unstable: small data changes ‚Üí different trees
- Can create biased trees with imbalanced data
- Not optimal for extrapolation

**When to Use:**
- As a baseline for tree-based models
- When interpretability is critical
- When you have non-linear relationships
- As a building block for ensembles (Random Forest, Boosting)

**Common Issue: Overfitting**
- Unrestricted trees can memorize training data (100% training accuracy)
- This leads to poor generalization on new data
- Solution: Constrain tree growth with hyperparameters

**üìñ References:**
- [Scikit-learn: Decision Trees](https://scikit-learn.org/stable/modules/tree.html)
- [ISL Book - Chapter 8: Tree-Based Methods](https://www.statlearning.com/)

---

In [None]:
# Train Decision Tree (unrestricted)
print("üå≥ Training Decision Tree Regressor (Unrestricted)...\n")

dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)

print("‚úÖ Decision Tree trained!")
print(f"\nüìä Tree Structure:")
print(f"   Max depth reached: {dt_model.get_depth()}")
print(f"   Number of leaves: {dt_model.get_n_leaves()}")
print(f"   Total nodes: {dt_model.tree_.node_count}")

In [None]:
# Evaluate on training and validation sets
y_pred_train_dt = dt_model.predict(X_train)
y_pred_val_dt = dt_model.predict(X_val)

# Training metrics
dt_train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train_dt))
dt_train_r2 = r2_score(y_train, y_pred_train_dt)

# Validation metrics
dt_val_rmse = np.sqrt(mean_squared_error(y_val, y_pred_val_dt))
dt_val_mae = mean_absolute_error(y_val, y_pred_val_dt)
dt_val_r2 = r2_score(y_val, y_pred_val_dt)

print("üìä Decision Tree Performance:")
print("="*70)
print(f"{'Metric':<30} {'Training':>15} {'Validation':>15}")
print("="*70)
print(f"{'RMSE ($)':<30} ${dt_train_rmse:>14,.2f} ${dt_val_rmse:>14,.2f}")
print(f"{'R¬≤ Score':<30} {dt_train_r2:>15.4f} {dt_val_r2:>15.4f}")
print("="*70)

if dt_train_r2 > 0.95 and dt_val_r2 < 0.80:
    print("\n‚ö†Ô∏è WARNING: High training R¬≤ but lower validation R¬≤ indicates OVERFITTING!")
    print("The tree has memorized the training data.")
    print("Solution: Constrain tree growth with hyperparameters.")

### Feature Importance

In [None]:
# Feature importance
dt_feature_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': dt_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nüéØ Decision Tree Feature Importance (Top 10):")
print("="*60)
print(dt_feature_importance.head(10).to_string(index=False))

# Visualize
plt.figure(figsize=(10, 6))
top_features = dt_feature_importance.head(10)
plt.barh(top_features['Feature'], top_features['Importance'])
plt.xlabel('Feature Importance', fontsize=11)
plt.title('Decision Tree: Top 10 Feature Importance', fontsize=12, pad=15)
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("\nüí° Higher importance = feature contributes more to predictions")

### Cross-Validation

In [None]:
# Cross-validation
print("üîÑ Performing 5-Fold Cross-Validation...\n")

cv_scores_dt = cross_val_score(
    dt_model, X_train, y_train,
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1
)
cv_scores_dt = -cv_scores_dt

print("üìä Cross-Validation RMSE:")
print("="*60)
for fold, score in enumerate(cv_scores_dt, 1):
    print(f"Fold {fold}: ${score:,.2f}")
print("="*60)
print(f"Mean CV RMSE:   ${cv_scores_dt.mean():,.2f} (¬± ${cv_scores_dt.std():.2f})")
print(f"Validation RMSE: ${dt_val_rmse:,.2f}")

if cv_scores_dt.std() > cv_scores_dt.mean() * 0.2:
    print("\n‚ö†Ô∏è High variance in CV scores suggests model instability.")

### Hyperparameter Tuning: Decision Tree

In [None]:
# Hyperparameter tuning
print("üéØ Tuning Decision Tree hyperparameters...\n")

param_grid_dt = {
    'max_depth': [3, 5, 7, 10, 15, None],
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [1, 2, 4, 8],
    'max_features': ['sqrt', 'log2', None]
}

dt_grid = GridSearchCV(
    DecisionTreeRegressor(random_state=42),
    param_grid_dt,
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

dt_grid.fit(X_train, y_train)

print(f"\n‚úÖ Best parameters: {dt_grid.best_params_}")
print(f"üìä Best CV RMSE: ${-dt_grid.best_score_:,.2f}")

In [None]:
# Evaluate tuned model
best_dt = dt_grid.best_estimator_
y_pred_train_dt_tuned = best_dt.predict(X_train)
y_pred_val_dt_tuned = best_dt.predict(X_val)

dt_tuned_train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train_dt_tuned))
dt_tuned_train_r2 = r2_score(y_train, y_pred_train_dt_tuned)
dt_tuned_val_rmse = np.sqrt(mean_squared_error(y_val, y_pred_val_dt_tuned))
dt_tuned_val_mae = mean_absolute_error(y_val, y_pred_val_dt_tuned)
dt_tuned_val_r2 = r2_score(y_val, y_pred_val_dt_tuned)

print("üìä Decision Tree (Tuned) Performance:")
print("="*70)
print(f"{'Metric':<30} {'Training':>15} {'Validation':>15}")
print("="*70)
print(f"{'RMSE ($)':<30} ${dt_tuned_train_rmse:>14,.2f} ${dt_tuned_val_rmse:>14,.2f}")
print(f"{'MAE ($)':<30} {'':>15} ${dt_tuned_val_mae:>14,.2f}")
print(f"{'R¬≤ Score':<30} {dt_tuned_train_r2:>15.4f} {dt_tuned_val_r2:>15.4f}")
print("="*70)
print(f"\nüìä Tree Structure (Tuned):")
print(f"   Max depth: {best_dt.get_depth()}")
print(f"   Number of leaves: {best_dt.get_n_leaves()}")

improvement = ((dt_val_rmse - dt_tuned_val_rmse) / dt_val_rmse) * 100
print(f"\nüí° Improvement over default: {improvement:.1f}% reduction in RMSE")

---
## 7. Model 2: Random Forest Regressor

**üìö Theory:**  
Random Forest is an **ensemble method** that combines multiple decision trees to create a more robust and accurate model.

**How Random Forest Works:**
1. Create multiple decision trees (e.g., 100 trees)
2. For each tree:
   - **Bootstrap sampling:** Randomly sample training data with replacement
   - **Random feature subset:** At each split, consider only random subset of features
3. Each tree makes predictions independently
4. Final prediction = **average** of all tree predictions

**Key Concepts:**
- **Bagging (Bootstrap Aggregating):** Reduces variance by averaging predictions
- **Random Subspace Method:** Each tree sees different feature subset ‚Üí decorrelates trees
- **Out-of-Bag (OOB) Samples:** ~37% of data not used in each tree ‚Üí free validation set

**Mathematical Form:**
$$\hat{y}_{RF} = \frac{1}{B} \sum_{b=1}^{B} \hat{y}_b(x)$$

Where $B$ is the number of trees and $\hat{y}_b$ is prediction from tree $b$.

**Key Hyperparameters:**
- **n_estimators:** Number of trees (more = better, but slower)
- **max_depth:** Maximum depth of each tree
- **min_samples_split:** Minimum samples to split node
- **min_samples_leaf:** Minimum samples in leaf
- **max_features:** Features to consider at each split ('sqrt', 'log2', or number)
- **max_samples:** Size of bootstrap sample (None = 100%)

**Typical Defaults (often work well):**
- n_estimators: 100-500
- max_features: 'sqrt' for classification, 1/3 of features for regression
- max_depth: None (grow until pure leaves or min_samples_leaf)

**Pros:**
- **Reduces overfitting** compared to single decision tree
- More stable and robust than single trees
- Handles high-dimensional data well
- Provides feature importance
- Works with missing values (in some implementations)
- Can estimate prediction uncertainty
- Parallelizable (trees are independent)

**Cons:**
- Less interpretable than single tree
- Slower training and prediction than single tree
- Larger model size (memory)
- Can still overfit with noisy data
- Not great for extrapolation

**When to Use:**
- When single decision tree overfits
- When you need robust, accurate predictions
- When you have sufficient computational resources
- As a strong baseline for tabular data
- When you need feature importance rankings

**Why Random Forest Works:**
- **Wisdom of Crowds:** Averaging reduces variance
- **Diversity:** Random sampling creates diverse trees
- **Bias-Variance Tradeoff:** Trades small increase in bias for large decrease in variance

**üìñ References:**
- [Scikit-learn: Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
- [Original Paper: Breiman (2001)](https://link.springer.com/article/10.1023/A:1010933404324)
- [ISL Book - Chapter 8: Random Forests](https://www.statlearning.com/)

---

In [None]:
# Train Random Forest
print("üå≤üå≤üå≤ Training Random Forest Regressor...\n")

rf_model = RandomForestRegressor(
    n_estimators=100,
    random_state=42,
    n_jobs=-1,
    verbose=1
)
rf_model.fit(X_train, y_train)

print("\n‚úÖ Random Forest trained with 100 trees!")

In [None]:
# Evaluate Random Forest
y_pred_train_rf = rf_model.predict(X_train)
y_pred_val_rf = rf_model.predict(X_val)

rf_train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train_rf))
rf_train_r2 = r2_score(y_train, y_pred_train_rf)
rf_val_rmse = np.sqrt(mean_squared_error(y_val, y_pred_val_rf))
rf_val_mae = mean_absolute_error(y_val, y_pred_val_rf)
rf_val_r2 = r2_score(y_val, y_pred_val_rf)

print("üìä Random Forest Performance:")
print("="*70)
print(f"{'Metric':<30} {'Training':>15} {'Validation':>15}")
print("="*70)
print(f"{'RMSE ($)':<30} ${rf_train_rmse:>14,.2f} ${rf_val_rmse:>14,.2f}")
print(f"{'MAE ($)':<30} {'':>15} ${rf_val_mae:>14,.2f}")
print(f"{'R¬≤ Score':<30} {rf_train_r2:>15.4f} {rf_val_r2:>15.4f}")
print("="*70)

print(f"\nüí° Notice: Validation performance is much closer to training (less overfitting!)")

### Feature Importance

In [None]:
# Feature importance
rf_feature_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("üéØ Random Forest Feature Importance (Top 10):")
print("="*60)
print(rf_feature_importance.head(10).to_string(index=False))

# Visualize
plt.figure(figsize=(10, 6))
top_features_rf = rf_feature_importance.head(10)
plt.barh(top_features_rf['Feature'], top_features_rf['Importance'], color='forestgreen')
plt.xlabel('Feature Importance', fontsize=11)
plt.title('Random Forest: Top 10 Feature Importance', fontsize=12, pad=15)
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

### Cross-Validation

In [None]:
# Cross-validation
print("üîÑ Performing 5-Fold Cross-Validation on Random Forest...\n")
print("‚è≥ This may take a few minutes...\n")

cv_scores_rf = cross_val_score(
    rf_model, X_train, y_train,
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1
)
cv_scores_rf = -cv_scores_rf

print("üìä Cross-Validation RMSE:")
print("="*60)
for fold, score in enumerate(cv_scores_rf, 1):
    print(f"Fold {fold}: ${score:,.2f}")
print("="*60)
print(f"Mean CV RMSE: ${cv_scores_rf.mean():,.2f} (¬± ${cv_scores_rf.std():.2f})")

### Hyperparameter Tuning: Random Forest

In [None]:
# Hyperparameter tuning with RandomizedSearchCV (faster than GridSearch)
print("üéØ Tuning Random Forest hyperparameters...\n")
print("Using RandomizedSearchCV for faster search...\n")

param_dist_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 15, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', 0.5],
    'max_samples': [0.7, 0.8, 0.9, None]
}

rf_random = RandomizedSearchCV(
    RandomForestRegressor(random_state=42, n_jobs=-1),
    param_dist_rf,
    n_iter=20,  # Try 20 random combinations
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    verbose=1,
    random_state=42
)

rf_random.fit(X_train, y_train)

print(f"\n‚úÖ Best parameters: {rf_random.best_params_}")
print(f"üìä Best CV RMSE: ${-rf_random.best_score_:,.2f}")

In [None]:
# Evaluate tuned Random Forest
best_rf = rf_random.best_estimator_
y_pred_train_rf_tuned = best_rf.predict(X_train)
y_pred_val_rf_tuned = best_rf.predict(X_val)

rf_tuned_train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train_rf_tuned))
rf_tuned_train_r2 = r2_score(y_train, y_pred_train_rf_tuned)
rf_tuned_val_rmse = np.sqrt(mean_squared_error(y_val, y_pred_val_rf_tuned))
rf_tuned_val_mae = mean_absolute_error(y_val, y_pred_val_rf_tuned)
rf_tuned_val_r2 = r2_score(y_val, y_pred_val_rf_tuned)

print("üìä Random Forest (Tuned) Performance:")
print("="*70)
print(f"{'Metric':<30} {'Training':>15} {'Validation':>15}")
print("="*70)
print(f"{'RMSE ($)':<30} ${rf_tuned_train_rmse:>14,.2f} ${rf_tuned_val_rmse:>14,.2f}")
print(f"{'MAE ($)':<30} {'':>15} ${rf_tuned_val_mae:>14,.2f}")
print(f"{'R¬≤ Score':<30} {rf_tuned_train_r2:>15.4f} {rf_tuned_val_r2:>15.4f}")
print("="*70)

---
## 8. Model 3: Gradient Boosting Regressor

**üìö Theory:**  
Gradient Boosting builds trees **sequentially**, where each new tree corrects errors made by previous trees.

**How Gradient Boosting Works:**
1. Start with simple model (constant prediction = mean)
2. Calculate residuals (errors) from current model
3. Fit new tree to predict these residuals
4. Add new tree's predictions (scaled by learning rate) to ensemble
5. Repeat steps 2-4 for N iterations
6. Final prediction = sum of all tree predictions

**Key Difference from Random Forest:**
- **Random Forest:** Trees built independently in parallel (bagging)
- **Gradient Boosting:** Trees built sequentially, each correcting previous errors (boosting)

**Mathematical Form:**
$$F_M(x) = F_0(x) + \sum_{m=1}^{M} \nu \cdot h_m(x)$$

Where:
- $F_M(x)$ = Final prediction after M iterations
- $F_0(x)$ = Initial prediction (usually mean)
- $\nu$ = Learning rate (shrinkage)
- $h_m(x)$ = Prediction from tree $m$ fitted to residuals

**Key Hyperparameters:**
- **n_estimators:** Number of boosting iterations (trees)
- **learning_rate:** Shrinkage factor (typically 0.01-0.3)
  - Lower learning rate needs more trees but often performs better
  - Rule of thumb: learning_rate √ó n_estimators ‚âà constant
- **max_depth:** Tree depth (typically 3-8 for boosting, shallow trees work well)
- **min_samples_split / min_samples_leaf:** Control tree complexity
- **subsample:** Fraction of samples for each tree (0-1, introduces randomness)
- **max_features:** Features to consider at each split

**Typical Good Defaults:**
- n_estimators: 100-1000 (more = better, but watch overfitting)
- learning_rate: 0.1 (lower if using many trees)
- max_depth: 3-5 (shallow trees work well)
- subsample: 0.8 (adds randomness, reduces overfitting)

**Pros:**
- **Often best performance** on structured/tabular data
- Captures complex non-linear relationships
- Handles feature interactions naturally
- Less prone to overfitting than deep trees (when tuned properly)
- Feature importance available

**Cons:**
- **Sequential training** (not parallelizable like Random Forest)
- Slower to train than Random Forest
- More hyperparameters to tune
- Can overfit if not careful (especially with high learning rate)
- Less interpretable than single tree or Random Forest

**When to Use:**
- When you want best possible accuracy
- When you have time for hyperparameter tuning
- When training time is not critical
- For Kaggle competitions and production systems

**Gradient Boosting Variants:**
- **Scikit-learn GradientBoosting:** Good baseline, well-tested
- **XGBoost:** Faster, more features, industry standard
- **LightGBM:** Very fast, memory efficient, handles large datasets
- **CatBoost:** Handles categorical features automatically

**üìñ References:**
- [Scikit-learn: Gradient Boosting](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)
- [Original Paper: Friedman (2001)](https://projecteuclid.org/euclid.aos/1013203451)
- [ISL Book - Chapter 8: Boosting](https://www.statlearning.com/)

---

In [None]:
# Train Gradient Boosting
print("üöÄ Training Gradient Boosting Regressor...\n")

gb_model = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    random_state=42,
    verbose=1
)
gb_model.fit(X_train, y_train)

print("\n‚úÖ Gradient Boosting trained with 100 trees!")

In [None]:
# Evaluate Gradient Boosting
y_pred_train_gb = gb_model.predict(X_train)
y_pred_val_gb = gb_model.predict(X_val)

gb_train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train_gb))
gb_train_r2 = r2_score(y_train, y_pred_train_gb)
gb_val_rmse = np.sqrt(mean_squared_error(y_val, y_pred_val_gb))
gb_val_mae = mean_absolute_error(y_val, y_pred_val_gb)
gb_val_r2 = r2_score(y_val, y_pred_val_gb)

print("üìä Gradient Boosting Performance:")
print("="*70)
print(f"{'Metric':<30} {'Training':>15} {'Validation':>15}")
print("="*70)
print(f"{'RMSE ($)':<30} ${gb_train_rmse:>14,.2f} ${gb_val_rmse:>14,.2f}")
print(f"{'MAE ($)':<30} {'':>15} ${gb_val_mae:>14,.2f}")
print(f"{'R¬≤ Score':<30} {gb_train_r2:>15.4f} {gb_val_r2:>15.4f}")
print("="*70)

### Feature Importance

In [None]:
# Feature importance
gb_feature_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': gb_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("üéØ Gradient Boosting Feature Importance (Top 10):")
print("="*60)
print(gb_feature_importance.head(10).to_string(index=False))

# Visualize
plt.figure(figsize=(10, 6))
top_features_gb = gb_feature_importance.head(10)
plt.barh(top_features_gb['Feature'], top_features_gb['Importance'], color='darkorange')
plt.xlabel('Feature Importance', fontsize=11)
plt.title('Gradient Boosting: Top 10 Feature Importance', fontsize=12, pad=15)
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

### Hyperparameter Tuning: Gradient Boosting

In [None]:
# Hyperparameter tuning
print("üéØ Tuning Gradient Boosting hyperparameters...\n")
print("‚è≥ This will take several minutes...\n")

param_grid_gb = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10],
    'subsample': [0.8, 1.0]
}

gb_random = RandomizedSearchCV(
    GradientBoostingRegressor(random_state=42),
    param_grid_gb,
    n_iter=20,
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    verbose=1,
    random_state=42
)

gb_random.fit(X_train, y_train)

print(f"\n‚úÖ Best parameters: {gb_random.best_params_}")
print(f"üìä Best CV RMSE: ${-gb_random.best_score_:,.2f}")

In [None]:
# Evaluate tuned Gradient Boosting
best_gb = gb_random.best_estimator_
y_pred_train_gb_tuned = best_gb.predict(X_train)
y_pred_val_gb_tuned = best_gb.predict(X_val)

gb_tuned_train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train_gb_tuned))
gb_tuned_train_r2 = r2_score(y_train, y_pred_train_gb_tuned)
gb_tuned_val_rmse = np.sqrt(mean_squared_error(y_val, y_pred_val_gb_tuned))
gb_tuned_val_mae = mean_absolute_error(y_val, y_pred_val_gb_tuned)
gb_tuned_val_r2 = r2_score(y_val, y_pred_val_gb_tuned)

print("üìä Gradient Boosting (Tuned) Performance:")
print("="*70)
print(f"{'Metric':<30} {'Training':>15} {'Validation':>15}")
print("="*70)
print(f"{'RMSE ($)':<30} ${gb_tuned_train_rmse:>14,.2f} ${gb_tuned_val_rmse:>14,.2f}")
print(f"{'MAE ($)':<30} {'':>15} ${gb_tuned_val_mae:>14,.2f}")
print(f"{'R¬≤ Score':<30} {gb_tuned_train_r2:>15.4f} {gb_tuned_val_r2:>15.4f}")
print("="*70)

---
## 9. Model Comparison: Tree-Based Models

**What you need to do:**  
Compare all tree-based models to identify the best performer.

In [None]:
# Create comparison table
tree_comparison_df = pd.DataFrame({
    'Model': [
        'Decision Tree (default)',
        'Decision Tree (tuned)',
        'Random Forest (default)',
        'Random Forest (tuned)',
        'Gradient Boosting (default)',
        'Gradient Boosting (tuned)'
    ],
    'RMSE': [
        dt_val_rmse, dt_tuned_val_rmse,
        rf_val_rmse, rf_tuned_val_rmse,
        gb_val_rmse, gb_tuned_val_rmse
    ],
    'MAE': [
        mean_absolute_error(y_val, y_pred_val_dt), dt_tuned_val_mae,
        rf_val_mae, rf_tuned_val_mae,
        gb_val_mae, gb_tuned_val_mae
    ],
    'R¬≤': [
        dt_val_r2, dt_tuned_val_r2,
        rf_val_r2, rf_tuned_val_r2,
        gb_val_r2, gb_tuned_val_r2
    ]
})

tree_comparison_df = tree_comparison_df.sort_values('RMSE')

print("\n" + "="*80)
print("üìä TREE-BASED MODELS COMPARISON - VALIDATION SET PERFORMANCE")
print("="*80)
print(tree_comparison_df.to_string(index=False))
print("="*80)

best_tree_model_name = tree_comparison_df.iloc[0]['Model']
print(f"\nüèÜ BEST TREE MODEL: {best_tree_model_name}")
print(f"   RMSE: ${tree_comparison_df.iloc[0]['RMSE']:,.2f}")
print(f"   R¬≤: {tree_comparison_df.iloc[0]['R¬≤']:.4f}")

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# RMSE
axes[0].barh(tree_comparison_df['Model'], tree_comparison_df['RMSE'], color='steelblue')
axes[0].set_xlabel('RMSE ($)', fontsize=11)
axes[0].set_title('Tree Models: RMSE Comparison\n(Lower is Better)', fontsize=12, pad=15)
axes[0].invert_yaxis()

# MAE
axes[1].barh(tree_comparison_df['Model'], tree_comparison_df['MAE'], color='coral')
axes[1].set_xlabel('MAE ($)', fontsize=11)
axes[1].set_title('Tree Models: MAE Comparison\n(Lower is Better)', fontsize=12, pad=15)
axes[1].invert_yaxis()

# R¬≤
axes[2].barh(tree_comparison_df['Model'], tree_comparison_df['R¬≤'], color='seagreen')
axes[2].set_xlabel('R¬≤ Score', fontsize=11)
axes[2].set_title('Tree Models: R¬≤ Comparison\n(Higher is Better)', fontsize=12, pad=15)
axes[2].invert_yaxis()

plt.tight_layout()
plt.show()

---
## 10. Final Evaluation on Test Set

**‚ö†Ô∏è CRITICAL: Test set evaluation for best tree model**

**What you need to do:**  
Evaluate the best tree model on held-out test data.

In [None]:
# Select best model
if best_tree_model_name == 'Decision Tree (tuned)':
    final_tree_model = best_dt
elif best_tree_model_name == 'Random Forest (tuned)':
    final_tree_model = best_rf
elif best_tree_model_name == 'Gradient Boosting (tuned)':
    final_tree_model = best_gb
else:
    final_tree_model = rf_model  # Default to RF

print(f"üèÜ Selected Model: {best_tree_model_name}")
print(f"\nüîì Unlocking test set for final evaluation...\n")

In [None]:
# Final test set evaluation
y_pred_test_tree = final_tree_model.predict(X_test)

test_rmse_tree = np.sqrt(mean_squared_error(y_test, y_pred_test_tree))
test_mae_tree = mean_absolute_error(y_test, y_pred_test_tree)
test_r2_tree = r2_score(y_test, y_pred_test_tree)

print("\n" + "="*80)
print(f"üìä FINAL TEST SET PERFORMANCE: {best_tree_model_name}")
print("="*80)
print(f"Root Mean Squared Error (RMSE): ${test_rmse_tree:>12,.2f}")
print(f"Mean Absolute Error (MAE):      ${test_mae_tree:>12,.2f}")
print(f"R¬≤ Score:                       {test_r2_tree:>12.4f}")
print("="*80)

# Compare to validation
val_rmse_tree = tree_comparison_df.iloc[0]['RMSE']
val_r2_tree = tree_comparison_df.iloc[0]['R¬≤']

print(f"\nüîç Validation vs Test:")
print(f"   Validation RMSE: ${val_rmse_tree:,.2f}  ‚Üí  Test RMSE: ${test_rmse_tree:,.2f}")
print(f"   Validation R¬≤: {val_r2_tree:.4f}  ‚Üí  Test R¬≤: {test_r2_tree:.4f}")

In [None]:
# Visualize predictions vs actual
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Predictions vs Actual
axes[0].scatter(y_test, y_pred_test_tree, alpha=0.5, s=20)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
             'r--', linewidth=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual Total Sales ($)', fontsize=11)
axes[0].set_ylabel('Predicted Total Sales ($)', fontsize=11)
axes[0].set_title(f'{best_tree_model_name}\nPredictions vs Actual (Test Set)', fontsize=12, pad=15)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Residuals
residuals_tree = y_test - y_pred_test_tree
axes[1].scatter(y_pred_test_tree, residuals_tree, alpha=0.5, s=20)
axes[1].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[1].set_xlabel('Predicted Total Sales ($)', fontsize=11)
axes[1].set_ylabel('Residuals ($)', fontsize=11)
axes[1].set_title('Residual Plot', fontsize=12, pad=15)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---
## 11. Key Takeaways & Insights

**What you should have learned:**

### 1Ô∏è‚É£ Tree-Based Model Family

‚úÖ **Decision Tree**
- Simple, interpretable, prone to overfitting
- Use as baseline or building block for ensembles

‚úÖ **Random Forest**
- Ensemble of independent trees (bagging)
- Reduces overfitting through averaging
- More stable than single tree
- Parallel training (fast)

‚úÖ **Gradient Boosting**
- Sequential ensemble (boosting)
- Each tree corrects previous errors
- Often achieves best performance
- Sequential training (slower)

### 2Ô∏è‚É£ Key Differences: Trees vs Linear Models

**Tree-Based Models:**
- ‚úÖ Handle non-linear relationships automatically
- ‚úÖ Don't require feature scaling
- ‚úÖ Capture feature interactions naturally
- ‚úÖ Robust to outliers
- ‚ùå Less interpretable (especially ensembles)
- ‚ùå Can overfit with small datasets
- ‚ùå Poor at extrapolation

**Linear Models:**
- ‚úÖ Highly interpretable (coefficients = feature effects)
- ‚úÖ Fast training and prediction
- ‚úÖ Work well with limited data
- ‚úÖ Good at extrapolation
- ‚ùå Assume linear relationships
- ‚ùå Require feature engineering for interactions
- ‚ùå Sensitive to feature scales (regularized models)

### 3Ô∏è‚É£ Ensemble Methods Wisdom

**Why Ensembles Work:**
- **Bagging (Random Forest):** Reduces variance by averaging diverse models
- **Boosting (Gradient Boosting):** Reduces bias by sequentially correcting errors
- **Key Insight:** Ensemble of weak learners ‚Üí strong learner

**Trade-offs:**
- Accuracy ‚Üë, Interpretability ‚Üì
- Stability ‚Üë, Training Time ‚Üë
- Generalization ‚Üë, Complexity ‚Üë

### 4Ô∏è‚É£ Practical Guidelines

**When to Use Which Model:**

1. **Start with Random Forest:**
   - Works well out-of-the-box
   - Good baseline for tree models
   - Robust to hyperparameters

2. **Try Gradient Boosting if:**
   - You need maximum accuracy
   - You have time for tuning
   - Dataset is not too small

3. **Use Single Decision Tree if:**
   - Interpretability is critical
   - You need to explain every decision
   - Dataset is small

4. **Consider Linear Models if:**
   - Relationships are approximately linear
   - You need coefficient interpretation
   - Speed is critical
   - You need to extrapolate

### 5Ô∏è‚É£ Feature Importance Insights
- Different models may rank features differently
- Ensemble methods provide more stable rankings
- Always validate importance with domain knowledge
- High importance ‚â† causation

---

### üìù Reflection Questions
1. Did tree models outperform linear models on this dataset? Why?
2. Why does Random Forest reduce overfitting compared to single trees?
3. What's the trade-off between Random Forest and Gradient Boosting?
4. How do feature importance rankings differ across models?
5. When would you prefer a linear model over a tree model?

---

### üöÄ Next Steps: Week 10
**Advanced Topics:**
- Classification models (same algorithms, different task)
- Advanced boosting: XGBoost, LightGBM, CatBoost
- Model deployment (Flask/FastAPI)
- Time series forecasting

---

**AI Tech Institute** | *Building Tomorrow's AI Engineers Today*