# Boosting Predictions: XGBoost

## Gradient Boosting Machines

In the previous session, we used Random Forests, which build trees *independently*.
Now we introduce **XGBoost (Extreme Gradient Boosting)**, which builds trees *sequentially* — each new tree corrects the errors of the previous ones.

**Dataset**: Malaysia Customer Transactions 2025 — predicting `total_transaction_amount`.

**Why XGBoost?**
- **Speed**: Highly optimized for performance.
- **Accuracy**: Often wins Kaggle competitions.
- **Robustness**: Handles missing values internally and doesn't require explicit scaling.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error, r2_score

%matplotlib inline
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

print("XGBoost version:", xgb.__version__)

## 1. Data Loading

**Note on Scaling**: Unlike KNN or Neural Networks, tree-based models like XGBoost are *invariant* to feature scaling. We use the encoded data directly.

In [None]:
# Load data
df = pd.read_csv("dummy_malaysia_customer_transactions_2025.csv")

# Encode categorical features
le_region = LabelEncoder()
le_quarter = LabelEncoder()
df['region_encoded'] = le_region.fit_transform(df['region'].fillna('Unknown'))
df['quarter_encoded'] = le_quarter.fit_transform(df['quarter'].fillna('Unknown'))

# Select Features and Target
features = ['region_encoded', 'quarter_encoded', 'number_of_purchases']
target = 'total_transaction_amount'

data = df[features + [target]].dropna()
X = data[features]
y = data[target]

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training Data Shape: {X_train.shape}")

## 2. Baseline XGBoost Model

We train a standard XGBoost Regressor.
Key parameters:
- `n_estimators`: Number of boosting rounds (trees).
- `max_depth`: Depth of each tree.
- `learning_rate` (or `eta`): Step size shrinkage to prevent overfitting.

In [None]:
# Initialize model
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100, seed=42)

# Train
xgb_model.fit(X_train, y_train)

# Predict
y_pred = xgb_model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("--- Baseline XGBoost ---")
print(f"MSE: {mse:.2f}")
print(f"R2 Score: {r2:.4f}")

## 3. Dealing with Overfitting: Early Stopping

**Early Stopping** lets us set `n_estimators` high and stop training if validation score doesn't improve for `N` rounds.

In [None]:
eval_set = [(X_train, y_train), (X_test, y_test)]

xgb_early = xgb.XGBRegressor(n_estimators=1000, learning_rate=0.05, seed=42, early_stopping_rounds=10)
xgb_early.fit(X_train, y_train, eval_set=eval_set, verbose=False)

print(f"Best Iteration: {xgb_early.best_iteration}")
print(f"Best Score (RMSE): {xgb_early.best_score:.4f}")

## 4. Hyperparameter Tuning

XGBoost is powerful but sensitive to hyperparameters. We'll tune:
- `learning_rate`: Lower is usually better (requires more trees).
- `max_depth`: Shallow trees prevent overfitting.
- `subsample`: Fraction of data samples used per tree.

In [None]:
param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [100, 200]
}

xgb_tuned = xgb.XGBRegressor(objective='reg:squarederror', seed=42)
grid_search = GridSearchCV(estimator=xgb_tuned, param_grid=param_grid, cv=3, scoring='r2', verbose=1)
grid_search.fit(X_train, y_train)

print(f"Best Params: {grid_search.best_params_}")
print(f"Best CV R2: {grid_search.best_score_:.4f}")

best_xgb = grid_search.best_estimator_

## 5. Visualization: Feature Importance

XGBoost shows how often each feature is used to split data.

In [None]:
plt.figure(figsize=(8, 5))
xgb.plot_importance(best_xgb, max_num_features=10, height=0.5)
plt.title("XGBoost Feature Importance (Customer Transactions)")
plt.show()

## 6. Exercises

Challenge yourself with these tasks!

In [None]:
# Task 1: Tune 'min_child_weight' and 'gamma'
# These parameters control regularization.
# Add them to a new GridSearchCV and see if you can beat the previous best score.

# Your code here

In [None]:
# Task 2: Compare Speed
# Use Python's 'time' module to measure training time of Random Forest vs XGBoost.
# Which one is faster for this dataset?

import time
# Your timing code here

## Summary

You've now added **XGBoost** to your ML toolkit!
1. **Gradient Boosting** reduces bias by correcting previous errors.
2. **Early Stopping** prevents overfitting automatically.
3. **Tuning** unlocks the full potential of the model.

Use XGBoost when you need top-tier accuracy on tabular data.