# Exploring Gradient Boosted Decision Trees
Structure and rough draft for a notebook investigating XGBoost and PME (later)

"XGBoost stands for “Extreme Gradient Boosting”, where the term “Gradient Boosting” originates from the paper Greedy Function Approximation: A Gradient Boosting Machine, by Friedman."

from [Introduction to Boosted Trees](https://xgboost.readthedocs.io/en/stable/tutorials/model.html)


## Import Data

In [None]:
pip install ucimlrepo

In [None]:
pip install xgboost

In [None]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
seoul_bike_sharing_demand = fetch_ucirepo(id=560)

# data (as pandas dataframes)
X = seoul_bike_sharing_demand.data.features
y = seoul_bike_sharing_demand.data.targets

# metadata
print(seoul_bike_sharing_demand.metadata)

# variable information
print(seoul_bike_sharing_demand.variables)

## Data Preprocessing

In [None]:
import pandas as pd
import tensorflow as tf
import xgboost as xgb
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from xgboost import XGBRegressor, plot_importance, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, r2_score

In [None]:
#Recombining Features and Target for Analysis

df = pd.concat([X, y], axis=1)
print(df.columns)

In [None]:
# Convert date column to datetime
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y') # Changed the format string to match the actual format
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['Weekday'] = df['Date'].dt.weekday
df.drop(columns=['Date'], inplace=True)

In [None]:
# Encoding categorical variables
df = pd.get_dummies(df, columns=['Seasons', 'Holiday', 'Functioning Day'], drop_first=True)

In [None]:
# Standardizing numerical values

scaler = StandardScaler()
num_cols = df.select_dtypes(include=[np.number]).columns.drop('Rented Bike Count')
df[num_cols] = scaler.fit_transform(df[num_cols])

In [None]:
df.info()

In [None]:
# Separate features and target
X = df.drop(columns=['Rented Bike Count'])
y = df['Rented Bike Count']

# Split into train (60%), eval (20%), test (20%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_eval, X_test, y_eval, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

In [None]:
X.head()

## Exploring XGBoost

In [None]:
# Train XGBoost model
eval_set = [(X_train, y_train), (X_eval, y_eval)]

model = xgb.XGBRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5
)

model.fit(X_train, y_train, eval_set=eval_set, verbose=True)

## XGBoost has a dedicated plotting library

[📘 Documentation](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.plot_importance)

XGBoost provides a built-in function to **plot the importance of each feature** based on various criteria. The `importance_type` parameter allows you to choose how importance is measured:

- **`weight`**: The number of times a feature appears in a tree.
- **`gain`**: The average gain of splits which use the feature.
- **`cover`**: The average coverage of splits which use the feature, where _coverage_ is defined as the number of samples affected by the split.


In [None]:
plot_importance(model, importance_type='gain')
plt.show()

In [None]:
plot_importance(model, importance_type='weight')
plt.show()

In [None]:
plot_importance(model, importance_type='cover')
plt.show()

### Evaluate model performance

In [None]:
# Predict on test set
y_pred_xgb = model.predict(X_test)

# Evaluation metrics
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

print("\nXGBoost Metrics:")
print(f"MAE: {mae_xgb:.2f}")
print(f"R²: {r2_xgb:.4f}")


In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred_xgb, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('XGBoost: Predicted vs Actual Values')
plt.tight_layout()
plt.show()


In [None]:
results = model.evals_result()
train_loss = results['validation_0']['rmse']
eval_loss = results['validation_1']['rmse']

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(train_loss, label='Train RMSE')
plt.plot(eval_loss, label='Validation RMSE')
plt.title('XGBoost RMSE Over Boosting Rounds')
plt.xlabel('Boosting Round')
plt.ylabel('RMSE')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(np.abs(np.array(train_loss) - np.array(eval_loss)), label='Train-Eval RMSE Gap')
plt.title('Generalization Gap Over Time')
plt.xlabel('Boosting Round')
plt.ylabel('Gap')
plt.legend()

plt.tight_layout()
plt.show()
