# 📕 07 - Building an XGBoost Regressor Model

In the previous notebook, I established baseline models to serve as a point of comparison. Now, I'm turning to XGBoost as my first Machine Learning model to handle the time series data.

XGBoost, which stands for eXtreme Gradient Boosting, is a decision-tree-based ensemble machine learning algorithm that uses a gradient boosting framework. At its core, it builds multiple weak prediction models, typically decision trees, in a stage-wise fashion. Each tree corrects the errors of its predecessor, refining the overall prediction accuracy of the model.

Despite the rising popularity of deep learning techniques, especially for non-tabular data types like images and text, XGBoost remains a go-to solution for tabular datasets. It offers several advantages, including the ability to handle missing values, built-in support for parallel processing, and optimization for both accuracy and computing speed, making it an efficient and precise tool for such data.


In [1]:
# import libraries
from datetime import datetime

import pandas as pd
import xgboost as xgb
from sklearn.metrics import mean_absolute_error

from src.data_split import train_test_split
from src.paths import TRANSFORMED_DATA_DIR

To generate predictions for the upcoming rides, I'll be primarily focusing on the past rides as input features. While the `pickup_hour` column, which contains datetime values like `2022-01-29`, offers potential to derive various numerical features, I've chosen to momentarily set it aside.

Later on, I might delve into leveraging the `pickup_hour` for additional feature engineering.


In [2]:
# load transformed data
df = pd.read_parquet(TRANSFORMED_DATA_DIR / 'tabular_data.parquet')
df.head()

Unnamed: 0,rides_previous_672_hour,rides_previous_671_hour,rides_previous_670_hour,rides_previous_669_hour,rides_previous_668_hour,rides_previous_667_hour,rides_previous_666_hour,rides_previous_665_hour,rides_previous_664_hour,rides_previous_663_hour,...,rides_previous_7_hour,rides_previous_6_hour,rides_previous_5_hour,rides_previous_4_hour,rides_previous_3_hour,rides_previous_2_hour,rides_previous_1_hour,pickup_hour,pickup_location_id,target_rides_next_hour
0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,2.0,0.0,0.0,...,2.0,0.0,1.0,0.0,0.0,0.0,0.0,2022-01-29 00:00:00,1,0.0
1,0.0,0.0,0.0,1.0,1.0,0.0,2.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2022-01-29 01:00:00,1,0.0
2,0.0,0.0,1.0,1.0,0.0,2.0,0.0,0.0,1.0,2.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2022-01-29 02:00:00,1,0.0
3,0.0,1.0,1.0,0.0,2.0,0.0,0.0,1.0,2.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2022-01-29 03:00:00,1,0.0
4,1.0,1.0,0.0,2.0,0.0,0.0,1.0,2.0,1.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2022-01-29 04:00:00,1,0.0


In [10]:
# split data into train and test
X_train, y_train, X_test, y_test = train_test_split(
    df=df,
    cutoff_date=datetime(2022, 8, 1, 0, 0, 0),
    target_col_name='target_rides_next_hour'
)

In [11]:
# print all shapes from train_test_split
print(f'X_train shape: {X_train.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'y_test shape: {y_test.shape}')

X_train shape: (1170240, 674)
y_train shape: (1170240,)
X_test shape: (972815, 674)
y_test shape: (972815,)


In [12]:
# extrac only past rides columns
past_rides_cols = [col for col in df.columns if col.startswith('rides_')]
X_train_only_numeric = X_train[past_rides_cols]

In [13]:
# instantiate and fit model
model = xgb.XGBRegressor()
model.fit(X_train_only_numeric, y_train)

In [14]:
# predict on test set
X_test_only_numeric = X_test[past_rides_cols]
predicts = model.predict(X_test_only_numeric)

In [15]:
# evaluate model with MAE
mae = mean_absolute_error(y_test, predicts)
print(f'MAE: {mae}')

MAE: 2.6424973011016846


## Analysis of XGBoost Model Results

After transitioning from heuristic baseline models to the more advanced XGBoost regressor, our predictive accuracy saw considerable improvement:

- **XGBoost Regressor Model**: Our XGBoost model, leveraging gradient boosting on decision trees, reported a Mean Absolute Error (MAE) of **2.6424**.

To provide a frame of reference:
  
- Our best-performing baseline, the **Baseline Model v3**, had an MAE of **3.4963**. This means the XGBoost model improved upon this result.

It's worth noting that these results with XGBoost were achieved without diving into feature engineering or optimizing hyperparameters. This indicates potential for further improvements as we refine our approach.
