# 📕 06 - Building a Baseline Model

Now, I am going to establish baseline models to provide a comparison point for future, more complex machine learning models. Baselines are crucial for this very reason. They should be swift and straightforward, setting an initial standard for comparison. In this notebook, we'll delve into three baseline models:

1. **Previous-Hour Predictor**: Forecasts the next hour's demand based on the prior hour.
  
2. **Same Hour from the Previous Week Predictor**: Uses data from the same hour one week earlier.
  
3. **4-Weeks Average Predictor**: Averages ride data from the same hour over the previous four weeks.

In [2]:
# Automatically reload modules before executing any line.
%reload_ext autoreload
%autoreload 2

# import libraries
from datetime import datetime

import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error

from src.paths import TRANSFORMED_DATA_DIR
from src.data_split import train_test_split

# load tabular data and display the first few rows
df = pd.read_parquet(TRANSFORMED_DATA_DIR / 'tabular_data.parquet')
df.head()

Unnamed: 0,rides_previous_672_hour,rides_previous_671_hour,rides_previous_670_hour,rides_previous_669_hour,rides_previous_668_hour,rides_previous_667_hour,rides_previous_666_hour,rides_previous_665_hour,rides_previous_664_hour,rides_previous_663_hour,...,rides_previous_7_hour,rides_previous_6_hour,rides_previous_5_hour,rides_previous_4_hour,rides_previous_3_hour,rides_previous_2_hour,rides_previous_1_hour,pickup_hour,pickup_location_id,target_rides_next_hour
0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,2.0,0.0,0.0,...,2.0,0.0,1.0,0.0,0.0,0.0,0.0,2022-01-29,1,0.0
1,0.0,0.0,0.0,0.0,0.0,4.0,1.0,2.0,1.0,2.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2022-01-30,1,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,...,0.0,1.0,2.0,0.0,0.0,0.0,0.0,2022-01-31,1,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,2.0,1.0,0.0,1.0,1.0,0.0,0.0,2022-02-01,1,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2022-02-02,1,0.0


In [3]:
# data splitting into training and testing sets using a specified cutoff date
X_train, y_train, X_test, y_test = train_test_split(
    df=df,
    cutoff_date=datetime(2022, 6, 1, 0, 0, 0),
    target_col_name='target_rides_next_hour'
)

# print shapes for all datasets
print(f'X_train shape: {X_train.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'y_test shape: {y_test.shape}')

X_train shape: (32595, 674)
y_train shape: (32595,)
X_test shape: (56710, 674)
y_test shape: (56710,)


## Baseline Model v1: The Previous-Hour Predictor

Before diving into complex models, it's always wise to start with a simple baseline. This provides a reference point, and if our more sophisticated models can't surpass this baseline, it indicates that they might not be as valuable as we initially thought.

Here, I introduce the `BaselineModelPreviousHour` as our chosen starting point. Instead of being an ML-driven model, it's a naive or heuristic model. The logic behind it is elementary: *for predicting the number of rides in the next hour, it simply takes the number of rides from the previous hour and uses that as its prediction.*

In [42]:
class BaselineModelPreviousHour:
    """
    A simple baseline model that predicts the next hour's rides based on the previous hour.
    """
    
    def fit(self, X_train: pd.DataFrame, y_train: pd.Series) -> None:
        """
        Fit method is intentionally left empty since our baseline model doesn't require training.
        """
        pass

    def predict(self, X_test: pd.DataFrame) -> np.ndarray:
        """
        Predict the number of rides for the next hour using the number of rides in the previous hour.

        Args:
            X_test (pd.DataFrame): The test features.

        Returns:
            np.ndarray: Predicted number of rides for the next hour.
        """
        return X_test['rides_previous_1_hour'].values
        

# Instantiate and use the model
base_model = BaselineModelPreviousHour()
predictions = base_model.predict(X_test)

test_mae = mean_absolute_error(y_test, predictions)
print(f'Test MAE: {test_mae:.4f}')


Test MAE: 6.0558


## Baseline Model v2: The Same Hour from the Previous Week Predictor

What if, instead of basing our prediction on the last hour, we looked back exactly one week to the corresponding hour? This strategy leverages the idea that certain hour-specific trends, like weekday rush hours, might manifest consistently on a week-to-week basis. The `BaselineModelPreviousWeek` harnesses this idea by forecasting the number of rides for a given hour, drawing directly from the equivalent hour seven days prior.


In [4]:
class BaselineModelPreviousWeek:
    """
    A simple baseline model that predicts the next hour's rides based on the same hour from the previous week.
    """
    
    def fit(self, X_train: pd.DataFrame, y_train: pd.Series) -> None:
        """
        Fit method is intentionally left empty since our baseline model doesn't require training.
        """
        pass

    def predict(self, X_test: pd.DataFrame) -> np.ndarray:
        """
        Predict the number of rides for the next hour using the number of rides from the same hour the previous week.

        Args:
            X_test (pd.DataFrame): The test features.

        Returns:
            np.ndarray: Predicted number of rides for the next hour.
        """
        return X_test[f'rides_previous_{7*24}_hour'].values

# Instantiate and use the model
base_model = BaselineModelPreviousWeek()
predictions = base_model.predict(X_test)

test_mae = mean_absolute_error(y_test, predictions)
print(f'Test MAE: {test_mae:.4f}')

Test MAE: 3.6811


## Baseline Model v3: The Four-Week Average Predictor

The third (and last) baseline model expands its scope to include data from the last four weeks. Instead of using a single week's average, this model evaluates the **same hour** over the previous four weeks, capturing a more extended pattern that potentially accommodates for more extended variations like holidays, events, or month-end behaviors.

This **non-ML** approach aims to further refine our prediction by averaging out short-term anomalies and emphasizing more established, longer-term patterns. Let's see if this extended horizon provides an improvement over our previous baselines.

In [6]:
class BaselineModelLast4Weeks:
    """
    A simple baseline model that predicts the next hour's rides based on the average rides 
    of the same hour from the previous 4 weeks.
    """
    
    def fit(self, X_train: pd.DataFrame, y_train: pd.Series) -> None:
        """
        Fit method is intentionally left empty since our baseline model doesn't require training.
        """
        pass

    def predict(self, X_test: pd.DataFrame) -> np.ndarray:
        """
        Predict the number of rides for the next hour using the average rides of the same hour
        from the last 4 weeks.

        Args:
            X_test (pd.DataFrame): The test features.

        Returns:
            np.ndarray: Predicted number of rides for the next hour.
        """
        average_4_weeks = np.average(
            [X_test[f'rides_previous_{i*7*24}_hour'].values for i in range(1, 5)],
            axis=0
        )
        return average_4_weeks

# Instantiate and use the model
base_model = BaselineModelLast4Weeks()
predictions = base_model.predict(X_test)

test_mae = mean_absolute_error(y_test, predictions)
print(f'Test MAE: {test_mae:.4f}')

Test MAE: 3.1963


## Analysis of Baseline Models' Results

Upon evaluating our trio of baseline models, we observed a progressive reduction in the Mean Absolute Error (MAE), underscoring the importance of choosing an effective strategy:

- **Baseline Model v1 (Previous Hour Predictor)**: This initial model took a simplistic approach, forecasting based on the rides from just the preceding hour, resulting in an MAE of **6.0558**.

- **Baseline Model v2 (Previous Week Predictor)**: This model, predicting based on the corresponding hour from the prior week, showcased the weekly patterns in ride demands, leading to a decreased MAE of **3.6811**.

- **Baseline Model v3 (4-Weeks Predictor)**: By tapping into a broader historical context — specifically, the same hour's averages over the last four weeks — we saw a further reduced MAE of **3.1963**.


## Conclusion

The continual improvement in MAE across these models illuminates the significance of incorporating historical patterns in our predictions. While none of these models utilize sophisticated machine learning algorithms, these baseline results serve as benchmarks. 

Next, we will explore machine learning techniques, starting with the implementation of an XGBoost model, to potentially enhance our forecasting accuracy.