# 06 | Baseline Model #
## Introduction ##
In this notebook, we will be creating the baseline model. Baseline model is a simple rule that you can infer by looking at the data that uses no ML, has no complexities, but can be used to obtain a **baseline performance**. In our case, since we are dealing with a time-series problem, we will use a **metric error for regression problems**, that is **the mean absolute error**. Once we get this value, we can use it as a standard of reference to subsequent models that we will train. As such, we can evaluate how well these models perform compared to our baseline model.

## Load the training data ##

In [7]:
import pandas as pd
from src.paths import TRANSFORMED_DATA_DIR
from datetime import datetime
from src.data_split import train_test_split
import numpy as np
from sklearn.metrics import mean_absolute_error



In [2]:
df = pd.read_parquet(TRANSFORMED_DATA_DIR / "tabular_data.parquet")
df

Unnamed: 0,rides_previous_672_hour,rides_previous_671_hour,rides_previous_670_hour,rides_previous_669_hour,rides_previous_668_hour,rides_previous_667_hour,rides_previous_666_hour,rides_previous_665_hour,rides_previous_664_hour,rides_previous_663_hour,rides_previous_662_hour,rides_previous_661_hour,rides_previous_660_hour,rides_previous_659_hour,rides_previous_658_hour,rides_previous_657_hour,rides_previous_656_hour,rides_previous_655_hour,rides_previous_654_hour,rides_previous_653_hour,rides_previous_652_hour,rides_previous_651_hour,rides_previous_650_hour,rides_previous_649_hour,rides_previous_648_hour,rides_previous_647_hour,rides_previous_646_hour,rides_previous_645_hour,rides_previous_644_hour,rides_previous_643_hour,rides_previous_642_hour,rides_previous_641_hour,rides_previous_640_hour,rides_previous_639_hour,rides_previous_638_hour,rides_previous_637_hour,rides_previous_636_hour,rides_previous_635_hour,rides_previous_634_hour,rides_previous_633_hour,...,rides_previous_37_hour,rides_previous_36_hour,rides_previous_35_hour,rides_previous_34_hour,rides_previous_33_hour,rides_previous_32_hour,rides_previous_31_hour,rides_previous_30_hour,rides_previous_29_hour,rides_previous_28_hour,rides_previous_27_hour,rides_previous_26_hour,rides_previous_25_hour,rides_previous_24_hour,rides_previous_23_hour,rides_previous_22_hour,rides_previous_21_hour,rides_previous_20_hour,rides_previous_19_hour,rides_previous_18_hour,rides_previous_17_hour,rides_previous_16_hour,rides_previous_15_hour,rides_previous_14_hour,rides_previous_13_hour,rides_previous_12_hour,rides_previous_11_hour,rides_previous_10_hour,rides_previous_9_hour,rides_previous_8_hour,rides_previous_7_hour,rides_previous_6_hour,rides_previous_5_hour,rides_previous_4_hour,rides_previous_3_hour,rides_previous_2_hour,rides_previous_1_hour,pickup_hour,pickup_location_id,target_rides_next_hour
0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,2.0,0.0,0.0,1.0,2.0,1.0,5.0,5.0,13.0,3.0,3.0,1.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,1.0,2.0,1.0,2.0,0.0,1.0,1.0,3.0,3.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0,3.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,2022-01-29,1,0.0
1,0.0,0.0,0.0,0.0,0.0,4.0,1.0,2.0,1.0,2.0,0.0,1.0,1.0,3.0,3.0,0.0,9.0,5.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,1.0,2.0,2.0,...,0.0,0.0,1.0,3.0,3.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2022-01-30,1,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,1.0,2.0,2.0,2.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,2.0,2.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,2.0,1.0,4.0,2.0,2.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,2022-01-31,1,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,2.0,2.0,1.0,2.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,...,1.0,2.0,1.0,4.0,2.0,2.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,2.0,1.0,0.0,1.0,1.0,0.0,0.0,2022-02-01,1,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,...,1.0,0.0,0.0,1.0,0.0,1.0,2.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2022-02-02,1,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
89300,3.0,0.0,2.0,3.0,2.0,3.0,13.0,8.0,9.0,9.0,6.0,8.0,15.0,12.0,11.0,19.0,31.0,30.0,14.0,6.0,4.0,4.0,7.0,1.0,6.0,4.0,0.0,0.0,2.0,0.0,14.0,7.0,8.0,4.0,7.0,7.0,8.0,16.0,22.0,26.0,...,2.0,2.0,1.0,0.0,5.0,4.0,2.0,1.0,3.0,4.0,4.0,0.0,4.0,3.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,2.0,4.0,4.0,3.0,1.0,1.0,4.0,2.0,3.0,6.0,5.0,3.0,1.0,6.0,1.0,3.0,2022-12-27,265,3.0
89301,6.0,4.0,0.0,0.0,2.0,0.0,14.0,7.0,8.0,4.0,7.0,7.0,8.0,16.0,22.0,26.0,21.0,39.0,20.0,4.0,5.0,7.0,5.0,9.0,7.0,2.0,3.0,4.0,7.0,4.0,10.0,9.0,7.0,11.0,10.0,10.0,8.0,4.0,15.0,23.0,...,3.0,1.0,1.0,4.0,2.0,3.0,6.0,5.0,3.0,1.0,6.0,1.0,3.0,3.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,2.0,6.0,2.0,1.0,1.0,2.0,0.0,1.0,1.0,4.0,2.0,1.0,2.0,2.0,2.0,8.0,2022-12-28,265,1.0
89302,7.0,2.0,3.0,4.0,7.0,4.0,10.0,9.0,7.0,11.0,10.0,10.0,8.0,4.0,15.0,23.0,27.0,19.0,13.0,7.0,6.0,7.0,16.0,9.0,6.0,5.0,4.0,3.0,0.0,3.0,11.0,12.0,9.0,10.0,9.0,15.0,9.0,17.0,16.0,24.0,...,1.0,1.0,2.0,0.0,1.0,1.0,4.0,2.0,1.0,2.0,2.0,2.0,8.0,1.0,3.0,0.0,8.0,0.0,0.0,0.0,1.0,0.0,0.0,3.0,3.0,2.0,1.0,4.0,4.0,2.0,2.0,3.0,5.0,1.0,1.0,0.0,8.0,2022-12-29,265,3.0
89303,6.0,5.0,4.0,3.0,0.0,3.0,11.0,12.0,9.0,10.0,9.0,15.0,9.0,17.0,16.0,24.0,19.0,22.0,9.0,8.0,2.0,4.0,8.0,5.0,6.0,6.0,4.0,2.0,5.0,1.0,0.0,3.0,5.0,8.0,9.0,4.0,2.0,7.0,5.0,8.0,...,3.0,2.0,1.0,4.0,4.0,2.0,2.0,3.0,5.0,1.0,1.0,0.0,8.0,3.0,4.0,6.0,3.0,2.0,1.0,1.0,1.0,0.0,2.0,5.0,2.0,1.0,2.0,1.0,3.0,2.0,3.0,3.0,1.0,2.0,0.0,1.0,2.0,2022-12-30,265,7.0


## Split the data into training and test data ##

In [3]:
X_train, y_train, X_test, y_test = train_test_split(
    df,
    cutoff_date=datetime(2022, 6, 1, 0 , 0, 0), # cutoff on the 2022-06-01, 00:00
    target_column_name="target_rides_next_hour"
)

print(f"{X_train.shape= }")
print(f"{y_train.shape= }")
print(f"{X_test.shape= }")
print(f"{y_test.shape= }")

X_train.shape= (32595, 674)
y_train.shape= (32595,)
X_test.shape= (56710, 674)
y_test.shape= (56710,)


## Build the baseline model ##
To predict the taxi demand for the upcoming hour, we will look at what was the demand for the  **previous hour**. Hence, our model takes the last observed demand for the next hour and then predicts that the demand for the next hour will be the same. It is a very simple model - the next hour will be exactly the same as the previous hour. To do this, we will create a `BaselineModelPreviousHour`.

In [5]:
class BaselineModelPreviousHour:
    """
    Prediction = actual demand observed in the last hour
    """
    def fit(self, X_train: pd.DataFrame, y_train: pd.Series):
        pass

    def predict(self, X_test: pd.DataFrame) -> np.array:
        """
        Dummy baseline model that predicts the upcoming number of rides by returning the number of rides from the previous hour.

        Args:

        - `X_test` (`pd.DataFrame`) - Pandas DataFrame with test features.
        
        Returns:

        - `np.array` - predicted results.
        """
        return X_test[f"rides_previous_1_hour"]

Most ML models have a basic interface in which you have **wo methods**. You first **fit** the model with features and targets. Then, you **predict**, which, given a features test data, returns an array of predictions (inference step). Here, we have provided no training, but it is useful to understand how ML models work and what kind of methods they usually provide.

In [6]:
model = BaselineModelPreviousHour()
predictions = model.predict(X_test)
predictions

0        0.0
1        0.0
2        0.0
3        0.0
4        0.0
        ... 
56705    3.0
56706    8.0
56707    8.0
56708    2.0
56709    7.0
Name: rides_previous_1_hour, Length: 56710, dtype: float32

How accurate are these predictions? We need to define our error metric - our **mean absolute error**. We can use `scikit-learn`'s function to attain this metric. 

In [8]:
test_mae = mean_absolute_error(y_test, predictions)
print(f"{test_mae=:.4f}")

test_mae=6.0558


Our mean absolute error is `6.0558`. We will use it as a reference number to compare with ML models. But we can improve our basemodel before applying this strategy.

## Create a second baseline model ##
We will work with the fact that our data presented some seasonality. Hence, we will use data from the previous week to estimate the demand for the upcoming hour. For a given Friday, we look at the previous Friday to estimate the prediction and use it as our baseline model. To do this, we will create a `BaselineModelPreviousWeek` class.


In [9]:
class BaselineModelPreviousWeek():
    """
    Baseline model that predicts the number of rides for the next hour using a 7-day interval.
    
    Prediction = actual demand observed at `t - 7` days.
    """
    def fit(self, X_train: pd.DataFrame, y_train: pd.Series):
        pass
    
    def predict(self, X_test: pd.DataFrame) -> np.array:
        """
        Dummy baseline model that predicts the upcoming number of rides by returning the number of rides based on a `t - 7` days interval period.

        Args:

        - `X_test` (`pd.DataFrame`) - Pandas DataFrame with test features.

        Returns:

        - `np.array` - predicted results.
        """
        return X_test[f"rides_previous_{7*24}_hour"]
         
    

In [11]:
model = BaselineModelPreviousWeek()
predictions = model.predict(X_test)

In [12]:
test_mae = mean_absolute_error(y_test, predictions)
print(f"{test_mae=:.4f}")

test_mae=3.6811


As we can see, our mean absolute error value dropped significantly - `3.6811` from the original `6.0558`. Our model performs much better now. But can we improve on it? Let's try a third baseline model.


## Create a third baseline model ##

In this model, we will take an average of the values that we observed 7, 14, 21 and 28 days ago. We leverage the seasonality of the data by taking four different measures and averaging them. To do to this, we create a `BaselineModelLast4Weeks` class.

In [13]:
class BaselineModelLast4Weeks:
    """
    Baseline model that predicts the number of rides for the next hour using the average number of rides for the past 4 weeks of the observed value.

    Prediction = actual demand observed at `t - 7` days, `t - 14` days, `t - 21` days, `t - 28` days.
    
    """
    def fit(self, X_train: pd.DataFrame, y_train: pd.Series):
        pass

    def predict(self, X_test: pd.DataFrame) -> pd.Series:
        """
        Dummy baseline model that predicts the upcoming number of rides by returning the average number of rides for the past 4 weeks of the observed value.

        Args:

        - `X_test` (`pd.DataFrame`) - Pandas DataFrame with test features.

        Returns:

        - `pd.Series` - predicted results.
        """   
        return 0.25 * (
            X_test[f"rides_previous_{7*24}_hour"] + \
            X_test[f"rides_previous_{2*7*24}_hour"] + \
            X_test[f"rides_previous_{3*7*24}_hour"] + \
            X_test[f"rides_previous_{4*7*24}_hour"] 
        )     

In [14]:
model = BaselineModelLast4Weeks()
predictions = model.predict(X_test)

test_mae = mean_absolute_error(y_test, predictions)
print(f"{test_mae=:.4f}")

test_mae=3.1963


Our error now is even lower, a fact that highlights the importance of building powerful baseline models. **Don't underestimate the power of simple baseline models**. Now, we can now start building ML models using this value - **3.1963** - as our reference. Any model, regardless of how complex it is, must be lower than this number to return any significant value to our product.