## Import dependencies

In [None]:
# display full output in Notebook, instead of only the last result
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# customized preprocessing functions
import util

# standard libraries
import pandas as pd
import numpy as np
import os
from datetime import datetime
import time
import matplotlib.pyplot as plt


# scikit-learn
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import OrdinalEncoder 
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# models
from pmdarima import auto_arima
import statsmodels.api as sm
import xgboost as xgb

## Import data

In [None]:
data_path = "../00_data"

df_rapperswil = pd.read_csv(os.path.join(data_path, "features_rapperswil.csv"), sep=",")
df_burgdorf = pd.read_csv(os.path.join(data_path, "features_burgdorf.csv"), sep=",")

print('Dataset shape of Rapperswil data: {}'.format(df_rapperswil.shape))
print('Dataset shape of Burgdorf data: {}'.format(df_burgdorf.shape))

In [None]:
# keep only target variable and date as index for arima
# df_rapperswil = df_rapperswil[['date', 'occupancy_rate']]

# only keep hour and target variable for heuristic model
df_rapperswil = df_rapperswil[['date', 'hour', 'occupancy_rate']]
df_burgdorf = df_burgdorf[['date', 'hour', 'occupancy_rate']]

## Train/Test Split

Notebook is written for using one parking site data at once. Consequently, do a new notebook run for the second parking site.

In [None]:
def split(df, split_date):
    
    # define split date
    # split_date = datetime.strptime(split_date, '%Y-%m-%d %H:%M:%S').date()
    split_date = datetime.strptime(split_date, '%Y-%m-%d %H:%M:%S')
    
    # split df into train and test set
    df_train = df.loc[df['date'] <= split_date].copy()
    df_test = df.loc[df['date'] > split_date].copy()
    
    # set date as index in both sets
    df_train = df_train.set_index('date')
    df_test = df_test.set_index('date')
    
    return df_train, df_test

In [None]:
df_rapperswil['date'] = pd.to_datetime(df_rapperswil['date'])
df_burgdorf['date'] = pd.to_datetime(df_burgdorf['date'])

df_train, df_test = split(df_burgdorf, '2021-08-01 01:00:00')

## 1a. Heuristic Model

As shown in the data exploration part, parking occupancy follows a relatively clear 24-hour cyclical pattern with some variance. Therefore, an appealing basemodel would be to consider the average parking occupancy rate for every hour.

In [None]:
# make prediction 
prediction = pd.DataFrame(df_train.groupby('hour')['occupancy_rate'].mean())

In [None]:
# predict on test set
df_all = pd.merge(df_test, prediction, on=['hour'])

In [None]:
# apply evaluation metrics
print('MAE: ', round(mean_absolute_error(y_true=df_all['occupancy_rate_x'], y_pred=df_all['occupancy_rate_y']), 2))
print('RMSE: ', round(mean_squared_error(y_true=df_all['occupancy_rate_x'], y_pred=df_all['occupancy_rate_y'], squared=False), 2))

## 1b. ARIMA Model

Model parameters:

- **p:** The number of lag observations included in the model, also called the lag order.
- **d:** The number of times that the raw observations are differenced, also called the degree of differencing.
- **q:** The size of the moving average window, also called the order of moving average.

Preliminary steps:

**1) Stationarity:**
ARIMA models assumes non-stationarity, meaning a flat looking series, without trend, constant variance over time, a constant autocorrelation structure over time and no periodic fluctuations. If stationarity exists, we need to mathematically differentiate the time series until it becomes stationary.

**2) Analyse Auto-Correlation:** We use auto correlation function (ACF) and partial auto correlation function (PACF) to determine the optimal number of MA and AR terms

In [None]:
stepwise_model = auto_arima(df_train, start_p=1, max_p=3, start_q=1,
                            max_q=3, start_d=0, max_d = 3,
                            seasonal=False,
                            trace=True,
                            error_action='ignore',  
                            suppress_warnings=True, 
                            stepwise=True) 

In [None]:
# fit model with optimal parameters obtained through grid search
stepwise_model.fit(df_train['occupancy_rate'])

In [None]:
# run prediction and assign values to test set
prediction = stepwise_model.predict(n_periods=len(df_test))
df_test['arima'] = stepwise_model.predict(n_periods=len(df_test))

# merge train and test set for visualization purpose
df_all = pd.concat([df_test, df_train], sort=False)

1) Calculate evaluation metrics performed on the test set

**Mean absolute error** \
MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight.

**Root mean squared error** \
RMSE is a quadratic scoring rule that also measures the average magnitude of the error. It’s the square root of the average of squared differences between prediction and actual observation.

**Difference of these two measures** \
Taking the square root of the average squared errors has some interesting implications for RMSE. Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE should be more useful when large errors are particularly undesirable. The three tables below show examples where MAE is steady and RMSE increases as the variance associated with the frequency distribution of error magnitudes also increases.

In [None]:
print('MAE: ', round(mean_absolute_error(y_true=df_test['occupancy_rate'], y_pred=df_test['arima']), 2))
print('RMSE: ', round(mean_squared_error(y_true=df_test['occupancy_rate'], y_pred=df_test['arima'], squared=False), 2))

2) Plot actual and predicted values

In [None]:
df_test['arima'] = stepwise_model.predict(n_periods=len(df_test))
df_all = pd.concat([df_test, df_train], sort=False)

In [None]:
df_all[['occupancy_rate', 'arima']].plot(figsize=(15, 5))