# Traffic Flow Regression
 - this notebook will build a machine learning model to predict the flow at a site given the time of day
 - see EDA.ipynb for a deeper understanding of the data

# Imports

In [81]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, SGDRegressor, Ridge, Lasso, ElasticNet
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor



# Prepare Dataset
Reading in a dataset already cleaned in EDA.ipynb

In [42]:
df = pd.read_csv('ml_datasets/traffic_flow_regression.csv')
df.head()

Unnamed: 0,region,locn,X,Y,site,day,date,start_time,end_time,flow,day_ind
0,RGA,Airton GrHill,-6.356151,53.293594,N01111A,TU,2022-04-01,03:00:00,03:15:00,13,1
1,RGA,Airton GrHill,-6.356151,53.293594,N01111A,TU,2022-04-01,03:15:00,03:30:00,10,1
2,RGA,Airton GrHill,-6.356151,53.293594,N01111A,TU,2022-04-01,03:30:00,03:45:00,0,1
3,RGA,Airton GrHill,-6.356151,53.293594,N01111A,TU,2022-04-01,03:45:00,04:00:00,9,1
4,RGA,Airton GrHill,-6.356151,53.293594,N01111A,TU,2022-04-01,04:00:00,04:15:00,0,1


Lets simplify by aggregating up the data by hour
- We will just include site, day, flow and time in hours for now

In [43]:
df['start_time'] = pd.to_datetime(df['start_time'])

df['time_(hour)'] = df['start_time'].dt.hour

df = df.groupby(['time_(hour)', 'site', 'day', 'date']).agg({'flow': 'sum'}).reset_index()
df.head()

  df['start_time'] = pd.to_datetime(df['start_time'])


Unnamed: 0,time_(hour),site,day,date,flow
0,0,N01111A,FR,2022-01-04,127
1,0,N01111A,FR,2022-01-14,111
2,0,N01111A,FR,2022-01-21,93
3,0,N01111A,FR,2022-01-28,113
4,0,N01111A,FR,2022-02-18,120


# Modelling
- It probably makes sense for the moment to just keep it simple and build a separate ML model for each site.
- We will also just use the features time and day to make a prediction
    - These features are simple enough just being categorical features


## Create X and y datasets

In [44]:
X = df[['site', 'day', 'time_(hour)']]
y = df[['flow', 'site']]

## Feature transformation
Lets use ordinal encoding as day and time have a meaningful order

In [45]:
ordinal_encoder = OrdinalEncoder()
X[['day', 'time_(hour)']] = ordinal_encoder.fit_transform(X[['day', 'time_(hour)']])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[['day', 'time_(hour)']] = ordinal_encoder.fit_transform(X[['day', 'time_(hour)']])


## Train Test split
The train test split should be create by site

In [46]:
def train_test_split_site(site):
    X_site = X[X['site']==site][['time_(hour)', 'day']]
    X_train, X_test, y_train, y_test = train_test_split(X_site, y[y['site']==site]['flow'], test_size=0.2, random_state=42)

    return X_train, X_test, y_train, y_test


## Create model

Let's start with a Linear Regression Model

In [82]:
model = DecisionTreeRegressor()

## Fit Model
### Let's first pick the site N01111A

In [83]:
X_train, X_test, y_train, y_test = train_test_split_site('N01111A')
model.fit(X_train, y_train)

# Model Performance

In [84]:
y_pred = model.predict(X_test)

In [85]:
X_train

Unnamed: 0,time_(hour),day
18278,21.0,0.0
6208,7.0,5.0
2742,3.0,6.0
9680,11.0,5.0
15668,18.0,0.0
...,...,...
6169,7.0,3.0
6204,7.0,5.0
7093,8.0,6.0
4484,5.0,6.0


In [86]:
mae = mean_absolute_error(y_test, y_pred)

# Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

# Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)

# R-squared (R²)
r2 = r2_score(y_test, y_pred)

# Print the evaluation metrics
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R-squared (R²): {r2}")

Mean Absolute Error (MAE): 105.85119582867544
Mean Squared Error (MSE): 31886.808902306893
Root Mean Squared Error (RMSE): 178.5687791925198
R-squared (R²): 0.9412836118110991
