# Traffic Flow Regression
 - this notebook will build a machine learning model to predict the flow at a site given the time of day
 - see EDA.ipynb for a deeper understanding of the data

# Imports

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, SGDRegressor, Ridge, Lasso, ElasticNet
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor



# Prepare Dataset
Reading in a dataset already cleaned in EDA.ipynb

In [12]:
df = pd.read_csv('ml_datasets/traffic_flow_regression.csv')
df.head()

Unnamed: 0,region,locn,X,Y,site,day,date,start_time,end_time,flow,day_ind
0,RGA,Airton GrHill,-6.356151,53.293594,N01111A,TU,2022-04-01,03:00:00,03:15:00,13,1
1,RGA,Airton GrHill,-6.356151,53.293594,N01111A,TU,2022-04-01,03:15:00,03:30:00,10,1
2,RGA,Airton GrHill,-6.356151,53.293594,N01111A,TU,2022-04-01,03:30:00,03:45:00,0,1
3,RGA,Airton GrHill,-6.356151,53.293594,N01111A,TU,2022-04-01,03:45:00,04:00:00,9,1
4,RGA,Airton GrHill,-6.356151,53.293594,N01111A,TU,2022-04-01,04:00:00,04:15:00,0,1


Lets simplify by aggregating up the data by hour
- We will just include site, day, flow and time in hours for now

In [13]:
df['start_time'] = pd.to_datetime(df['start_time'])

df['time_(hour)'] = df['start_time'].dt.hour

df = df.groupby(['time_(hour)', 'site', 'day', 'date']).agg({'flow': 'sum'}).reset_index()
df.head()

  df['start_time'] = pd.to_datetime(df['start_time'])


Unnamed: 0,time_(hour),site,day,date,flow
0,0,N01111A,FR,2022-01-04,127
1,0,N01111A,FR,2022-01-14,111
2,0,N01111A,FR,2022-01-21,93
3,0,N01111A,FR,2022-01-28,113
4,0,N01111A,FR,2022-02-18,120


# Modelling
- It probably makes sense for the moment to just keep it simple and build a separate ML model for each site.
- We will also just use the features time and day to make a prediction
    - These features are simple enough just being categorical features


## Create X and y datasets

In [14]:
X = df[['site', 'day', 'time_(hour)']]
y = df[['flow', 'site']]

## Feature transformation
Lets use ordinal encoding as day and time have a meaningful order

In [15]:
ordinal_encoder = OrdinalEncoder()
X[['day', 'time_(hour)']] = ordinal_encoder.fit_transform(X[['day', 'time_(hour)']])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[['day', 'time_(hour)']] = ordinal_encoder.fit_transform(X[['day', 'time_(hour)']])


## Train Test split
The train test split should be create by site

In [16]:
def train_test_split_site(df, site):
    
    X_site = X[X['site']==site][['time_(hour)', 'day']]
    X_train, X_test, y_train, y_test = train_test_split(X_site, y[y['site']==site]['flow'], test_size=0.2, random_state=42)

    return X_train, X_test, y_train, y_test


## Create model

- For version 1, lets use two simple models... Linear Regression and a Decision Tree

In [17]:
lr = LinearRegression()
dt = DecisionTreeRegressor()

## Fit Model
### Let's first pick the site N01111A

In [18]:
X_train, X_test, y_train, y_test = train_test_split_site(df, 'N01111A')

Linear Regression

In [19]:
X_train

Unnamed: 0,time_(hour),day
17753,21.0,0.0
6033,7.0,5.0
2667,3.0,6.0
9405,11.0,5.0
15218,18.0,0.0
...,...,...
5994,7.0,3.0
6029,7.0,5.0
6893,8.0,6.0
4359,5.0,6.0


In [20]:
lr.fit(X_train, y_train)

Decision Tree

In [21]:
dt.fit(X_train, y_train)

# Predict

In [22]:
predictions = {}
predictions['lr'] = lr.predict(X_test)
predictions['dt'] = dt.predict(X_test)

# Model Performance


In [23]:
def model_performance(predictions, y_test, model_type):

    y_pred = predictions[model_type]

    mae = mean_absolute_error(y_test, y_pred)

    # Mean Squared Error (MSE)
    mse = mean_squared_error(y_test, y_pred)

    # Root Mean Squared Error (RMSE)
    rmse = np.sqrt(mse)

    # R-squared (R²)
    r2 = r2_score(y_test, y_pred)


    # Print the evaluation metrics
    print(f"Mean Absolute Error (MAE): {mae}")
    print(f"Mean Squared Error (MSE): {mse}")
    print(f"Root Mean Squared Error (RMSE): {rmse}")
    if model_type == 'lr':
        print(f"R-squared (R²): {r2}")

    mean_flow = np.mean(df['flow'])
    print(f'Mean Absolute Percentage Error (MAPE): {100*mae/mean_flow:.1f}%')

    return mae, mse, rmse

## - Linear Regression

In [24]:
results = {
    'lr': {
        'mae': None,
        'mse': None,
        'rmse': None
    },
    'dt': {
        'mae': None,
        'mse': None,
        'rmse': None
    }

}
# Assuming model_performance returns a tuple (mae, mse, rmse)
results['lr']['mae'], results['lr']['mse'], results['lr']['rmse'] = model_performance(predictions,y_test, 'lr')


Mean Absolute Error (MAE): 595.0585401535006
Mean Squared Error (MSE): 444291.9835040713
Root Mean Squared Error (RMSE): 666.5523111534993
R-squared (R²): 0.1818804869259122
Mean Absolute Percentage Error (MAPE): 66.8%


## - Decision Tree

In [25]:
results['dt']['mae'], results['dt']['mse'], results['dt']['rmse'] = model_performance(predictions,y_test, 'dt')


Mean Absolute Error (MAE): 105.85119582867544
Mean Squared Error (MSE): 31886.808902306893
Root Mean Squared Error (RMSE): 178.5687791925198
Mean Absolute Percentage Error (MAPE): 11.9%


### Observations
- We see that the decision tree model makes better predictions than the linear regression model as the error values are smaller
- This suggests the relationship between the features and the target variable is non-linear as a decision tree better captures non-linear relationships
- The MAE is still 106 in the decisions tree. This means on average the decision tree prediction of flow is 106 off the actual flow
- To put this in the perspective of the problem at hand, the MAPE for the decision tree model is 11.9%. This means the prediction is ~12% off the correct value on average. This puts the raw number seen before into the perspective of the problem better!

#### Evaluating models
- The margin for error for a problem is always dependant on the purpose of the model. If our predictions were going to be used in a medical scenario for example we would want the model to be as accurate as possible
- In a business context we would need to discuss with the appropriate stakeholders on what is considered an acceptable margin for error
- Let's assume our margin for error is 0! And strive for by improving our model. See V2!

# All sites
Lets have a look at all sites using the same methods to see if what we saw for that site is the same

In [26]:
def train_fit_eval_site(site):

    df = DATAFRAME[DATAFRAME['site']==site]
   
    X_train, X_test, y_train, y_test = train_test_split_site(df, site)
    
    lr.fit(X_train, y_train)
    dt.fit(X_train, y_train)
    predictions = {}

    predictions['lr'] = lr.predict(X_test)
    predictions['dt'] = dt.predict(X_test)

    print('SITE: ',site, ' Number of training examples:', len(df))

    print('Linear Regression:')
    model_performance(predictions,y_test, 'lr')

    print('Decision Tree:')
    model_performance(predictions,y_test, 'dt')

In [27]:
df['site'].unique()

array(['N01111A', 'N01131A', 'N01151A', 'N02111A', 'N02131A', 'N03121A'],
      dtype=object)

In [28]:
DATAFRAME = df
for site in df['site'].unique():
    train_fit_eval_site(site)

SITE:  N01111A  Number of training examples: 3480
Linear Regression:
Mean Absolute Error (MAE): 595.0585401535006
Mean Squared Error (MSE): 444291.9835040713
Root Mean Squared Error (RMSE): 666.5523111534993
R-squared (R²): 0.1818804869259122
Mean Absolute Percentage Error (MAPE): 66.8%
Decision Tree:
Mean Absolute Error (MAE): 105.85119582867544
Mean Squared Error (MSE): 31886.808902306893
Root Mean Squared Error (RMSE): 178.5687791925198
Mean Absolute Percentage Error (MAPE): 11.9%
SITE:  N01131A  Number of training examples: 2879
Linear Regression:
Mean Absolute Error (MAE): 144.54884377590952
Mean Squared Error (MSE): 126955.37377810039
Root Mean Squared Error (RMSE): 356.30797602369273
R-squared (R²): 0.0006871624393772757
Mean Absolute Percentage Error (MAPE): 16.2%
Decision Tree:
Mean Absolute Error (MAE): 145.65707689046383
Mean Squared Error (MSE): 138745.56103937022
Root Mean Squared Error (RMSE): 372.48565212551506
Mean Absolute Percentage Error (MAPE): 16.3%
SITE:  N01151A 

As a table...

| Site     | Training Examples | Model             | Mean Absolute Error (MAE) | Mean Squared Error (MSE) | Root Mean Squared Error (RMSE) | R-squared (R²) | Mean Absolute Percentage Error (MAPE) |
|----------|------------------|-------------------|----------------------------|---------------------------|----------------------------------|----------------|--------------------------------------|
| N01111A  | 3480             | Linear Regression  | 595.06                     | 444291.98                 | 666.55                           | 0.1819         | 66.8%                                |
|          |                  | Decision Tree      | 105.85                     | 31886.81                  | 178.57                           |                | 11.9%                                |
| N01131A  | 2879             | Linear Regression  | 144.55                     | 126955.37                 | 356.31                           | 0.0007         | 16.2%                                |
|          |                  | Decision Tree      | 145.66                     | 138745.56                 | 372.49                           |                | 16.3%                                |
| N01151A  | 3480             | Linear Regression  | 546.97                     | 374030.12                 | 611.58                           | 0.1362         | 61.4%                                |
|          |                  | Decision Tree      | 93.03                      | 27962.88                  | 167.22                           |                | 10.4%                                |
| N02111A  | 3480             | Linear Regression  | 760.37                     | 725177.17                 | 851.57                           | 0.2787         | 85.3%                                |
|          |                  | Decision Tree      | 118.66                     | 46061.67                  | 214.62                           |                | 13.3%                                |
| N02131A  | 3480             | Linear Regression  | 585.46                     | 467660.51                 | 683.86                           | 0.3044         | 65.7%                                |
|          |                  | Decision Tree      | 105.09                     | 31225.56                  | 176.71                           |                | 11.8%                                |
| N03121A  | 3480             | Linear Regression  | 521.87                     | 356974.59                 | 597.47                           | 0.2509         | 58.6%                                |
|          |                  | Decision Tree      | 115.25                     | 36362.44                  | 190.69                           |                | 12.9%                                |


We see the same general pattern here with MAPE roughly 11-13% on the decision tree but higher error for the linear regression model all around