# Traffic Flow Regression - V2
 - this notebook will builds upon traffic_flow_regression_V1 by adding additional feature engineering and a more robust model selection and training process
 - see EDA.ipynb for a deeper understanding of the data

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, SGDRegressor, Ridge, Lasso, ElasticNet
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
import sys, os 

sys.path.append(os.path.join(os.getcwd(), 'utils'))


## Prepare Dataset
Let's challenge ourselves more in V2 and keep the traffic flow in 30 minute intervals rather than hours. This will require a more fine-tuned model. We could keep 15 minute intervals but 30 minutes intuitively seems more sensible and reduces the amount of non interesting intervals (0 flow)

In [3]:

# Load the dataset
df = pd.read_csv('ml_datasets/traffic_flow_regression.csv')

# Convert 'start_time' and 'date' to datetime
df['start_time'] = pd.to_datetime(df['start_time'])
df['date'] = pd.to_datetime(df['date'])

# Round the start_time to the nearest half-hour
df['start_time'] = df['start_time'].dt.round('30min')

# Extract half-hour intervals (0 for the hour, 30 for half past)
df['time_(half_hour)'] = df['start_time'].dt.strftime('%H:%M')

# Group by the new half-hour intervals along with other specified columns
df = df.groupby(['time_(half_hour)', 'start_time', 'site', 'day', 'date', 'X', 'Y']).agg({'flow': 'sum'}).reset_index()

# Display the first few rows
df


  df['start_time'] = pd.to_datetime(df['start_time'])


Unnamed: 0,time_(half_hour),start_time,site,day,date,X,Y,flow
0,00:00,2024-10-12 00:00:00,N01111A,FR,2022-01-04,-6.356151,53.293594,79
1,00:00,2024-10-12 00:00:00,N01111A,FR,2022-01-14,-6.356151,53.293594,58
2,00:00,2024-10-12 00:00:00,N01111A,FR,2022-01-21,-6.356151,53.293594,43
3,00:00,2024-10-12 00:00:00,N01111A,FR,2022-01-28,-6.356151,53.293594,58
4,00:00,2024-10-12 00:00:00,N01111A,FR,2022-02-18,-6.356151,53.293594,69
...,...,...,...,...,...,...,...,...
41394,23:30,2024-10-12 23:30:00,N03121A,WE,2022-05-25,-6.424314,53.357202,67
41395,23:30,2024-10-12 23:30:00,N03121A,WE,2022-06-04,-6.424314,53.357202,104
41396,23:30,2024-10-12 23:30:00,N03121A,WE,2022-06-15,-6.424314,53.357202,68
41397,23:30,2024-10-12 23:30:00,N03121A,WE,2022-06-22,-6.424314,53.357202,74


# Feature Engineering
- Feature engineering is the process of selecting, manipulating and transforming raw data into features that can be used in supervised learning
- A crucial aspect of a data scientists job is to create features that are predictive from the raw data
- This is one reason why domain knowledge can be particularly important for a data scientist in their job. But also creativity and logical thinking are a part of it too

## What causes traffic?
- Luckily I don't particularly need much domain knowledge to get started. I live in Dublin and drive so that qualifies me somewhat!
- We already use the day and the time of day as features which are probably the most predictive features I would think to predict traffic flow in general, other than specific instances like car accidents and road works
- These features were pretty much handed to us. But is there anything else there that could be predictive
### 1. The month!
- I've always noticed traffic to be much better when schools are off so maybe including information on the month could be useful for us
- We record the traffic on a given date but over a 6 month time window obviously each date is unique making it an unuseful feature
- By instead extracting the month we can make what was an unuseful feature (the date) into something that might be useful


Create a month column from the date column

In [4]:
df['month'] = df['date'].dt.month
df.head()

Unnamed: 0,time_(half_hour),start_time,site,day,date,X,Y,flow,month
0,00:00,2024-10-12,N01111A,FR,2022-01-04,-6.356151,53.293594,79,1
1,00:00,2024-10-12,N01111A,FR,2022-01-14,-6.356151,53.293594,58,1
2,00:00,2024-10-12,N01111A,FR,2022-01-21,-6.356151,53.293594,43,1
3,00:00,2024-10-12,N01111A,FR,2022-01-28,-6.356151,53.293594,58,1
4,00:00,2024-10-12,N01111A,FR,2022-02-18,-6.356151,53.293594,69,2


In a similar manner we can create the hour as a feature as was done in V1 but abstracting a bit more for this version where we can predict flow in shorter intervals

In [5]:
df['time_(hour)'] = df['start_time'].dt.hour


## What causes Traffic (cont)
- That was a pretty easy addition of a feature. Whether it ends up being super predictive or not isn't what's important right now, but more the principle of how features can be created that may be useful from what is handed to us
- One thing I noticed from the V1 notebook is the isolation of each site. I think it makes sense to have separate models for each site but we are then isolating a dataset of 40,000 samples into ~7000 per site.
### 2. Traffic Flow at other junctions 
- Well as we saw in our EDA some of the sites are quite close together. So if one site has a lot of traffic flow surely that would predict that the other sites close to it would also have traffic flow as cars may be driving through different sites
- To model this, let's create a feature for each sample that is the amount of traffic flow in all the sites in the <b> previous interval </b> to that sample.
 #### ** NB: Data Leakage **
- Let's make up a use case for this model as that we are building a real time traffic control system for these sites
- Here the model has already been trained and is being fed data live and every half an hour it predicts what the traffic flow is going to be for the next half an hour. Since one of our features is going to be traffic flow at other junctions, this will be fed too.
- Obviously in this scenario **we do not have the traffic flow data for the other sites for that interval as it hasn't happened yet**.
- Therefore the data we would have available to us that is useful is the traffic flow data from the previous interval
- This may seem obvious but when building and training a model on historical data, it can become easy to just see the dataset as a bunch of numbers and forget where they are coming from in the real world
- This is something that separates building a model for fun and learning about machine learning to actually putting a model out there in production that has use in the real world.

Create a datetime column

In [6]:
df['datetime'] = pd.to_datetime(df['date'].astype(str) + ' ' + df['time_(half_hour)'])


Function to get the flow at the previous interval in time

In [7]:

# Get all unique sites
unique_sites = df['site'].unique()

# Create a function to calculate the previous flow for all other sites
def calculate_previous_flow_for_sites(row, site):
    # Get the time of the current row
    current_time = row['datetime']
    
    # Find the previous interval
    previous_time = current_time - pd.Timedelta(minutes=30)
    
    # Filter the flow for the specific site at the previous time
    previous_flow = df[(df['datetime'] == previous_time) & (df['site'] == site)]['flow']
    
    # Return the flow value if it exists, else 0
    return previous_flow.iloc[0] if not previous_flow.empty else 0

# For each unique site, create a column with the flow of that site in the previous interval
# for site in unique_sites:
#     df[f'previous_flow_{site}'] = df.apply(calculate_previous_flow_for_sites, axis=1, site=site)

df = pd.read_csv('ml_datasets/traffic_flow_regression_mod.csv', index_col=None)

# Display the updated DataFrame
df.head()

Unnamed: 0,time_(half_hour),start_time,site,day,date,X,Y,flow,month,time_(hour),datetime,previous_flow_N01111A,previous_flow_N01131A,previous_flow_N01151A,previous_flow_N02111A,previous_flow_N02131A,previous_flow_N03121A,overall_previous_flow
0,00:00,2024-10-12 00:00:00,N01111A,FR,2022-01-04,-6.356151,53.293594,79,1,0,2022-01-04 00:00:00,25,0,26,53,29,68,201
1,00:00,2024-10-12 00:00:00,N01111A,FR,2022-01-14,-6.356151,53.293594,58,1,0,2022-01-14 00:00:00,35,0,25,68,39,17,184
2,00:00,2024-10-12 00:00:00,N01111A,FR,2022-01-21,-6.356151,53.293594,43,1,0,2022-01-21 00:00:00,32,0,30,62,35,58,217
3,00:00,2024-10-12 00:00:00,N01111A,FR,2022-01-28,-6.356151,53.293594,58,1,0,2022-01-28 00:00:00,45,0,32,91,45,50,263
4,00:00,2024-10-12 00:00:00,N01111A,FR,2022-02-18,-6.356151,53.293594,69,2,0,2022-02-18 00:00:00,58,0,37,63,52,53,263


Create a column that is the overall previous flow across all sites

In [8]:
df['overall_previous_flow'] = df[[f'previous_flow_{site}' for site in unique_sites]].sum(axis=1)
df.head()

Unnamed: 0,time_(half_hour),start_time,site,day,date,X,Y,flow,month,time_(hour),datetime,previous_flow_N01111A,previous_flow_N01131A,previous_flow_N01151A,previous_flow_N02111A,previous_flow_N02131A,previous_flow_N03121A,overall_previous_flow
0,00:00,2024-10-12 00:00:00,N01111A,FR,2022-01-04,-6.356151,53.293594,79,1,0,2022-01-04 00:00:00,25,0,26,53,29,68,201
1,00:00,2024-10-12 00:00:00,N01111A,FR,2022-01-14,-6.356151,53.293594,58,1,0,2022-01-14 00:00:00,35,0,25,68,39,17,184
2,00:00,2024-10-12 00:00:00,N01111A,FR,2022-01-21,-6.356151,53.293594,43,1,0,2022-01-21 00:00:00,32,0,30,62,35,58,217
3,00:00,2024-10-12 00:00:00,N01111A,FR,2022-01-28,-6.356151,53.293594,58,1,0,2022-01-28 00:00:00,45,0,32,91,45,50,263
4,00:00,2024-10-12 00:00:00,N01111A,FR,2022-02-18,-6.356151,53.293594,69,2,0,2022-02-18 00:00:00,58,0,37,63,52,53,263


## Let's see model performance with these additional features

In [9]:
cat_features = ['day', 'time_(hour)']
num_features = ['previous_flow_N01111A','previous_flow_N01131A','previous_flow_N01151A','previous_flow_N02111A','previous_flow_N02131A','previous_flow_N03121A','overall_previous_flow']
# Now, import the module
import DecisionTree_site 
dt = DecisionTree_site.DecisionTree('N01111A', cat_features, num_features, 'flow', df)

# TODO: TURN TO SKLEARN PIPELINE

In [10]:
dt.transform_split_train_score_eval()

Mean Absolute Error (MAE) Test: 73.97212837837837
Mean Absolute Error (MAE) Train: 0.597972972972973
Root Mean Squared Error (RMSE) Test: 136.08673425380178
Root Mean Squared Error (RMSE) Train: 4.819338429483646
R-squared (R²) Test: 0.9179051341801862
R-squared (R²) Train: 0.999897438999266
Mean Absolute Percentage Error (MAPE) Test: 16.1%
Mean Absolute Percentage Error (MAPE) Train: 0.1%
