<a href="https://colab.research.google.com/github/dewaalaw/dsprojects/blob/main/traffic_flow_forecasting_part_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import datetime

import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
import matplotlib.pyplot as plt

from tensorflow.keras import Model, Sequential

from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.metrics import MeanAbsoluteError

from tensorflow.keras.layers import Dense, Conv1D, LSTM, Lambda, Reshape, RNN, LSTMCell

import warnings
warnings.filterwarnings('ignore')

In [None]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [None]:
df = pd.read_csv('/gdrive/MyDrive/TrafficFlowData/Metro_Interstate_Traffic_Volume.csv')

In [None]:
df.head(10)

In [None]:
df['date_time'] = pd.to_datetime(df['date_time'])
df = df.drop_duplicates(subset='date_time', ignore_index=True)
df.shape

(40575, 9)

In [None]:
date_range = list(pd.date_range('2012-10-02 09:00:00', '2018-09-30 23:00:00', freq='H')) #space-out date range hourly
print(len(date_range))

52551


In [None]:
temp_df = pd.DataFrame({'date_time':date_range})
df = pd.merge(temp_df, df, how='left', on='date_time')
#df.head()

In [None]:
df.isna().sum()

In [None]:
fig, ax = plt.subplots(figsize=(13, 6))
ax.plot(df.traffic_volume)
ax.set_xlabel('Date')
ax.set_ylabel('Traffic Volume')

fig.autofmt_xdate()
plt.tight_layout()

In [None]:
df = df[35000:].reset_index(drop=True) #Collect from 35000th row, resetting index
# df.head(5)

In [None]:
df = df.drop(['holiday', 'weather_main', 'weather_description'], axis=1)
# df.shape

In [None]:
df = df.fillna(df.groupby(df.date_time.dt.hour).transform('median'))

In [None]:
df.isna().sum()

date_time         0
temp              0
rain_1h           0
snow_1h           0
clouds_all        0
traffic_volume    0
dtype: int64

In [None]:
fig, ax = plt.subplots(figsize=(14, 6))
ax.plot(df.traffic_volume)
ax.set_xlabel('Date')
ax.set_ylabel('Traffic Volume')

fig.autofmt_xdate()
plt.tight_layout()

In [None]:
df.to_csv('/gdrive/MyDrive/TrafficFlowData/metro_interstate_traffic_volume_preprocessed.csv')

Wesbound traffic volume on I-94 between Minneapolis and St Paul in Minnesota, starting on September 29, 2016, at 5 p.m. Daily seasonility is clearly vissible, traffic being lower at the start and end of each day.

In [None]:
fig, ax = plt.subplots(figsize=(14,6))
ax.plot(df['traffic_volume'])
ax.set_xlabel('Time')
ax.set_ylabel('Traffic volume')

plt.xticks(np.arange(7, 400, 24), ['Friday', 'Saturday', 'Sunday', 'Monday', 
                                   'Tuesday', 'Wednesday', 'Thursday', 
                                   'Friday', 'Saturday', 'Sunday', 'Monday', 
                                   'Tuesday', 'Wednesday', 'Thursday', 
                                   'Friday', 'Saturday', 'Sunday' ])
plt.xlim(0, 400)
fig.autofmt_xdate()
plt.tight_layout()

Hourly temperature (in Kelvin) from September 29, 2016, to September 30, 2018. Yearly seasonality is rather apparent despite the slight noise.

In [None]:
fig, ax = plt.subplots(figsize = (14, 6))
ax.plot(df['temp'])
ax.set_xlabel('Time')
ax.set_ylabel('Temperature (K)')

plt.xticks([2239, 10999], [2017, 2018])

fig.autofmt_xdate()
plt.tight_layout()

Hourly Temperature in Kelvin starting on September 29, 2016 at 5pm. CST. Despite the noise we can see that temperatures are indeed lower at the start of each day, the end of each day, and peak during midday, suggesting daily seasonality.

In [None]:
fig, ax = plt.subplots(figsize=(14,6))
ax.plot(df['temp'])
ax.set_xlabel('Time')
ax.set_ylabel('Temperature (K)')

plt.xticks(np.arange(7, 400, 24), ['Friday', 'Saturday', 'Sunday', 'Monday', 
                                   'Tuesday', 'Wednesday', 'Thursday', 
                                   'Friday', 'Saturday', 'Sunday', 'Monday', 
                                   'Tuesday', 'Wednesday', 'Thursday', 
                                   'Friday', 'Saturday', 'Sunday' ])
plt.xlim(0, 400)
fig.autofmt_xdate()
plt.tight_layout()

Feature Engineering and Data Splitting

In [None]:
df.describe()

In [None]:
# Drop columns which is mostly naught throughout data (see Peixeiro 2022, pg 242)
cols_to_drop = ['rain_1h', 'snow_1h']
df = df.drop(cols_to_drop, axis=1)

In [None]:
# Convert date_time from string to a numerical value of seconds (see Peixeiro 2022, pg 242)
# Retain cyclical nature of time by way of the sin transformation
timestamp_s = pd.to_datetime(df['date_time']).map(datetime.datetime.timestamp)
day = 24 * 60 * 60 # Number of seconds in a day
df['day_sin'] = (np.sin(timestamp_s * (2*np.pi/day))).values # Radians are used in the sine function

While the sin transformation regains the syclical property of time, it leaves indistinguished the difference between am and pm. We correct this by applying a cosine transformation—it being out of phase with the sine function—allowing thus the distinction between am and pm time. 

In [None]:
df['day_cos'] = (np.cos(timestamp_s * (2*np.pi/day))).values
df = df.drop(['date_time'], axis=1)

In [None]:
# Confirm sine vs cosine function
df.sample(50).plot.scatter('day_sin', 'day_cos', figsize=(14, 10)).set_aspect('equal')

The above plot confims that we have converted the time strings to numerical values that retain the cyclical nature of time, a format conducive for deep learning models. 

In [None]:
n = len(df)

# Split the data by train, validation, and test ration of 70:20:10
train_df = df[0:int(n*0.7)]
val_df = df[int(n*0.7):int(n*0.9)]
test_df = df[int(n*0.9):]

As seen below, apart from normalising the data, scaling the data only affects the scale not its distribution, thus in this instance placing the values between 0 and 1. Normalising the data on the other hand affects the distribution and scale. 

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(train_df) # Abate date leakage by fitting the scaler on the train set

train_df[train_df.columns] = scaler.transform(train_df[train_df.columns])
val_df[val_df.columns] = scaler.transform(val_df[val_df.columns])
test_df[test_df.columns] = scaler.transform(test_df[test_df.columns])

In [None]:
# Save each frame as a csv
train_df.to_csv('/gdrive/MyDrive/TrafficFlowData/metro_interstate_traffic_volume_train.csv')
val_df.to_csv('/gdrive/MyDrive/TrafficFlowData/metro_interstate_traffic_volume_validate.csv')
test_df.to_csv('/gdrive/MyDrive/TrafficFlowData/metro_interstate_traffic_volume_test.csv')

In [None]:
wide_window.plot(baseline_last)

Predictions of the baseline single-step model on three sequences from the sample batch. The prediction at each timestep is the last known value, suggesting that the baseline works as expected.

In [None]:
print(performance['Baseline - Last'][1])

0.08518269658088684


## Mulit-step Baseline Model
Predict more than one timestep into the future. Forecast the traffic volume for the following 24 hours of data given an input of 24 hours. Generate a fitting data window. Since we are predicting 24 timesteps into the future with an input of 24 hours, the input width, label width, and shift will be 24.

In [None]:
multi_window = DataWindow(input_width=24, label_width=24, shift=24, 
                          label_columns=['traffic_volume'])

We are thus faces by two baselines: predict the last known value for the next 24 timesteps; predict the last 24 timesteps for the next 24 timesteps.

### Predict the last known value

In [None]:
class MultiStepLastBaseline(Model):
  def __init__(self, label_index=None):
    super().__init__()
    self.label_index = label_index

  def call(self, inputs):
    if self.label_index is None:
      # Return the last known value of the target column over the next 
      # 24 timesteps.
      return tf.tile(inputs[:, -1:, :], [1, 24, 1])
    # Return the last known value of all columns ove the next 24 hours 
    #if no target is specified.
    return tf.tile(inputs[:, -1:, self.label_index:], [1, 24, 1])

In [None]:
ms_baseline_last = MultiStepLastBaseline(label_index=column_indices['traffic_volume'])

ms_baseline_last.compile(loss=MeanSquaredError(), metrics=[MeanAbsoluteError()])

ms_val_performance = {}
ms_performance = {}

ms_val_performance['Baseline - Last'] = ms_baseline_last.evaluate(multi_window.val)
ms_val_performance['Baseline - Last'] = ms_baseline_last.evaluate(multi_window.test, verbose=0)



In [None]:
multi_window.plot(ms_baseline_last)