# **Using LSTM to Predict traffic patterns in Manhattan**

As part of this cyberinfrastructure competition, teams must develop a machine learning model to help predict the movement of people in Manhattan. Due to the nature of the data, this might be confusing for some teams initially.

This notebook seeks to expand upon the Training dataset and show how the data can be used to generate predictions using a sample LSTM Model

We first begin with importing some basic machine learning related libraries

In [21]:
import h5py
import numpy as np
from datetime import datetime
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import LSTM, Dense, Reshape

## Opening the Training Data

The TrainingData.h5 file is in HDF5 formatting. This isn't as common as .csv or .npy that many are familiar with. As such we'll first begin by just opening and reading the data.

The training dataset consists of 2 subdatasets, trip and timeslot

1) trip contains a list of bike traffic information of all the trips in Manhattan in a given time interval 
2) timeslot contains a list of the corresponding time intervals of the trip data

Below we'll show how to open and access the trip and timeslot data

In [22]:
with h5py.File('datasets/training_data.h5', 'r') as f:
    # Access the trip dataset and their corresponding timestamps
    traffic_data = f['trip'][()]
    dates = f['timeslot'][()]

## Understanding the Data

Now let's examine the data and see what it looks like, as well as any potential preprocessing that we have to do

First off, the timeslot data, each timeslot corresponds to a 30 minute interval in the month of October 2019. Examining the first timeslot interval in the data we can see its taken in 2019-10-01 (October 1st 2019) at time 00:30 (from 12 am to 12:30 am). As such there are roughly 48x31 timeslots in total (1488 data samples total).


In [23]:
dates[0]

b'201910010030'

In [24]:
dates.shape

(1488,)

Next there's the trip data, each data sample consists of a 2x16x8 tensor. Here the first 16x8 matrix corresponds to how people moved out of different regions of Manhattan (outflow) and the second 16x8 matrix shows how people moved into different regions of Manhattan (inflow). With the 16x8 matrix being an abstracted map of Manhattan.


In [25]:
traffic_data[0]

array([[[ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  1.],
        [ 0.,  0.,  3.,  4.,  4.,  0.,  1.,  0.],
        [ 0.,  1.,  0.,  8.,  0.,  1.,  0.,  0.],
        [ 0.,  0.,  2.,  5.,  1.,  2.,  1.,  0.],
        [ 0.,  2.,  1.,  1.,  1.,  1.,  0.,  0.],
        [ 0.,  0.,  2.,  1.,  5.,  2.,  0.,  0.],
        [ 0.,  2.,  1.,  1.,  0.,  1.,  0.,  0.],
        [ 0.,  4.,  4.,  6.,  7.,  0.,  0.,  0.],
        [ 1.,  1., 11.,  4.,  2.,  0.,  0.,  1.],
        [ 0.,  4.,  1.,  1.,  1.,  1.,  0.,  1.],
        [ 0.,  1.,  1.,  0.,  0.,  0.,  0.,  2.],
        [ 1.,  1.,  3.,  4.,  0.,  0.,  0.,  0.],
        [ 2.,  2.,  0.,  2.,  0.,  0.,  1.,  0.],
        [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.]],

       [[ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  1.],
        [ 0.,  0.,  0.,  1.,  0.,  1.,  1.,  0.],
        [ 0.,  1.,  1.,  3.,  0.,  1.,  2.,  0.],
        [ 0.,  1.,  0.,  0.,  1.,  3.,  0.,  0.]

In [26]:
traffic_data.shape

(1488, 2, 16, 8)

## Processing the Data

Before using the traffic and timeslot data. Some processing must be done to ensure the machine learning algorithm has the best accuracy. Note, all teams are welcome and encouraged to process the data however they wish. This tutorial only contains some basic data processing.

For the timeslot dates data, we'll just change them from the stored string format into a list where the year, month, day, hour, and minute are separated for ease of use

For the traffic trip data, we'll just do some scaling to ensure all the matrix values lie between the range (0, 1)

In [27]:
formatted_dates = []

for date_string in dates:
    formatted_date = datetime.strptime(date_string.decode(), '%Y%m%d%H%M')

    year = formatted_date.year
    month = formatted_date.month
    day = formatted_date.day
    hour = formatted_date.hour
    minute = formatted_date.minute

    formatted_dates.append(np.array([year, month, day, hour, minute]))

formatted_dates = np.array(formatted_dates).reshape(1488, 5, 1)

In [28]:
formatted_dates[0]

array([[2019],
       [  10],
       [   1],
       [   0],
       [  30]])

In [29]:
formatted_dates.shape

(1488, 5, 1)

Now with the data processed, we'll do a simple training and testing split given the data we have. For this tutorial we will use the last 5 days of October 2019 to test out model and the rest for training.

Note: As this data is time dependent, the data should not be shuffled during the train/test split process

In [30]:
test_size = 240

train_traffic_data = traffic_data[:-test_size]
test_traffic_data = traffic_data[-test_size:]

train_formatted_dates = formatted_dates[:-test_size]
test_formatted_dates = formatted_dates[-test_size:]

## Running the ML Models

First we'll begin by changing the names of our data to match standard conventions. It's important to note here the goal of the model is to take in some timeslot and predict the traffic data tensor at that timeslot.

As such the timeslot dates will be our X variable and the bike traffic data will be the target variable

In [31]:
X_train, X_test = train_formatted_dates, test_formatted_dates
y_train, y_test = train_traffic_data, test_traffic_data

Next, we'll define a basic LSTM model. Teams are welcome to choose whichever model architecture they are comfortable with.

In [32]:
# Basic LSTM Model
model = Sequential()
model.add(LSTM(50, activation='tanh', input_shape=(5, 1)))
model.add(Dense(2 * 16 * 8, activation='linear')) 
model.add(Reshape((2, 16, 8))) 

Next we'll compile and train the LSTM model. For this example, we'll be showing the Mean Square Error of the model at each epoch.

Thanks to our earlier work organizing the data, the process of actually training and testing our model is straightforward

In [33]:
model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=10, batch_size=64, validation_data=(X_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1895d9c9f10>

In [34]:
mse = model.evaluate(X_test, y_test)

# Show rmse to see how model performs on the test set
rmse = np.sqrt(mse)
print(f'Root Mean Squared Error: {rmse}')

Root Mean Squared Error: 9.611063590204829


In [35]:
model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=200, batch_size=64, validation_data=(X_test, y_test))

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

<keras.callbacks.History at 0x18972601ca0>

When examining the model's results during the training stage it's important to make sure the model's loss gets lower as the number of epochs grows. This is a good indication that the model is actually learning the correct patterns and trends in the data. 

Also this example will not delve into things like hyperparameter tuning, but this is a good way to try and get some extra performance out of your model. Just make sure to not overfit your hyperparameters to the data as these models will be tested later on with a separate dataset as part of the criteria to decide where teams will place overall.

In [36]:
mse = model.evaluate(X_test, y_test)

# Show rmse to see how model performs on the test set
rmse = np.sqrt(mse)
print(f'Root Mean Squared Error: {rmse}')

Root Mean Squared Error: 7.36705209520649


The model performs well, the rmse is low and is quite similar to the model's performance on the test set after 20 epochs. Let's go ahead and show how to save and reload this model.

This is a very important step as your models will be tested later on as well so having your models saved will make this process much easier. This is especially true for larger models that may take a long time to train. Below we show a quick example of how to save and reload the weights of your model

In [37]:
model.save('lstm_model.keras')

In [38]:
new_model = load_model('lstm_model.keras')

mse = new_model.evaluate(X_test, y_test)

# Show rmse to see how model performs on the test set
rmse = np.sqrt(mse)
print(f'Root Mean Squared Error: {rmse}')

Root Mean Squared Error: 7.36705209520649


This concludes this tutorial notebook. Please keep in mind that you don't need to replicate this tutorial in your final models. Teams are welcome to preprocess their data differently, change their model's architectures, change their model's hyperparameters, etc. this is a time to learn and explore different topics within the scope of machine learning.

If any teams have follow up questions or need any help relating to this hackathon please feel free to attend any of the upcoming sessions. We are all more than happy to help answer any questions you might have!

# **Thank you and best of luck in the competition!**