<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Timeseries Forecasting with LSTM Neural Networks

_Authors: Kiefer Katovich (SF)_

---

### Learning Objectives
- Learn how to build a basic LSTM using Keras.
- Train the LSTM model on unemployment data and forecast to future data.
- Understand the different ways to fit the LSTM model for timeseries data.
- Understand the "stateful" LSTM setup and how it compares to the standard fitting procedure.

### Lesson Guide
- [Introduction](#introduction)
	- [A note on the lesson content](#a-note-on-the-lesson-content)
- [Load the Keras modules](#load-the-keras-modules)
- [Prepare the unemployment data for timeseries modeling](#prepare-the-unemployment-data-for-timeseries-modeling)
	- [Create the first-order differenced unemployment rate](#create-the-first-order-differenced-unemployment-rate)
	- [Normalize the differenced unemployement data with `MinMaxScaler`](#normalize-the-differenced-unemployement-data-with-minmaxscaler)
	- [Split the timeseries into 50% train/test splits](#split-the-timeseries-into--traintest-splits)
- [Write a function to create the predictor and target data](#write-a-function-to-create-the-predictor-and-target-data)
	- [Create training and testing data for a lag of 1](#create-training-and-testing-data-for-a-lag-of-)
- [Reshape the data to work with the LSTM](#reshape-the-data-to-work-with-the-lstm)
- [Constructing the Keras model](#constructing-the-keras-model)
	- [Fit the LSTM model](#fit-the-lstm-model)
	- [Plot the original data, the training predictions and the testing predictions](#plot-the-original-data-the-training-predictions-and-the-testing-predictions)
- [LSTM with multiple lags as predictors](#lstm-with-multiple-lags-as-predictors)
- [Refraing the problem using the LSTM "time steps" dimension](#refraing-the-problem-using-the-lstm-time-steps-dimension)
	- [Rebuild and fit the LSTM model, and plot the predictions](#rebuild-and-fit-the-lstm-model-and-plot-the-predictions)
- ["Stateful" LSTM models](#stateful-lstm-models)


<a id="introduction"></a>
## Introduction
---

Modeling timeseries and forecasting with neural networks is a growing trend. The Long Short Term Memory (LSTM) recurrent neural network architecture is a popular choice when "context" or memory across time is a desired capability of the model.

In this walkthrough/codealong lecture we will be building an LSTM using the Keras framework to forecast stock market timeseries data. 

<a id="a-note-on-the-lesson-content"></a>
### A note on the lesson content

This codealong focuses primarily on the Keras implementation and application of LSTM neural networks. This lecture does not cover:
- The theory behind recurrent neural networks.
- The mathematics or theory behind the architecture of LSTM networks.

There are a variety of great resources to dive deeper into LSTM networks:
- [A beginners guide to recurrent networks and LSTMs](http://deeplearning4j.org/lstm.html#a-beginners-guide-to-recurrent-networks-and-lstms)
- [Understanding LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [This tutorial provides a great introduction to building a simple LSTM for timeseries forecasting.](http://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/) which 


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

sns.set_style('whitegrid')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

<a id="load-the-keras-modules"></a>
## Load the Keras modules
---

Keras is a popular, high-level framework for neural networks in python. It is a "wrapper" on top of lower level backend frameworks: particularly TensorFlow and Theano.

> **Note:** this lesson is written with Keras running on the TensorFlow backend instead of the Theano backend. This does, unfortunately, have an impact on the way the code is written. Theano and TensorFlow have different indexing orders. [To change the backend that Keras runs on, please see this guide in the documentation.](https://keras.io/backend/)

**We need to import four classes from Keras:**

- **`Sequential`** is the model that will house our neural network layers. It stacks together the input, "hidden" and output layers of the network.
- **`Dense`** is a standard fully-connected neural network layer where each node as inputs/output connections to every node in the layers it is connected to.
- **`LSTM`** is the Long Short Term Memory recurrent neural network layer. This will house our LSTM "neurons" or "cells".

We are also importing `MinMaxScaler` from sklearn. Inputs to the LSTM (and neural networks in general) perform better if they are normalized.

In [2]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from sklearn.preprocessing import MinMaxScaler

Using TensorFlow backend.


<a id="prepare-the-unemployment-data-for-timeseries-modeling"></a>
## Prepare the unemployment data for timeseries modeling
---

First we will load quarterly US unemployment data to do some basic forecasting using LSTMs. 

**Load the unemployment data from CSV file and perform any required cleaning.**

In [3]:
data = pd.read_csv('./datasets/seasonally-adjusted-quarterly-us.csv')

In [4]:
# A:

<a id="create-the-first-order-differenced-unemployment-rate"></a>
### Create the first-order differenced unemployment rate

In timeseries modeling the raw timeseries is rarely used. Typically we will use the first order differenced timeseries (or further order differences). Technically this differencing is done to ensure that the timeseries is stationary. 

However, there are more intuitive reasons why we would want to model the differences as opposed to the actual values. Take the AAPL stock price, for example, and imagine we are a day-trader. If we hold the AAPL stock, we are of course interested in having a model that can predict the price of the stock in the future. However, what we are *really* interested in is how the stock price will *change* in the future from the current point. It is more useful to say "the stock will go up by one point" than it is to say "the stock price will be 51." 

In [5]:
# A:

<a id="normalize-the-differenced-unemployement-data-with-minmaxscaler"></a>
### Normalize the differenced unemployement data with `MinMaxScaler`

We want the rate to be restricted to the range -1 to 1.

> **Note:** the differencing will make the first value of the series a NaN value. Make sure to drop this from the dataset.

In [6]:
# A:

<a id="split-the-timeseries-into--traintest-splits"></a>
### Split the timeseries into 50% train/test splits

We don't want a random train/test split in this case. With timeseries data, we are interested in how are model generalizes to future data in particular. Make the test set the second half of the data through time.

In [7]:
# A:

<a id="write-a-function-to-create-the-predictor-and-target-data"></a>
## Write a function to create the predictor and target data
---

The function will need to create a Y target and X predictor. The X predictor matrix will simply be the shifted versions of Y, our target unemployment timeseries. In other words, we want our features to be previous timesteps of the target data for given lags.

**Make a function with two arguments:**
1. The timeseries data.
2. The number of lags of the timeseries to have as predictors.

The default should create a dataset where X is the unemployment rate a given time (t) and Y is the unemployment rate at the next time (t + 1). At the default lag of 1 X will just be the unemployment timeseries of Y shifted back by 1.

> **Note:** make sure that the output X and Y are of the same length! You will need to slice off a row (at least - depends on the lag order).

In [8]:
# A:

<a id="create-training-and-testing-data-for-a-lag-of-"></a>
### Create training and testing data for a lag of 1

Again, this means our X will just have 1 column.

In [9]:
# A:

<a id="reshape-the-data-to-work-with-the-lstm"></a>
## Reshape the data to work with the LSTM
---

The format of data the LSTM expects is:

    [samples, time_steps, features]
    
Which is a 3D matrix.

We have been using 2D predictor matrices for our machine learning algorithms, where our X predictor matrix has been in the form:

    [samples, features]

Since we are working with timeseries (which is the data an LSTM expects), we are now required to provide information about the time.

You can use the `np.reshape()` command to turn your 2D X matrix into a 3D matrix that works for the LSTM matrix. We will talk about the "time step" dimension more down the line. 

> **Note:** In the case of a single lag this time step dimension is redundant. Later on, when we redesign the X matrix so that our individual features have multiple timesteps, this 3D format requirement will be clearer.

In [10]:
# A:

<a id="constructing-the-keras-model"></a>
## Constructing the Keras model

Our LSTM model will be constructed in three parts.

First initialize the sequential layer-to-layer neural network model:

```python
model = Sequential()
```
    
Add an LSTM layer with 4 blocks/cells/neurons. We specify the `input_shape` to be the same dimensions as our features. You will notice that the `input_shape` below takes a tuple `(None, lag)`. The `None` is a placeholder for the timesteps of our features. By putting `None` we are simply avoiding specifying the timesteps.

```python
model.add(LSTM(4, input_shape=(None, lag)))
```

Add the output layer as a layer of one neuron that is fully connected to all of the previous LSTM cells:

```python
model.add(Dense(1))
```

In [11]:
# A:

<a id="fit-the-lstm-model"></a>
### Fit the LSTM model

We can fit the model with these commands:

```python
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, nb_epoch=100, batch_size=1, verbose=2)
```

Which will use the squared error loss (regression) and fit the data over 100 "epochs", or passes over the training data. It makes multiple passes because these LSTM neural networks learn according to a learning rate (which we have not specified).

The `optimizer='adam'` selects the type of algorithm for gradient descent. The Adam optimizer performs well and is often recommended for many types of neural network architectures.

In [12]:
# A:

<a id="plot-the-original-data-the-training-predictions-and-the-testing-predictions"></a>
### Plot the original data, the training predictions and the testing predictions

You can predict from a Keras model much like with a sklearn model:

```python
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
```

Make sure you lag the data forward for training and testing!

In [13]:
# A:

<a id="lstm-with-multiple-lags-as-predictors"></a>
## LSTM with multiple lags as predictors
---

We can instead predict the unemployment rate from not just the rate prior, but the `t-1`, `t-2`, and `t-3` rates.

You can use the function you wrote above to construct a new X and y where X now has 3 predictors according to the different lags.

Create the new Y and X variables below:

In [14]:
# A:

<a id="refraing-the-problem-using-the-lstm-time-steps-dimension"></a>
## Refraing the problem using the LSTM "time steps" dimension
---

Recall that our X matrix is converted to the form:

    [samples, time steps, features]
    
In the model we just made, we were saying that we had 3 different features, each of 1 time step long. This works fine, but it would be more appropriate to say that we had 1 feature with three different time steps, since that's what the data actually is (the unemplyment rate lagged to different degrees).

Instead of reshaping our data where the time step dimension is 1, we can instead reshape it so that the feature dimension is 1 and the time step dimension is 3, like so:

```python
trainX = np.reshape(trainX, (trainX.shape[0], trainX.shape[1], 1))
testX = np.reshape(testX, (testX.shape[0], testX.shape[1], 1))
```

In this toy example, this will for all intents and purposes be the same, but it is more appropriate to specify it this way since the variable is the same.

Recreate your training and testing data but reshaping your lags on the time dimension rather than the feature dimension:

In [15]:
# A:

<a id="rebuild-and-fit-the-lstm-model-and-plot-the-predictions"></a>
### Rebuild and fit the LSTM model, and plot the predictions

You will need to change the `input_shape` now to be `input_shape=(lag, 1)`, which indicates we have "lag" number of timesteps and 1 predictor/feature.

In [16]:
# A:

<a id="stateful-lstm-models"></a>
## "Stateful" LSTM models
---

There is another even more appropriate way we can construct this LSTM model. Right now we are restricting the model to use the previous 3 timesteps only, but wouldn't it be better to allow the model to "remember" *all* of the previous timesteps? How would we do this?

LSTM models in Keras can be set as "stateful". This means that instead of resetting their internal state after each training batch, the internal state of the neurons is maintained. We can use this to our advantage to have the LSTM maintain a memory of all the previous timesteps as it fits:
1. Build the X and y data with lag 1.
2. Reshape the data to be 3D.
3. Set the batch size to 1 (this means that only 1 observation will be fed into the model at a time.)
4. Construct the LSTM model with a stateful LSTM layer.
5. Fit the model multiple times setting `nb_ephoch=1`, `shuffle=False`, and `batch_size=1`. This will go through the observations sequentially till the end, feeding them into the model one by one.
6. At the end of each iteration through all the data, call `model.reset_states()` to reset the internal state manually for the next iteration through.

The code to do this is outlined below:

In [17]:
# A: