# Multivariate Time Series Forecasting

For this analysis, we will employ a special type of Neural Netwowrk called ___Long Short Term Memory___ (LSTM), which will be applicable for evaluating ___sequential, time series data___. I want to quickly go over LSTM for my benefit in understanding for this analysis. I use [this site](https://cnvrg.io/pytorch-lstm/) as a helpful source.

---
### Purpose

Why wouldn't a traditional NN work for this analysis? For the dataset we have, the order of the data, i.e. the Sequence, matters. The traditional NN is restricted due to this, importantly,

* They have a fixed input length
* They can not remember the sequence of the data, i.e order is not important
* Can not share parameters across the sequence

For a sequential task, it requires the following:

* The model should be able to handle variable-length sequences
* Can track Long term dependencies
* Maintain information about the order
* Share parameters across the sequence.

---

### Recurrent Neural Network

A ___Recurrent Neural Network___ (RNN) is structured like a typical NN (unidirectional), but has loops inside it to persist the information over timestamp $t$. Hence, "recurrent". In other words, we have a ___recurrence relation___ applied at every timestamp to process a sequence:

\begin{align}
    h_{t} = f_{w}(h_{t - 1}, x_{t})\, ,
\end{align}

where $h_t$ is the current cell state, $f_w$ is a function that is parameterized by weights, $h_{t-1}$ is the previous or last state, and $x_t$ is the input vector at timestamp $t$. _Note that you are using the same function and set of parameters at every timestamp_. For a more complex representation: given the input vector, $x_{t}$, RNN applies a function to update its hidden state which is a standard NN operation:

\begin{align}
    h_{t} = \tanh\left( W^{T}_{hh}h_{t-1} + W^{T}_{xh}x_{t}\right)\, .
\end{align}

Here we have 2 separate weight matrices ($W$) then apply the non-linearity ($\tanh$) to the sum of input $x_t$ and previous state $h_{t-1}$ after multiplication to these 2 weight matrices. Finally, we will have the output vector $\hat{y}_{t}$ at the timestamp $t$:

\begin{align}
    \hat{y}_{t} = W^{T}_{hy}h_{t} \, ,
\end{align}

which is a modified, transformed version of this internal state, that results simply by multiplication by another weight matrix. This is simply how RNN can update its hidden state and calculate the output.

A single cell would look like

<img src="figures/single-RNN-cell.jpg" alt="fishy" class="bg-primary mb-1" width="200px">

---

Consider the __unfolding process__:

<img src="figures/Unfolding-RNNs.jpg" alt="fishy" class="bg-primary mb-1" width="800px">

* We can see that we are adding the input at every time stamp, and generating the output $\hat{y}$ at every timestamp. We are going to use the same weight matrices at every timestamp.
* $W_{hh}$ is the weight matrix by which you update the previous state, as shown in the equation above, and as visible in the figure. 
* $W_{xh}$ is the weight matrix that is applied at every timestamp to the input value. Why is the weight matrix that is applied to the output ŷ
* From these outputs $\hat{y}_0$, $\hat{y}_1$, $\hat{y}_2$, $\dots$, $\hat{y}_t$, you can calculate the Loss, $L_{1}$, $L_{2}$, $\dots$, $L_{t}$, at each timestamp $t$.

---

### Backpropagation through time in RNN

As a __summary__:

1) RNN updates the hidden state via input and previous state
2) Compute the output matrix via a simple neural network operation that is $W \times h$
3) Return the output and update the hidden state 
4) You can combine, and take the sum of all these losses to calculate a total loss $L$, through which you can propagate backwards to complete the ___backpropagation___.

Backpropagation in RNNs work similarly to backpropagation in Simple Neural Networks, which has the following main steps.

1) Feed Forward Pass
2) Take the derivative of the loss with each parameter
3) Shift parameters to update the weights and minimize the Loss.

<img src="figures/Backpropagation-in-RNNs.jpg" alt="fishy" class="bg-primary mb-1" width="800px">

The figure above simply 

* Completes a _feedforward pass_, i.e., computes the output directly from the input, in one pass, (blue to pink via black arrows). 
* Calculates the loss at each output (khaki Ls).
* Takes the derivative of each output.
* Propagates backward to update the weights (follow red arrows).

Computing the gradients require a lot of factors of Whh plus repeated gradient computations, which makes it a bit problematic, such as exploding gradients (repeated gradient computations, such as weight matrix, or gradient themselves, are greater than 1) or vanishing gradients (like exploding gradients but too small). This can usually be solved using an activation function (Rectified Linear Unit function instead of $\tanh$), weight initilization, or Changing Network Architecture. We will focus on the latter here, where we modify the architecture of RNNs and use the more complex recurrent unit with Gates such as LSTMs or GRUs (Gated Recurrent Units).

---

### Long Short Term Memory (LSTMs)

LSTMs are a special type of Neural Networks that perform similarly to RNNs, but run better than RNNs, and further solve some of the important shortcomings of RNNs for long term dependencies, and vanishing gradients. ___LSTMs are best suited for long term dependencies___.

* LSTMs introduce self-looping to produce paths where gradients can flow for a long duration (meaning gradients will not vanish). 
* The main contribution of initial long-short-term memory (Hochireiter and Schmidhuber, 1997). 
* Later on, a crucial addition has been made to make the weight on this self-loop conditioned on the context, rather than fixed. 
* This can help in changing the time scale of integration. 

This means that even when LSTM has fixed parameters, the time scale of integration can change based on the input sequence because the time constants are outputs by the model itself.

#### LSTMs work in four steps:

The key building block behind LSTM is a structure known as __gates__. Information is __added__ or __removed__ through these gates. Gates can optionally let information through, for example via a __sigmoid layer__, and pointwise multiplication.


1) ___Forget the irreverent history___:
    * Done via the ___forget gate___, its main purpose is to _decide which information the LSTM should keep or carry, and which information it should throw away_. 
    * This is the function of the prior internal state $h_{t-1}$  and the new input $x_t$.
    * This happens because not all the information in a sequence or a sentence needs to be important.
    * This forget gate is denoted by $f_{i}^{(t)}$ (for time step $t$ and cell $i$), which sets this weight value between 0 and 1 which decides how much information to send:
    \begin{align}
    f_{i}^{(t)} = \sigma \left( b_{i}^{f} + \sum_{j} U_{i,j} x_{j}^{(t)}  + \sum_{j} W_{i,j}^{f}h_{j}^{(t-1)} \right)\, ,
    \end{align} 
    * Where $x^{(t)}$ is the current input vector, $h^{(t)}$ is the current hidden state, containing the outputs of all the LSTM cells, and $b^f$, $U^f$, $W^f$ are respectively biases, input weights, and recurrent weights for the forget gates.



2) ___Perform the computations & store the relevant new information___
    * When LSTM has decided what relevant information to keep, and what to discard, it then performs some computations to store the new information, via the ___input gate___ or sometimes known as an ___external input gate___. 
    * To update the internal cell state, you'll first have to pass the previous hidden state, and the current input with the bias into a sigmoid activation function, that decides which values to update by transforming them between 0 and 1.



3) ___Use the above steps to selectively update their internal state___




4) ___Forget the irreverent history___
