# Recurrent Neural Networks

RNNs are based on the same principles as those behind FFNNs, which is why we spent so much time reminding ourselves of the feedforward and backpropagation steps that are used in the training phase.

There are two main differences between FFNNs and RNNs. The Recurrent Neural Network uses:

1. ***sequences*** as inputs in the training phase, and
2. ***memory*** elements

Memory is defined as the output of hidden layer neurons, which will serve as additional input to the network during next training step.

<img src="rnn_img/m9.png" alt="drawing" width="500"/>

Where:
- y is the output
- x is the input
- s is the temporal dependancy

### Applications:

1. Sentiment Analysis
2. Speech Recognition
3. Time Series Prediction
4. Natural Language Processing
5. Gesture Recognition

<img src="rnn_img/m10.png" alt="drawing" width="500"/>
<img src="rnn_img/m11.png" alt="drawing" width="500"/>
<img src="rnn_img/m12.png" alt="drawing" width="500"/>

As we've see, in FFNN the output at any time t, is a function of the current input and the weights. This can be easily expressed using the following equation:

$y_t = F(x_t, W)$

In RNNs, our output at time t, depends not only on the current input and the weight, but also on previous inputs. In this case the output at time t will be defined as:

<img src="rnn_img/m13.png" alt="drawing" width="500"/>

This is the RNN folded model:

<img src="rnn_img/m14.png" alt="drawing" width="200"/>

In this picture, $\bar{x}$ represents the input vector, $\bar{y}$ represents the output vector and $\bar{s}$ denotes the state vector.

$W_x$ is the weight matrix connecting the inputs to the state layer.

$W_y$ is the weight matrix connecting the state layer to the output layer.

$W_s$ represents the weight matrix connecting the state from the previous timestep to the state in the current timestep.



### "Unfolded in time". The unfolded model is usually what we use when working with RNNs.

<img src="rnn_img/m15.png" alt="drawing" width="800"/>


In FFNNs the hidden layer depended only on the current inputs and weights, as well as on an activation function $\Phi$ in the following way:

$\bar{h'} = (\bar{x} W^1 ) $

In RNNs the state layer depended on the current inputs, their corresponding weights, the activation function and also on the previous state:

<img src="rnn_img/w13.png" alt="drawing" width="400"/>


The output vector is calculated exactly the same as in FFNNs. It can be a linear combination of the inputs to each output node with the corresponding weight matrix $W_y$, or a softmax function of the same linear combination.

$y_t = σ(\bar{s}_t W_y)$

# Unfolded Model

The Elman Network (Unfolded Network) is usually modelled bottom up.

<img src="rnn_img/m16.png" alt="drawing" width="500"/>

State inputs are the outputs from the previous foward pass, they are fed into the hidden layers with the inputs.

<img src="rnn_img/m18.png" alt="drawing" width="500"/>

# Backpropagation Through Time

We are now ready to understand how to train the RNN.

When we train RNNs we also use backpropagation, but with a conceptual change. The process is similar to that in the FFNN, with the exception that we need to consider previous time steps, as the system has memory. This process is called Backpropagation Through Time (BPTT).

We will use the Loss Function for our error. The Loss Function is the square of the difference between the desired and the calculated outputs. There are variations to the Loss Function, for example, factoring it with a scalar. In the backpropagation example we used a factoring scalar of 1/2 for calculation convenience.

As described previously, the two most commonly used are the Mean Squared Error (MSE) (usually used in regression problems) and the cross entropy (usually used in classification problems). We are using a variation of the MSE.

The state vector $\bar{s}_t$ is calculated the following way:

<img src="rnn_img/m20.png" alt="drawing" width="200"/>

The output vector $\bar{y}_t$ can be product of the state vector $\bar{s}_t$ and the corresponding weight elements of matrix $W_y$ As mentioned before, if the desired outputs are between 0 and 1, we can also use a softmax function. The following set of equations depicts these calculations:

<img src="rnn_img/m21.png" alt="drawing" width="200"/>

As mentioned before, for the error calculations we will use the Loss Function, where:

$E_t$ represents the output error at time t

$d_t$ represents the desired output at time t

$y_t$ represents the calculated output at time t

<img src="rnn_img/m22.png" alt="drawing" width="200"/>

In BPTT we train the network at timestep t as well as take into account all of the previous timesteps.

The easiest way to explain the idea is to simply jump into an example.

In this example we will focus on the BPTT process for time step t=3. You will see that in order to adjust all three weight matrices, $W_x$, $W_s$ and $W_yW$, we need to consider timestep 3 as well as timestep 2 and timestep 1.

As we are focusing on timestep t=3, the Loss function will be: $E_3=(\bar{d}_3-\bar{y}_3)^2$

<img src="rnn_img/m23.png" alt="drawing" width="200"/>

To update each weight matrix, we need to find the partial derivatives of the Loss Function at time 3, as a function of all of the weight matrices. We will modify each matrix using gradient descent while considering the previous timesteps.

<img src="rnn_img/m24.png" alt="drawing" width="200"/>

We will now unfold the model. You will see that unfolding the model in time is very helpful in visualizing the number of steps (translated into multiplication) needed in the Backpropagation Through Time process. These multiplications stem from the chain rule and are easily visualized using this model.

In this video we will understand how to use Backpropagation Through Time (BPTT) when adjusting two weight matrices:

$W_y$ - the weight matrix connecting the state the output

$W_s$ - the weight matrix connecting one state to the next state

<img src="rnn_img/m25.png" alt="drawing" width="600"/>

The partial derivative of the Loss Function with respect to $W_y$ is found by a simple one step chain rule: (Note that in this case we do not need to use BPTT.

<img src="rnn_img/m27.png" alt="drawing" width="200"/>

Generally speaking, we can consider multiple timesteps back, and not only 3 as in this example. For an arbitrary timestep N, the gradient calculation needed for adjusting $W_y$ is:

<img src="rnn_img/m28.png" alt="drawing" width="200"/>

<img src="rnn_img/m30.png" alt="drawing" width="1200"/>

<img src="rnn_img/m31.png" alt="drawing" width="1200"/>

We still need to adjust $W_x$ , the weight matrix connecting the input to the state.

<img src="rnn_img/m32.png" alt="drawing" width="1200"/>

<img src="rnn_img/m33.png" alt="drawing" width="1200"/>

<img src="rnn_img/m34.png" alt="drawing" width="1200"/>

# Summary

<img src="rnn_img/m35.png" alt="drawing" width="1200"/>

Possible Activation Functions:

1. Hyperbolic Tangent
2. Sigmoid
3. ReLu

<img src="rnn_img/m41.png" alt="drawing" width="1200"/>

Gradient Clipping is simply normalizing the gradient values to a threshold. By doing this exploding gradients are managed reasonably.