# Programming for Data Science and Artificial Intelligence

## 17. Recurrent Neural Networks from Scratch

- [WEIDMAN] Ch6
- [CHARU] Ch7

Recurrent neural networks are designed to handle data that appears in sequences: instead of size **(batch_size, features)**, now we add additional dimension which is **time steps**

<img src = "figures/rnn.png" width=300>

For example, one observation could have the features from time t = 1 with the value of the target from time t = 2, the next observation could have the features from time t = 2 with the value of the target from time t = 3, and so on. If we wanted to use data from multiple time steps to make each prediction rather than data from just one time step, we could use the features from t = 1 and t = 2 to predict the target at t = 3, the features from t = 2 and t = 3 to predict the target at t = 4, and so on.

However, **treating each time step as independent ignores the fact that the data is ordered sequentially**. How would we ideally want to use the sequential nature of the data to make better predictions? The solution would look something like this:

1. Use features from time step t = 1 to make predictions for the corresponding target at t = 1.

2. Use features from time step t = 2 as well as the information from t = 1, including the value of the target at t = 1, to make predictions for t = 2.

3. Use features from time step t = 3 as well as the accumulated information from t = 1 and t = 2 to make predictions at t = 3.

4. And so on, at each step using the information from all prior time steps to make a prediction.

To do this, it seems that we'd want to pass our data through the neural network one sequence element at a time, with the data from the first time step being passed through first, then the data from the next time step, and so on. In addition, we’ll want our neural network to **accumulate information** about what it has seen before as the new sequence elements are passed through. 

In a more concrete fashion, consider the following steps and figure:

1. In the first time step, t = 1, we would pass through the observation from the first time step (along with randomly initialized representations, perhaps). We would output a prediction for t = 1, along with representations at each layer.

2. In the next time step, we would pass through the observation from the second time step, t = 2, **along with the representations computed during the first time step** (which, again, are just the outputs of the neural network’s layers), and combine these somehow (it is in this combining step that the variants of RNNs we'll learn about differ). We would use these two pieces of information to output a prediction for t = 2 as well as the updated representations at each layer, which are now a function of the inputs passed in at both t = 1 and t = 2.

3. In the third time step, we would pass through the observation from t = 3, as well as the representations that now incorporate the information from t = 1 and t = 2, and use this information to make predictions for t = 3, as well as additional updated representations at each layer, which now incorporate information from time steps 1–3.

This process is depicted in this figure:

<img src = "figures/rnn2.png" width=450>

We see that each layer has a representation that is **persistent** getting updated over time as new observations are passed through.

### The First Class for RNNs: RNNLayer

RNNs will deal with data in which each observation is two-dimensional, with dimensions `(sequence_length, num_features)` - order is not important; and since it is always more efficient computationally to pass data forward in batches, `RNNLayer` will have to take in three-dimensional ndarrays, of size `(batch_size, sequence_length, num_features)`. 

However, we want to feed our data through our `RNNLayers` one sequence element at a time as shown in the previous picture; how can we do this if our input, data, is `(batch_size, sequence_length, num_features)`? Here’s how:

1. Select a two-dimensional array from the second axis, starting with `data[:, 0, :]`. This ndarray will have shape `(batch_size, num_features)`.

2. Initialize a **hidden state** for the `RNNLayer` that will continually get updated with each sequence element passed in, this time of shape `(batch_size, hidden_size)`. This ndarray will represent the layer’s **accumulated information** about the data that has been passed in during the prior time steps.

3. Pass these two ndarrays forward through the first time step in this layer. We'll end up designing `RNNLayer` to output `ndarrays` of different dimensionality than the inputs, just like regular Dense layers can, so the output will be of shape `(batch_size, num_outputs)`. In addition, update the neural network’s representation for **each observation**: at **each time step**, our `RNNLayer` should also output an ndarray of shape `(batch_size, hidden_size)`.

4. Select the next two-dimensional array from data: `data[:, 1, :]`.

5. Pass this data, as well as the values of the `RNN`’s representations outputted at the first time step, into the second time step at this layer to get another output of shape `(batch_size, num_outputs)`, as well as updated representations of shape `(batch_size, hidden_size)`.

6. Continue until all sequence_length time steps have been passed through the layer. Then concatenate all the results together to get an output from that layer of shape `(batch_size, sequence_length, num_outputs)`.

This gives us an idea of how our `RNNLayers` should work—and we'll solidify this understanding when we code it up—but it also hints that we'll need another class to handle receiving the data and updating the layer's hidden state at each time step. For this we'll use the `RNNNode`, the next class we’ll cover.

### The Second Class for RNNs: RNNNode

Based on the description from the prior section, an RNNNode should have a forward method with the following inputs and outputs:

Two ndarrays as inputs:

One for the data inputs to the network, of shape `[batch_size, num_features]`

One for the representations of the observations at that time step, of shape `[batch_size, hidden_size]`

Two ndarrays as outputs:

One for the outputs of the network at that time step, or shape `[batch_size, num_outputs]`

One for the updated representations of the observations at that time step, of shape: `[batch_size, hidden_size]`

Next, we’ll show how the two classes, RNNNode and RNNLayer, fit together.

### Putting These Two Classes Together

The RNNLayer class will wrap around a list of `RNNNodes` and will (at least) contain a `forward` method that has the following inputs and outputs:

- Input: a batch of sequences of observations of shape `[batch_size, sequence_length, num_features]`

- Output: the neural network output of those sequences of shape `[batch_size, sequence_length, num_outputs]`

The figure below shows the order that data would move forward through an `RNN` with two `RNNLayers` with five `RNNNodes` each. At each time step, inputs initially of dimension `feature_size` are passed successively forward through the first `RNNNode` in each `RNNLayer`, with the network ultimately outputting a prediction at that time step of dimension output_size. In addition, each RNNNode passes a **hidden state** forward to the next `RNNNode` within each layer. Once data from each of the five time steps has been passed forward through all the layers, we will have a final set of predictions of shape `(5, output_size)`, where `output_size` should be the same dimension as the targets. These predictions would then be compared to the target, and the loss gradient would be computed, kicking off the backward pass.

<img src = "figures/rnn3.png" width=600>

Alternatively, data could flow through the RNN in the order shown below. Whatever the order, the following must occur:

- Each layer needs to process its data at a given time step before the next layer—for example, in above, 2 can’t happen before 1, and 4 can’t happen before 3.

- Similarly, each layer has to process all of its time steps in order—in figure above, for example, 4 can’t happen before 2, and 3 can’t happen before 1.

- The last layer has to output dimension feature_size for each observation.

<img src = "figures/rnn4.png" width=600>

### The Backward Pass

Backpropagation through recurrent neural networks is often described as a separate algorithm called **backpropagation through time.** While this does indeed describe what happens during backpropagation, it makes things sound a lot more complicated than they are. Keeping in mind the explanation of how data flows forward through an RNN, we can describe what happens on the backward pass this way: we pass data backward through the `RNN` by passing gradients backward through the network in reverse of the order that we passed inputs forward on the forward pass—which, indeed, is the same thing we do in regular feed-forward networks.

Looking at the diagrams in figures above on the forward pass:

1. We start with a batch of observations, each of shape `(feature_size, sequence_length)`.

2. These inputs are broken up into the individual `sequence_length` elements and passed into the network one at a time.

3. Each element gets passed through all the layers, ultimately getting transformed into an output of size `output_size`.

4. At the same time, the layer passes the hidden state forward into the layer’s computation at the next time step.

5. This continues for all `sequence_length` time steps, resulting in a total output of size `(output_size, sequence_length)`.

Backpropagation simply works the same way, but in reverse:

1. We start with a gradient of shape `[output_size, sequence_length]`, representing how much each element of the output (also of size `[output_size, sequence_length]`) ultimately impacts the loss computed for that batch of observations.

2. These gradients are broken up into the individual `sequence_length` elements and passed backward through the layers in reverse order.

3. The gradient for an individual element is passed backward through all the layers.

4. At the same, the layers pass the gradient of the loss with respect to the hidden state at that time step backward into the layers’ computations at the prior time steps.

5. This continues for all `sequence_length` time steps, until the gradients have been passed backward to every layer in the network, thus allowing us to compute the gradient of the loss with respect to each of the weights, just as we do in the case of regular feed-forward networks.

This parallelism between the backward and forward pass is highlighted in the figure below, which shows how data flows through an RNN during the backward pass. You’ll notice, of course, that it is the very similar as figures above but with reversed arrows.

<img src = "figures/rnn5.png" width=600>

This highlights that, at a high level, the forward and backward passes for an RNNLayer are very similar to those of a layer in a normal neural network: they both receive `ndarrays` of a certain shape as input, output `ndarrays` of another shape, and on the backward pass receive an **output gradient of the same shape as their output and produce an input gradient of the same shape as their input**. There is a key difference in the way the weight gradients are handled in `RNNLayers` versus other layers, however, so we’ll briefly cover that before we shift to coding this all up.

### Accumulating gradients for the weights in an RNN

In recurrent neural networks, just as in regular neural networks, each layer will have one set of weights. That means that the same set of weights will affect the layer's output at all sequence_length time steps; during backpropagation, therefore, the same set of weights will receive `sequence_length` different gradients. For example, in the circle labeled *1* in the backpropagation shown in figure above, the second layer will receive a gradient for the last time step, while in the circle labeled *3*, the layer will receive a gradient for the second-to-last time step; both of these will be driven by the same set of weights. Thus, during backpropagation, we'll have to accumulate gradients for the weights over a series of time steps, which means that however we choose to store the weights, we'll have to update their gradients using something like the following:

`weight_grad += grad_from_time_step`

This is different from the `Dense` and `Conv2D` layers, in which we just stored the parameters in a param_grad argument.

We’ve laid out how `RNNs` work and the classes we want to build to implement them; now let’s start figuring out the details.