# Programming for Data Science and Artificial Intelligence

## 17. Recurrent Neural Networks from Scratch

- [WEIDMAN] Ch6
- [CHARU] Ch7

Tecurrent neural networks are designed to handle data that appears in sequences: instead of size **(batch_size, features)**, now we add additional dimension which is **time steps**

<img src = "figures/rnn.png" width=300>

For example, one observation could have the features from time t = 1 with the value of the target from time t = 2, the next observation could have the features from time t = 2 with the value of the target from time t = 3, and so on. If we wanted to use data from multiple time steps to make each prediction rather than data from just one time step, we could use the features from t = 1 and t = 2 to predict the target at t = 3, the features from t = 2 and t = 3 to predict the target at t = 4, and so on.

However, **treating each time step as independent ignores the fact that the data is ordered sequentially**. How would we ideally want to use the sequential nature of the data to make better predictions? The solution would look something like this:

1. Use features from time step t = 1 to make predictions for the corresponding target at t = 1.

2. Use features from time step t = 2 as well as the information from t = 1, including the value of the target at t = 1, to make predictions for t = 2.

3. Use features from time step t = 3 as well as the accumulated information from t = 1 and t = 2 to make predictions at t = 3.

4. And so on, at each step using the information from all prior time steps to make a prediction.

To do this, it seems that we'd want to pass our data through the neural network one sequence element at a time, with the data from the first time step being passed through first, then the data from the next time step, and so on. In addition, we’ll want our neural network to **accumulate information** about what it has seen before as the new sequence elements are passed through. 

In a more concrete fashion, consider the following steps and figure:

1. In the first time step, t = 1, we would pass through the observation from the first time step (along with randomly initialized representations, perhaps). We would output a prediction for t = 1, along with representations at each layer.

2. In the next time step, we would pass through the observation from the second time step, t = 2, **along with the representations computed during the first time step** (which, again, are just the outputs of the neural network’s layers), and combine these somehow (it is in this combining step that the variants of RNNs we'll learn about differ). We would use these two pieces of information to output a prediction for t = 2 as well as the updated representations at each layer, which are now a function of the inputs passed in at both t = 1 and t = 2.

3. In the third time step, we would pass through the observation from t = 3, as well as the representations that now incorporate the information from t = 1 and t = 2, and use this information to make predictions for t = 3, as well as additional updated representations at each layer, which now incorporate information from time steps 1–3.

This process is depicted in this figure:

<img src = "figures/rnn2.png" width=500>
