# Recurrent neural network (RNN) basics

This `09_rnn_basics` notebook focuses on recurrent neural networks (RNNs), a key architecture for processing sequential data in machine learning. RNNs are widely used for tasks such as language modeling, time-series prediction, and more. 

The notebook covers building and training a simple RNN model, evaluating its performance, and visualizing its predictions. It also dives into important techniques for handling long sequences and padding, which are essential for dealing with variable-length input data.

## Table of contents

1. [Understanding RNNs](#understanding-rnns)
2. [Setting up the environment](#setting-up-the-environment)
3. [Building a simple RNN model](#building-a-simple-rnn-model)
4. [Training the RNN model](#training-the-rnn-model)
5. [Evaluating the RNN model](#evaluating-the-rnn-model)
6. [Visualizing model predictions](#visualizing-model-predictions)
7. [Handling long sequences and padding](#handling-long-sequences-and-padding)

## Understanding RNNs

Recurrent Neural Networks (RNNs) are a class of neural networks specifically designed to handle sequential data. Unlike traditional feedforward networks, RNNs have a recurrent connection that allows information to persist over time steps, making them well-suited for tasks involving sequences, such as time series prediction, natural language processing (NLP), and speech recognition.

### **Why RNNs?**

RNNs excel in tasks where context and the order of inputs are essential. For instance, in natural language processing, understanding the meaning of a word often depends on the words that precede it in a sentence. Standard neural networks treat inputs independently, which means they lack the ability to capture these sequential relationships. RNNs address this issue by maintaining a hidden state that evolves as it processes each element of the sequence, thereby retaining information about previous inputs.

### **Key concepts of RNNs**

#### **Sequential data and time steps**

In RNNs, the input data typically comes in sequences. Each item in the sequence is referred to as a time step. For example, in a sentence, each word would be one time step. This structure allows RNNs to process one element at a time while maintaining a memory of previous elements, which is essential for understanding context.

#### **Hidden state and recurrence**

The defining feature of RNNs is their ability to maintain a hidden state that evolves over time. At each time step, the hidden state is updated based on both the current input and the hidden state from the previous time step. This recurrence allows RNNs to keep track of information over multiple time steps, making them capable of capturing long-term dependencies in sequences.

#### **Output of the RNN**

At each time step, the RNN generates an output based on the hidden state. The output can be produced at every time step (for tasks like sequence prediction) or only after processing the entire sequence (for tasks like sentiment analysis or sequence classification). The flexibility of producing outputs either at each time step or at the end of the sequence makes RNNs adaptable to a variety of tasks.

#### **The vanishing gradient problem**

One of the challenges RNNs face is the **vanishing gradient problem**. When training RNNs using backpropagation through time (BPTT), the gradients of the loss with respect to the network’s weights can become very small as they are propagated backward through many time steps. This makes it difficult for the network to learn long-range dependencies, as information from earlier time steps may be lost. This issue is particularly problematic for standard RNNs in tasks that require understanding of long-term context.

### **Variants of RNNs**

Several advanced architectures have been developed to address the limitations of standard RNNs, particularly the vanishing gradient problem:

#### **Long Short-Term Memory (LSTM)**

LSTMs are a specialized type of RNN designed to capture long-term dependencies in sequences. They achieve this through a gating mechanism that controls the flow of information within each LSTM cell. LSTMs have three key gates:

- **Forget gate**: Decides which information from the previous cell state should be discarded.
- **Input gate**: Determines which new information should be added to the cell state.
- **Output gate**: Controls what part of the cell state should be output at the current time step.

This gated structure allows LSTMs to preserve information across long sequences, overcoming the vanishing gradient problem and making them highly effective for tasks like language modeling, translation, and time series forecasting.

#### **Gated Recurrent Unit (GRU)**

GRUs are a simplified version of LSTMs. They combine the forget and input gates into a single gate and use fewer parameters, making them less complex but still powerful. GRUs often perform similarly to LSTMs, especially in tasks where sequences are relatively short. Their simpler structure makes them more computationally efficient, which can be advantageous in certain applications.

### **Applications of RNNs**

RNNs are well-suited for tasks that involve sequential or time-dependent data. Some common applications include:

- **Time series forecasting**: RNNs can predict future values in a time series by learning patterns from historical data, making them useful for applications such as stock market prediction, weather forecasting, and anomaly detection.
- **Language modeling and text generation**: In NLP tasks, RNNs can be used to predict the next word in a sentence or generate text based on a given input, making them valuable for applications like machine translation and chatbot development.
- **Speech recognition**: RNNs process sequences of audio signals, making them ideal for converting spoken language into text, as used in virtual assistants and voice-controlled systems.
- **Video analysis**: RNNs can process sequences of video frames to identify actions or events in video data, which is useful for tasks like video classification and activity recognition.

### **Maths**

#### **RNN architecture and hidden state**

At the core of an RNN is the idea of maintaining a hidden state, which evolves over time as the network processes each input in a sequence. The hidden state at time step $ t $, denoted $ h_t $, is a function of both the input at time step $ t $, $ x_t $, and the hidden state from the previous time step, $ h_{t-1} $. This recurrence is what gives RNNs the ability to retain memory of previous inputs.

Mathematically, the hidden state is computed as follows:

$$
h_t = f(W_{xh} x_t + W_{hh} h_{t-1} + b_h)
$$

Where:
- $ h_t $ is the hidden state at time step $ t $,
- $ x_t $ is the input at time step $ t $,
- $ W_{xh} $ is the weight matrix for the input,
- $ W_{hh} $ is the weight matrix for the hidden state,
- $ b_h $ is the bias term for the hidden state,
- $ f $ is a non-linear activation function, typically **tanh** or **ReLU**.

This equation shows how the hidden state is influenced by both the current input $ x_t $ and the hidden state from the previous time step $ h_{t-1} $. The recursive nature of the hidden state allows the RNN to capture dependencies between elements in the sequence.

#### **Output of the RNN**

The output of an RNN at each time step, $ o_t $, is computed using the hidden state $ h_t $. Typically, the output is a function of the hidden state and an output weight matrix $ W_{ho} $:

$$
o_t = g(W_{ho} h_t + b_o)
$$

Where:
- $ o_t $ is the output at time step $ t $,
- $ W_{ho} $ is the weight matrix for the output,
- $ b_o $ is the bias term for the output,
- $ g $ is an activation function, such as a softmax function for classification tasks.

The output can be computed at every time step or only after processing the entire sequence, depending on the specific task (e.g., sequence generation, classification).

#### **Unrolling through time**

To train an RNN, the network is "unrolled" through time. In this unrolled version, each time step is treated as a separate layer, with the same weights $ W_{xh} $ and $ W_{hh} $ being shared across all time steps. For a sequence of length $ T $, the RNN would be unrolled into $ T $ layers, each corresponding to one time step.

This unrolling enables us to apply the backpropagation algorithm to compute the gradients of the loss with respect to the weights.

#### **Backpropagation through time (BPTT)**

The training process for RNNs is performed using **backpropagation through time (BPTT)**, a variant of backpropagation adapted for sequential data. BPTT computes the gradients of the loss function with respect to the weights by considering the entire sequence. Given a loss function $ L $, the total loss over a sequence is the sum of the losses at each time step:

$$
L_{\text{total}} = \sum_{t=1}^{T} L_t
$$

Where:
- $ L_t $ is the loss at time step $ t $,
- $ T $ is the total number of time steps.

The gradient of the loss with respect to the weights is computed using the chain rule. For example, for the weight matrix $ W_{xh} $, the gradient is:

$$
\frac{\partial L_{\text{total}}}{\partial W_{xh}} = \sum_{t=1}^{T} \frac{\partial L_t}{\partial o_t} \cdot \frac{\partial o_t}{\partial h_t} \cdot \frac{\partial h_t}{\partial W_{xh}}
$$

This backpropagation through time involves calculating the gradient at each time step and summing the contributions across all time steps. The challenge with BPTT, particularly in long sequences, is the **vanishing gradient problem**, where the gradients become very small as they are propagated backward through many time steps. This makes it difficult for the network to learn long-range dependencies.

#### **Vanishing gradient problem**

The vanishing gradient problem arises because the same weight matrices are used repeatedly at each time step. When backpropagating through many time steps, the gradients of the activation function can become very small, especially when using activation functions like **tanh** or **sigmoid**, whose derivatives are less than 1. As a result, the gradients decay exponentially, making it difficult for the RNN to learn from inputs far back in the sequence.

Mathematically, the vanishing gradient problem occurs because the derivative of the hidden state at time step $ t $ with respect to the hidden state at time step $ t-k $ is:

$$
\frac{\partial h_t}{\partial h_{t-k}} = \prod_{i=1}^{k} \frac{\partial h_{t-i+1}}{\partial h_{t-i}}
$$

If the derivatives are small, as they typically are with activation functions like **tanh** or **sigmoid**, this product becomes very small as $ k $ increases, leading to vanishing gradients.

#### **Gradient clipping**

To address the vanishing gradient problem, a technique called **gradient clipping** is often used during training. Gradient clipping involves setting a threshold for the magnitude of the gradients. If the gradients exceed this threshold, they are rescaled to prevent them from becoming too large or too small. This helps stabilize the training of RNNs, particularly when dealing with long sequences.

#### **Long Short-Term Memory (LSTM)** and **Gated Recurrent Unit (GRU)**

LSTMs and GRUs are popular variants of RNNs that address the vanishing gradient problem by introducing gating mechanisms that control the flow of information through the network.

##### **LSTM**

In an LSTM, the hidden state is replaced by a **cell state** that can carry information across long sequences without modification. The cell state is updated by three gates: the **forget gate**, **input gate**, and **output gate**. These gates determine which information is retained, updated, or output at each time step.

The LSTM architecture mitigates the vanishing gradient problem by allowing gradients to flow more easily through the cell state, making it easier to learn long-range dependencies.

##### **GRU**

GRUs are a simpler alternative to LSTMs, combining the forget and input gates into a single **update gate**. GRUs retain many of the benefits of LSTMs, such as the ability to capture long-term dependencies, but with a more streamlined architecture and fewer parameters.

## Setting up the environment

##### **Q1: How do you install the necessary libraries for building and training RNNs in PyTorch?**


##### **Q2: How do you import the required modules for working with RNNs in PyTorch?**


##### **Q3: How do you set up your environment to use a GPU if available, or fallback to a CPU in PyTorch?**


##### **Q4: How do you check the version of PyTorch installed in your environment?**

## Building a simple RNN model

##### **Q5: How do you define an RNN model using PyTorch’s `nn.RNN` module?**


##### **Q6: How do you specify the input size, hidden size, and number of layers when building an RNN in PyTorch?**


##### **Q7: How do you initialize the hidden state for an RNN in PyTorch before starting the forward pass?**


##### **Q8: How do you implement a forward pass through the RNN model in PyTorch?**


##### **Q9: How do you retrieve the final hidden state output by the RNN model in PyTorch?**

## Training the RNN model

##### **Q10: How do you define the loss function for training an RNN model on a sequence classification task in PyTorch?**


##### **Q11: How do you choose and configure an optimizer for training an RNN in PyTorch?**


##### **Q12: How do you implement a training loop for the RNN model that includes forward pass, loss computation, and backpropagation in PyTorch?**


##### **Q13: How do you track and print the training loss at each epoch during training in PyTorch?**


##### **Q14: How do you implement gradient clipping in PyTorch to prevent exploding gradients during RNN training?**

## Evaluating the RNN model

##### **Q15: How do you evaluate the performance of a trained RNN model on a validation dataset in PyTorch?**


##### **Q16: How do you calculate the accuracy of an RNN model on a test set in PyTorch?**


##### **Q17: How do you run inference with a trained RNN model on new sequence data in PyTorch?**


##### **Q18: How do you save and load a trained RNN model in PyTorch for later use?**

## Visualizing model predictions

##### **Q19: How do you visualize the predicted output versus the actual output for a sequence prediction task in PyTorch?**


##### **Q20: How do you plot the loss curve over the training epochs to analyze the RNN model's learning behavior in PyTorch?**

## Handling long sequences and padding

##### **Q21: How do you handle sequences of varying lengths when training an RNN in PyTorch?**


##### **Q22: How do you pad sequences to ensure they have the same length in a batch when training an RNN in PyTorch?**


##### **Q23: How do you use `nn.utils.rnn.pack_padded_sequence` to handle padded sequences in an RNN model in PyTorch?**


##### **Q24: How do you unpack the sequences using `nn.utils.rnn.pad_packed_sequence` after processing them through an RNN in PyTorch?**


##### **Q25: How do you modify the RNN model to correctly handle packed sequences in PyTorch?**

## Conclusion