# INTERMEDIATE PREREQUISITES

# Neural Network Models: RNN, LSTM, Bi-LSTM, and GRU

## Recurrent Neural Network (RNN)

![image](res/rnn.png)

### Model Introduction
Recurrent Neural Networks (RNNs) are a class of artificial neural networks designed to work with sequential data. Unlike feedforward neural networks, RNNs have connections that form directed cycles, allowing them to maintain an internal state or "memory" of previous inputs.

### How and Why the Model was Created
RNNs were developed in the 1980s to address the limitation of traditional neural networks in processing sequential data. The key motivation was to create a network that could use its internal memory to process sequences of inputs, making them suitable for tasks like speech recognition and language modeling.

### Detailed Working Explanation
1. **Basic Structure**: An RNN consists of input, hidden, and output layers. The hidden layer has a self-loop connection, allowing it to pass information from one time step to the next.

2. **Forward Pass**:
   At each time step t, the RNN takes an input x_t and the previous hidden state h_(t-1) to compute the current hidden state h_t:
   h_t = tanh(W_hx * x_t + W_hh * h_(t-1) + b_h)
   Where W_hx, W_hh are weight matrices, and b_h is a bias vector.

3. **Output Computation**:
   The output y_t is computed based on the current hidden state:
   y_t = W_hy * h_t + b_y
   Where W_hy is a weight matrix and b_y is a bias vector.

4. **Backpropagation Through Time (BPTT)**:
   RNNs are trained using BPTT, which unrolls the network through time and applies backpropagation. The gradients are computed for each time step and summed.

5. **Gradient Flow**:
   During training, gradients can vanish or explode as they're propagated back through time, making it difficult to learn long-term dependencies.

6. **Example**: In a character-level language model, each input x_t is a character, and the network predicts the next character y_t. The hidden state h_t captures the context of previous characters.

### When to Use (Use Cases)
- Time series prediction
- Natural language processing tasks (e.g., sentiment analysis, text generation)
- Speech recognition
- Music generation

### Advantages
- Can process sequences of variable length
- Shares parameters across time steps, reducing the number of parameters to learn
- Capable of capturing temporal dependencies in data

### Disadvantages
- Suffers from vanishing and exploding gradient problems
- Difficulty in capturing long-term dependencies
- Can be computationally expensive for very long sequences


## Long Short-Term Memory (LSTM)


![image](res/lstm.png)

### Model Introduction
Long Short-Term Memory (LSTM) networks are a specialized form of RNNs designed to capture long-term dependencies in sequential data. They use a gating mechanism to control the flow of information, allowing them to remember or forget information selectively.

### How and Why the Model was Created
LSTMs were introduced to address the vanishing gradient problem faced by traditional RNNs. The goal was to create a model that could learn and remember information over long sequences, which is crucial for many real-world applications.

### Detailed Working Explanation
1. **LSTM Cell Structure**: An LSTM cell consists of a cell state and three gates: forget gate, input gate, and output gate.

2. **Forget Gate**:
   f_t = σ(W_f * [h_(t-1), x_t] + b_f)
   This gate decides what information to discard from the cell state.

3. **Input Gate**:
   i_t = σ(W_i * [h_(t-1), x_t] + b_i)
   c̃_t = tanh(W_c * [h_(t-1), x_t] + b_c)
   This gate decides what new information to store in the cell state.

4. **Cell State Update**:
   c_t = f_t * c_(t-1) + i_t * c̃_t
   The cell state is updated based on the forget and input gates.

5. **Output Gate**:
   o_t = σ(W_o * [h_(t-1), x_t] + b_o)
   h_t = o_t * tanh(c_t)
   This gate controls what information from the cell state is output.

6. **Example**: In a sentiment analysis task, the LSTM can learn to focus on key words or phrases that strongly indicate sentiment, while forgetting less relevant information.

### When to Use (Use Cases)
- Machine translation
- Speech recognition
- Sentiment analysis
- Time series forecasting with long-term trends

### Advantages
- Capable of learning long-term dependencies
- Mitigates the vanishing gradient problem
- Selective memory through gating mechanism
- Robust performance across a wide range of sequence lengths

### Disadvantages
- More complex than standard RNNs, requiring more computational resources
- Can be challenging to train, requiring careful initialization and hyperparameter tuning
- Potential for overfitting, especially on smaller datasets

## Bidirectional LSTM (Bi-LSTM)

![image](res/bilistm.png)

### Model Introduction
Bidirectional LSTM (Bi-LSTM) is an extension of the standard LSTM that processes input sequences in both forward and backward directions. This allows the network to capture context from both past and future states, providing a more comprehensive understanding of the sequence.

### How and Why the Model was Created
Bi-LSTMs were developed to address the limitation of unidirectional LSTMs in tasks where future context is as important as past context. They were introduced to improve performance in tasks like speech recognition and natural language processing, where understanding the full context of a sequence is crucial.

### Detailed Working Explanation
1. **Network Structure**: A Bi-LSTM consists of two separate LSTM layers: one processes the input sequence from left to right (forward), and the other from right to left (backward).

2. **Forward Pass**:
   Forward LSTM: h_f_t = LSTM_f(x_t, h_f_(t-1))
   Backward LSTM: h_b_t = LSTM_b(x_t, h_b_(t+1))
   Where LSTM_f and LSTM_b are the forward and backward LSTM functions respectively.

3. **Output Combination**:
   The outputs from both directions are combined, often by concatenation:
   h_t = [h_f_t, h_b_t]

4. **Final Output**:
   y_t = W_y * h_t + b_y
   Where W_y is a weight matrix and b_y is a bias vector.

5. **Training**: Both forward and backward passes are trained simultaneously using backpropagation through time.

6. **Example**: In named entity recognition, a Bi-LSTM can use both preceding and following words to accurately classify an entity, which is particularly useful for disambiguating entities based on context.

### When to Use (Use Cases)
- Named Entity Recognition
- Part-of-speech tagging
- Machine translation
- Sentiment analysis where whole-sentence context is important

### Advantages
- Captures both past and future context
- Improves performance in tasks where bidirectional context is crucial
- Reduces ambiguity in classification tasks
- Can be combined with attention mechanisms for even better performance

### Disadvantages
- Increased computational complexity compared to unidirectional LSTMs
- Requires the entire sequence to be available before processing, making it unsuitable for real-time applications
- Potential for overfitting, especially on smaller datasets
- More complex to implement and tune

## Gated Recurrent Unit (GRU)

![image](res/gru.ppm)

### Model Introduction
Gated Recurrent Units (GRUs) are a type of recurrent neural network designed as a simpler alternative to LSTMs. They use a gating mechanism to control information flow but with a simplified structure compared to LSTMs.

### How and Why the Model was Created
GRUs were introduced as part of an effort to create a more computationally efficient alternative to LSTMs. The goal was to maintain the ability to capture long-term dependencies while reducing the number of parameters and computational complexity.

### Detailed Working Explanation
1. **GRU Cell Structure**: A GRU cell consists of two gates: reset gate and update gate.

2. **Update Gate**:
   z_t = σ(W_z * [h_(t-1), x_t])
   This gate decides how much of the past information to pass along to the future.

3. **Reset Gate**:
   r_t = σ(W_r * [h_(t-1), x_t])
   This gate decides how much of the past information to forget.

4. **Candidate Hidden State**:
   h̃_t = tanh(W * [r_t * h_(t-1), x_t])
   This is the new memory content, which will be used to update the hidden state.

5. **Hidden State Update**:
   h_t = (1 - z_t) * h_(t-1) + z_t * h̃_t
   The hidden state is updated based on the update gate and the candidate hidden state.

6. **Example**: In a text classification task, the GRU can learn to focus on key phrases while ignoring less relevant parts of the input sequence.

### When to Use (Use Cases)
- Text classification
- Sentiment analysis
- Time series prediction with moderate sequence lengths
- When computational efficiency is a priority

### Advantages
- Simpler architecture than LSTMs, with fewer parameters
- Generally faster to train and run than LSTMs
- Effective at capturing medium to long-range dependencies
- Often performs comparably to LSTMs on many tasks

### Disadvantages
- May be less powerful than LSTMs for some complex tasks requiring fine-grained control over memory
- Less studied and understood compared to LSTMs
- Performance can vary depending on the specific task and dataset
- May struggle with very long-term dependencies compared to LSTMs in some cases