**References:**

- Blog:
    - [Dive into Deep Learning - RNN](https://d2l.ai/chapter_recurrent-neural-networks/index.html)
    - [Dive into Deep Learning - LSTM](https://d2l.ai/chapter_recurrent-modern/lstm.html)
    - [Dive into Deep Learning - GRU](https://d2l.ai/chapter_recurrent-modern/gru.html)
    - [Scaler - Deep Learning for Sequence Modelling](https://www.scaler.com/topics/deep-learning/)
    - [Analytics Vidhya - Intro to LSTM](https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/) <br></br>

- Videos:
    - [StatQuest - RNN, LSTM Clearly Explained!](https://www.youtube.com/playlist?list=PLblh5JKOoLUIxGDQs4LFFD--41Vzf-ME1)
    - [Serano Academy - Intro to RNN](https://youtu.be/UNmqTiOnRfg)
    - [Brandon Rohrer - How RNNs & LSTM Work](https://e2eml.school/how_rnns_lstm_work)
    - [Michael Phi - Illustrated Guide to RNNs, LSTMS, GRU](https://www.youtube.com/@thea.i.hacker-michaelphi6569/videos)
    - [Stanford - RNN](https://youtu.be/6niqTuYFZLQ)
    - [Brandon Rohrer - Transformers](https://e2eml.school/transformers) <br></br>
    
- Kaggle, Code Implementation:
    - [Intro to Recurrent Neural Networks LSTM | GRU - TensorFlow](https://www.kaggle.com/code/thebrownviking20/intro-to-recurrent-neural-networks-lstm-gru) by [Siddharth Yadav](https://www.linkedin.com/in/siddharth-yadav-43a59115a/)
    - [NLP Beginner 1: (RNN,LSTM,GRU) (Embeddings, GloVe) - PyTorch](https://www.kaggle.com/code/deshwalmahesh/nlp-beginner-1-rnn-lstm-gru-embeddings-glove) by [Mahesh Deshwal](https://www.linkedin.com/in/deshwalmahesh/)

## **1. Recurrent Neural Networks (RNNs):**

### 1.1 What are RNNs & How does it work?

1. **What are RNNs?**

    - Recurrent Neural Networks (RNNs) are a type of neural network architecture designed to handle sequential data. This means that they can be used to model data that has a temporal order, such as text, speech, or music. 

    - They are particularly well-suited for tasks where the order and context of data matter, such as Time-Series Prediction, NLP, and Speech Recognition. 

2. **How does RNNs work?**

    - RNNs maintain a hidden state that gets updated with each new input, allowing them to capture information from previous steps and use it to influence the current step's output. This hidden state represents the network's understanding of the data that it has seen so far. The hidden state is then used to predict the next element in the sequence.
    
    - **Here is an example of how an RNN might be used to predict the next word in a sentence:** 
        - The RNN would start with a hidden state that is initialized to all zeros. Then, it would read the first word in the sentence, and update its hidden state based on that word. The RNN would then read the second word in the sentence, and update its hidden state again. This process would continue until the RNN has read the entire sentence. Finally, the RNN would output a prediction for the next word in the sentence.

* **

### 1.2 Limitations of RNNs

**Limitations:**

1. The **vanishing and exploding gradient** problems are two of the most common limitations of RNNs. These problems occur when the gradients of the loss function become very small or very large, which makes it difficult for the network to learn.

    - **The vanishing gradient problem occurs when the weights of the RNN are very small. This can happen when the RNN is trying to learn long-term dependencies**, as the gradients of the loss function will be multiplied by the weights many times as the RNN propagates back through time. As a result, the gradients can become very small, which makes it difficult for the network to learn.

    - **The exploding gradient problem occurs when the weights of the RNN are very large. This can happen when the RNN is trying to learn short-term dependencies**, as the gradients of the loss function will be multiplied by the weights many times as the RNN propagates back through time. As a result, the gradients can become very large, which can cause the network to become unstable and difficult to train.
        
    - There are a number of techniques that can be used to mitigate the vanishing and exploding gradient problems in RNNs. These techniques include:

        - **Using a smaller learning rate:** A smaller learning rate will help to prevent the gradients from becoming too large or too small.
        
        - **Using gradient clipping:** Gradient clipping is a technique that limits the magnitude of the gradients. This can help to prevent the gradients from becoming too large and causing the network to become unstable.
        
        - **Using a more regularized network:** Regularization techniques can help to prevent the weights of the network from becoming too large or too small.<br></br>
        

2. There are a number of other limitations of RNNs:

    - **RNNs can be slow to train:** RNNs can be slow to train, especially for long sequences. This is because the network has to propagate the gradients back through time, which can be computationally expensive.

    - **RNNs can be sensitive to the initial state:** The initial state of the RNN can have a significant impact on the output of the network. This can make it difficult to train the network to be robust to different starting points.

    - **RNNs can be difficult to interpret:** RNNs can be difficult to interpret, as they are essentially black boxes. This can make it difficult to understand how the network is making its predictions.
    
* **

### 1.3 Intutive & Mathematical understanding of RNNs

**Intuitive Explanation:**

>Imagine a neural network that's designed to understand sequences of data, like sentences in a paragraph or frames in a video. RNNs are like a chain of connected units, where each unit not only processes the current data point but also remembers information from the past. It's like passing a message along a line of friends, where each friend adds their own knowledge to the message before passing it on.

**Mathematical Explanation:**

>In RNNs, at each step of the sequence, the current input is combined with the previous hidden state using certain weight matrices. This combined information is then passed through a nonlinear function (usually a tanh activation) to generate the new hidden state. The hidden state at each step stores information from the past, allowing the network to consider the context as it processes the sequence.

* **

This is what an RNN looks like:

<div align='center'>
    <img src='images/rnn_unrolled.png' title='RNN' width=500/>
    <img src='images/rnn.png' title='RNN' width=500/>
</div>

Here $h_t$ represents the hidden state at time $t$ and $x_t$ represents the i/p value at time $t$.

- **At time $t$, the cell has an input $x(t)$ and output $y(t)$. Part of the output $y(t)$, represented by the hidden state $h_t$, is fed back into the cell for use at a later time step $t+1$.**

- Just as in a traditional neural network, where the learned parameters are stored as weight matrices, the RNN's parameters are defined by the three weight matrices $U(W_{xh})$, $V(W_{hy})$, and $W(W_{hh})$, **corresponding to the weights of the input, output, and hidden states respectively.**

- Figure 5.1(b) shows the same RNN in an “unrolled view.” Unrolling just means that we draw the network out for the complete sequence. The network shown here has three time steps, suitable for processing three element sequences. 

- **Note that the weight matrices U, V, and W, are shared between each of the time steps.** This is because we are applying the same operation to different inputs at each time step. Being able to share these weights across all the time steps greatly reduces the number of parameters that the RNN needs to learn.


- The three weights in the RNN can be understood as:
    - $U \text{ or } W_{xh} \rightarrow$ input to hidden, 
    - $W \text{ or } W_{hh} \rightarrow$ hidden to hidden, and 
    - $V \text{ or } W_{hy} \rightarrow$ hidden to output 

* **

### 1.4 Workings of a Simple RNN

<div align='center'>
    <img src='images/rnn_unrolled.png' title='RNN' width=500/>
    <img src='images/rnn.png' title='RNN' width=500/>
</div>

1. Hidden State ($h_t$):

    - The internal state of the RNN at a time t is given by the value of the hidden vector $h_t$, which is the sum of the weight matrix $W(W_{hh})$ and the hidden state $h_{t-1}$ at time $t-1$, and the product of the weight matrix $U(W_{xh})$ and the input $x_t$ at time $t$, passed through a $tanh$ activation function. 
    
    - Equation: $h_t = tanh(W_{hh}h_{t-1} + W_{xh}x_t + b_t)$ 
    
        - The choice of `tanh` over other activation functions such as sigmoid has to do with it being more efficient for learning in practice and helps combat the vanishing gradient problem.<br></br>
    
    - Here, $W_{hh}$ or $W$ is the weight matrix for the recurrent connections, and $W_{xh}$ or $U$ is the weight matrix for the input connections, and $b_t$ is the bias term


2. Output ($y_t$):

    - The output vector $y_t$ at time $t$ is the product of the weight matrix $V$ and the hidden state $h_t$, passed through a softmax activation, such that the resulting vector is a set of output probabilities.
    
    - Equation: $y_t = softmax(W_{hy}h_t + b_y)$ 

    - Here, $W_{hy} \text{ or } V $ is the weight matrix for the output connections, and $b_y$ is the output bias term.

* **

#### ***Intiutive understanding of Weight Matrices:***

The weight matrices in neural networks play a crucial role in determining how information is transformed and propagated through the network. 

**Weight Matrices in RNNs:** 

In a standard RNN, there are two primary weight matrices:

1. **$W_{xh}$ or $U$:**  This matrix represents the weights connecting the input data ($x_t$) to the hidden state ($h_t$) at the current time step. **It determines how much influence the input has on the hidden state.**


2. $W_{hh}$ or $W$ : This matrix represents the recurrent weights connecting the previous hidden state ($h_{t-1}$) to the current hidden state ($h_t$). **It controls the propagation of information from previous time steps to the current time step.**

Both of these matrices are learned during the training process to optimize the network's performance on the given task. They control how information flows and is transformed as it moves through the network over time steps.

### 1.5 Backpropagation through time (BPTT)

- Just like traditional neural networks, training RNNs also involves the backpropagation of gradients. 

- The difference, in this case, is that since the weights are shared by all time steps, the gradient at each output depends not only on the current time step but also on the previous ones. This process is called **backpropagation through time**. 

- Because the weights $U$, $V$, and $W$, are shared across the different time steps in the case of RNNs, we need to sum up the gradients across the various time steps in the case of BPTT. This is the key difference between traditional backpropagation and BPTT.

* **

Consider the RNN with five time steps shown in the foll. fig.: 

<div align='center'>
    <img src='images/bptt.png'/>
</div>

During the forward pass, the network produces predictions $\hat{y}_t$ at time $t$ that are compared with the label $y_t$ to compute a loss at that timestep $L_t$. The loss at all time steps is added to determine the overall loss $L = \sum L_t$. 

Our objective is to reduce the loss. By determining the RNN's optimal weights, we can reduce the loss. RNNs have three weights, input to hidden U, hidden to hidden W, and hidden to output V.

During backpropagation (shown by the dotted lines), the gradients of the loss with respect to the weights $U$, $V$, and $W$, are computed at each time step and the parameters updated with the sum of the gradients:

$$\frac{\partial L}{\partial W} = \sum_{t} \frac{\partial L_t}{\partial W}$$


$$V = V - \alpha \frac{\partial L}{\partial V}, U = U - \alpha \frac{\partial L}{\partial U}, W = W - \alpha \frac{\partial L}{\partial W}$$

* **

### 1.6 Vanishing and exploding gradients

The reason BPTT is particularly sensitive to the problem of vanishing and exploding gradients comes from the product part of the expression(not shown) representing the final formulation of the gradient of the loss with respect to $W$.

Consider the case where the individual components in the product are less than 1. As we backpropagate across multiple time steps, the product of gradients becomes smaller and smaller, ultimately leading to the problem of vanishing gradients. Similarly, if the gradients are larger than 1, the products get larger and larger, and ultimately lead to the problem of exploding gradients.

### 1.7 Types of RNN

<div align='center'>
    <img src='images/rnn_types.png'/>
</div>

## **2. Long Short-Term Memory (LSTM):**

[Dive into Deep Learning - LSTM](https://d2l.ai/chapter_recurrent-modern/lstm.html)

### 2.1 What are LSTMs?

- **Long Short-Term Memory (LSTM)** is an advanced type of RNN architecture specifically designed to address the vanishing gradient problem. LSTMs introduce memory cells and various gating mechanisms to control the flow of information within the network. 

- LSTMs have a special structure that allows them to remember information for long periods of time, even if the data is not very sequential. This is done by using a set of gates that control the flow of information through the network. The gates allow the LSTM to decide what information to forget and what information to remember.

- The key components of an LSTM cell are:

    1. **Cell State ($C_t$):** This is the memory of the cell that can store relevant information over long sequences. It can be modified through various gates and operations.

    2. **Hidden State ($h_t$):** This is the output of the LSTM cell that can be used for predictions and influencing future cell states.

    3. **Input Gate:** Controls the flow of new information into the cell state.

    4. **Forget Gate:** Determines what information from the cell state should be discarded.

    5. **Output Gate:** Controls the flow of information from the cell state to the hidden state.

    6. **Candidate Values:** These are potential values that can be added to the cell state.

LSTMs excel at learning and remembering long-term dependencies in sequential data due to their gating mechanisms that enable them to store or discard information as needed.



### 2.2 LSTM Cell

- SimpleRNN combines the hidden state from the previous time step and the current input through a `tanh` layer to implement recurrence. **LSTMs also implement recurrence in a similar way, but instead of a single tanh layer, there are four layers interacting in a very specific way.** 

- Below figure illustrates the transformations that are applied in the hidden state at time step $t$:
<div align='center'>
    <img src='images/lstm_cell.png' width=650/>
</div>

- Let's look at it component by component:

    - The line across the top of the diagram is the **cell state $c$**, representing the internal memory of the unit.
    
    - The line across the bottom is the **hidden state $h$**, and the **$i$, $f$, $o$, and $g$ gates** are the mechanisms by which the LSTM works around the vanishing gradient problem. 
        - Here, $i$, $f$, and $o$ are the input, forget, and output gates. And $g$ is a hidden gate called Cell State Candidate. <br></br>
    
    - During training, the LSTM learns the parameters of these gates.
    
* **

**Intuitive Explanation:**
>Think of LSTMs as an improved version of RNNs, designed to handle long sequences more effectively. Imagine a memory cell that's really good at remembering important things and forgetting less important stuff. LSTMs are like having a personal assistant who decides which memories to store, update, or recall. This helps the network understand what's important even when the data is far apart in the sequence.

**Mathematical Explanation:**
>LSTMs use multiple gates (like switches) that control the flow of information. The input gate decides what information is added to the memory cell. The forget gate decides what information is discarded from the memory cell. The output gate determines how much information is used to influence the output. This mechanism helps LSTMs learn and retain relevant information over long sequences.

* **

### 2.3 Key Components of an LSTM Cell

Let's look at how these gates work inside an LSTM cell by looking at the equations of the cell. These equations describe how the value of the hidden state $h_t$ at time $t$ is calculated from the value of the hidden state $h_{t-1}$ at the previous time step.

<div align='center'>
    <img src='images/lstm_cell.png' width=650/>
</div>

The LSTM cell's computations involve several equations that describe the interactions of its components. Here's a breakdown of the key equations:

1. **Input Gate ($i_t$):**
   - This gate determines how much new information should be added to the cell state.
   - It is calculated using the `sigmoid` activation function $\leftarrow \sigma$.
   - $\text{Equation: } i_t = \sigma(W_{xi}x_t + W_{hi}h_{t-1} + b_i)$ <br></br>

2. **Forget Gate ($f_t$):**
   - This gate controls what information to remove from the cell state.
   - It is calculated using the sigmoid activation function.
   - $\text{Equation: } f_t = \sigma(W_{xf}x_t + W_{hf}h_{t-1} + b_f)$ <br></br>

3. **Cell State Candidate ($\tilde{C}_t$ or $g$):**
   - This represents the new candidate values that could be added to the cell state.
   - It is calculated using the tanh activation function.
   - $\text{Equation: } \tilde{C}_t = \tanh(W_{xc}x_t + W_{hc}h_{t-1} + b_c)$ <br></br>

4. **Cell State Update ($C_t$):**
   - This updates the cell state by combining the forget gate, input gate, and the candidate values.
   - $\text{Equation: } C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$
   - $\odot$ denotes element-wise multiplication <br></br>

5. **Output Gate ($o_t$):**
   - This gate determines the amount of information to output from the cell state.
   - It is calculated using the sigmoid activation function.
   - $\text{Equation: } o_t = \sigma(W_{xo}x_t + W_{ho}h_{t-1} + b_o)$ <br></br>

6. **Hidden State ($h_t$):**
   - The final hidden state is obtained by combining the output of the cell state and the output gate.
   - $\text{Equation: } h_t = o_t \odot \tanh(C_t)$
   
* **

### 2.4 Weight Matrices in LSTMs:

In Long Short-Term Memory (LSTM) networks, the weight matrices are used in various gates and operations:

1. **$W_{xi}, W_{hi}, b_i$:** These matrices and bias terms are associated with the input gate. They determine how the current input ($x_t$) and the previous hidden state $(h_{t-1})$ influence the input gate's decision to let new information into the cell state.

2. **$W_{xf}, W_{hf}, b_f$:** These matrices and bias terms are linked to the forget gate. They control the influence of the input and previous hidden state on the forget gate, deciding what information to discard from the cell state.

3. **$W_{xc}, W_{hc}, b_c$:** These matrices and bias terms are related to the computation of the candidate cell state values. They determine how the input and previous hidden state contribute to generating potential new values for the cell state.

4. **$W_{xo}, W_{ho}, b_o$:** These matrices and bias terms are connected to the output gate. They influence the output gate's decision to expose certain information from the cell state to the hidden state.



### 2.5 Indepth Working of LSTM Components

<div align='center'>
    <img src='images/lstm_cell.png' width=750/>
</div>

Let's understand these in more detail...

1. **Input Gate($i_t$):**

    - **Intuitive Explanation:**
        - The input gate is like a doorman that decides which new information is important enough to let into the cell state. 
        - It evaluates the current input and the previous hidden state to figure out what should be stored for later use.<br></br>
        
    - **Mathematical Explanation:**
        - The i/p gate's value($i_t$) is calculated using `sigmoid` activation function, which squashes the combined influence of the current i/p ($x_t$) and the previous hidden state ($h_{t-1}$) using weight matrices $(W_{xi} \text{ and } W_{hi})$ and a bias term $(b_i)$:
        
        - Equation: $i_t = \sigma(W_{xi}x_t + W_{hi}h_{t-1} + b_i)$
        
        
2. **Forget Gate $(f_t)$:**

    - **Intuitive Explanation:**
        - The forget gate decides what information from the past cell state should be kept and what should be discarded. 
        - It considers the previous hidden state and the current input to make this decision.<br></br>
        
    - **Mathematical Explanation:**
        - The forget gate's value $(f_t)$ is also calculated using `sigmoid` activation function. Weight matrices $(W_{xf} \text{ and } W_{hf})$ and a bias term $(b_f)$ determine how the previous hidden state $(h_{t-1})$ and current i/p state $(x_t)$ influence the forget gate's decision.
        
        - Equation: $f_t = \sigma(W_{xf}x_t + W_{hf}h_{t-1} + b_f)$
 
 
* **
Before moving on to the next one, let's first understand what ***cell state*** means:

The "cell state" is a fundamental concept in LSTM networks. It's a crucial component that enables LSTMs to capture and store information over long sequences.

- **Cell State $(C_t)$:** 

    - **Intuitive Explanation:**
        - Think of the cell state as a kind of memory within the LSTM cell. It's like a notebook that the LSTM uses to jot down important information it encounters as it processes a sequence of data. This memory allows the LSTM to remember patterns, relationships, and dependencies that exist across different time steps.<br></br>
        
    - **Mathematical Explanation:**
        - The cell state is a vector that evolves over time as the LSTM processes a sequence. It's updated through a combination of the input gate, forget gate, and cell state candidate. 
            - The input gate decides what new information should be added to the cell state, 
            - while the forget gate determines what information from the previous cell state should be discarded. 
            - The cell state candidate proposes new information to be included. 
        - We will look at the cell state update equation later.
        
* **


3. **Cell State Candidate $(\tilde{C}_t \text{ or } g)$:**

    - **Intuitive Explanation:**
        - Think of the cell state candidate as a suggestion box for new information. It's like having a space where the LSTM cell considers what new details might be important to remember. This candidate information is generated by combining the current input and the previous hidden state in a meaningful way.<br></br>
        
    - **Mathematical Explanation:**
        - The cell state candidate $(\tilde{C}_t \text{ or } g)$ is calculated using `tanh` activation function. Weight matrices $(W_{xc} \text{ and } W_{hc})$ and a bias term $(b_c)$ controls the influence of the current i/p $(x_t)$ and the previous hidden state $(h_{t-1})$ on the candidate values.
        
        - Equation: $\tilde{C}_t = \tanh(W_{xc}x_t + W_{hc}h_{t-1} + b_c)$


4. **Cell State Update $(C_t)$:**

    - **Intuitive Explanation:**
        - ***The cell state is like a long-term memory that gets updated based on the input gate, forget gate, and the cell state candidate.*** It decides what to remember, forget, and update for the current step.<br></br>
        
    - **Mathematical Explanation:**
        - The new cell state $(C_t)$ is the combination of previous cell state $(C_{t-1})$, modified by the forget gate $(f_t)$ to discard some information, and updated by the input gate $(i_t)$ and the cell state candidate $(\tilde{C}_t)$ to add new information.
        
        - Equation: $C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$
        - $\odot$ denotes element-wise multiplication


5. **Output Gate $(o_t)$:**

    - **Intuitive Explanation:**
        - ***The output gate controls how much of the cell state should be exposed to the current hidden state. It determines what part of the cell state is important for the current prediction.***<br></br>
        
    - **Mathematical Explanation:**
        - The o/p gate $(o_t)$ is calculated using `sigmoid` activation function. Weight matrices $(W_{xo} \text{ and } W_{ho})$ and a bias term $(b_o)$ decide how the current input $(x_t)$ and the previous hidden state $(h_{t-1})$ influence the output gate's decision.
        
        - Equation: $o_t = \sigma(W_{xo}x_t + W_{ho}h_{t-1} + b_o)$
        
        
6. **Hidden State $(h_t)$:**

    - **Intuitive Explanation:**
        - ***The hidden state is the cell's version of the current thought or understanding. It's influenced by the output gate and the modified cell state, carrying the relevant memory information for the current prediction.***<br></br>
        
    - **Mathematical Explanation:**
        - The new hidden state $(h_t)$ is obtained by the o/p gate $(o_t)$ to the modeified cell state $(C_t)$ using `tanh` activation function.
        
        - Equation: $h_t = o_t \odot \tanh(C_t)$

In summary, an LSTM uses these equations to decide what to remember, forget, and update in a sequence of data. The gates and candidate values work together to control how information flows through the cell and how memory is maintained over time steps.

### 2.6 Memory in LSTMs:

In the context of Long Short-Term Memory (LSTM) networks, the "cell state" is often referred to as the long-term memory. However, the concept of "short-term memory" is not explicitly represented as a separate component within the LSTM architecture.

Here's a more detailed explanation:

1. **Long-Term Memory (Cell State):** The cell state in an LSTM is designed to capture and store information that is relevant over longer sequences. It acts as a memory that can maintain important context and dependencies across different time steps. The cell state evolves and gets updated with each step, incorporating new information through the input gate and retaining or discarding information through the forget gate.

2. **Short-Term Memory:** While LSTMs are specialized for capturing long-term dependencies, the hidden state (\(h_t\)) serves as a form of short-term memory within the LSTM architecture. The hidden state carries information about the current step's input, as well as relevant context from previous steps. However, this short-term memory is often less emphasized compared to the cell state, which focuses on capturing more persistent patterns and context.

In summary, within the LSTM framework, the cell state is primarily associated with long-term memory due to its ability to retain important information across longer sequences. The hidden state can be thought of as representing a form of short-term memory, capturing recent information and context for the current prediction. However, the division between short-term and long-term memory is not as explicit in the LSTM architecture as it might be in human memory models.

### 2.7 How LSTMs makes preditictions?

The output gate in a Long Short-Term Memory (LSTM) network is indeed responsible for generating predictions, especially in tasks like predicting the next word in a sequence, which is common in natural language processing applications.

The output gate plays a crucial role in determining how much of the cell state's information is used to influence the current hidden state $(h_t)$. This hidden state is often considered the LSTM's current understanding or representation of the input sequence up to that point.

In the context of predicting the next word in a sentence:

- The LSTM processes the sequence of words up to the current point.

- The output gate determines which parts of the accumulated information in the cell state are relevant for generating a prediction.

- By allowing only relevant information to pass through, the LSTM's hidden state focuses on the most important features of the sequence.

- This hidden state is then typically used to generate a probability distribution over possible next words.

- A softmax activation function is often applied to this distribution to convert it into a proper probability distribution.

So, in essence, the output gate helps the LSTM decide which parts of the cell state's long-term memory are most relevant for generating predictions at the current time step. It's a critical component that enables the LSTM to make informed decisions based on the context it has learned from the input sequence.

### 2.8 Understanding LSTM via an Example

***Predicting next word in a long sentence***

Let's dive deep into the concept of Long Short-Term Memory (LSTM) using the example of predicting the next word in a long sentence. This example will help us understand how an LSTM processes sequential data, retains context, and generates predictions.

- **Scenario:** Imagine we're building a language model that predicts the next word in a sentence. 

Let's walk through the LSTM process step by step:

1. **Input Encoding:**
    - Suppose our sentence is: ***"The weather is beautiful today."*** 
    
    - Each word in this sentence is a discrete time step. We convert each word into a numerical representation using techniques like word embeddings. These embeddings are fed into the LSTM one at a time.<br></br>

2. **LSTM Cells:**
    - At each time step, the LSTM cell processes the current word's embedding along with the previous hidden state $(h_{t-1})$ and cell state $(C_{t-1})$.<br></br>

3. **Forget Gate:**
    - The forget gate $(f_t)$ determines what information to retain from the previous cell state $(C_{t-1})$. It considers the previous hidden state and the current word's embedding to decide which past information is still relevant.<br></br>

4. **Input Gate:**
    - The input gate $(i_t)$ evaluates the current word's embedding and the previous hidden state to decide what new information should be added to the cell state. It identifies relevant context for the current prediction.<br></br>

5. **Cell State Update:**
    - The cell state update equation combines the previous cell state, the output of the forget gate, and the output of the input gate. It decides what to forget and what to add to the cell state, shaping its content.<br></br>

6. **Output Gate:**
    - The output gate $(o_t)$ determines which parts of the cell state should influence the current hidden state $(h_t)$. It identifies context that is important for generating predictions.<br></br>

7. **Hidden State and Prediction:**
    - The hidden state $(h_t)$ is the LSTM's current understanding of the sequence up to the current time step. It captures context, relationships, and patterns. This hidden state is then used to predict the next word in the sequence.<br></br>

8. **Generating Predictions:**
    - The hidden state $(h_t)$ is passed through a softmax activation function, which converts the hidden state into a probability distribution over the vocabulary of words. Each word in the vocabulary corresponds to a possible next word in the sentence. The word with the highest probability is chosen as the predicted next word.<br></br>

9. **Next Time Step:**
    - The LSTM proceeds to the next time step (word) in the sequence. The updated hidden state $(h_t)$ and cell state $(C_t)$ from the current time step are used as the previous hidden state and cell state for the next time step.<br></br>

10. **Repeating the Process:**
    - The process of processing each word through the LSTM, updating the hidden state and cell state, and generating predictions is repeated until the entire sentence is processed.<br></br>

11. **Capturing Dependencies:**
    - Throughout this process, the LSTM captures dependencies between words across different time steps. For instance, when predicting the word "beautiful," the LSTM will have learned to consider the word "weather" from earlier in the sentence.<br></br>

12. **Long-Range Context:**
    - Thanks to the cell state and the mechanism of gates, the LSTM can capture long-range dependencies. It can remember relevant information from the beginning of the sentence even when predicting the last word.<br></br>


In summary, an LSTM processes sequential data by iteratively updating its hidden state and cell state, using forget gates, input gates, and output gates to control the flow of information. This allows it to capture context, dependencies, and patterns in the data, making it a powerful tool for tasks like predicting the next word in a sentence.

## **3. Gated Recurrent Unit (GRU):**

### 3.1 What is Gated Recurrent Unit (GRU)?

**Gated Recurrent Unit (GRU):**

- The Gated Recurrent Unit (GRU) is another variant of the RNN architecture that, like the LSTM, is designed to address the vanishing gradient problem. GRUs are somewhat simpler than LSTMs and consist of two main gates:

    1. **Update Gate:** Determines how much of the previous state(memory) to keep and how much of the new candidate values to integrate.

    2. **Reset Gate:** Decides which parts of the previous state should be ignored in favor of the new input i.e. how to combine the new input with the previous memory.

- GRUs strike a balance between simplicity and effectiveness. They are computationally less expensive than LSTMs while still being capable of capturing long-term dependencies.

* **

**Intuitive Explanation:**
>Think of GRUs as a more streamlined version of LSTMs. Imagine a smart door that knows when to let new information in, when to update its memory, and when to forget things. GRUs are like having an intuitive doorman who decides whether to open the door wide, keep it partly open, or close it altogether based on the situation.

**Mathematical Explanation:**
> GRUs simplify the LSTM architecture by using update and reset gates. The update gate controls how much of the previous hidden state is carried forward, and the reset gate determines which parts of the previous state should be ignored. This allows GRUs to capture relevant information while being computationally efficient.

* **

In summary, RNNs, LSTMs, and GRUs are all neural network architectures that are specialized for processing sequential data. 

LSTMs and GRUs were developed to overcome the limitations of traditional RNNs in handling long-term dependencies and vanishing gradients. 

LSTMs achieve this through memory cells and three gating mechanisms (input, forget, output), while GRUs use two gates (update, reset) for similar purposes. 

These architectures have significantly improved the capabilities of neural networks in tasks involving sequences like language modeling, speech recognition, and more.

* **

### 3.2 GRU Cell

[Dive into Deep Learning - GRU](https://d2l.ai/chapter_recurrent-modern/gru.html)

<div align='center'>
    <img src='images/gru_cell.png' width=750/>
</div>

The GRU cell has simpler equations compared to the LSTM but still captures the essential dynamics. Here's a breakdown of the key equations:

1. **Update Gate $(z_t)$:**
   - This gate decides how much of the previous hidden state to retain and how much of the new input to integrate.
   - It is calculated using the `sigmoid` activation function.
   - Equation: $z_t = \sigma(W_{xz}x_t + W_{hz}h_{t-1} + b_z)$ <br></br>

2. **Reset Gate $(r_t)$:**
   - This gate determines which parts of the previous hidden state should be ignored in favor of the new input.
   - It is calculated using the `sigmoid` activation function.
   - Equation: $r_t = \sigma(W_{xr}x_t + W_{hr}h_{t-1} + b_r)$ <br></br>

3. **Candidate Hidden State $(\tilde{h}_t)$:**
   - This is the new candidate hidden state that combines the new input and the previous hidden state.
   - It is calculated using the `tanh` activation function.
   - Equation: $\tilde{h}_t = \tanh(W_{xh}x_t + r_t \odot (W_{hh}h_{t-1}) + b_h)$ <br></br>

4. **Hidden State $(h_t)$:**
   - The final hidden state is a combination of the update gate and the candidate hidden state.
   - Equation: $h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$

These equations represent the core mathematical operations of LSTM and GRU cells. They define how information flows through the gates and how memory is updated and maintained over time steps.

* **

### 3.3 Weight Matrices in GRU:

In Gated Recurrent Units (GRUs), the weight matrices are used in the update and reset gates:

1. **$W_{xz}, W_{hz}, b_z$:** These matrices and bias terms control how the input $(x_t)$ and the previous hidden state (\(h_{t-1}\)) influence the update gate. The update gate decides how much of the previous hidden state to retain and how much of the new input to integrate.

2. **$W_{xr}, W_{hr}, b_r$:** These matrices and bias terms are associated with the reset gate. They determine how the input and previous hidden state influence the reset gate, which decides which parts of the previous hidden state to ignore.

3. **$W_{xh}, b_h$:** These matrices and bias terms are used in generating the candidate hidden state $(\tilde{h}_t)$ which is then combined with the update gate to compute the new hidden state $(h_t)$.

In all cases, the values in these weight matrices are learned during training through optimization algorithms like gradient descent. The values determine the strength and direction of the connections between different parts of the network, ultimately affecting the network's ability to learn and represent patterns in the data.

* **

### 3.4 GRU in a bit more depth

<div align='center'>
    <img src='images/gru_cell.png' width=750/>
</div>

Key equations of a Gated Recurrent Unit (GRU) in a detailed and intuitive manner:

1. **Reset Gate $(r_t)$:**

    - **Intuitive Explanation:**
        - The reset gate determines what information from the previous hidden state should be ignored or "reset." It's like deciding which parts of the past should be temporarily put aside to process the current input more effectively.<br></br>

    - **Mathematical Explanation:**
        - The reset gate's value $(r_t)$ is calculated using the sigmoid activation function, which considers the current input $(x_t)$ and the previous hidden state $(h_{t-1})$. Weight matrices $(W_{xr} \text{ and } W_{hr})$ and a bias term $(b_r)$ control the influence:

        - Equation:  $r_t = \sigma(W_{xr}x_t + W_{hr}h_{t-1} + b_r)$

2. **Update Gate $(z_t)$:**

    - **Intuitive Explanation:**
        - The update gate decides how much of the previous hidden state should be combined with the new candidate hidden state. It determines how much of the past information to retain and how much to replace with new information.<br></br>

    - **Mathematical Explanation:**
        - The update gate's value $(z_t)$ is calculated using the sigmoid activation function, similar to the reset gate. Weight matrices $(W_{xz} \text{ and } W_{hz})$ and a bias term $(b_z)$ control the influence of the current input $(x_t)$ and the previous hidden state $(h_{t-1})$:

        - Equation: $z_t = \sigma(W_{xz}x_t + W_{hz}h_{t-1} + b_z)$

3. **Candidate Hidden State $(\tilde{h}_t)$:**

    - **Intuitive Explanation:**
        - The candidate hidden state represents new information that could be added to the hidden state. It's like a suggestion for an updated understanding of the current input.<br></br>

    - **Mathematical Explanation:**
        - The candidate hidden state $(\tilde{h}_t)$ is calculated using the hyperbolic tangent (tanh) activation function. Weight matrices $(W_{xh} \text{ and } W_{hh})$ and a bias term $(b_h)$ control the transformation of the current input $(x_t)$ and the reset-gated previous hidden state $(r_t \odot h_{t-1})$:

        - Equation: $\tilde{h}_t = \tanh(W_{xh}x_t + W_{hh}(r_t \odot h_{t-1}) + b_h)$
        - $(\odot)$ represents element-wise multiplication

4. **Hidden State $(h_t)$:**

    - **Intuitive Explanation:**
        - The hidden state is like the GRU's current thought or understanding of the input sequence up to the current time step. It's influenced by the update gate and the candidate hidden state, carrying relevant information for making predictions.<br></br>

    - **Mathematical Explanation:**
        - The new hidden state $(h_t)$ is a combination of the previous hidden state $(h_{t-1})$, modified by the update gate $(z_t)$ to control how much of the candidate hidden state $(\tilde{h}_t)$ to incorporate:

        - Equation: $h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$

In summary, a GRU uses these components to manage its memory, decide which information to retain, replace, or update, and generate a meaningful hidden state for predictions. This architecture enables GRUs to capture sequential patterns and dependencies in data.

* **