# This is a Jupyter Notebook in markdown format.

# # Recurrent Neural Networks (RNNs): Long-Term Dependencies and the Vanishing Gradient Problem

This notebook provides a detailed explanation of the long-term dependency issue in simple Recurrent Neural Networks (RNNs), specifically focusing on the **vanishing gradient problem**. It also breaks down the underlying mathematical formulas that lead to this limitation, combining insights from the provided Udemy video transcription.

---

## 1. Introduction to Long-Term Dependency in RNNs

Recurrent Neural Networks (RNNs) are designed to process sequential data, where the output at a given step depends on previous computations. They achieve this by maintaining a "hidden state" that captures information from prior time steps.

However, simple RNNs face a significant challenge: they struggle to effectively capture **long-term dependencies**. This means that information from earlier parts of a long sequence tends to have a diminishing or negligible impact on predictions made much later in the sequence. This limitation makes simple RNNs unsuitable for use cases where understanding distant context is crucial (e.g., predicting the next word in a sentence based on context set many words earlier).

To clarify this abstract concept, let's dive into a concrete example and the underlying mathematics.

---

## 2. Setting up the RNN Example

Consider a simple RNN processing a sequence of words in a sentence. Let's assume a sentence has $T$ words, $X_1, X_2, \dots, X_T$.

**Unfolding the RNN over Time:**

The RNN processes each word sequentially, unfolding its structure over time steps:

* **At time $t=1$:** Input is $X_1$.
* **At time $t=2$:** Input is $X_2$.
* ...
* **At time $t=T$:** Input is $X_T$, producing the final output $y\_hat_T$.

At each time step $t$, the RNN calculates a hidden state $O_t$ (or $h_t$) based on the current input $X_t$ and the previous hidden state $O_{t-1}$. This hidden state is then used to compute an output $y\_hat_t$.

The calculation of the hidden state $O_t$ often involves an activation function (e.g., Sigmoid or Tanh):

$O_t = \text{Activation}(X_t \cdot W_I + O_{t-1} \cdot W_H + B_H)$

Where:
* $X_t$: Input vector at time $t$.
* $W_I$: Input weight matrix, transforming the input into the hidden state space.
* $O_{t-1}$: Hidden state vector from the previous time step.
* $W_H$: Recurrent weight matrix, transforming the previous hidden state into the current hidden state. This is the weight matrix that `recurs` across time steps.
* $B_H$: Bias vector for the hidden layer.
* $\text{Activation}(\cdot)$: An element-wise non-linear activation function (e.g., Sigmoid, Tanh).

The output $y\_hat_t$ is typically calculated from the hidden state $O_t$ through an output layer:

$y\_hat_t = \text{Activation}(O_t \cdot W_O + B_O)$

Where:
* $W_O$: Output weight matrix.
* $B_O$: Bias vector for the output layer.

---

## 3. Weight Updates and Backpropagation Through Time (BPTT)

Training an RNN, like any neural network, involves updating its weights ($W_I, W_H, W_O$) to minimize a loss function (e.g., cross-entropy for classification, mean squared error for regression). This is typically done using an optimization algorithm like Gradient Descent, which relies on calculating the gradients of the loss function with respect to each weight.

The generic weight update formula is:

$W_{new} = W_{old} - \text{learning\_rate} \times \frac{\partial \text{Loss}}{\partial W_{old}}$

The term $\frac{\partial \text{Loss}}{\partial W_{old}}$ is the **gradient**, which indicates the direction and magnitude of the steepest ascent of the loss function. To minimize the loss, we move in the opposite direction of the gradient.

In RNNs, this gradient calculation involves a process called **Backpropagation Through Time (BPTT)**, which is essentially backpropagation applied across the unfolded time steps.

### 3.1. Gradient for Output Weights ($W_O$)

Calculating the gradient for the output weights ($W_O$) is relatively straightforward because $W_O$ directly affects the output $y\_hat_t$ at each time step. Using the chain rule:

$\frac{\partial \text{Loss}}{\partial W_O} = \frac{\partial \text{Loss}}{\partial y\_hat_T} \times \frac{\partial y\_hat_T}{\partial W_O}$ (considering the loss at the final time step $T$)

This chain is short and typically doesn't pose a vanishing gradient problem on its own.

### 3.2. Gradient for Recurrent (Hidden) Weights ($W_H$) - The Core of the Vanishing Gradient Problem

This is where the long-term dependency issue manifests. The recurrent weight $W_H$ is used at *every* time step to compute the current hidden state from the previous one. Therefore, its influence propagates across all time steps, and its gradient needs to accumulate contributions from all these connections.

Consider the task of updating $W_H$ based on the loss calculated at the final time step $T$ (e.g., $T=50$ words in a sentence). To capture the influence of $W_H$ from an early time step (e.g., $t=1$) on the final loss, the chain rule becomes very long:

$\frac{\partial \text{Loss}_T}{\partial W_H} = \sum_{k=1}^{T} \frac{\partial \text{Loss}_T}{\partial O_T} \times \frac{\partial O_T}{\partial O_{T-1}} \times \frac{\partial O_{T-1}}{\partial O_{T-2}} \times \dots \times \frac{\partial O_k}{\partial W_H}$

Let's focus on a single path that represents the influence of $W_H$ at an early time step on the final output, as demonstrated in the transcription:

$\frac{\partial \text{Loss}}{\partial W_H} \approx \frac{\partial \text{Loss}}{\partial y\_hat_T} \times \frac{\partial y\_hat_T}{\partial O_T} \times \frac{\partial O_T}{\partial O_{T-1}} \times \frac{\partial O_{T-1}}{\partial O_{T-2}} \times \dots \times \frac{\partial O_2}{\partial W_H}$

The crucial terms in this long product are of the form $\frac{\partial O_t}{\partial O_{t-1}}$, which represents how the hidden state at time $t$ changes with respect to the hidden state at time $t-1$.

---

## 4. Decomposing the Recurrent Gradient Term: $\frac{\partial O_t}{\partial O_{t-1}}$

Let's break down the mathematical expression for $\frac{\partial O_t}{\partial O_{t-1}}$. Assuming the activation function for the hidden state is **Sigmoid** (as discussed in the transcription for $O_3$ and $O_2$):

Recall the hidden state calculation:
$O_t = \text{Sigmoid}(X_t \cdot W_I + O_{t-1} \cdot W_H + B_H)$

To find $\frac{\partial O_t}{\partial O_{t-1}}$, we apply the chain rule:

$\frac{\partial O_t}{\partial O_{t-1}} = \frac{\partial}{\partial O_{t-1}} \left( \text{Sigmoid}(X_t \cdot W_I + O_{t-1} \cdot W_H + B_H) \right)$

This involves two parts:
1.  The derivative of the outer function (Sigmoid).
2.  The derivative of the inner function (the argument of Sigmoid) with respect to $O_{t-1}$.

Let $z_t = X_t \cdot W_I + O_{t-1} \cdot W_H + B_H$ (the net input to the activation function at time $t$).

Then, $\frac{\partial O_t}{\partial O_{t-1}} = \frac{d \text{Sigmoid}(z_t)}{d z_t} \times \frac{\partial z_t}{\partial O_{t-1}}$

We know that the derivative of the Sigmoid function, $\text{Sigmoid}'(z)$, is given by $\text{Sigmoid}(z) \times (1 - \text{Sigmoid}(z))$.

Now, let's find $\frac{\partial z_t}{\partial O_{t-1}}$:

$\frac{\partial}{\partial O_{t-1}} (X_t \cdot W_I + O_{t-1} \cdot W_H + B_H)$

Since $X_t \cdot W_I$ and $B_H$ are constants with respect to $O_{t-1}$, their derivatives are zero.
The derivative of $O_{t-1} \cdot W_H$ with respect to $O_{t-1}$ is simply $W_H$. (Assuming $W_H$ is a scalar for simplicity, or considering element-wise derivatives if $O_{t-1}$ and $W_H$ are vectors/matrices).

Therefore, the crucial term becomes:

$\frac{\partial O_t}{\partial O_{t-1}} = \text{Sigmoid}'(z_t) \times W_H$

**Key Properties of $\text{Sigmoid}'(z)$:**

The derivative of the Sigmoid function, $\text{Sigmoid}'(z)$, has a maximum value of **0.25** (when $z=0$) and approaches 0 as $|z|$ increases. Its range is **(0, 0.25]**.

---

## 5. The Vanishing Gradient Problem Explained

The issue arises when we multiply many of these $\frac{\partial O_t}{\partial O_{t-1}}$ terms together in the long chain for the gradient of $W_H$.

For instance, if a sentence has 50 words, and we want the gradient to effectively update $W_H$ based on the first word ($X_1$), the chain rule involves multiplying terms like:

$\left( \text{Sigmoid}'(\dots) \times W_H \right)_{\text{for } O_{50} \to O_{49}} \times \left( \text{Sigmoid}'(\dots) \times W_H \right)_{\text{for } O_{49} \to O_{48}} \times \dots \times \left( \text{Sigmoid}'(\dots) \times W_H \right)_{\text{for } O_2 \to O_1}$

Since each $\text{Sigmoid}'(\dots)$ term is at most 0.25, and assuming $|W_H|$ values are also relatively small (e.g., less than 1, which is common in initialization to prevent exploding gradients), the product of many such terms becomes incredibly small.

**Example:** If, on average, $(\text{Sigmoid}'(\dots) \times W_H)$ yields a value of 0.1 for each step, then for a 50-step sequence, the contribution from the first word would be proportional to $(0.1)^{49}$, which is a number extremely close to zero.

**Consequences of Vanishing Gradients:**

1.  **Limited Learning from Distant Past:** When the gradient approaches zero, the term $\text{learning\_rate} \times \frac{\partial \text{Loss}}{\partial W_{old}}$ in the weight update formula also approaches zero. This means that $W_{new} \approx W_{old}$, and the weights barely change based on information from early time steps. The RNN effectively "forgets" the information from the beginning of long sequences.
2.  **Bias Towards Recent Information:** Words or inputs closer to the output time step will have shorter gradient chains. Their gradients will not vanish as much, meaning they will contribute more significantly to weight updates. This causes simple RNNs to prioritize short-term dependencies over long-term ones.
3.  **Suboptimal Performance:** In tasks requiring an understanding of context from many steps ago (e.g., sentiment analysis of a long review where the key positive/negative word appeared at the start), simple RNNs will perform poorly because they cannot effectively use that distant information.

---

## 6. Contrast: Short Chain Rule (for Recent Dependencies)

Consider updating $W_H$ based on the most recent time step, say $T=50$. The chain rule for this would be much shorter:

$\frac{\partial \text{Loss}}{\partial W_H} \propto \frac{\partial \text{Loss}}{\partial y\_hat_T} \times \frac{\partial y\_hat_T}{\partial O_T} \times \frac{\partial O_T}{\partial W_H}$

This chain involves fewer multiplications of small derivatives. Consequently, the gradient will be a more significant, non-zero value, allowing the weights to be updated effectively based on recent information. This explains why simple RNNs are better at capturing short-term dependencies.

---

## 7. Limitations of Simple RNNs Summarized

The **vanishing gradient problem** is the fundamental reason why simple RNNs struggle with:

* **Long-term dependencies:** They cannot effectively "remember" or propagate information from early parts of a long sequence.
* **Tasks requiring long-range context:** Their performance degrades significantly when the relevant information is far away in the input sequence.

---

## 8. Attempts to Mitigate and Advanced Solutions

Initial attempts to address vanishing gradients included:

* **Changing Activation Functions:**
    * **Tanh (Hyperbolic Tangent):** The `tanh` function outputs values between -1 and 1. Its derivative, `tanh'(x)`, ranges between **0 and 1**. While its maximum derivative (1) is higher than sigmoid's (0.25), repeatedly multiplying values even within (0, 1) still leads to shrinking gradients over very long sequences. It alleviates the problem slightly but doesn't solve it completely.
    * **ReLU (Rectified Linear Unit) and Leaky ReLU:**
        * **ReLU:** $f(x) = \max(0, x)$. Its derivative is $1$ for $x > 0$ and $0$ for $x < 0$.
        * **Leaky ReLU:** $f(x) = \max(\alpha x, x)$ where $\alpha$ is a small positive constant. Its derivative is $1$ for $x > 0$ and $\alpha$ for $x < 0$.
        * These activation functions help because their derivatives are often 1, preventing the gradient from shrinking rapidly. However, they can introduce other issues like **exploding gradients** (if gradients become too large) and don't provide a direct mechanism for selective memory.

The most effective solutions came from developing more sophisticated RNN architectures that specifically address the vanishing gradient problem and the related issue of forgetting long-term information:

* **Long Short-Term Memory (LSTM) Networks:** These networks introduce "gates" (input, forget, and output gates) that control the flow of information into and out of a "cell state." This cell state acts as a conveyor belt for information, allowing gradients to flow more easily across many time steps, effectively solving the vanishing gradient problem and enabling them to learn long-term dependencies.
* **Gated Recurrent Unit (GRU) Networks:** A simpler variant of LSTMs, GRUs combine the functionality of the forget and input gates into a single "update gate" and also merge the cell state and hidden state. They are generally computationally less expensive than LSTMs while still being highly effective at capturing long-term dependencies.

These advanced RNN architectures are crucial for modern natural language processing and other sequential data tasks where understanding long-range context is paramount. They will be the focus of subsequent discussions.

---

## 9. Conclusion

In summary, simple Recurrent Neural Networks (RNNs), while foundational for sequential data processing, are inherently limited by the **vanishing gradient problem**. This mathematical phenomenon arises from the repeated multiplication of small derivative values during backpropagation through time, particularly when using activation functions like sigmoid or tanh. As a result, gradients from early time steps in a long sequence shrink to near zero, effectively preventing the network from learning and retaining information about long-term dependencies.

This limitation means simple RNNs struggle with tasks that require memory of distant past events in a sequence, leading to reduced accuracy in applications like long-text understanding or complex time-series prediction. The inability to effectively update weights based on early inputs forces the network to rely more heavily on recent information.

To overcome this fundamental challenge, more sophisticated architectures like Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs) were developed. These models introduce gating mechanisms that provide a more controlled and stable flow of gradients across many time steps, thereby allowing them to effectively capture and utilize long-term dependencies. Understanding the vanishing gradient problem in simple RNNs is therefore essential for appreciating the necessity and ingenuity behind LSTMs and GRUs, which form the bedrock of many state-of-the-art sequential data models today.