# ✅ **What are RNNs**

**Recurrent Neural Networks (RNNs)** are a type of neural network designed to handle **sequential data** such as:
- Time series
- Text
- Audio
- Video

Unlike standard neural networks, RNNs **have memory**. They remember previous inputs using internal **hidden states**, making them ideal for problems where **order and context** matter.

---

# Core Idea

RNNs process one element at a time from the input sequence and **pass information forward** through hidden states. This allows them to model **temporal dependencies**.

---


# RNN Architecture & Equations

Given an input sequence:  
$$
x_1, x_2, ..., x_T
$$

At each time step $t$:

1. **Hidden state update**  
$$
h_t = \tanh(W_{xh} \cdot x_t + W_{hh} \cdot h_{t-1} + b_h)
$$

2. **Output (optional)**  
$$
y_t = W_{hy} \cdot h_t + b_y
$$

Where:
- $x_t$: input at time step $t$
- $h_t$: hidden state at time $t$
- $W_{xh}, W_{hh}, W_{hy}$: weight matrices
- $b_h, b_y$: bias vectors
- $\tanh$: non-linear activation function

---


# RNN vs ANN

| **Feature**                 | **ANN**                         |     **RNN**                         |
|------------------------|----------------------------------|-------------------------------------|
| Input type             | Fixed-size                      | Sequential                           |
| Memory of past inputs  | None                            |Maintains hidden state                |
| Weight sharing         | No                              | Yes (across time steps)              |
| Variable input length  | No                              | Yes                                  |
| Temporal modeling      | Not supported                   | Supported                            |

---


# ✅ **Forward Propagation in RNN**

**Forward propagation** in an RNN means passing the input sequence step by step through the RNN to compute hidden states and outputs.

---

# Equations

At each time step \( t \):

- Hidden state:
$$
h_t = \tanh(W_{xh} \cdot x_t + W_{hh} \cdot h_{t-1} + b_h)
$$

- Output (optional):
$$
y_t = W_{hy} \cdot h_t + b_y
$$

---

# Example (3 time steps)

Given \( x_1, x_2, x_3 \) and \( h_0 = 0 \):

1. Compute \( h1, y1 \)  
2. Compute \( h2, y2 \)  
3. Compute \( h3, y3 \)

---

# Visualization

x1 → [RNN] → h1 → y1

x2 → [RNN] → h2 → y2

x3 → [RNN] → h3 → y3

↑ ↑
h1 h2


---

# Summary

- Uses previous hidden state \( h_{t-1} \) and current input \( x_t \)
- Same weights are shared at each step
- Helps model sequences over time

---


# ✅ **Types of RNNs**

RNNs can be structured differently based on input-output sequence format.

---

| Type            | Input        | Output         | Example               |
|-----------------|--------------|----------------|------------------------|
| One-to-One      | Single       | Single         | Image classification  |
| One-to-Many     | Single       | Sequence       | Image captioning      |
| Many-to-One     | Sequence     | Single         | Sentiment analysis    |
| Many-to-Many    | Sequence     | Sequence       | Translation, NER      |




---

#  Many-to-Many RNN: Two Types

| Type                 | Input Length | Output Length | Example              |
|----------------------|--------------|----------------|----------------------|
| Fixed-Length         | Same         | Same           | POS tagging, NER     |
| Variable-Length      | Same/Diff    | Different      | Translation, Summary |


---


# ✅ **Backpropagation in RNNs (BPTT)**

Backpropagation in RNNs is called **Backpropagation Through Time (BPTT)**.

---

# Why ?

- RNNs share weights across time steps
- Errors are propagated **back through all time steps**
- We must compute how **past hidden states affect future losses**

---

# Steps of BPTT

1. Forward pass to compute:
   - \( h_1, h_2, ..., h_T \)
   - \( y_1, y_2, ..., y_T \)
2. Compute total loss:
   $$
   \mathcal{L} = \sum_{t=1}^{T} \mathcal{L}_t
   $$
3. Backward pass:
   - Compute gradients w.r.t weights:
     - \( W_{xh}, W_{hh}, W_{hy} \)
   - Accumulate over time
   - Update weights

---

# Challenges

- **Vanishing gradients** → can't learn long-term patterns
- **Exploding gradients** → unstable training


# Solutions:
- Gradient clipping
- Use LSTM or GRU instead of basic RNN

---


# ✅ **Problems with Simple RNNs**

Simple (vanilla) RNNs have several issues, especially with long sequences:

---

## 1. Vanishing Gradient

- Gradients shrink during backpropagation
- Weights stop updating → RNN forgets long-term dependencies

---

## 2. Exploding Gradient

- Gradients grow rapidly → causes unstable training and NaNs

---

## 3. Short-Term Memory

- RNNs focus only on recent inputs
- Struggle with tasks needing long-range memory

---

## 4. Hard to Train

- Due to unstable gradients
- Training becomes slow and ineffective

---

# Solutions

| Problem                | Solution                        |
|------------------------|----------------------------------|
| Vanishing gradients     | LSTM / GRU, ReLU, LayerNorm      |
| Exploding gradients     | Gradient Clipping                |
| Short-term memory       | Use gated RNNs (LSTM, GRU)       |


---



# ✅ **LSTM (Long Short-Term Memory)**

LSTM is an advanced RNN that solves the **vanishing gradient problem** and captures **long-term dependencies**.

---

# Components of an LSTM Cell

- **Cell state** \( C_t \): long-term memory
- **Hidden state** \( h_t \): short-term memory
- **Gates**:
  - Forget gate
  - Input gate
  - Output gate

---



In [14]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/LSTM.png" style="width: 800px;"/>
</div>
"""))

---

# LSTM Equations

Given input \( x_t \), previous hidden state \( h_{t-1} \), and cell state \( C_{t-1} \):

1. **Forget gate**:
$$
f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)
$$

2. **Input gate**:
$$
i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)
$$
$$
\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)
$$

3. **Cell state update**:
$$
C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t
$$

4. **Output gate**:
$$
o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)
$$
$$
h_t = o_t \odot \tanh(C_t)
$$

---

# Summary

| Symbol        | Meaning                            |
|---------------|-------------------------------------|
| \( f_t \)     | Forget gate                         |
| \( i_t \)     | Input gate                          |
| \( o_t \)     | Output gate                         |
| \( \tilde{C}_t \) | Candidate cell state            |
| \( C_t \)     | Cell state (long-term memory)       |
| \( h_t \)     | Hidden state (short-term memory)    |

---

# Why LSTM is Better

- Handles long sequences well
- Reduces vanishing gradient
- Remembers important information using gates

---



# ✅ **GRU (Gated Recurrent Unit)**

GRU is a simpler version of LSTM that also solves the **vanishing gradient problem** using gates — but with **fewer components**.

---

# Components of GRU

- **Update gate** \( z_t \): What to keep from the past
- **Reset gate** \( r_t \): What to forget
- **Hidden state** \( h_t \): Acts as both memory and output

---


In [12]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/GRU.png" style="width: 500px;"/>
</div>
"""))

---

# GRU Equations

1. **Update gate**:
$$
z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)
$$

2. **Reset gate**:
$$
r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)
$$

3. **Candidate hidden state**:
$$
\tilde{h}_t = \tanh(W_h \cdot [r_t \odot h_{t-1}, x_t] + b_h)
$$

4. **Final hidden state**:
$$
h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t
$$

---

# Why Use GRU?

- Faster and simpler than LSTM
- Requires fewer parameters
- Performs well in many tasks
- Good for smaller datasets and faster inference

---

# GRU vs LSTM

| Feature         | LSTM               | GRU                      |
|-----------------|--------------------|--------------------------|
| Gates           | 3 (input, forget, output) | 2 (reset, update) |
| Cell state      | Yes                | No (merged into hidden)  |
| Complexity      | Higher             | Lower                    |

---



# ✅ **RNN vs LSTM vs GRU**



In [10]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/RNNvsLSTMvsGRU.png" style="width: 1000px;"/>
</div>
"""))

---

# ✅ **Deep RNNs (Stacked RNNs)**

A **Deep RNN** has multiple RNN layers stacked on top of each other.

- The output of one layer becomes the input to the next.
- Helps in learning **hierarchical and complex sequence patterns**.

# Layer Structure

$$
\text{Layer 1: } x_1 \rightarrow h_1 \rightarrow h_2 \rightarrow h_3 \\ 
\text{Layer 2: } h_1 \rightarrow h_1' \rightarrow h_2' \rightarrow h_3'
$$

---

# Advantages

- Captures deeper temporal relationships  
- Improves model performance  
- Works well for complex tasks like **language modeling**

---

# ✅ **Bidirectional RNNs**

A **Bidirectional RNN** processes the input sequence in both **forward and backward** directions using two separate RNNs.

- **Forward RNN** processes:

  $$
  x_1 \rightarrow x_2 \rightarrow \dots \rightarrow x_T
  $$

- **Backward RNN** processes:

  $$
  x_T \rightarrow x_{T-1} \rightarrow \dots \rightarrow x_1
  $$

- The final output at time step \( t \) is:

  $$
  h_t = [\overrightarrow{h_t};\ \overleftarrow{h_t}]
  $$

---

# Advantages

- Uses both **past and future** context  
- Improves performance on tasks like:
  - Part-of-Speech (POS) tagging
  - Named Entity Recognition (NER)
  - Machine Translation

# Example

Sequence:  
$$
\text{Input: } x_1\quad x_2\quad x_3\quad x_4 \\
\text{Forward: } \rightarrow\quad \rightarrow\quad \rightarrow\quad \rightarrow \\
\text{Backward: } \leftarrow\quad \leftarrow\quad \leftarrow\quad \leftarrow \\
\text{Final Output: } [\overrightarrow{h_1} \oplus \overleftarrow{h_1}],\ [\overrightarrow{h_2} \oplus \overleftarrow{h_2}],\ \dots
$$

---

# Note:
- You can use Deep RNNs and Bidirectional RNNs concept for each types (LSTM and GRU).
- Bidirectional LSTM(BiLSTM)
- Bidirectional GRU

---
