# Comprehensive Guide to Recurrent Neural Networks (RNNs)

## Table of Contents
1. [Introduction to RNNs](#introduction)
2. [Basic RNN Architecture](#architecture)
3. [Mathematical Foundations](#math)
4. [Forward Propagation in RNNs](#forward-prop)
5. [Backpropagation Through Time (BPTT)](#bptt)
6. [Numerical Example](#example)
7. [Common Challenges](#challenges)
8. [Variants and Solutions](#variants)

## 1. Introduction to RNNs

Recurrent Neural Networks (RNNs) are designed to work with sequential data, where the order of inputs matters. Unlike feedforward networks, RNNs maintain an internal state (memory) that gets updated as they process a sequence.

Key Applications:
- Natural Language Processing
- Time Series Prediction
- Speech Recognition
- Audio Generation
- Video Processing

In [2]:
import numpy as np
%load_ext nb_js_diagrammers

## 2. Basic RNN Architecture

In [2]:
%%mermaid_magic
graph LR
    X[Input x_t] --> H[Hidden State h_t]
    H_prev[Previous State h_t-1] --> H
    H --> Y[Output y_t]
    H --> H_next[Next State h_t+1]
    
    style X fill:#f9f,stroke:#333,stroke-width:2px
    style H fill:#bbf,stroke:#333,stroke-width:2px
    style Y fill:#bfb,stroke:#333,stroke-width:2px
    style H_prev fill:#bbf,stroke:#333,stroke-width:2px
    style H_next fill:#bbf,stroke:#333,stroke-width:2px

- **The RNN processes input sequences one element at a time:**

    1. Takes current input (x_t)
    2. Combines it with previous hidden state (h_t-1)
    3. Produces current hidden state (h_t)
    4. Generates output (y_t)

## 3. Mathematical Foundations

### Core Equations:


1. **Hidden State Update:**


$$h_t = tanh(W_{hh} \times h_{t-1} + W_{xh} \times x_t + b_h)$$


2. **Output Calculation:**


$$y_t = W_{hy} * h_t + b_y$$


Where:

- $W_{hh}$: Hidden-to-hidden weights
- $W_{xh}$: Input-to-hidden weights
- $W_{hy}$: Hidden-to-output weights
- $b_h$, $b_y$: Bias terms
- $tanh$: Hyperbolic tangent activation function

### Unrolled Network Visualization

In [3]:
%%mermaid_magic
graph LR
    subgraph t-1
        X1[x_t-1] --> H1[h_t-1]
        H1 --> Y1[y_t-1]
    end
    
    subgraph t
        X2[x_t] --> H2[h_t]
        H2 --> Y2[y_t]
    end
    
    subgraph t+1
        X3[x_t+1] --> H3[h_t+1]
        H3 --> Y3[y_t+1]
    end
    
    H1 --> H2
    H2 --> H3
    
    style X1 fill:#f9f,stroke:#333,stroke-width:2px
    style X2 fill:#f9f,stroke:#333,stroke-width:2px
    style X3 fill:#f9f,stroke:#333,stroke-width:2px
    style H1 fill:#bbf,stroke:#333,stroke-width:2px
    style H2 fill:#bbf,stroke:#333,stroke-width:2px
    style H3 fill:#bbf,stroke:#333,stroke-width:2px
    style Y1 fill:#bfb,stroke:#333,stroke-width:2px
    style Y2 fill:#bfb,stroke:#333,stroke-width:2px
    style Y3 fill:#bfb,stroke:#333,stroke-width:2px

## 4. Forward Propagation in RNNs

The forward pass involves:

1. Initialize hidden state (usually zeros)
2. For each time step t:
   - Get input x_t
   - Calculate new hidden state h_t
   - Generate output y_t
3. Pass hidden state to next time step

In [4]:
%%mermaid_magic
flowchart TD
    A[Initialize h_0] --> B[Get input x_t]
    B --> C[Calculate h_t]
    C --> D[Generate y_t]
    D --> E{More timesteps?}
    E -->|Yes| B
    E -->|No| F[End]

## 5. Backpropagation Through Time (BPTT)


- **BPTT is the training algorithm for RNNs:**
    1. Run forward pass for entire sequence
    2. Calculate loss at each time step
    3. Backpropagate error through time
    4. Update weights

Gradient calculation involves chain rule through time:

$$\frac{\partial L}{\partial W} = \sum_t \left(\frac{\partial L_t}{\partial y_t} \cdot \frac{\partial y_t}{\partial h_t} \cdot \frac{\partial h_t}{\partial W}\right)$$


## 6. Numerical Example <a name="example"></a>

Let's work through a simple example with:
- Input dimension: 2
- Hidden state dimension: 3
- Output dimension: 1
- Sequence length: 2

## Initial parameters:

In [3]:
W_xh = np.array([[0.1, 0.2, 0.3],
                 [0.4, 0.5, 0.6]])

W_hh = np.array([[0.7, 0.8, 0.9],
                 [0.1, 0.2, 0.3],
                 [0.4, 0.5, 0.6]])

W_hy = np.array([[0.1, 0.2, 0.3]])

b_h = np.array([0.1, 0.1, 0.1])
b_y = np.array([0.1])

## Input sequence:

In [6]:
x = np.array([[1, 2],    # x_1
              [3, 4]])   # x_2
print(x)

[[1 2]
 [3 4]]


## Forward pass calculations:

In [8]:
# Time step 1
h_0 = np.zeros(3)
h_1 = np.tanh(np.dot(x[0], W_xh) + np.dot(h_0, W_hh) + b_h)
y_1 = np.dot(h_1, W_hy.T) + b_y

# Time step 2
h_2 = np.tanh(np.dot(x[1], W_xh) + np.dot(h_1, W_hh) + b_h)
y_2 = np.dot(h_2, W_hy.T) + b_y

## 7. Common Challenges <a name="challenges"></a>

### 1. Vanishing Gradients

In [10]:
%%mermaid_magic -h 100
graph LR
    A[Earlier States] -->|Weak Gradient| B[Later States]
    style A fill:#f99,stroke:#333,stroke-width:2px
    style B fill:#9f9,stroke:#333,stroke-width:2px

### 2. Exploding Gradients

In [11]:
%%mermaid_magic -h 100
graph LR
    A[Earlier States] -->|Extreme Gradient| B[Later States]
    style A fill:#99f,stroke:#333,stroke-width:2px
    style B fill:#f99,stroke:#333,stroke-width:2px

### 3. Long-term Dependencies
### 4. Training Instability

## 8. Variants and Solutions <a name="variants"></a>

### LSTM (Long Short-Term Memory)

In [13]:
%%mermaid_magic -h 300
graph TD
    I[Input Gate] --> C[Cell State]
    F[Forget Gate] --> C
    O[Output Gate] --> H[Hidden State]
    C --> H

### GRU (Gated Recurrent Unit)

In [14]:
%%mermaid_magic -h 200
graph TD
    R[Reset Gate] --> H[Hidden State]
    U[Update Gate] --> H

### Key improvements:

- Better gradient flow
- Controlled information flow
- Improved long-term memory
- More stable training