# Gated Recurrent Units (GRUs): A Simplified Approach to Recurrent Networks

### This Jupyter Notebook introduces Gated Recurrent Units (GRUs), a popular and often more computationally efficient alternative to LSTMs for handling long-term dependencies in sequential data. GRUs were introduced by Cho et al. in 2014.

## Key Learning Objectives:

### 1. **Motivation for GRUs: Addressing LSTM's Complexity**
* **LSTM Recap**: LSTMs effectively solve the vanishing gradient problem and capture long-term dependencies using three complex gates (Forget, Input, Output) and maintaining two distinct memory states (Long-Term Memory / Cell State ($C_t$) and Short-Term Memory / Hidden State ($H_t$)).
* **LSTM Disadvantage**: This complex architecture leads to:
    * **More Trainable Parameters**: Each gate has its own weight matrices and bias vectors.
    * **Increased Training Time**: More parameters and operations mean longer computational time for both forward and backward propagation.
    * **Higher Computational Cost**: Especially for deployment on resource-constrained devices.
* **GRU Goal**: To achieve similar performance to LSTMs in capturing long-term dependencies but with a simpler architecture, fewer parameters, and faster training.

### 2. **GRU Architecture: Key Differences from LSTM**

* **Combined Memory**: Unlike LSTMs with separate $C_t$ and $H_t$, GRUs combine these into a single **hidden state** ($H_t$). This $H_t$ acts as both the short-term and long-term memory.
* **Fewer Gates**: GRUs reduce the three gates of LSTMs to just two:
    * **Update Gate (Z_t)**
    * **Reset Gate (R_t)**
* **No Separate Output Gate**: The function of the output gate is implicitly handled by the update gate.

### 3. **Understanding GRU Gates and Their Operations**

#### **3.1. Reset Gate ($R_t$)**
* **Purpose**: Decides how much of the *previous hidden state* ($H_{t-1}$) should be *forgotten* or *reset* when computing the *new candidate hidden state*. It's similar to the forget gate in LSTMs but focuses on the previous hidden state for the *candidate* calculation.
* **Inputs**: Current input ($X_t$) and previous hidden state ($H_{t-1}$).
* **Calculation**: $R_t = \sigma(W_r \cdot [H_{t-1}, X_t] + b_r)$
    * $W_r$: Weight matrix for the Reset Gate.
    * $\sigma$: Sigmoid activation, outputting values between 0 and 1.
* **Function in Candidate State**: The $R_t$ value is **point-wise multiplied** with $H_{t-1}$ before being used to calculate the candidate hidden state. A value of 0 for an element in $R_t$ means "completely forget" the corresponding element in $H_{t-1}$ for the new candidate.

#### **3.2. Candidate Hidden State ($\tilde{H}_t$)**
* **Purpose**: This is a *temporary* or *candidate* new hidden state. It proposes the new information that could be added to the overall hidden state.
* **Inputs**: Current input ($X_t$) and the *reset-filtered* previous hidden state ($R_t \times H_{t-1}$).
* **Calculation**: $\tilde{H}_t = \tanh(W_h \cdot [R_t \times H_{t-1}, X_t] + b_h)$
    * $W_h$: Weight matrix for the candidate hidden state.
    * $\tanh$: Tanh activation, outputting values between -1 and 1.
* **Role of Reset Gate**: The $R_t \times H_{t-1}$ term is where the "resetting" takes effect. If $R_t$ has small values, it effectively "resets" or zeroes out parts of the previous hidden state, meaning those parts are not considered when creating the new candidate hidden state.

#### **3.3. Update Gate ($Z_t$)**
* **Purpose**: This is the most crucial gate, combining the functionalities of both the Forget Gate and Input Gate in an LSTM. It decides how much of the *previous hidden state* ($H_{t-1}$) to retain and how much of the *new candidate hidden state* ($\tilde{H}_t$) to incorporate.
* **Inputs**: Current input ($X_t$) and previous hidden state ($H_{t-1}$).
* **Calculation**: $Z_t = \sigma(W_z \cdot [H_{t-1}, X_t] + b_z)$
    * $W_z$: Weight matrix for the Update Gate.
    * $\sigma$: Sigmoid activation, outputting values between 0 and 1.

#### **3.4. Final Hidden State ($H_t$)**
* **Purpose**: The actual hidden state that gets passed to the next time step and used for predictions.
* **Calculation**: $H_t = Z_t \times H_{t-1} + (1 - Z_t) \times \tilde{H}_t$
* **Explanation**:
    * **Coupling**: Notice the crucial coupling: $Z_t$ determines how much of $H_{t-1}$ is kept, and $(1 - Z_t)$ simultaneously determines how much of $\tilde{H}_t$ (the new candidate) is added.
    * If $Z_t$ is close to 1: The GRU largely keeps the old hidden state ($H_{t-1}$) and ignores the new candidate ($\tilde{H}_t$). This means "remembering" previous long-term information.
    * If $Z_t$ is close to 0: The GRU largely discards the old hidden state ($H_{t-1}$) and incorporates the new candidate ($\tilde{H}_t$). This means "updating" with new information.
    * This mechanism ensures a continuous flow where remembering older information directly implies less addition of new information, and vice-versa, making the update process more streamlined.

### diagram - 

![Alt text for the image](images/lstm_gru.png)

### 4. **Advantages of GRUs over LSTMs**
* **Fewer Parameters**: Since there are only two gates and no separate cell state, GRUs have fewer weight matrices and bias vectors.
* **Faster Training**: Fewer parameters generally lead to faster convergence during training.
* **Simpler Architecture**: Easier to implement and understand.
* **Comparable Performance**: Despite their simplicity, GRUs often achieve performance comparable to LSTMs on many tasks, especially with smaller datasets.

### Conclusion:
GRUs offer an elegant and efficient solution for capturing long-term dependencies in sequential data. By combining the cell state and hidden state into a single unit and streamlining the gating mechanism, GRUs provide a powerful yet simpler alternative to LSTMs, making them a popular choice in various deep learning applications.

**Next Video**: Further practical applications and comparisons of RNN variants.