## **GRU (Gated Recurrent Unit)**

A **Gated Recurrent Unit (GRU)** is a type of **Recurrent Neural Network (RNN)** designed to solve the vanishing gradient problem and improve long-term dependencies while keeping the computational cost lower than **LSTMs (Long Short-Term Memory networks).**

### **Why GRUs?**

Traditional RNNs suffer from:

- **Vanishing gradient problem** → Cannot retain long-term dependencies.
- **Exploding gradients** → Causes unstable training.
- **Difficulty in learning long sequences**.

GRUs solve these issues by introducing **gates** that control the flow of information, selectively deciding what information should be passed and what should be forgotten.

---

## **1. Structure of a GRU**

A GRU consists of **two gates**:

1. **Reset Gate ****$r_t$** → Decides how much past information to forget.
2. **Update Gate ****$z_t$** → Controls how much of the new information should be used to update the hidden state.

Unlike **LSTMs**, GRUs do not have a separate memory cell $C_t$, making them computationally more efficient.

---

## **2. Mathematical Formulation of GRU**

At time step $t$, given:

- **Input**: $x_t$ (current input)
- **Previous hidden state**: $h_{t-1}$

A GRU computes the next hidden state $h_t$ using the following steps:

### **Step 1: Compute the Reset Gate**

$$
r_t = \sigma(W_r x_t + U_r h_{t-1} + b_r)
$$

- $W_r$ and $U_r$ are weight matrices.
- $b_r$ is the bias.
- $\sigma$ is the sigmoid activation function.

**Purpose:**

- If $r_t$ is close to **0**, it resets the previous hidden state, forgetting past information.
- If $r_t$ is close to **1**, it keeps most of the past information.

---

### **Step 2: Compute the Update Gate**

$$
z_t = \sigma(W_z x_t + U_z h_{t-1} + b_z)
$$

- $W_z, U_z, b_z$ are the weights and bias.
- $\sigma$ is the sigmoid activation function.

**Purpose:**

- If $z_t$ is close to **0**, the hidden state is mostly influenced by the new candidate hidden state.
- If $z_t$ is close to **1**, the hidden state remains mostly unchanged (helps in long-term memory retention).

---

### **Step 3: Compute the Candidate Hidden State**

$$
\tilde{h}_t = \tanh(W_h x_t + U_h (r_t \odot h_{t-1}) + b_h)
$$

- $\tilde{h}_t$ is the candidate hidden state.
- $r_t \odot h_{t-1}$ → Element-wise multiplication of reset gate and previous hidden state.
- $\tanh$ ensures values remain between $[-1,1]$.

**Purpose:**

- Uses $r_t$ to decide how much previous memory to include.
- Helps the network selectively use relevant past information.

---

### **Step 4: Compute the Final Hidden State**

$$
h_t = (1 - z_t) \odot \tilde{h}_t + z_t \odot h_{t-1}
$$

- If $z_t$ is **0**, it updates the hidden state fully using $\tilde{h}_t$.
- If $z_t$ is **1**, it keeps the old hidden state $h_{t-1}$.

This equation ensures **smooth memory updates**, balancing new and old information.

---

## **3. How GRUs Solve RNN Problems**

- **Avoids vanishing gradients** by directly copying past hidden states when needed.
- **Learns long-term dependencies** via the update gate.
- **Simpler than LSTMs** (fewer parameters, faster training).

---

## **4. Differences Between GRUs and LSTMs**

| Feature                       | GRU                      | LSTM                               |
| ----------------------------- | ------------------------ | ---------------------------------- |
| Number of Gates               | 2 (Reset, Update)        | 3 (Forget, Input, Output)          |
| Memory Cell                   | No separate memory cell  | Has a memory cell $C_t$          |
| Computational Complexity      | Lower (fewer parameters) | Higher (more complex architecture) |
| Performance on Small Datasets | Often better             | Requires more training data        |
| Training Speed                | Faster                   | Slower due to more parameters      |

---

## **5. Summary**

- **GRUs** use **reset and update gates** to control memory flow.
- **Mathematically**, they use **sigmoid and tanh activations** to regulate information retention.
- **Compared to LSTMs**, they are **simpler and faster** while still handling long-term dependencies well. 

