# Gated Recurrent Unit (GRU) Documentation

## 1. Introduction
A Gated Recurrent Unit (GRU) is a type of recurrent neural network (RNN) introduced by Cho et al. in 2014 as a simplified version of the Long Short-Term Memory (LSTM) network. GRUs are effective in capturing sequential dependencies while addressing the vanishing gradient problem that standard RNNs face.

Reference paper: [Cho et al., 2014 - Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation](https://arxiv.org/abs/1406.1078)

## 2. Mathematical Formulation
A GRU consists of two main gates:
- **Update Gate**: Decides how much past information should be carried forward.
- **Reset Gate**: Determines how much of the past information should be forgotten.

Let:
- \( x_t \) be the input vector at time step \( t \)
- \( h_t \) be the hidden state at time step \( t \)
- \( W_z, W_r, W_h \) be weight matrices for the update, reset, and candidate hidden state
- \( U_z, U_r, U_h \) be recurrent weight matrices
- \( b_z, b_r, b_h \) be biases
- \( \sigma \) be the sigmoid activation function
- \( \odot \) be element-wise multiplication
- \( \tanh \) be the hyperbolic tangent activation function

The GRU update equations are:

### 2.1. Reset Gate
$$ [
r_t = \sigma(W_r x_t + U_r h_{t-1} + b_r)
] $$

### 2.2. Update Gate
$$[
z_t = \sigma(W_z x_t + U_z h_{t-1} + b_z)
]$$

### 2.3. Candidate Hidden State
$$[
\tilde{h}_t = \tanh(W_h x_t + U_h (r_t \odot h_{t-1}) + b_h)
]$$

### 2.4. Final Hidden State Update
$$[
h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t
]$$

## 3. Explanation
- The **reset gate** \( r_t \) decides whether to ignore past hidden states.
- The **update gate** \( z_t \) determines how much of the past hidden state should be retained.
- The **candidate hidden state** \( \tilde{h}_t \) computes a new state based on the reset hidden state.
- The **final hidden state** \( h_t \) is a linear interpolation of the past hidden state and the candidate hidden state.

## 4. Advantages of GRU
- **Fewer parameters** than LSTMs (since it lacks an explicit cell state)
- **Computationally efficient**
- **Avoids vanishing gradient issues** better than vanilla RNNs
- **Performs well in sequential tasks** (e.g., speech recognition, NLP, time-series prediction)

## 5. Reference Paper
Cho, Kyunghyun, et al. *"Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation."* arXiv preprint arXiv:1406.1078 (2014).

