# Overview of Gradient Descent

Gradient descent is a **first-order iterative** optimization algorithm for finding the minimum of a function.  
It's used to minimize the loss function by updating model parameters in the **opposite direction of the gradient**.

**Key Components:**
- **Learning rate ($\eta$):** Determines step size
- **Gradient ($\nabla_w$):** Partial derivatives of loss w.r.t. parameters
- **Convergence criteria:** Stopping conditions for the algorithm

**Gradient Descent Variations**

* **Batch Gradient Descent**:
    - Process the **entire training set** per iteration.
    - Provides **smooth updates** toward the optimal solution.
    - **Computationally intensive** for large datasets.
    - **Infeasible** for some applications. *(BGD needs entire training set which is not available in some cases like online learning)*

* **Mini-batch Gradient Descent**:
    - Process **small, random subset** of data per iteration.
    - **Balance between efficiency and computational power** (less noisy than SGD, faster than BGD).
    - **Preferred in practice** for most deep learning applications.

* **Stochastic Gradient Descent**:
    - Processes **one training example** per iteration.
    - **Advantages:**
        - Suitable for **online/real-time** learning.
        - **Frequent updates** can lead to faster convergence.
    - **Disadvantages:**
        - Introduces **high variance (noise)** in updates.
        - **Erratic convergence path** (less smooth than BGD).

<div style="text-align:center">
    <img src="../assets/gradient_variations.png" alt="comparison of different gradient variations">
</div>

# Batch Gradient Descent (Vanilla GD)

## Characteristics

- Processes the entire training set to compute a single update
- Provides the exact gradient direction (no sampling noise)
- Guaranteed convergence to global minimum for convex functions

## Algorithm

**Input:** $ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ..., (x^{(N)}, y^{(N)}) $

> Initialize all weights  
> Do:  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Update: $ W = W - \eta_j \text{loss}(x^{(n)}, y^{(n)}) $  
> Until convergence

## Advantages & Disadvantages

**Advantages:**
- Stable, deterministic convergence
- Straightforward implementation

**Disadvantages:**
- Computationally expensive for large $N$
- Requires entire dataset in memory
- Gets stuck in local minima for non-convex functions

**Practical Considerations:**
- Rarely used in modern deep learning
- May be suitable for small datasets

# Stochastic Gradient Descent (SGD)

## Characteristics:

- Processes **one random sample** per iteration
- **High variance** updates (noisy gradients)
- Can **escape local minima** due to noise

## SGD: Incremental Update

**Algorithm:**

> Given $ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ..., (x^{(N)}, y^{(N)}) $  
> Initialize all weights  
> $ j = 0 $  
> Do:  
> &nbsp;&nbsp;&nbsp;&nbsp;Randomly permute data  
> &nbsp;&nbsp;&nbsp;&nbsp;For all $ n = 1 : N $:  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$ j = j + 1 $  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Update: $ W = W - \eta_j \sum_{n=1}^N \text{loss}(x^{(n)}, y^{(n)}) $  
> Until $ Err $ has converged

**Key Properties:**
- Loop through the samples in the same order, lead to **cyclic behavior**
- Provided training instances must be presented in random order *(Stochastic Gradient Descent)*
- Usually *learning rate* reduces with $j$ making it be the function of $j$
- The iterations can make multiple passes over the data
- A single pass through the entire training data is called an **epoch**
- An epoch over a training set with $N$ samples results in $N$ updates of parameters *(for single sample SGD)*

## Convergence

SGD **converges almost surely** to a global/local minimum under:

**Sufficient Conditions:**
- **Infinite exploration:**
    $$\sum_k \eta_k = \infty$$
- **Vanishing steps:**
    $$\sum_k \eta_k^2 \lt \infty$$

The fastest converging series that satisfies both above requirements is: $\eta_k \propto \frac{1}{k}$

**Optimal Learning Rate Schedule:**
- For strongly convex functions: $\eta_k \propto \frac{1}{k}$
- For non-convex functions: **Heuristic tuning** is common.

**Convergence Behavior:**
- Convex loss $\rightarrow$ Converges to the **global optimal**
- Non-convex loss $\rightarrow$ Converges to a **stationary point**

## Minimize Loss Function

**Expected vs. Empirical Loss**

- **Expected Loss (ideal, intractable):**
    $$E\left[loss\left(f(X;W), g(X)\right)\right] = \int_{X} loss\left(f(X;W), g(X)\right) P(X) dX$$

- **Empirical Loss (practical):**
    $$Err\left(f(X;W), g(X)\right) = \frac{1}{N} \sum_{i=1}^N loss\left(f(x^{(i)}; W), y^{(i)}\right)$$


The variance of the empirical error:
$$var(Err) = \frac{var(loss)}{N}$$
The larger this variance, the greater the likelihood $\hat{W}$ will differ significantly from $W^*$

## SGD vs. BGD

| Aspect               | SGD                              | Batch GD                        |
|----------------------|----------------------------------|---------------------------------|
| **Update Speed**     | Fast (per-sample updates)        | Slow (full-batch updates)       |
| **Variance**         | High (noisy gradients)           | Low (smooth convergence)        |
| **Scalability**      | Excellent for large datasets     | Limited by memory               |
| **Memory Usage**     | Low (processes 1 sample at time) | High (stores entire dataset)    |
| **Convergence**      | Erratic path, may oscillate      | Stable, direct path             |
| **Online Learning**  | Yes (can update continuously)    | No (requires full dataset)      |
| **Use Cases**        | Large datasets, streaming data   | Small datasets (<10k samples)   |

# Mini-Batch Gradient Descent 

## Characteristics

**Key Characteristics**
- Processes **small, randomly sampled batches** (typically 32–512 samples per batch).
- Balances noise and computational efficiency.
- The **default choice** for most deep learning applications.

**Algorithm Overview:**

Given $ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ..., (x^{(N)}, y^{(N)}) $

> Initialize all weights  
> $ j = 0 $  
> Do:  
> &nbsp;&nbsp;&nbsp;&nbsp;Randomly permute data  
> &nbsp;&nbsp;&nbsp;&nbsp;Split data into batches $\{B_1, \cdots,B_m\}$  
> &nbsp;&nbsp;&nbsp;&nbsp;For each batch $B$:  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$ j = j + 1 $  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Update: $ W = W - \eta_j \sum_{n=1}^B \text{loss}(x^{(n)}, y^{(n)}) $  
> Until $ Err $ has converged

## Minimizing the Loss Function

- Mini-batch updates minimize the **batch error**:
    $$
    \text{BatchErr}\left(f(X;W), g(X)\right) = \frac{1}{B} \sum_{i=1}^B loss\left(f(x^{(i)}; W), y^{(i)}\right)
    $$

- **Expected batch error = Expected loss:**
    $$
    E\left[\text{BatchErr}\right] = E\left[loss\left(f(X;W), g(X)\right)\right]
    $$

- **Variance of the empirical error** (reduces with larger batch size):
    $$
    var(E) = \frac{var(loss)}{B}
    $$

## Key Advantages & Considerations

**Advantages:**

- **Efficiency on Large Datasets**
    - Processes data in smaller, manageable batches.
    - Vectorization enables parallel computation.

- **Parallel Training on GPUs**
    - Multiple mini-batches can be processed in parallel.
    - Independent local updates are aggregated into a global model.

**Potential Challenge: Noisy Gradients**
- **Solution:** Use larger batch sizes when possible.

## Mini-Batch Gradient Descent with Forward/Backward Propagation

**Mini-Batch Gradient Descent with Forward/Backward Propagation**

For each batch of data we have input $X$ and outout $Y$ separately

> For $\text{epoch} = 1, \cdots, k$:  
> &nbsp;&nbsp;&nbsp;&nbsp;**Shuffle** the training data.  
> &nbsp;&nbsp;&nbsp;&nbsp;For $t = 1, ..., m$ (where $m$ = number of batches):  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Forward propagation** on batch $ X^t $:  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$J^t = \frac{1}{m} \sum_{n \in \text{Batch}_t} L \left( \hat{Y}_n^t, Y_n^t \right) + \lambda R(W)$  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Backpropagation** on $ J^t $ to compute gradients $ dW $.  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For each layer $ l = 1, ..., L $:  
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$W^l = W^l - \alpha \, dW^l$

> **Forward pass** (Vectorized implementation):  
> Initialize input: $ A^{[0]} = X^t $.  
> For each layer $ l = 1, ..., L $:  
> &nbsp;&nbsp;&nbsp;&nbsp;$Z^{[l]} = W^{[l]} A^{[l-1]}$  
> &nbsp;&nbsp;&nbsp;&nbsp;$A^{[l]} = f^{[l]} Z^{[l]}$  
> $\hat{Y}^t = A^{[L]} \quad \text{(Output of the last layer)}$  

- $ J^t $: Batch loss (with regularization term $ \lambda R(W) $).  
- $ L $: Loss function (e.g., cross-entropy, MSE).  
- $ \alpha $: Learning rate.  

## Choosing the Mini-Batch Size

- **Small datasets:** Use **full-batch** gradient descent.
- **Large datasets:** Typical sizes—64, 128, 256, or 1024.
- **GPU memory constraints:** Ensure batch data + computations fit in memory.
- **General rule:** Depends on the optimization landscape.

# Learning Rate

## Introduction

The **learning rate ($\eta$)** is a critical hyperparameter that controls the step size during optimization. It directly influences:
- **Convergence speed** – How quickly the model learns.
- **Final performance** – Whether the model reaches a good (or optimal) solution.

| Aspect                  | High Learning Rate (η)                           | Low Learning Rate (η)           |
|-------------------------|--------------------------------------------------|---------------------------------|
| **Convergence Speed**   | Faster initial progress                          | Slower but stable updates       |
| **Minima Behavior**     | Risk of overshooting minima                      | Risk of getting stuck           |
| **Stability**           | May cause divergence (∞/NaN)                     | Stable convergence              |
| **Final Performance**   | Potentially better escape from poor local minima | May converge suboptimally       |
| **Typical Use Cases**   | Early training phases                            | Fine-tuning phases              |
| **Loss Landscape**      | Better for rugged landscapes                     | Better for smooth landscapes    |

**Practical Guidelines**
- **Loss stagnates?** $\eta$ is **too low**.
- **Loss explodes/NaN?** $\eta$ is **too high**.
- **Start small**, then tune until loss decreases smoothly.

<div style="text-align:center">
    <img src="../assets/learning_rate_effect.png" alt="learning rate effect">
</div>

## Learning Rate Decay

### **Introduction**

**Learning Rate Decay**

As training progresses, **reducing $\eta$** helps refine convergence:
- **Early training:** Larger steps for rapid progress.
- **Later stages:** Smaller steps for precise optimization.

**Benefits:**
- **Faster convergence**
- **Improved final accuracy**



**Learning Rate Strategies**

- **Adaptive Learning Rate Methods**  
    Algorithms like Adagrad, RMSprop, and Adam adjust the learning rate dynamically based on parameters or gradients.

    - **Adagrad:** Adapts the learning rate to each parameter, performing smaller updates for frequently occurring features.
    - **RMSprop:** Modifies Adagrad by using a moving average of squared gradients to scale the learning rate.
    - **Adam:** Combines elements of RMSprop and momentum, adjusting the learning rate based on an exponentially decaying average of past gradients.

- **Learning Rate Schedules**  
    These methods adjust the learning rate globally (same for all parameters) based on a predefined rule or formula.

### **Learning Rate Schedules**

#### **Step-wise Decay**

The learning rate is reduced by a constant factor after a fixed number of epochs.
$$
\text{lr} = \text{lr}_0 d^{\lfloor \frac{1 + \text{epoch}}{s} \rfloor}
$$
Where:
- $lr_0$​: Initial learning rate
- $d$: Decay rate
- $s$: Step size
- epoch: Index of the epoch

<div style="text-align:center">
    <img src="../assets/step_decay.png" alt="step-wise learning rate">
</div>

**Implementation:**

```python
#tensorflow
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(initial_learning_rate=0.01, decay_steps=10000, decay_rate=0.5, staircase=True)
optimizer = tf.keras.optimizers.SGD(learning_rate=lr_schedule)

#pytorch
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)
```

#### **Reduce on Loss Plateau Decay**

This strategy reduces the learning rate when a performance metric (e.g., validation loss) has stopped improving for a set number of epochs.

**Advantages:**
- Automatically adapts to training dynamics
- Prevents premature convergence

<div style="text-align:center">
    <img src="../assets/plateau_decay.png" alt="plateau decay example">
</div>

#### **Fixed Learning Rate:**

**Simplest** approach where the **rate remains constant** throughout training.

**Implementation**
```python
#tensorflow
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
#pytorch
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
```

#### **Time-Based Decay**

The learning rate decreases over time using a predefined formula, often proportionally to the inverse of the training epoch number.  
This helps in taking larger steps at the beginning and finer steps as the model approaches convergence.

##### **Cosine Decay**

Smooth reduction following a cosine curve.

**Formula:**
$$
\alpha_t = \frac{1}{2} \alpha_0 (1 + \cos(\frac{2 \pi}{T}))
$$
Where:
- $\alpha_0$: Initial learning rate
- $\alpha_t$: Learning rate at epoch t
- $T$: Total number of epochs
    
<div style="text-align:center">
    <img src="../assets/lr_decay_cosine.png" alt="cosine learning rate decay">
</div>

##### **Linear Decay**

Straightforward linear reduction.

**Formula:**
$$
\alpha_t = \alpha_0(1 - \frac{t}{T})
$$
Where:
- $\alpha_0$: Initial learning rate
- $\alpha_t$: Learning rate at epoch t
- $T$: Total number of epochs

<div style="text-align:center">
    <img src="../assets/lr_decay_linear.png" alt="linear learning rate decay">
</div>

##### **Inverse Square Root Decay**

Aggressive early decay with slower reduction later.

**Formula:**
$$
\alpha_t = \frac{\alpha_0}{\sqrt{t}}
$$
Where:
- $\alpha_0$: Initial learning rate
- $\alpha_t$: Learning rate at epoch t

<div style="text-align:center">
    <img src="../assets/lr_decay_inverse_sqrt.png" alt="inverse sqrt learning rate decay">
</div>

##### **Exponential Decay**

Continuous exponential reduction.

**Formula:**
$$
\alpha_t = \alpha_0 e^{t}
$$
Where:
- $\alpha_0$: Initial learning rate
- $\alpha_t$: Learning rate at epoch t

<div style="text-align:center">
    <img src="../assets/lr_decay_exponential.png" alt="exponential learning rate decay">
</div>

**Implementation:**
```python
#tensorflow
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(initial_learning_rate=0.01, decay_steps=10000, decay_rate=0.9, staircase=False)
optimizer = tf.keras.optimizers.SGD(learning_rate=lr_schedule)

#pytorch
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
```

##### **Polynomial**

Flexible decay controlled by power parameter.

**Formula:**
$$
\alpha_t = \alpha_0(1 - \frac{t}{T})^m
$$
Where:
- $\alpha_0$: Initial learning rate
- $\alpha_t$: Learning rate at epoch t
- $T$: Total number of epochs
- $m$: Power hyperparameter

<div style="text-align:center">
    <img src="../assets/all_decay_schedules.png" alt="all schedules">
</div>

#### **Cyclical Learning Rate (CLR)**

**Cyclical Learning Rate**

This method allows the learning rate to oscillate between a minimum and a maximum boundary.  
It can be implemented using various functions such as triangular, sinusoidal, or exponential patterns.

**Benefits:**
- Escapes local minima
- Automates learning rate tuning

<div style="text-align:center">
    <img src="../assets/clr.png" alt="clr learning rate decay">
</div>

**Implementation:**
```python
#pytorch
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
scheduler = torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr=0.001, max_lr=0.01)
```

#### **Learning Rate Warmup**

Starts with a small learning rate and gradually *(e.g., linearly)* increases it over a few initial epochs or iterations.  
This is particularly useful in preventing the model from diverging in the initial phase of training.  
Often used in training deep learning

**Use Cases:**
- Transformer models
- Large batch training

<div style="text-align:center">
    <img src="../assets/linear_warmup.png" alt="linear warmup learning rate decay">
</div>