# 📝 Notes on Optimizers in Deep Learning  

---

## 1. Introduction  
Optimizers are algorithms used to **update the weights** of a neural network in order to minimize the **loss function**.  
They determine **how the model learns** and how quickly it converges.  

---

## 2. Key Terms You Must Know  

### Loss Function (Cost Function)  
- A measure of error between predictions and actual values.  
- Example:  
  - Mean Squared Error (MSE) → Regression  
  - Cross-Entropy Loss → Classification  

### Gradient  
- The **partial derivative of the loss** with respect to weights.  
- Shows the **direction** to update weights to reduce error.  

### Learning Rate (η)  
- A hyperparameter that controls the **step size** in updating weights.  
- Small η → slow learning  
- Large η → unstable, may diverge  

### Weight Update Rule  
General formula:  
$$
w = w - \eta \cdot \nabla L
$$  
**Where:**
- $w$: weight (parameter)
- $\eta$: learning rate
- $\nabla L$: gradient of the loss with respect to $w$

### Epoch  
- One complete pass through the **entire training dataset**.  

### Batch & Mini-Batch  
- **Batch Gradient Descent:** Uses the full dataset to compute gradient.  
- **Stochastic Gradient Descent (SGD):** Updates weights after every single sample.  
- **Mini-Batch Gradient Descent:** Uses small subsets (common in practice).  

### Convergence  
- The point where the model stops improving (loss stabilizes).  

### Momentum  
- Technique that **remembers past gradients** to smooth updates.  
- Helps avoid zig-zagging and speeds up convergence.  

### Adaptive Learning Rate  
- Optimizers like **RMSProp** and **Adam** adjust the learning rate for each parameter automatically.  

---



## 1. Gradient Descent (Batch GD)  

- Uses the **entire dataset** to compute the gradient before updating weights.  
- Accurate but very **slow** for large datasets.  

$$
w \leftarrow w - \eta \,\nabla_w \mathcal{L}(w)
$$  

**Key points:**  
- Stable but computationally expensive.  
- Rarely used in pure form; more practical variants exist.  

---

## 2. Stochastic Gradient Descent (SGD)  

- Updates weights **after every single training sample**.  
- Much faster than batch gradient descent, but very noisy.  

$$
w \leftarrow w - \eta \,\nabla_w \mathcal{L}(w; x_i, y_i)
$$  

Where $(x_i, y_i)$ is a single training sample.  

**Key points:**  
- Introduces randomness (helps escape local minima).  
- Can oscillate heavily around minima.  

---

## 3. Mini-Batch Gradient Descent  

- Uses a **small subset (batch)** of training data for each update.  
- Most widely used in practice.  

$$
w \leftarrow w - \eta \,\nabla_w \mathcal{L}(w; B)
$$  

Where $B$ = batch of samples.  

**Key points:**  
- Efficient and balances stability vs. speed.  
- Supported by GPUs easily.  

---

## 4. SGD with Momentum  

- Adds a **momentum term** that remembers past gradients to smooth updates.  
- Helps accelerate learning in relevant directions and dampens oscillations.  

Update rules:  

$$
v_t = \gamma v_{t-1} + \eta \,\nabla_w \mathcal{L}(w_t)
$$  

$$
w_{t+1} = w_t - v_t
$$  

Where:  
- $v_t$: velocity (accumulated gradient)  
- $\gamma$: momentum coefficient (typically 0.9)  

**Key points:**  
- Faster convergence than vanilla SGD.  
- Reduces zig-zagging, especially in ravines.  

---

## 5. RMSProp (Root Mean Square Propagation)  

- Adjusts the learning rate for each parameter individually.  
- Uses an exponentially decaying average of squared gradients.  

Update rules:  

$$
E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta) g_t^2
$$  

$$
w_{t+1} = w_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \, g_t
$$  

Where:  
- $g_t$: gradient at time $t$  
- $\beta$: decay rate (commonly 0.9)  
- $\epsilon$: small constant to avoid division by zero  

**Key points:**  
- Good for non-stationary problems and RNNs.  
- Handles exploding/vanishing gradients better.  

---

## 6. Adam (Adaptive Moment Estimation)  

- Combines **Momentum + RMSProp**.  
- Maintains moving averages of both gradients and squared gradients.  

Update rules:  

1. Compute moving averages:  

$$
m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t
$$  

$$
v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2
$$  

2. Bias correction:  

$$
\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad 
\hat{v}_t = \frac{v_t}{1-\beta_2^t}
$$  

3. Parameter update:  

$$
w_{t+1} = w_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \,\hat{m}_t
$$  

Where:  
- $\beta_1 \approx 0.9$, $\beta_2 \approx 0.999$  
- $\epsilon$ is a small constant (e.g., $10^{-8}$)  

**Key points:**  
- Default optimizer in many frameworks (fast, reliable).  
- Works well in most problems, but sometimes generalization is weaker than SGD.  

---

### Summary Table  

| Optimizer | Formula (Core) | Key Feature | Pros | Cons |
|-----------|----------------|-------------|------|------|
| Gradient Descent | $w \leftarrow w - \eta \nabla L$ | Full dataset | Stable | Very slow |
| SGD | $w \leftarrow w - \eta \nabla L(x_i)$ | One sample | Fast, randomness helps | Noisy |
| Mini-Batch | $w \leftarrow w - \eta \nabla L(B)$ | Batch of data | Balanced, GPU friendly | Needs batch tuning |
| SGD + Momentum | $v_t = \gamma v_{t-1} + \eta g_t$ | Uses momentum | Faster, smoother | Needs momentum hyperparam |
| RMSProp | $w \leftarrow w - \frac{\eta}{\sqrt{E[g^2]_t}+\epsilon} g_t$ | Adaptive learning rate | Good for RNNs | May over-adapt |
| Adam | Combines Momentum + RMSProp | Adaptive + momentum | Default, fast convergence | Sometimes overfits |

---
