## Activation Functions

| Activation                        | Formula                                                      | Range   | Strengths ✅                                                                                                                         | Drawbacks ❌                                                                                                      | Typical Use                                                        |
| --------------------------------- | ------------------------------------------------------------ | ------- | ----------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------ |
| **Sigmoid**                       | $f(x) = \frac{1}{1+e^{-x}}$                                  | (0, 1)  | - Smooth, differentiable<br>- Interpretable as probability                                                                          | - Vanishing gradients (saturates at extremes)<br>- Not zero-centered (slows learning)<br>- Expensive computation | Output layer in binary classification                              |
| **Tanh**                          | $f(x) = \tanh(x)$                                            | (-1, 1) | - Zero-centered<br>- Stronger gradients than sigmoid<br>- Better for hidden layers than sigmoid                                     | - Still suffers vanishing gradients at extremes                                                                  | Hidden layers in RNNs (older architectures)                        |
| **ReLU**                          | $f(x) = \max(0, x)$                                          | \[0, ∞) | - Simple & efficient<br>- Sparse activations (many zeros)<br>- Avoids vanishing gradient (for +ve side)                             | - **Dying ReLU problem** (neurons stuck at 0)<br>- Not zero-centered                                             | Default choice for hidden layers in modern DNNs                    |
| **Leaky ReLU**                    | $f(x) = x$ if $x>0$, else $\alpha x$ ($\alpha \approx 0.01$) | (-∞, ∞) | - Fixes dying ReLU problem<br>- Keeps small gradient for negative side                                                              | - Slightly more computation than ReLU<br>- Choice of $\alpha$ is arbitrary                                       | Hidden layers where dying ReLU is a concern                        |
| **ELU** (Exponential Linear Unit) | $f(x) = x$ if $x>0$, else $\alpha(e^x-1)$                    | (-α, ∞) | - Negative values smooth (helps mean shift toward 0)<br>- Avoids dying ReLU<br>- Better convergence sometimes                       | - More computationally expensive (exp function)<br>- Choice of $\alpha$ matters                                  | Hidden layers in deep CNNs (alternative to ReLU)                   |
| **Maxout**                        | $f(x) = \max(x^T W_i + b_i)$, $i=1..k$                       | (-∞, ∞) | - Very flexible (generalizes ReLU & Leaky ReLU)<br>- Approximates any convex piecewise-linear function<br>- Works well with Dropout | - More parameters (multiple weight sets)<br>- Higher computation cost<br>- Less common in practice now           | Hidden layers (when expressive power is needed, esp. with Dropout) |


## Gradient Descent algorithms Comparison

| Algorithm                             | Description                                                         | Strengths ✅                                                                                                                          | Drawbacks ❌                                                                                       | Typical Use                                    |
| ------------------------------------- | ------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------- | ---------------------------------------------- |
| **Batch Gradient Descent**            | Uses the **entire dataset** to compute gradient before each update. | - Stable & accurate gradient<br>- Smooth convergence                                                                                 | - Very slow for large datasets<br>- High memory usage<br>- Not suitable for online/streaming data | Small datasets or convex optimization problems |
| **Stochastic Gradient Descent (SGD)** | Uses **one training example at a time** to update weights.          | - Fast updates<br>- Can escape local minima (due to noise)<br>- Works with streaming data                                            | - Very noisy updates<br>- Convergence may oscillate<br>- Needs careful tuning of learning rate    | Large-scale online learning, streaming tasks   |
| **Mini-batch Gradient Descent**       | Uses a **small batch of samples** (e.g., 32–256) for each update.   | - Balance of efficiency & stability<br>- Faster convergence than Batch GD<br>- Less noisy than SGD<br>- Vectorization (GPU friendly) | - Still some noise in updates<br>- Batch size must be tuned                                       | Standard choice in deep learning               |


## Mini-Batch Gradient Descent Variants and Comparison

| Optimizer                                  | Update Rule (simplified)                                                                                  | Strengths ✅                                                                          | Drawbacks ❌                                                                            | Typical Use                                      |
| ------------------------------------------ | --------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------- | ------------------------------------------------ |
| **1. Vanilla (Standard) Mini-batch GD**    | $w := w - \eta \nabla L(w)$                                                                               | - Simple, easy to implement<br>- Works if learning rate is tuned                     | - Fixed learning rate (sensitive)<br>- Can get stuck in poor minima                    | Rarely used alone in deep nets                   |
| **2. Momentum**                            | $v := \beta v - \eta \nabla L(w)$; $w := w + v$                                                           | - Accelerates in right direction<br>- Reduces oscillations                           | - Needs momentum parameter ($\beta$) tuning                                            | Deep nets where loss surface is ravine-shaped    |
| **3. Nesterov Accelerated Gradient (NAG)** | $v := \beta v - \eta \nabla L(w + \beta v)$; $w := w + v$                                                 | - Looks ahead → faster convergence<br>- Improves over plain momentum                 | - Slightly more computation<br>- Extra hyperparameter                                  | Deep nets (faster training than momentum)        |
| **4. AdaGrad**                             | $w := w - \frac{\eta}{\sqrt{G_t + \epsilon}} \nabla L(w)$ (where $G_t$ accumulates squared gradients)     | - Per-parameter adaptive learning rate<br>- Works well for sparse data (NLP, text)   | - Learning rate shrinks too much over time (training may stop early)                   | Sparse features, NLP                             |
| **5. RMSProp**                             | $w := w - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \nabla L(w)$ (exponentially decaying avg of past grads) | - Fixes AdaGrad’s diminishing LR<br>- Works well for non-stationary data             | - Extra hyperparameter tuning (decay rate)                                             | RNNs, non-stationary problems                    |
| **6. Adam (Adaptive Moment Estimation)**   | Combines **Momentum + RMSProp**: keeps running avg of both gradient (m) & squared gradient (v)            | - Fast convergence<br>- Works well in practice<br>- Default choice for deep learning | - May generalize worse than SGD in some cases<br>- Sensitive to learning rate schedule | Most modern deep nets (CNNs, Transformers, etc.) |
