# 1. **Regularization in Deep Learning**
models with a large number of free parameters can describe
an amazingly wide range of phenomena. Even if such a model agrees well with the available
data, that doesn’t make it a good model. It may just mean there’s enough freedom in the
model that it can describe almost any data set of the given size, without capturing any
genuine insights into the underlying phenomenon. When that happens the model will work
well for the existing data, but will fail to generalize to new situations.
Regularization in deep learning is used to prevent **overfitting**, helping models generalize better to unseen data. 

---

###  **1.1 Weight-Based Regularization**
These add a penalty to the loss function based on the model's weights.

1. **L1 Regularization (Lasso)**
   - Adds the absolute value of weights to the loss.
   - Promotes sparsity (many weights become zero).

2. **L2 Regularization (Ridge)**
   - Adds the square of the weights to the loss.
   - Keeps weights small but not sparse.

3. **Elastic Net**
   - Combines L1 and L2 penalties.

---

###  **1.2 Architecture-Based Regularization**
Changes the structure or behavior of the network during training.

4. **Dropout**
   - Randomly "drops" units (sets them to zero) during training.
   - Prevents co-adaptation of neurons.
   - It prevents co-adaptation of features.
   - It kind of like model ensembles within one model.

5. **DropConnect**
   - Instead of dropping activations, it randomly drops weights.

6. **Batch Normalization (BN)**
   - Normalizes layer inputs to stabilize training.
   - Has some regularization effect but was not designed primarily for it.

7. **Layer Normalization / Group Normalization**
   - Similar to BN but works better for certain types of data (e.g., NLP, small batch sizes).

---

###  **1.3 Data-Based Regularization**
Involves modifying the data to encourage better generalization.

8. **Data Augmentation**
   - Random transformations (e.g., rotation, cropping, flipping) applied to input data.
   - Makes model invariant to these changes.

9. **Mixup**
   - Combines two input images and their labels to create a new training example.

10. **Cutout / CutMix / Random Erasing**
    - Removes or replaces parts of the input image.

---

###  **1.4 Early Training Control**
Controlling training time or gradients to avoid overfitting.

11. **Early Stopping**
    - Stops training when validation performance stops improving.

12. **Gradient Clipping**
    - Limits the size of gradients to prevent exploding gradients.

---

###  **1.5 Noise-Based Regularization**

13. **Label Smoothing**
    - Softens one-hot labels (e.g., instead of `[0, 1, 0]`, use `[0.1, 0.8, 0.1]`).

14. **Input Noise**
    - Adds random noise to input during training.

15. **Weight Noise**
    - Adds noise directly to model weights during training.

---

###  **1.6 Advanced / Bayesian Approaches**

16. **Variational Dropout / Bayesian Neural Networks**
    - Models uncertainty by treating weights as distributions instead of fixed values.

17. **Stochastic Depth**
    - Randomly skips entire layers (used in ResNets).

---


##  **2. When Regularization is Needed?**

Regularization is needed when your model is **overfitting**—i.e., it performs well on the training data but poorly on validation or test data.

---

###  **2.1 Symptoms of Overfitting (Regularization is Needed )**

####  1. **Large Gap Between Training and Validation Metrics**
- **Training accuracy** is high, but **validation/test accuracy** is much lower.
- Or, **training loss** is much lower than **validation loss**.

####  2. **Validation Loss Increases While Training Loss Decreases**
- A clear sign your model is memorizing the training data instead of generalizing.

####  3. **Degrading F1 Score on Validation/Test**
- Especially important in **imbalanced datasets**—you may have good accuracy but poor F1 score.

---

###  **2.2 Metrics to Watch**

####  **Classification**
| Metric        | What to Look For                                                                 |
|---------------|----------------------------------------------------------------------------------|
| Accuracy      | High train accuracy, low val/test accuracy ⇒ overfitting                         |
| F1 Score      | More stable on imbalanced data; large drop from train to test ⇒ overfitting      |
| Precision/Recall | If they drop significantly on validation set ⇒ model is too confident on training data |

####  **Regression**
| Metric        | What to Look For                                                                 |
|---------------|----------------------------------------------------------------------------------|
| MSE / MAE     | Low train error, high val/test error ⇒ overfitting                              |
| R² Score      | Close to 1.0 on training, much lower on test ⇒ overfitting                       |

---

###  **2.3. Visualization Can Help Too**
- **Learning curves** (plotting loss/accuracy vs. epochs for both training and validation):
  - If the validation loss **starts increasing while training loss keeps dropping**, that’s a red flag.

---

###  Example Scenario
You're training a neural net on image classification:

| Epoch | Train Acc | Val Acc | Train Loss | Val Loss |
|-------|-----------|---------|------------|----------|
| 1     | 70%       | 68%     | 0.6        | 0.62     |
| 5     | 95%       | 72%     | 0.2        | 0.58     |
| 10    | 99%       | 65%     | 0.1        | 0.75     |

 You're overfitting. Time to apply regularization (e.g., Dropout, L2, Data Augmentation).

---

##  L2 Regularization in Neural Networks

L2 regularization is used **not only in linear regression** but also **in deep learning**, where we apply it to all weights in the network.

---

###  **Example: Cross-Entropy Loss with L2**

$
C = -\frac{1}{n} \sum_{j} \left[ y_j \ln a^L_j + (1 - y_j) \ln(1 - a^L_j) \right] + \frac{\lambda}{2n} \sum_w w^2
$

- First term: cross-entropy (how wrong our predictions are)
- Second term: L2 penalty (sums squares of all weights)

---

###  **Example: MSE with L2**

$
C = \frac{1}{2n} \sum_x \| y - a^L \|^2 + \frac{\lambda}{2n} \sum_w w^2
$

Both forms share this pattern:

$
C = C_0 + \frac{\lambda}{2n} \sum_w w^2
$

- $ C_0 $: original loss (unregularized)
- $ \lambda $: regularization factor
  - **Small $ \lambda $** → prioritize fitting the training data
  - **Large $ \lambda $** → prioritize small weights

---

##  Gradient Descent with L2 Regularization

Gradient update without regularization:

$
w_{\text{new}} = w - \eta \frac{\partial C_0}{\partial w}
$

With L2 regularization:

$
w_{\text{new}} = w - \eta \left( \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} w \right)
= \left(1 - \frac{\eta \lambda}{n} \right) w - \eta \frac{\partial C_0}{\partial w}
$

- The term $ \left(1 - \frac{\eta \lambda}{n} \right) $ **shrinks** the weight on every update
- This is known as **weight decay**

---

##  Mini-Batch Stochastic Gradient Descent (SGD)

For a mini-batch of size $ m $:

### Weight update:

$
w_{\text{new}} = \left(1 - \frac{\eta \lambda}{n} \right) w - \frac{\eta}{m} \sum_x \frac{\partial C_x}{\partial w}
$

### Bias update (no regularization term):

$
b_{\text{new}} = b - \frac{\eta}{m} \sum_x \frac{\partial C_x}{\partial b}
$

- We **don’t regularize biases**—only weights.

---

##  Summary

| Concept                          | Equation / Insight |
|----------------------------------|---------------------|
| OLS loss                         | $ \|X\beta - Y\|^2 $ |
| Ridge loss (L2)                  | $ \|X\beta - Y\|^2 + \lambda \|\beta\|^2 $ |
| Ridge solution                   | $ \hat{\beta}_R = (X^T X + \lambda I)^{-1} X^T Y $ |
| Neural network loss (w/ L2)      | $ C = C_0 + \frac{\lambda}{2n} \sum_w w^2 $ |
| Weight update rule (SGD + L2)    | $ w_{\text{new}} = \left(1 - \frac{\eta \lambda}{n} \right) w - \eta \nabla_w C_0 $ |

Let me know if you want me to do a **comparison with L1 (Lasso) regularization**, or explain how to **choose λ**, or **visualize the effect of L2 in weight space**.