# Regularization
Machine Learning models need to generalize well to new examples that the model has not seen in practice. In this module, we introduce regularization, which helps prevent models from overfitting the training data.

## The Problem of Overfitting
The problem of overfitting and underfitting revolves around a model's ability to capture the true underlying of data versus its noise.

### Three States of Modeling
- **Underfitting (High Bias)**:
    - **Cause**: The model is too simple (e.g., a linear function for non-linear data) or has a few features.
    - **Effect**: It fails to capture the essential structure of the data, resulting in poor performance on both training and new data.

- **Balanced Fit**:
    - **Cause**: Adding an appropriate number of features (e.g., quadratic terms) to match the data's complexity.
    - **Effect**: The model generalizes well to new, unseen data.

- **Overfitting (High Variance)**:
    - **Cause**: The model is overly complex (e.g., high-order polynomials)z.
    - **Effect**: It fits the training data perfectly (passing through every point) but captures random noise rather than the trend, leading to poor predictions for new data.

### How to Address Overfitting
When a model is too complex for the training data, you can use two main strategies below to solve:
- **Feature Reduction**: Manually or algorithmically remove less important features to simplify the hypothesis.
- **Regularization**: Keep all features but reduce the magnitude (weight) of the parameters $\theta_i$. This is particularly effective when you have many features that each contribute a small amount of information.

## Cost Function
Let's explore how `Regularization` modifies the `Cost Function` to solve overfitting by penalizing large parameters.

### Core Concept: Penalizing Complexity
**The Strategy**: 
To prevent **overfitting**, we modify the `Cost Function` of the model. We don't mean to punish the model for being inccurate; but we punish it for having large parameter values ($\theta$).

**The Logic**:
- **Large $\theta$ values** create "aggressive" curves that over-react to noise in the data.
- **Small $\theta$ values** force those same curves to become flatter and smoother.

**The Benefit**: 
By making the parameters smaller, the model simplifies its own hypothesis. It retains all available features but "mutes" their intensity, resulting in a curve that ignores random fuctuations and focuses on the true underlying trend. This allows the models to **generalize** (performing accurately on new data rather than just memorizing the training set).


### Regularized Cost Function
To apply this to all features simultaneously, we add a `regularization term` to the cost function:
$$
J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum_{j=1}^{n}\theta_j^2
$$

Where:
- **$\lambda$ (Lambda)**: The regularization parameter. It controls the trade-off beetween fitting the training data and keeping the parameters small.
- **The Penalty $(\sum\theta_j^2)$**: By convention, we regularize parameters from $j=1$ to $n$ (excluding the bias term $\theta_0$).



### Role of $\lambda$ (Lambda)
The **Regularization Parameter ($\lambda$)** acts as a control knob that balances two goals: 
- Fitting the training data well.
- Keeping the model simple.

| $\lambda$ value | Effect on Model | Risk |
|---|---|---|
| **Too Large** | Penalizes parameters too heavily; many $\theta \approx 0$ | **Underfitting** (High Bias) |
| **Optimal** | Smoothes the curve while maintaining the trend | **Good Generalization** |
| **Too Small / Zero** | The penalty term disappears; the model remains complex | **Overfitting** (High Variance) |

## Regularized Linear Regression


### Gradient Descent
In the update rule, we separate $\theta_0$ from the rest. For all other parameters, we add the regularization term $\frac{\lambda}{m}\theta_j$:

**Recall**: standard Gradient Descent Rules:
$$
\theta_j:=
\theta_j-\alpha\begin{bmatrix}
\frac{\partial}{\partial\theta_j}J(\theta)
\end{bmatrix}\\
=\theta_j-\alpha\begin{bmatrix}
\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}).x_j^{(i)}
\end{bmatrix}
$$


**The Update Rule**:
$$
\theta_0:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}).x_0^{(i)}\\
\theta_j:=\theta_j-\alpha\begin{bmatrix}
\frac{\partial}{\partial\theta_j}J^{(regularized)}(\theta)
\end{bmatrix}=\theta_j-\alpha\begin{bmatrix}(\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}).x_0^{(i)})+\frac{\lambda}{m}\theta_j\end{bmatrix},\ 
j\in\{1,2,...,n\}
$$


Equivalent form:
$$
\theta_0:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}).x_0^{(i)}\\
\theta_j:=\theta_j(1-\alpha\frac{\lambda}{m}) - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}).x_j^{(i)},\ j\in\{1,2,\cdots,n\}
$$

- The term $(1-\alpha\frac{\lambda}{m})$ is always be less than 1. This term reduces the values of $\theta_j$ in very interation. Notice that the second term is now exactly the same as it was before.

### Normal Equation
**Regularization** can also be applied to the non-intertive normal equation. We add a matrix $L$ to the calculation to handle the penalty:
$$
X_{\begin{bmatrix}m\times (n+1)\end{bmatrix}}=
\begin{bmatrix}
    (x^{(1)^T})\\
    \vdots\\
    (x^{(m)})^T
\end{bmatrix},  
y_{[m\times1]}=
\begin{bmatrix}
    y^{(1)}\\
    \vdots\\
    y^{(m)}
\end{bmatrix}\\
\theta_{[(n+1)\times1]}=(X^TX+\lambda\cdot L)^{-1}X^Ty
$$
- The **$L$ Matrix**: An $(n+1)\times (n+1)$ identity-like matrix, but with a **0** (zero) in the top-left corner (to ensure $\theta_0$ is not penalized):
$$
L_{[(n+1)\times(n+1)]}=\begin{bmatrix}
0       & 0         & \cdots & 0\\
0       & 1         & \cdots & 0\\
\vdots  & \vdots    & \ddots & \vdots\\
0       & 0         & \cdots & 1
\end{bmatrix}
$$


### Key Benefits
- **Solves Non-Invertibility $(m\le n)$**: If the number of features $n$ is greater than the number of examples $m$, the original $X^TX$ is non-invertible (singular). Adding $\lambda\cdot L$ makes the matrix **$[X^TX+\lambda\cdot L]$** invertible, ensuring a mathematical solution exists.
- **Prevents Overfitting**: By discouraging large weights, the model becomes less sensitive to noise in the training data.

## Regularized Logistic Regression

---
# The end!