Regularization works by modifying the standard linear regression objective function, which minimizes the mean squared error (MSE), by adding a penalty term proportional to the sum of the squares of the model's parameters scaled by a regularization parameter alpha.

### Quick Recap: Mean Squared Error (MSE)

MSE measures the average of the squared differences between true values $y_i$ and predicted values $\hat{y}_i$:

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$

For example, if a model predicts values $\hat{y}_1 = -3$ and $\hat{y}_2 = 7$ for true values $y_1 = -3$ and $y_2 = 12$, the total squared error is 

$$
( -3 - (-3) )^2 + ( 12 - 7 )^2 = 0 + 25 = 25,
$$
and the MSE for two data points is $25/2 = 12.5$.

### How Ridge Regression Modifies This

Ridge regression minimizes an objective function that is the sum of the MSE and a penalty term:

$$
\text{Objective} = \text{MSE} + \alpha \sum_{j=1}^d \theta_j^2
$$

where $\theta_j$ are the coefficients (excluding the intercept), and $\alpha$ is a fixed hyperparameter controlling the penalty's weight. This penalty term is the L2 norm of the parameter vector, pushing coefficient magnitudes towards zero (called shrinkage).

Using the earlier example, if $\alpha=2$ and parameters $\theta_1=3$, $\theta_2=2$, the penalty is

$$
2 \times (3^2 + 2^2) = 2 \times (9 + 4) = 26.
$$

So the total objective is 

$$
12.5 + 26 = 38.5.
$$

The Ridge regression finds parameters that minimize this total, balancing between lowering prediction error and keeping parameters small.

### Trade-Off and Model Selection

Ridge regression prefers models with smaller coefficients, even if the MSE is slightly higher. For example, consider two models:

- Model A: MSE = 10, coefficients $3, 4$  
- Model B: MSE = 3, coefficients $8, 6$

With $\alpha=2$, the objective function values are:

- Model A: $10 + 2 \times (9 + 16) = 10 + 50 = 60$  
- Model B: $3 + 2 \times (64 + 36) = 3 + 200 = 203$

Ridge favors Model A despite higher MSE because its coefficients are smaller, reducing the penalty and preventing the model from overfitting by limiting parameter size.

### Effect of Alpha

- When $\alpha$ is small, penalty is negligible, so Ridge behaves like standard linear regression, allowing large coefficients.
- When $\alpha$ is large, penalty dominates, shrinking coefficients closer to zero, simplifying the model but potentially increasing bias.
- The optimal $\alpha$ balances MSE and penalty, controlling model complexity and error.

Graphically, as $\alpha$ increases, the surface of the objective function flattens and its minimum shifts towards smaller coefficient values, nudging the model parameters closer to zero (origin).

In summary, Ridge regression adds a **shrinkage** penalty on coefficients to reduce overfitting by controlling the complexity of the model through the $\alpha$ parameter. 

It trades off increased prediction error for a simpler, more generalizable model by minimizing the combined objective of prediction error and parameter size.

Sources:

[1](https://www.geeksforgeeks.org/python/python-mean-squared-error/)
[2](https://www.geeksforgeeks.org/maths/mean-squared-error/)
[3](https://www.deepchecks.com/glossary/mean-square-error-mse/)
[4](https://en.wikipedia.org/wiki/Mean_squared_error)
[5](https://encord.com/glossary/mean-square-error-mse/)
[6](https://www.youtube.com/watch?v=VaOlkbKQFcY)
[7](https://www.freecodecamp.org/news/machine-learning-mean-squared-error-regression-line-c7dde9a26b93/)
[8](https://www.datacamp.com/tutorial/loss-function-in-machine-learning)
[9](https://www.bmc.com/blogs/mean-squared-error-r2-and-variance-in-regression-analysis/)
[10](https://www.simplilearn.com/tutorials/statistics-tutorial/mean-squared-error)