# Regularization
models with a large number of free parameters can describe
an amazingly wide range of phenomena. Even if such a model agrees well with the available
data, that doesn’t make it a good model. It may just mean there’s enough freedom in the
model that it can describe almost any data set of the given size, without capturing any
genuine insights into the underlying phenomenon. When that happens the model will work
well for the existing data, but will fail to generalize to new situations.


Regularization is a technique used in machine learning to prevent overfitting. Overfitting occurs when a model performs very well on training data but poorly on new, unseen data. Regularization introduces a penalty on the complexity of the model, effectively reducing the risk of overfitting.

Here are some of the common regularization methods and a brief example of how they are applied in the context of linear regression:

1. **L1 Regularization (Lasso Regression)**:
    - It adds a penalty equivalent to the absolute value of the magnitude of the coefficients.
    - Objective = Loss (e.g., Mean Squared Error) + λ * |w|

2. **L2 Regularization (Ridge Regression)**:
    - It adds a penalty equivalent to the square of the magnitude of the coefficients.
    - Objective = Loss (e.g., Mean Squared Error) + λ * w^2

3. **Elastic Net**:
    - A combination of L1 and L2 regularization.
    - Objective = Loss (e.g., Mean Squared Error) + λ1 * |w| + λ2 * w^2




## Linear least squares
$X_{m\times n}{\vec {\beta_{n\times 1} }}=Y_{m\times 1}$

${\displaystyle L(D,{\vec {\beta }})=||X{\vec {\beta }}-Y||^{2}=(X{\vec {\beta }}-Y)^{T}(X{\vec {\beta }}-Y)=Y^{T}Y-Y^{T}X{\vec {\beta }}-{\vec {\beta }}^{T}X^{T}Y+{\vec {\beta }}^{T}X^{T}X{\vec {\beta }}}$


${\displaystyle {\frac {\partial L(D,{\vec {\beta }})}{\partial {\vec {\beta }}}}={\frac {\partial \left(Y^{T}Y-Y^{T}X{\vec {\beta }}-{\vec {\beta }}^{T}X^{T}Y+{\vec {\beta }}^{T}X^{T}X{\vec {\beta }}\right)}{\partial {\vec {\beta }}}}=-2X^{T}Y+2X^{T}X{\vec {\beta }}}$


setting the gradient of the loss to zero and solving for ${\displaystyle {\vec {\beta }}}$ we get: 

${\displaystyle -2X^{T}Y+2X^{T}X{\vec {\beta }}=0\Rightarrow X^{T}Y=X^{T}X{\vec {\beta }}\Rightarrow {\vec {\hat {\beta }}}=(X^{T}X)^{-1}X^{T}Y}{\displaystyle -2X^{T}Y+2X^{T}X{\vec {\beta }}=0}$


$\Rightarrow X^{T}Y=X^{T}X{\vec {\beta }}\Rightarrow {\vec {\hat {\beta }}}=(X^{T}X)^{-1}X^{T}Y$

## Tikhonov regularization (ridge regression) with L2 norm
We add the magnitude of $\beta$ to our cost to plenalize huge weights and keep the weights small (close to zero)  and all other things being equal. 

${\displaystyle {\hat {\beta }}_{R}=(\mathbf {X} ^{\mathsf {T}}\mathbf {X} +\lambda \mathbf {I} )^{-1}\mathbf {X} ^{\mathsf {T}}\mathbf {y} }$

$\begin{eqnarray} C = -\frac{1}{n} \sum_{xj} \left[ y_j \ln a^L_j+(1-y_j) \ln
(1-a^L_j)\right] + \frac{\lambda}{2n} \sum_w w^2.
\end{eqnarray}$

The first term is the cross-entropy and the second term, is the squares of all the weights in the network. 


$\begin{eqnarray} C = \frac{1}{2n} \sum_x \|y-a^L\|^2 +
  \frac{\lambda}{2n} \sum_w w^2.
\end{eqnarray}$

n both cases we can write the regularized cost function as:

$\begin{eqnarray}  C = C_0 + \frac{\lambda}{2n}
\sum_w w^2,
\end{eqnarray}$

$C_0$ is the original, unregularized cost function.

$\lambda$: when $\lambda$ is small we prefer to minimize the original cost function, but when $\lambda$ is
large we prefer small weights.

$\begin{eqnarray}
b_{new} = b -\eta \frac{\partial C_0}{\partial b}.
\end{eqnarray}$

$\begin{eqnarray} 
  w_{new}= & & w-\eta \frac{\partial C_0}{\partial
    w}-\frac{\eta \lambda}{n} w   & = & \left(1-\frac{\eta \lambda}{n}\right) w -\eta \frac{\partial
    C_0}{\partial w}. 
\end{eqnarray}$

For stochastic gradient descent we can estimate $\partial C_0 / \partial w$ by averaging over a mini-batch of m training examples. Thus the regularized learning rule for stochastic gradient descent becomes:

$\begin{eqnarray} 
  w_{new}= \left(1-\frac{\eta \lambda}{n}\right) w -\frac{\eta}{m}
  \sum_x \frac{\partial C_x}{\partial w}, 
\end{eqnarray}$

$\begin{eqnarray}
  b_{new} = b - \frac{\eta}{m} \sum_x \frac{\partial C_x}{\partial b},
\end{eqnarray}$


$n$ is, as usual, the size of our training set

$m$ is size of the mini-batch training examples


Heuristically, if the cost function is unregularized, then the length of the weight vector is likely to grow, all other things being equal. Over time this can lead to the weight vector being very large indeed. This can cause the weight vector to get stuck pointing in more or less the same direction, since changes due to gradient descent only make tiny changes to the direction, when the length is long, which is making it hard for our learning algorithm to properly explore the weight space, and consequently harder to find good minima of the cost function.


<img src='images/regularization.svg'>

## Lasso with L1 norm
Lasso (Least Absolute Shrinkage and Selection Operator)

Refs: [1](https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c)

## Advanced Regularization
[1](https://www.youtube.com/watch?v=ATo7vnzy5sY)

In [4]:

from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.datasets import make_regression

# Generate some sample data
X, y = make_regression(n_samples=100, n_features=2, noise=0.1)

# Lasso Regression (L1 regularization)
lasso = Lasso(alpha=0.1)  # alpha here is equivalent to the lambda in the formula
lasso.fit(X, y)

# Ridge Regression (L2 regularization)
ridge = Ridge(alpha=0.1)
ridge.fit(X, y)

# Elastic Net (Combination of L1 and L2)
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)  # l1_ratio controls the mixture of L1 vs. L2
elastic_net.fit(X, y)





In the example above, the `alpha` parameter controls the amount of regularization: a larger value means more regularization. The `l1_ratio` parameter in `ElasticNet` determines the balance between L1 and L2 regularization.

In practice, choosing the best regularization type and parameter often requires experimentation and cross-validation.
