# Regularization

As a general term, *regularization* refers to modifying something that is difficult to always compute accurately with something more tractable. For learning models, regularization is a common way to combat overfitting.

Imagine we had an $\real^{n\times 4}$ feature matrix in which the features are identical; that is, the predictor variables satisfy $x_1=x_2=x_3=x_4$, and suppose the target $y$ also equals $x_1$. Clearly, we get a perfect regression if we use

$$
y = 1x_1 + 0x_2 + 0x_3 + 0x_4.
$$

But an equally good regression is 

$$
y = \frac{1}{4}x_1 + \frac{1}{4}x_2 + \frac{1}{4}x_3 + \frac{1}{4}x_4.
$$

For that matter, so is

$$
y = 1000x_1 - 500x_2 - 500x_3 + 1x_4.
$$

A problem with more than one solution is called **ill-posed**. If we made tiny changes to the predictor variables in this thought experiment, the problem would technically be well-posed, but there would be a wide range of solutions that were very nearly correct, in which case the problem is said to be **ill conditioned**, and for practical purposes it remains just as difficult.

The ill conditioning can be regularized away by modifying the least squares loss function to penalize complexity in the model, in the form of excessively large regression coefficients. The common choices are **ridge regression**,

$$
L(\bfw) = \twonorm{ \bfX \bfw- \bfy }^2 + \alpha \twonorm{\bfw}^2,
$$

and **LASSO**, 

$$
L(\bfw) = \twonorm{ \bfX \bfw- \bfy }^2 + \alpha \onenorm{\bfw}.
$$

As $\alpha\to 0$, both forms revert to the usual least squares loss, but as $\alpha \to \infty$, the optimization becomes increasingly concerned with prioritizing a small result for $\bfw$. 

While ridge regression is an easier function to minimize quickly, LASSO has an interesting advantage, as illustrated in this figure.

```{figure} ../_static/regularization.png
```

LASSO tends to produce **sparse** results, meaning that some of the regression coefficients are zero or negligible. These zeros indicate predictor variables that have minor predictive value, which can be valuable information in itself. Moreover, when regression is run without these variables, there may be little effect on the bias, but a reduction in variance.

## Case study: Diabetes progression

We'll apply regularized regression to data collected about the progression of diabetes.

In [1]:
from sklearn import datasets
diabetes = datasets.load_diabetes(as_frame=True)["frame"]
diabetes

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068330,-0.092204,75.0
2,0.085299,0.050680,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.025930,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0
...,...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207,178.0
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018118,0.044485,104.0
439,0.041708,0.050680,-0.015906,0.017282,-0.037344,-0.013840,-0.024993,-0.011080,-0.046879,0.015491,132.0
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044528,-0.025930,220.0


First, we look at basic linear regression on all 10 predictive features in the data.

In [2]:
X = diabetes.iloc[:,:-1]
y = diabetes.iloc[:,-1]

from sklearn.model_selection import train_test_split

X_tr,X_te,y_tr,y_te = train_test_split(X,y,test_size=0.2,random_state=0)

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_tr,y_tr)
print("linear model score:",lm.score(X_te,y_te))

linear model score: 0.33222203269065154


We will find that ridge regression improves the score a bit, at least for some hyperparameter values:

In [3]:
from sklearn.linear_model import Ridge
rr = Ridge(alpha=0.5)
rr.fit(X_tr,y_tr)
print("ridge regression score:",rr.score(X_te,y_te))

ridge regression score: 0.35667773997495866


A LASSO regression makes a smaller improvement:

In [4]:
from sklearn.linear_model import Lasso
lass = Lasso(alpha=0.2)
lass.fit(X_tr,y_tr)
print("LASSO model score:",lass.score(X_te,y_te))

LASSO model score: 0.3433304728856845


However, while ridge regression still uses all of the features, LASSO ignores four of them:

In [5]:
print("ridge coeffs:")
print(rr.coef_)
print("LASSO coeffs:")
print(lass.coef_)

ridge coeffs:
[   8.3095978  -121.63358099  388.91388623  219.55009665  -19.77469383
  -70.82842908 -183.84763557  125.0866159   328.9173479   100.08665295]
LASSO coeffs:
[  -0.          -90.57212015  546.28988664  196.86417823   -0.
  -19.04570829 -198.35369805    0.          469.75585901    0.        ]


We can use the magnitude of the LASSO coefficients to rank the relative importance of the predictive features:

In [6]:
import numpy as np
idx = np.argsort(np.abs(lass.coef_))  # sort zero to largest
idx = idx[::-1]                       # reverse
X.columns[idx]

Index(['bmi', 's5', 's3', 'bp', 'sex', 's2', 's6', 's4', 's1', 'age'], dtype='object')

Finally, we will use cross-validation to compare basic regression with all factors, versus using just the top 5 factors:

In [7]:
from sklearn.model_selection import cross_val_score,KFold

kf = KFold(n_splits=8,shuffle=True,random_state=10)

scores = cross_val_score(lm,X,y,cv=kf)
print("scores with all predictors:")
print(f"mean = {scores.mean():.5f}, std = {scores.std():.4f}")

scores = cross_val_score(lm,X.iloc[:,idx[:5]],y,cv=kf)
print("scores with top 5 predictors:")
print(f"mean = {scores.mean():.5f}, std = {scores.std():.4f}")

scores with all predictors:
mean = 0.48061, std = 0.1164
scores with top 5 predictors:
mean = 0.48293, std = 0.0962


When fewer factors are used, we see some reduction in variance, and the mean testing score actually goes up a bit as well.