# Intro to penalized regression models

<font color = 'green'> __Regularization:__ <font color = 'black'> Adding a constraint or a penalty to a model to reduce its complexity. We did this with trees by setting a maximum depth. The purpose of this is to reduce overfitting, by not allowing a model to freely expand its parameters.

<font color = 'green'> __Bias-variance tradeoff:__ <font color = 'black'> Prediction error can be roughly broken down into
* irreducible error, coming from inherent variability in the target variable; we can't do anything about this
* bias, coming from a model that is insufficient to capture the relationships between the predictors and the target
* variance, coming from sensitivity of the model to variations in the training data
    
Overfitting error is due to excess variance. Generally speaking, more complex models often have lower bias but higher variance. This is the *bias-variance tradeoff.* We can exploit it by introducing a small amount of bias to reduce variance -- this is the goal of regularization.

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

We have a data set of observations of 252 people's body dimensions, along with measurements of their body fat percentage. Our goal is to estimate body fat from the other variables.

In [None]:
bodyfat = pd.read_csv('bodyfat.csv')
print(bodyfat.shape)
bodyfat.head()

<font color = 'green'> __Cost function:__ <font color = 'black'> A cost function or loss function measures the inaccuracy of a model. Curve fitting models, including linear regression, operate by finding coefficients of the model so that the cost function is minimized.

<font color = 'green'> __Linear model:__ <font color = 'black'> For our purposes, a linear model is one with a model equation of the form

$$ \hat y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_m x_m $$
    
or, written as a vector equation,

$$ \hat y = \boldsymbol \beta \cdot \mathbf x $$

where $\boldsymbol \beta = (\beta_0, \beta_1, \ldots)$ and $\mathbf x = (1, x_1, x_2, \ldots)$ (the 1 is to get the constant term / intercept).

<font color = 'green'> __Least squares:__ <font color = 'black'> In least squares, we use the cost function

$$ \mathcal L(\boldsymbol \beta) = \frac{1}{N} \sum_i |y_i - \hat y_i|^2 $$

(i.e. the mean squared error)
    
The script $\mathcal L$ is common notation for a cost function (from the other common term for a cost function, *loss function*). Note that we think of $\mathcal L$ as a function of the coefficients $\mathbf \beta$, not so much as a function of $\mathbf x$ or $y$. This is because we want to find the value of $\mathbf \beta$ that minimizes the loss.

<font color = 'green'> __Ridge regression:__ <font color = 'black'> A modification of least squares, using the loss function
    
$$ \mathcal L_{ridge}(\boldsymbol \beta) = \frac{1}{N} \sum_i |y_i - \hat y_i|^2 + \alpha \|\boldsymbol \beta\|^2$$

where the double vertical bars $\| \cdot \|$ refer to vector magnitude. So, in ridge regression, we "penalize" models that have large coefficients by adding the squares of those coefficients to the cost. The result is a model that tries to minimize the MSE while also keeping the coefficients small. The tuning parameter (hyperparameter) $\alpha$ control the balance between these two goals. The larger $\alpha$ is, the more the model will prioritize small coefficients.

<font color = 'green'> __Lasso:__ <font color = 'black'> A modification of least squares, using the loss function
    
$$ \mathcal L_{lasso}(\boldsymbol \beta) = \frac{1}{N} \sum_i |y_i - \hat y_i|^2 + \alpha \sum_j |\beta_j|$$
    
so we are penalizing the absolute values of the coefficients instead of their squares.
    
This seems like a small difference, but the lasso method has one important property. Ridge regression will never force any of the coefficients all the way to 0, but the lasso can. Since setting a coefficient to 0 amounts to dropping that predictor for the model, the lasso method is capable of performing *variable selection*; it can suggest which predictors, if any, should be excluded from the model entirely. This can be useful when we have a large number of predictors and cannot easily tell which are the most useful.

<font color = 'green'> __Elastic net:__ <font color = 'black'> A model that combines the lasso with ridge regression by using the cost function
    
$$ \mathcal L_{net}(\boldsymbol \beta) = t \cdot \mathcal L_{lasso}(\boldsymbol \beta) + (1-t) \cdot \mathcal L_{ridge}(\boldsymbol \beta)$$
    
$0 \leq t \leq 1$ is another tuning parameter, controlling the mixture of the two penalties. Notice that when $t = 1$, we just have a lasso; when $t = 0$, we just have a ridge.

Now let's fit a few examples of the above models to our data.

In [None]:
# Select out a training set

train_idx = np.random.choice(bodyfat.index, 50, replace = False)
bodyfat_train = bodyfat.loc[train_idx]
bodyfat_test = bodyfat.drop(train_idx)
train_X = bodyfat_train.drop('bodyfat', axis = 1)
train_y = bodyfat_train['bodyfat']
test_X = bodyfat_test.drop('bodyfat', axis = 1)
test_y = bodyfat_test['bodyfat']

We'll fit a few of these models to the training data:

In [None]:
ols_model = LinearRegression()
ridge_model = Ridge(alpha = 1)
lasso_model = Lasso(alpha = 0.01)
net_model = ElasticNet(alpha = 0.5)

ols_model.fit(train_X, train_y)
ridge_model.fit(train_X, train_y)
lasso_model.fit(train_X, train_y)
net_model.fit(train_X, train_y)

The values of $\alpha$ are chosen above to conveniently illustrate some features of the models, but in practice you'd want to select these using cross-validation.

In [None]:
ols_preds = ols_model.predict(test_X)
ridge_preds = ridge_model.predict(test_X)
lasso_preds = lasso_model.predict(test_X)
net_preds = net_model.predict(test_X)

In [None]:
print("OLS RMSE:", np.sqrt(mean_squared_error(ols_preds, test_y)))
print("Ridge RMSE:", np.sqrt(mean_squared_error(ridge_preds, test_y)))
print("Lasso RMSE:", np.sqrt(mean_squared_error(lasso_preds, test_y)))
print("Elastic net RMSE:", np.sqrt(mean_squared_error(net_preds, test_y)))

It seems the lasso performed the best here. Let's take a look at the coefficients.

In [None]:
print("OLS coefficients:")
print(ols_model.coef_)
print("Ridge coefficients:")
print(ridge_model.coef_)
print("Lasso coefficients:")
print(lasso_model.coef_)
print("Elastic net coefficients:")
print(net_model.coef_)

Interestingly, the lasso dropped a number of variables.

In [None]:
[train_X.columns[i] for i in range(len(train_X.columns)) if lasso_model.coef_[i] != 0.]

In [None]:
[train_X.columns[i] for i in range(len(train_X.columns)) if lasso_model.coef_[i] == 0.]