## Overview of regularization

---

**Regularizing** regression models is to:
- **automatically** avoid overfitting 
- **while** we fit our model
- by adding a "penalty" to our loss function.

### Before regularziation (OLS):

$$
\begin{align}
\text{minimize: MSE} &= \textstyle\frac{1}{n}\sum (y_i - \hat{y}_i)^2 \\ \\
                     &= \textstyle\frac{1}{n}\|\mathbf{y} - \hat{\mathbf{y}}\|^2 \\ \\
                     &= \textstyle\frac{1}{n}\|\mathbf{y} - \mathbf{X\beta}\|^2
\end{align}
$$

### After regularization (Ridge):

$$
\begin{align}
\text{minimize: MSE + penalty} &= \textstyle\frac{1}{n}\sum (y_i - \hat{y}_i)^2 + \alpha \sum \beta_j^2 \\ \\
                               &= \textstyle\frac{1}{n}\|\mathbf{y} - \hat{\mathbf{y}}\|^2 + \alpha \|\beta\|^2 \\ \\
                               &= \textstyle\frac{1}{n}\|\mathbf{y} - \mathbf{X}\hat{\beta}\|^2 + \alpha \|\beta\|^2
\end{align}
$$

Adding this penalty term onto the end and then minimizing has a similar effect to the one described above. That is, **ridge regression shrinks our regression coefficients closer to zero to make our model simpler**. We are accepting more bias in exchange for decreased variance. We'll be tasked with picking the "best" $\alpha$ that optimizes this bias-variance tradeoff.


### Sidenote on notation:
We'll be using $\alpha$ to denote our **regularization parameter**, since that's what Scikit-Learn uses. However, this is contrary to data science literature. It is normally denoted with a $\lambda$. Why? Only Google knows.

### [Neat parameter space visualization!](https://timothykbook.shinyapps.io/RegularizationPlot/)

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Load in the wine .csv.
wine = pd.read_csv('./data/winequality_merged.csv')

# Convert all columns to lowercase and replace spaces in column names.
wine.columns = wine.columns.str.lower().str.replace(' ', '_')

In [None]:
 # How big is this dataset?
wine.shape

In [None]:
# Check for missing values.
wine.isnull().sum()

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# Create X and y.
X = wine.drop('quality', axis=1)
y = wine['quality']

# Instantiate our PolynomialFeatures object to create all two-way terms.
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)

# Fit and transform our X data.
X_overfit = poly.fit_transform(X)

In [None]:
poly.get_feature_names(X.columns)

In [None]:
# Check out the dimensions of X_overfit.
X_overfit.shape

### Preprocessing

In [None]:
# Import train_test_split.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
# Create train/test splits.
X_train, X_test, y_train, y_test = train_test_split(
    X_overfit,
    y,
    test_size=0.7,
    random_state=42
)

In [None]:
# Scale our data.
# Relabeling scaled data as "Z" is common.
sc = StandardScaler()
Z_train = sc.fit_transform(X_train)
Z_test = sc.transform(X_test)

In [None]:
print(f'Z_train shape is: {Z_train.shape}')
print(f'y_train shape is: {y_train.shape}')
print(f'Z_test shape is: {Z_test.shape}')
print(f'y_test shape is: {y_test.shape}')

### Normal Linear Regression

In [None]:
# Import the appropriate library and fit our OLS model.
from sklearn.linear_model import LinearRegression

ols = LinearRegression()
ols.fit(Z_train, y_train)

# How does the model score on the training and test data?
print(ols.score(Z_train, y_train))
print(ols.score(Z_test, y_test))

### Ridge Regularization

In [None]:
# Ridge regressor lives here:
from sklearn.linear_model import Ridge

In [None]:
# Instantiate.
ridge_model = Ridge(alpha=10)

# Fit.
ridge_model.fit(Z_train, y_train)

# Evaluate model using R2.
print(ridge_model.score(Z_train, y_train))
print(ridge_model.score(Z_test, y_test))

### LASSO Regularization

In [None]:
# Imports similar to Ridge
from sklearn.linear_model import Lasso

In [None]:
# Instantiate
lasso_model = Lasso(alpha=.007)

# fit
lasso_model.fit(Z_train, y_train)

# Evaluate model using R2.
print(lasso_model.score(Z_train, y_train))
print(lasso_model.score(Z_test, y_test))

### Compare Coefficients

#### OLS

In [None]:
ols.coef_

#### Ridge

In [None]:
ridge_model.coef_

#### LASSO

In [None]:
lasso_model.coef_

In [None]:
list(zip(poly.get_feature_names(X.columns), ols.coef_))

In [None]:
list(zip(poly.get_feature_names(X.columns), ridge_model.coef_))

In [None]:
list(zip(poly.get_feature_names(X.columns), lasso_model.coef_))

## If there is time left...


## BOTH!

Can't decide?

![](../imgs/ridge-VS-lasso.jpg)

The Elastic Net combines the Ridge and Lasso penalties.  It adds *both* penalties to the loss function:

$$
\begin{eqnarray}
SSE + Ridge + Lasso &=& \sum_{i=1}^n \left(y_i - \hat{y}_i\right)^2 + \alpha\left[\rho\sum_{j=1}^p |\beta_j| + (1-\rho)\sum_{j=1}^p \beta_j^2\right] \\
&=& \|\mathbf{y} - \mathbf{X}\beta\|^2 + \alpha\left(\rho\|\beta\|_1 + (1 - \rho)\|\beta\|^2\right)
\end{eqnarray}
$$


In the elastic net, the effect of the ridge versus the lasso is balanced by the $\rho$ parameter.  It is the ratio of Lasso penalty to Ridge penalty and must be between zero and one.

`ElasticNet` in sklearn has two parameters:
- `alpha`: the regularization strength.
- `l1_ratio`: the amount of L1 vs L2 penalty (i.e., $\rho$). An l1_ratio of 0 is equivalent to the Ridge, whereas an l1_ratio of 1 is equivalent to the Lasso.

In [None]:
from sklearn.linear_model import ElasticNet

In [None]:
# Instantiate and fit
elastic_model = ElasticNet(alpha=.5).fit(Z_train, y_train)

# Evaluate model using R2.
print(elastic_model.score(Z_train, y_train))
print(elastic_model.score(Z_test, y_test))

In [None]:
elastic_model.coef_

In [None]:
list(zip(poly.get_feature_names(X.columns), elastic_model.coef_))