# Python Machine Learning: Regularization

In machine learning, the name of the game is generalization. We want to have a model perform well on the training set, but we need to make sure that the patterns the model learns can actually generalize to data the model hasn't seen before.

So, the scenario we want to avoid is that of **overfitting**. This occurs when our model too strongly learns patterns in the training data, and doesn't generalize well. Overfit models tend to exhibit large generalization gaps: large differences in predictive performance between the training and test data.

Overfitting can happen for a variety of reasons, the most well known of which is having a model that's too complicated. Luckily, all is not lost. There are a variety of approaches we can use to combat overfitting. In general, these approaches are called **regularization**.

## Overfitting and Regularization

In the previous lesson, we discussed feature engineering, the process by which we create new features in order to make our model more expressive. One tradeoff to adding features to the model is that the model becomes more complex, which makes it prone to overfitting. 

For example, consider a basic regression with the points shown below:

![overfitting](../images/overfitting.png)

We could fit a simple line to this data, which will exhibit some error. However, we could also fit a more complex model - say, a polynomial - which could perfectly fit to the training data. There will be no error in the training predictions, which seems great!

But do we *really* think the polynomial is making good predictions on *all* possible data points? Look at how it behaves in between the training examples. It's very likely on *new* data - that is, when the model needs to generalize - the linear model will perform much better than the polynomial model. This is because the polynomial model overfit to the data.

So, it's common in machine learning to follow a "parsimony principle". Specifically, we aim to choose simpler models that can still be predictive, because simpler models are less likely to overfit, and thus generalize decently well.

Regularization is often though of in terms of the **bias-variance tradeoff**. Specifically, prediction errors often break down in terms of two components: bias and variance. The linear model exhibits higher bias, since it exhibits large errors on the training example. But the polynomial model has higher variance - it's more likely to give wildly different predictions for training samples close together.

We don't always have to use linear regression in the spirit of opting for simpler models. Sometimes, it's good to use the complicated model, particularly if it makes sense in a specific context. This is where **regularization** is useful: a technique we can use to make a model less prone to overfitting during training. It's important to note that regularization is more of a concept than it is a specific, standardized technique. There are many approaches used for regularizing. Today, we're going to cover the usage of **penalty terms** to regularize linear models.

---
### Challenge 1: Warm-Up

Before we get started, let's warm up by importing our data and performing a train test split. We've providing the importing code for you. Go ahead and split the data into train/test sets using an 80/20 split, and a random state of 23.

---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

In [None]:
# Import data
data = pd.read_csv('../data/auto-mpg.csv')
# Remove the response variable and car name
X = data.drop(columns=['car name', 'mpg'])
# Assign response variable to its own variable
y = data['mpg'].astype(np.float64)

In [None]:
# YOUR CODE HERE


## Ridge Regression

Recall the formulation of a linear model. We have the parameters we are trying to estimate, given in the model:

$$Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_P X_P$$

We do this by minimizing the following objective function:

$$
\begin{align}
\text{MSE} = L(\beta) &= \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2 \\
&= \frac{1}{N}\sum_{i=1}^{N}\left(y_i - \beta_0 - \sum_{j=1}^P \beta_j X_j\right)^2
\end{align}
$$

We're going to regularize this model. We're not going to change the actual linear model - that's the top equation - but we will change how we choose the $\beta$ parameters. Specifically, we're going to do **ridge regression** (also called $\ell_2$ regularization and Tikhonov regularization). Instead of using the least squares objective function, specified in the second equation, we're going to use the following objective function: 

$$ L(\beta) = \sum_{i=1}^N (y_i - \hat y_i)^2  + \alpha \sum_{j=1}^P \beta_j^2 $$ 

What's the difference? There's a second term added on, which is equal to the sum of the squares of the $\beta$ values. What does this mean?

Our goal is for the loss, $L(\beta)$, to be as small as possible. The first term says we can make that small if we make our errors, $y_i - \hat y_i$, small. The second term says that we increase the loss if the $\beta$ values get too large. There's a tradeoff here: if we make the $\beta$ values all zero to accomodate the second term, then the first term will be large. So, in ridge regression, we try and minimize the errors, while trying hard not to make the coefficients too big.

Also, note that ridge regression requires a **hyperparameter**, called $\alpha$ (sometimes $\lambda$). This hyperparameter indicates how much regularization should be done. In other words, how much do we care about the coefficient penalty term vs. how much do we care about the sum of squared errors term? The higher the value of $\alpha$, the more regularization, and the smaller the resulting coefficients will be. On the other hand, if we use an $\alpha$ value of 0, we get the same solution as the OLS regression done above.

Why does ridge regression serve as a good regularizer? The penalty actually does several things, which are beneficial for our model:
1. **Multicollinearity:** Ridge regression was devised largely to combat multicollinearity, or when features are highly correlated with each other. Ordinary least squares struggles in these scenarios, because multicollinearity can cause a huge increase in variance: it makes the parameter estimates unstable. Adding the penalty term stabilizes the parameter estimates, at a little cost to bias. This results in better generalization performance.
2. **Low Number of Samples:** The most common scenario where you might overfit is when you have many features, but not many samples. Adding the penalty term stabilizes the model in these scenarios. There's not a great intuition for this without diving into the math, so you can just take it at face value. 
3. **Shrinkage:** The $\ell_2$ penalty results in shrinkage, or a small reduction in the size of the parameters. This is effectively a bias, but helps regularize by reducing variance that often comes with overfit models.

## Ridge Regression in Practice

As with linear regression, `scikit-learn` makes it easy to fit a ridge regression. We simply use the `Ridge` class from `scikit-learn`. This time, however, we're going to specify some arguments when we create the ridge regression object. The most important one is the regularization penalty, $\alpha$, which we need to choose:

In [None]:
from sklearn.linear_model import Ridge
# Create models
ridge = Ridge(
    # Regularization penalty
    alpha=10,
    random_state=1)
# Fit object
ridge.fit(X_train, y_train)

In [None]:
# Run predictions
y_train_pred_ridge = ridge.predict(X_train)
y_test_pred_ridge = ridge.predict(X_test)

In [None]:
# Evaluate model
print(f'Training R^2: {ridge.score(X_train, y_train)}')
print(f'Test R^2: {ridge.score(X_test, y_test)}')
print(f'Train RMSE: {mean_squared_error(y_train, y_train_pred_ridge, squared=False)}')
print(f'Test RMSE: {mean_squared_error(y_test, y_test_pred_ridge, squared=False)}')

---
### Challenge 2: Benchmarking

Re-run the ordinary least squares on the data using `LinearRegression`. Then, create a new ridge regression where the `alpha` penalty is set equal to zero. How do the performances of these models compare to each other? How do they compare with the original ridge regression? Be sure to compare both the training performances and test performances.

---

In [None]:
from sklearn.linear_model import LinearRegression
# YOUR CODE HERE
# Create models

# Fit models

# Run predictions

# Evaluate models


Based off your experiments, you probably found that ridge regression resulted in worse training performance, but slightly better generalization performance! So the regularization can help, particularly in this case where we know the parameters are correlated with each other.

## Choosing Hyperparameters: Validation Sets

The current issue with our analysis thus far is that we don't know what $\alpha$ value we should use. Since hyperparameters are chosen *before* we fit the model, we can't just choose them based off the training data. So, how should we go about conducting **hyperparameter search**: identifying the best hyperparameter(s) to use?

Let's think back to our original goal. We want a model that generalizes to unseen data. So, ideally, the choice of the hyperparameter should be such that the performance on unseen data is the best. We can't use the test set for this, but what if we had another set of held-out data? 

This is the basis for a **validation set**. If we had extra held-out dataset, we could try a bunch of hyperparameters on the training set, and see which one results in a model that performs the best on the validation set. We then would choose that hyperparameter, and use it to refit the model on both the training data and validation data. We could then, finally, evaluate on the test set.

![validation](../images/validation.png)

So, you'll often see a dataset not only split up into training/test sets, but training/validation/test sets, particularly when you need to choose a hyperparameter.

### Cross-Validation

We just formulated the process of choosing a hyperparameter with a single validation set. However, there are many ways to perform validation. The most common way is **cross-validation**. Cross-validation is motivated by the concern that we may not choose the best hyperparameter if we're only validating on a small fraction of the data. If the validation sample, just by chance, contains specific data samples, we may bias our model in favor of those samples, and limit its generalizability.

So, during cross-validation, we effectively validate on the *entire* dataset, by breaking it up into folds. Here's the process:

1. Perform a train/test split, as you normally would.
2. Choose a number of folds - the most common is $K=5$ - and split up your training data into those equally sized "folds".
3. For *each* hyperparameter, we're going to fit $K$ models. Let's assume $K=5$. The first model will be fit on Folds 2-5, and validated on Fold 1. The second model will be fit on Folds 1, 3-5, and validated on Fold 2. This process continues for all 5 splits.
4. Each hyperparameter's performance is summarized by the average predictive performance on all 5 held-out folds. We then choose the hyperparameter that had the best average performance.
5. We can then refit a new model to the entire training set, using our chosen hyperparameter. That's our final model - evaluate it on the test set!

![cross-validation](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)

## Cross-Validation in Practice

You guessed it: `scikit-learn` makes it really easy to fit a model with cross-validation. We'll use the `RidgeCV` class. Check out the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html) for details about it.

`RidgeCV` is going to need to know a few things from us: which hyperparameters do we want? How many folds should we use? We'll specify these when creating the model object.

In [None]:
from sklearn.linear_model import RidgeCV
# Create ridge model, with CV
ridge_cv = RidgeCV(
    # Which alpha values to test for?
    alphas=np.logspace(-1, 3, 100),
    # Number of folds
    cv=5)
# Fit model
ridge_cv.fit(X_train, y_train)
# Evaluate model
print(ridge_cv.score(X_train, y_train))
print(ridge_cv.score(X_test, y_test))

We can also access the best $\alpha$ value:

In [None]:
ridge_cv.alpha_

As well as the coefficients:

In [None]:
ridge_cv.coef_

## Bonus Material: Lasso Regression

**Lasso regression** (also called $\ell_1$ regularization) is another form of regularized regression that penalizes the coefficients. Rather than taking a squared penalty of the coefficients, Lasso uses an absolute value penalty: 

$$ L(\beta) = \sum_{i=1}^N (y_i - \hat y_i)^2  + \alpha \sum_{j=1}^P |\beta_j| $$ 

This has a similar effect on making the coefficients smaller, but also has a tendency to force some coefficients to be set *exactly equal to 0*. This leads to what is called **sparser** models, and is another way to reduce overfitting introduced by more complex models.

Setting some coefficients exactly equal to zero has the added benefit of performing **feature selection**: it can exactly identify if some features are not worth including in the model, because their coefficients are set exactly to 0 (meaning that their values would have no impact on prediction).

---
### Challenge 3: Performing a Lasso Fit

Below, we've imported the `Lasso` object from `scikit-learn` for you. Just like `Ridge`, it needs to know what the strength of the regularization penalty is before fitting to the data. 

Fit several Lasso models, with different regularization strengths. Try one with a small but non-zero regularization strength, and try one with a very large regularization strength. Look at the coefficients. What do you notice?

---

In [None]:
from sklearn.linear_model import Lasso
# YOUR CODE HERE
