# Regularization
Author: Carson Andorf

Up to this point, we improved our machine learning models by minimizing training loss.  This can work very well, but there are many instances where making the training loss as small as possible can actually cause the model's performance on validation data (i.e., the validation loss) *to get worse* (see the figure below)!   This problem is referred to as overtraining or overfitting.  Overfitting occurs when the model starts to fit to the uniqueness or quirkiness of the training set.


![Two loss function](../nb-images/Regularization.svg)
<div style="text-align: right"> (Image from Google machine learning crash course) </div>

For instance, consider this simple dataset that contains information about insect, fish, and bird species and whether or not they can fly:

|Name|Class|Can fly|
|:--:|:---:|:-----:|
|Pileated woodpecker|Birds|Yes|
|Emu|Birds|No|
|Northern cardinal|Birds|Yes|
|Blacktip shark|Cartilaginous fishes|No|
|Bluntnose stingray|Cartilaginous fishes|No|
|Black drum|Bony fishes|No|
|Florida carpenter ant|Insects|No|
|Periodical cicada|Insects|Yes|
|Luna moth|Insects|Yes|

Your task is to develop a model to classify whether or not an animal can fly, based on information available in the dataset.  Here is a relatively simple model based on these data:
  * If the animal is a bird or an insect, predict that it can fly.
  * Otherwise, predict that it cannot fly.

This model is imperfect.  Indeed, it misclassifies 2 of the examples in the training data.  We can try developing a more complex model to reduce our training loss:
  * If the species is a bird and has a one-word name, predict that it cannot fly.
  * If it is a bird with a two-word name, predict that it can fly.
  * If it is an insect with a three-word name, predict that it cannot fly.
  * If it is an insect with a two-word name, predict that it can fly.
  * Otherwise, predict that it cannot fly.

Aha!  That model classifies each training example perfectly!  When presented with new examples, however (e.g., "albatross" or "zebra swallowtail butterfly"), the more complex model will often fail spectacularly while the simpler model, although imperfect, will perform relatively well.  Clearly, our more complex model is disastrously overfitted to the training data.

The bottom line is that we want our models to be general enough to work well on new examples.  Methods to help prevent overfitting are collectively referred to as *regularization* techniques.  Regularization essentially attempts to apply Occam’s razor on the model: a less complex machine learning model with good empirical results is preferred over a more complex model.  Another way to think about regularization is that you should not trust your training examples too much.

There are a variety of ways to implement regularization to prevent overfitting.  One option is to stop training early before overfitting happens. In the figure above, you would try to stop at the point on the red curve where validation loss begins to increase.  Another approach, which is often easier to implement in practice, is to explicitly penalize model complexity in the model-fitting procedure.  Instead of just minimizing loss, we minimize `loss + complexity`.

## L<sub>1</sub> and L<sub>2</sub> regularization

For this lesson, we will focus on two widely used regularization methods: L<sub>1</sub> and L<sub>2</sub> regularization.  Both of these methods represent model complexity as a function of the model's feature weights.

L<sub>1</sub> regularization defines model complexity as the sum of the absolute value of the feature weights (multiplied by a constant, *lambda*, which we will discuss later).

![L1 regularization](../nb-images/reg_formula1.png)

In this formula, weights near zero have small effects on the model complexity, where larger weights have a larger effect. 

For example, if your model has the following weights [-0.5,-0.2,0.5,0.7,1.0,2.5], the L<sub>1</sub> regularization term is just the sum of absolute value of each of the weights:

In [2]:
import numpy as np
weights = [-0.5, -0.2, 0.5, 0.7, 1.0, 2.5]
print(sum(np.absolute(weights)))

5.4


L<sub>2</sub> regularization represents model complexity as the sum of the squares of the feature weights (again multiplied by a constant, *lambda*).

![L2 regularization](../nb-images/reg_formula2.png)

Using the same weights from above, the L<sub>2</sub> regularization term can be computed with the following code.

In [3]:
print(sum(np.square(weights))) 

8.280000000000001


To use either of these regularization terms when fitting a model, the usual procedure is to add the regularization term to whatever loss function you want to use.  E.g., *total loss* = *loss* + *regularization term*.


## Lambda

The regularization parameter in the above formulas, *lambda*, allows you to adjust the balance between minimizing the loss function and penalizing overly complex models.  Increasing *lambda* strengthens the regularization effect and will cause more of the model weights to be near zero, while decreasing *lambda* places more emphasis on reducing the loss function.  Thus, the value of *lambda* is very important during model fitting.  Too large a value for *lambda* and your model might be overly simple and prone to underfitting.  Too small of a value and your model will be more complex, but you run the risk of overfitting. The optimal value of lambda is data-dependent and will usually need to be estimated in some way.


## Practical differences between L<sub>1</sub> and L<sub>2</sub> regularization

L<sub>1</sub> and L<sub>2</sub> both can help prevent overfitting.  From a practical standpoint, perhaps the most important difference between the two is that L<sub>1</sub> regularization can help with *feature selection*.  As a consequence of the mathematical properties of L<sub>1</sub> regularization, L<sub>1</sub> regularization can result in models where some of the feature weights are 0, effectively removing those features from the model.  In contrast, L<sub>2</sub> regularization can decrease model weights but not drive them to 0.  L<sub>1</sub> regularization can also be more robust and resistant to larger outliers in the data.  On the other hand, L<sub>2</sub> regularization results in a minimization problem with a unique solution, which is not always the case for L<sub>1</sub> regularization.  Which regularization method is best depends on the specifics of the data, the modeling problem, and the goals of the analysis.
 

## Practice Example

Regularization is supported in scikit-learn.  For example, the `LogisticRegression` model supports both L<sub>1</sub>- and L<sub>2</sub>-regularized regression.  The two major parameters are `penalty` and `C`.  The paraeter `penalty` is used to specify what type of regression to use (e.g. `'l2'`).  The parameter `C` is used as the inverse of the regularization parameter *lambda*.  Therefore, smaller values of `C` represent stronger regularization.   

The following programming exercise will show the effect of changing the type and strength of regularization on the logistic regression models using the Iris dataset.

### Step 1: Import the necessary modules

In [40]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error as mse

### Step 2:  Load the data set and prepare the features and response

In [50]:
idata = pd.read_csv('../nb-datasets/iris_dataset.csv')
idata['species'] = idata['species'].astype('category')

# Convert the categorical variable "species" to 1-hot encoding (AKA "dummy variables"),
# but eliminate the first dummy variable because it is collinear with the other two
# and does not provide any additional information.
idata_enc = pd.get_dummies(idata, drop_first=True)

# Separate the x and y values.
x = idata_enc.drop(columns='petal_length')
y = idata_enc['petal_length']

# Split the train and test sets.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

# See what we have.
print(y_train)
x_train.head()

72     4.9
80     3.8
101    5.1
7      1.5
6      1.4
      ... 
29     1.6
147    5.2
40     1.3
32     1.5
129    5.8
Name: petal_length, Length: 112, dtype: float64


Unnamed: 0,sepal_length,sepal_width,petal_width,species_versicolor,species_virginica
72,6.3,2.5,1.5,1,0
80,5.5,2.4,1.1,1,0
101,5.8,2.7,1.9,0,1
7,5.0,3.4,0.2,0,0
6,4.6,3.4,0.3,0,0


### Step 4: Give standard linear regression a try

We'll give regular old non-regularized linear regression a try first.  This will give us a baseline to compare to the regularization methods.

In [47]:
model = LinearRegression()
model.fit(x_train, y_train)

print('Train R2:', model.score(x_train, y_train))
print('Test R2:', model.score(x_test, y_test))

train_loss = mse(y_train, model.predict(x_train))
test_loss = mse(y_test, model.predict(x_test))
print('Train loss:', train_loss)
print('Test loss:', test_loss)

print('Coefficients:', model.coef_)

Train R2: 0.9793028093800101
Test R2: 0.971832233015809
Train loss: 0.06542523192185211
Test loss: 0.07901487787870036
Coefficients: [ 0.65503965 -0.2961708   0.47704172  1.48747658  2.03756475]


### Step 5: Let's give L<sub>2</sub> regularization a try

The results above suggest that overfitting really isn't too much of a problem here.  We have a relatively small number of features (i.e., parameters we must estimate) compared to the number of observations, so this result should not be very suprising.  Nevertheless, let's try using L<sub>2</sub> regularization and see what happens.


In [65]:
model = Ridge(alpha = 2.0)
model.fit(x_train, y_train)

print('Train R2:', model.score(x_train, y_train))
print('Test R2:', model.score(x_test, y_test))

train_loss = mse(y_train, model.predict(x_train))
test_loss = mse(y_test, model.predict(x_test))
print('Train loss:', train_loss)
print('Test loss:', test_loss)

print('Coefficients:', model.coef_)

Train R2: 0.9755523410003801
Test R2: 0.9685744986895839
Train loss: 0.07959825668785966
Test loss: 0.08148606374415827
Coefficients: [ 0.74299035 -0.46691584  1.06369227  0.55090305  0.73545887]


### Step 6: Finally, let's give L<sub>1</sub> regularization a try

In [62]:
model = Lasso(alpha = 0.1)
model.fit(x_train, y_train)

print('Train R2:', model.score(x_train, y_train))
print('Test R2:', model.score(x_test, y_test))

train_loss = mse(y_train, model.predict(x_train))
test_loss = mse(y_test, model.predict(x_test))
print('Train loss:', train_loss)
print('Test loss:', test_loss)

print('Coefficients:', model.coef_)

Train R2: 0.9510922774078785
Test R2: 0.9511275681292637
Train loss: 0.1592369010450801
Test loss: 0.12672580969872624
Coefficients: [ 0.59186282 -0.13893823  1.51562559  0.          0.        ]


# After completing the exercise, answer the questions below

1. How did changing the value of C affect the coefficient weights and training/test performance?
2. What value of C gave you the best test accuracy?
3. For this problem, do you recommend a high or low regularization strength?  Does this make sense based on the data?
4. Switch the reg_model from 'l1' to '12'. Which regularization (l1 or l2) had models with more zero weights?
5. How could you use a sparse matrix (i.e. a matrix with a lot of zero weights) for feature selection?