# Regularization
Author: Carson Andorf

Up to this point, we improved our machine learning models by minimizing training loss.  This works very well, but there are many instances where additional iterations to our model will continue to decrease training loss, but the actual performance on validation data (validation loss) will actually increase (See figure below).   This is referred to as overtraining or overfitting. Overfitting occurs when the model starts to fit to the uniqueness or quirkiness of the training set.


![Two loss function](../nb-images/Regularization.svg)
<div style="text-align: right"> (Image from Google machine learning crash course) </div>

Imagine if you were trying to build a model that learns English, but your training set was based on words spoken from teenagers. In the early iterations, the model would quickly reduce training loss, but the model could eventually reach a point of overfitting based on slang, abbreviations, and potential limited vocabulary unique to how teenagers speak. We want our model to be general enough to still work well on new examples. A method to reduce overfitting is called Regularization. Regularization essentially attempts to apply Occam’s razor on the model: a less complex machine learning model with good empirical results is preferred over a more complex model. Another way to think about regularization is that you should not trust your training examples too much.

There are two main methods to use regularization to prevent overfitting. The first in to stop training early before overfitting happens. In the figure above, you would try to stop at the point on the red curve where validation loss begins to increase. This method is difficult to implement in practice. The second method is to penalize the model complexity. Instead of just minimizing loss (empirical risk minimization), we will minimize Loss+Complexity. This is called structural risk minimization.

## L1 and L2 Regularization

There are two common ways to represent model complexity:

  * As a function of all the feature weights
  * As a function of the total number of nonzero feature weights

We will focus on two regularization methods that quantify complexity as a function of weights: L1 and L2 regularization.  

L1  defines the complexity as the sum of the absolute value of the feature weights.

![L1 regularization](../nb-images/reg_formula1.png)

In this formula, weights near zero have small effects on the model complexity, where larger weights have a larger effect. 

For example, if your model has the following weights [-0.5,-0.2,0.5,0.7,1.0,2.5], the L1 regularization term is just the sum of absolute value of each of the weights.

In [4]:
import numpy as np
weights = [-0.5,-0.2,0.5,0.7,1.0,2.5]
print ("|-0.5| + |-0.2| + |0.5| + |0.7| + |1.0| + |2.5| =", sum(np.absolute(weights))) 

|-0.5| + |-0.2| + |0.5| + |0.7| + |1.0| + |2.5| = 5.4


L2 defines the regularization term as the sum of the squares of the feature weights.

![L2 regularization](../nb-images/reg_formula2.png)

Using the same weights from above, the L2 regularization can be computed with the following code.

In [5]:
import numpy as np
weights = [-0.5,-0.2,0.5,0.7,1.0,2.5]
weights_squared = np.square(weights)
print ("-0.5^2 + -0.2^2 + 0.5^2 + 0.7^2 + 1.0^2 + 2.5^2 =", sum(np.square(weights))) 

-0.5^2 + -0.2^2 + 0.5^2 + 0.7^2 + 1.0^2 + 2.5^2 = 8.280000000000001


## Lambda
Lambda (also called regularization rate) in the above formulas allows you to choose how much regularization you want to apply to your model.  Increasing Lambda strengthens the regularization effect and will cause more of the weights to be near zero.  You do want to be careful when choosing Lambda.  Too large of a value for Lambda and your model will be simple, but this could oversimplify your model and cause underfitting.  Too small of a value and your model will be more complex, but you run the risk of overfitting. The optimal value of lambda is data-dependent, so you may need to do some tuning to find the ideal value.

## L1 vs L2
Both L1 and L2 can help with overfitting. L1 can also help with feature selection as they tend to produce more sparse weight matrices allowing you to choose non-zero weight features.  L1 is also more robust and resistant to larger outliers in the data.  L2 will only have one solution (L2 may have multiple solutions) and is more computationally efficient to compute. L2 is better in practice if the model is a function of all of the input features.  
 

## Practice Example
Regularization is supported in scikit-learn. For example, the LogisticRegression model supports both L1 and L2 regression. The two major parameters are penalty and C.  Penalty is used to specify what type of regression to use (e.g. ‘l2’)  The parameter C is used as the inverse of the regularization strength (e.g. 0.1).  Smaller values of C represent stronger regularization.   

The following programming exercise will show the effect of changing the type and strength of regularization on the logistic regression models using the Iris dataset.

### Step 1: Import the necessary modules

In [6]:
# import modules
import pandas
import numpy
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

### Step 2:  Load the data set and create your features and target classes. 

In [7]:
# import iris data from scikit and data preparation
iris = datasets.load_iris()

# Create X from the features
iris_X = iris.data

# Create y from traget class
iris_y = iris.target

# Add the next two lines to make this a binary classification problem (remove 1 of the 3 classes)
#iris_X = iris_X[iris_y != 2]
#iris_y = iris_y[iris_y != 2]

In [8]:
# check data shape
print(iris_X.data.shape)

# View first 5 rows
print(iris_X[0:5])

(150, 4)
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


In [9]:
# View classes
print(iris_y)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


### Step 3:  Prepare your training and test sets

In [10]:
# splitting into train and test data.
X_train, X_test, y_train, y_test = train_test_split(iris_X, iris_y, test_size=0.25, random_state=0)

#### Standardize Features

The penalty is the sum of the absolute value of the weights, so we need to scale the data so the weights are all based on the same scale.


In [11]:
# Create a scaler object
sc = StandardScaler()

# Fit the scaler to the training data and transform
X_train_std = sc.fit_transform(X_train)

# Apply the scaler to the test data
X_test_std = sc.transform(X_test)

### Step 4:  Apply Regularization.  

In this example, we will loop through several examples of regularization strength which is stored in the array C.  L1 and L2 regularization can be changed by the variable reg_model. Pay careful attention to how C and the choice of regularization method affect your model weights.

In [13]:
C = [5, 1, .1, .001]
reg_model = "l1"
#reg_model = "l2"

for c in C:
    model = LogisticRegression(penalty=reg_model, C=c, solver='liblinear', multi_class='auto')
    model.fit(X_train, y_train)
    predicted_train = model.predict(X_train)
    predicted = model.predict(X_test)
    print('C:', c)
    print('////////////////////////////////////////////////////')
    print('Coefficient of each feature:')
    print(model.coef_)
    print('')
    print('Confusion Matrix:')
    print (metrics.confusion_matrix(y_test, predicted))
    print ('Test accuracy:', metrics.accuracy_score(y_test, predicted)) 
    print('')
    print('Confusion Matrix:')
    print (metrics.confusion_matrix(y_train, predicted_train))
    print ('Training accuracy:', metrics.accuracy_score(y_train, predicted_train)) 
    print('////////////////////////////////////////////////////')
    print('')

C: 5
////////////////////////////////////////////////////
Coefficient of each feature:
[[ 0.          3.4247781  -3.84598588  0.        ]
 [-0.02944101 -2.37470414  1.06330456 -2.42186994]
 [-4.2998018  -2.97935922  6.47634691  5.72496752]]

Confusion Matrix:
[[13  0  0]
 [ 0 14  2]
 [ 0  0  9]]
Test accuracy: 0.9473684210526315

Confusion Matrix:
[[37  0  0]
 [ 0 33  1]
 [ 0  0 41]]
Training accuracy: 0.9910714285714286
////////////////////////////////////////////////////

C: 1
////////////////////////////////////////////////////
Coefficient of each feature:
[[ 0.          2.34857721 -2.64903276  0.        ]
 [ 0.41214186 -1.4761859   0.44239796 -1.18472386]
 [-3.07915763 -1.46576355  3.76849411  3.21530389]]

Confusion Matrix:
[[13  0  0]
 [ 0 12  4]
 [ 0  0  9]]
Test accuracy: 0.8947368421052632

Confusion Matrix:
[[37  0  0]
 [ 0 31  3]
 [ 0  0 41]]
Training accuracy: 0.9732142857142857
////////////////////////////////////////////////////

C: 0.1
///////////////////////////////////



# After completing the exercise, answer the questions below?

1. How did changing the value of C affect the coefficient weights and training/test performance?
2. What value of C gave you the best test accuracy?
3. For this problem, do you recommend a high or low regularization strength?  Does this make sense based on the data?
4. Switch the reg_model from 'l1' to '12'. Which regularization (l1 or l2) had models with more zero weights?
5. How could you use a sparse matrix (i.e. a matrix with a lot of zero weights) for feature selection?

Bonus: Can you make this a binary classification problem?  How does that change your results? 