<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to Regularization

_Authors:_ Tim Book, Matt Brems

## Learning Objectives

1. Describe what a loss function is.
2. Define regularization.
3. Describe and differentiate LASSO and Ridge regularization.
4. Understand how regularization affects the bias-variance tradeoff.
5. Implement LASSO regression and Ridge regression.

[Optional read](https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a) from _Medium_. I found this immensly helpful in preparation for the intuition deck we just covered on this topic

## Review

<details><summary>What is the bias-variance tradeoff?</summary>

- Mean squared error can be decomposed into a bias component plus a variance component (plus a systematic error, but we don't have control over this part, so we often ignore it).
    <center>$E[SSE] = \text{bias}^2 + variance + \sigma^2$</center>
- The bias-variance tradeoff refers to the fact that taking steps to minimize bias usually comes at the expense of an increase in variance (and possibly making a complex, overfit model). Similarly, taking steps to minimize variance usually comes at the expense of an increase in bias (and possibly making a too simplistic underfit model).

</details>

---

<details><summary>What evidence/information would lead me to believe that my model suffers from high variance?</summary>
    
- After splitting my data into training and testing sets, if I see that my model <b>performs way better on my training set than my testing set, this means that my model is not generalizing very well to "new" data</b>.
- An example might be where our training MSE is substantially lower than our testing MSE, or where our training R-squared is substantially higher than our testing R-squared. In other words, <i>training vs testing metrics are quite different</i>.
</details>

## Why is high variance bad?

High variance is bad because it means that our model <b>doesn't generalize well to new data</b>. This means that our model looks as though it performs well on our training data but won't perform as well on new, unseen data.

---
<details><summary>How might we try to fix a model that suffers from high variance?</summary>

There are multiple ways:
- Gather more data. (Although this is usually expensive and time-consuming.)
- Drop <i>(irrelevant)</i> features.
- Make our existing features less complex. (i.e. get rid of interaction terms or higher order terms.)
- Choose a simpler model.
- <b>Regularization!</b>
</details>

## Pop Math Quiz

### Problem 1
**What is the least value of $b$ that minimizes the following term...**

$$ f = (y - b)^2 $$

<details><summary></summary>
When $b = y$, this expression has value 0. Since it's squared, it can't go below that.
</details>

### Problem 2
**What is the value of $b$ that minimizes...**

$$ f = (y - b)^2 + \alpha b^2 $$

where $\alpha > 0$?

<details><summary></summary>
This is more complicated, isn't it? You can use differential calculus and come up with an answer:

$$ \frac{\partial{f}}{\partial{b}} = 2(y - b)(0 - 1) + 2\alpha b $$

To find the minima of f: partial derivative w.r.t b --> 0 --> $\hat b$
    
$$ 0 = 2(\hat b - y) + 2\alpha \hat b $$
    
$$ \hat{b} = \frac{y}{1 + \alpha} $$

But what is the effect of $\alpha$ on our solution?
- when $\alpha$ increases, $\hat{b}$ drops  
</details>

## Overview of regularization

---

**Regularizing** regression models is to:
- **automatically** avoid ***overfitting*** while we _fit our model_ by adding a `penalty` to our loss function.

### _Before_ regularization (OLS - ordinary least squares):
where ||v|| is the notation for the norm of a vector v and is calculated as the sum of the absolute vector values.

$$
\begin{align}
\text{minimize: MSE} &= \textstyle\frac{1}{n}\sum (y_i - \hat{y}_i)^2 \\ \\
                     &= \textstyle\frac{1}{n}\|\mathbf{y} - \hat{\mathbf{y}}\|^2 \\ \\
                     &= \textstyle\frac{1}{n}\|\mathbf{y} - \mathbf{X\beta}\|^2
\end{align}
$$

### _After_ regularization (Ridge):

$$
\begin{align}
\text{minimize: MSE + penalty} &= \textstyle\frac{1}{n}\sum (y_i - \hat{y}_i)^2 + \alpha \sum \beta_j^2 \\ \\
                               &= \textstyle\frac{1}{n}\|\mathbf{y} - \hat{\mathbf{y}}\|^2 + \alpha \|\beta\|^2 \\ \\
                               &= \textstyle\frac{1}{n}\|\mathbf{y} - \mathbf{X}\hat{\beta}\|^2 + \alpha \|\beta\|^2
\end{align}
$$

Adding this penalty term onto the end and then minimizing has a similar effect to the one described above in Problem 2. That is, **ridge regression shrinks our regression coefficients closer to zero to make our model simpler**. We are accepting more bias in exchange for decreased variance. We'll be tasked with picking the "best" $\alpha$ that optimizes this bias-variance tradeoff.

### Other Variations

| Name | Loss Function |
| --- | --- |
| OLS | MSE |
| Ridge Regression (L2 Norm) | MSE + $\alpha\|\beta\|^2_2$ |
| LASSO Regression (L1 Norm) | MSE + $\alpha\|\beta\|_1$ |
| $L_q$-Regression | MSE + $\alpha\|\beta\|^q_q$ |

### Sidenote on notation:
We'll be using $\alpha$ to denote our **regularization parameter**, since that's what Scikit-Learn uses. However, this is contrary to data science literature. It is normally denoted with a $\lambda$. Why? Only Google knows.

### [Neat parameter space visualization!](https://timothykbook.shinyapps.io/RegularizationPlot/)
Increasing regularization parameter, shrinks coefficients

## What is the effect of regularization?

---

**To demonstrate the effects of regularization, we will be using a dataset on wine quality.**

### Load the wine .csv

This version has red and white wines concatenated together and tagged with a binary 1/0 indicator (1 is red wine). There are many other variables purportedly related to the rated quality of the wine.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('darkgrid') # optional, just plots charts on a darker (gray) canvas 

In [2]:
# Load in the wine .csv.
wine = pd.read_csv('datasets/winequality_merged.csv')

# Convert all columns to lowercase and replace spaces in column names with _
wine.columns = wine.columns.str.lower().str.replace(' ', '_')

In [3]:
# Check the first five rows.
wine.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,ph,sulphates,alcohol,quality,red_wine
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,1
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,1
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1


In [4]:
# How big is this dataset?
wine.shape

(6497, 13)

In [5]:
# Check for missing values.--> there are none
wine.isnull().sum()

fixed_acidity           0
volatile_acidity        0
citric_acid             0
residual_sugar          0
chlorides               0
free_sulfur_dioxide     0
total_sulfur_dioxide    0
density                 0
ph                      0
sulphates               0
alcohol                 0
quality                 0
red_wine                0
dtype: int64

In [6]:
from sklearn.preprocessing import PolynomialFeatures # to generate polynomial and interaction terms

# Create X and y.
X = wine.drop('quality', axis=1)# take all cols except, 'quality' --> response
y = wine['quality']

# Instantiate our PolynomialFeatures object to create all two-way terms. degree=2 by default
# meaning, for [a,b]-->[a,b,a^2,b^2,ab]
# include_bias=False, because we don't need to manually create intercept col for sklearn
poly = PolynomialFeatures(include_bias=False)
# Fit and transform our X data.
X_overfit = poly.fit_transform(X)

In [7]:
X.shape # num of feature cols, leaving 'quality'-->12

(6497, 12)

In [8]:
poly.get_feature_names(X.columns)

['fixed_acidity',
 'volatile_acidity',
 'citric_acid',
 'residual_sugar',
 'chlorides',
 'free_sulfur_dioxide',
 'total_sulfur_dioxide',
 'density',
 'ph',
 'sulphates',
 'alcohol',
 'red_wine',
 'fixed_acidity^2',
 'fixed_acidity volatile_acidity',
 'fixed_acidity citric_acid',
 'fixed_acidity residual_sugar',
 'fixed_acidity chlorides',
 'fixed_acidity free_sulfur_dioxide',
 'fixed_acidity total_sulfur_dioxide',
 'fixed_acidity density',
 'fixed_acidity ph',
 'fixed_acidity sulphates',
 'fixed_acidity alcohol',
 'fixed_acidity red_wine',
 'volatile_acidity^2',
 'volatile_acidity citric_acid',
 'volatile_acidity residual_sugar',
 'volatile_acidity chlorides',
 'volatile_acidity free_sulfur_dioxide',
 'volatile_acidity total_sulfur_dioxide',
 'volatile_acidity density',
 'volatile_acidity ph',
 'volatile_acidity sulphates',
 'volatile_acidity alcohol',
 'volatile_acidity red_wine',
 'citric_acid^2',
 'citric_acid residual_sugar',
 'citric_acid chlorides',
 'citric_acid free_sulfur_diox

In [9]:
# Check out the dimensions of X_overfit.
# getting poly features on 12 cols should result in-->12_original+12_squared_of_original+multipliers
# now we have gone from just 12 to 90 features to fit for learning vs 'quality'!!
X_overfit.shape

(6497, 90)

#### Let's split our data up into training and testing sets. 

***Share your take on why do we split our data into training and testing sets?***

In [10]:
# Import train_test_split.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [11]:
# Create train/test splits using X_overfit, y defined above.
# setting random_state helps with repeatability on the splits, instead of diff splits each time
# note that setting such a high test_size vs train is rarely practised, it is done so here for simulating a poor model learning here
# default test_size in sklearn=25%. rule of thumb: train_size>test_size
X_train, X_test, y_train, y_test = train_test_split(X_overfit, y, test_size=0.7, random_state=42)

In [12]:
# Scale our data to standardize different feature magnitudes.
# Relabeling scaled data as "Z" is common.(Z-score-Scaling topic in feature engineering lesson)
# note that we will only "transform()" test, applying the learning achieved by "fitting" on train 
sc = StandardScaler()
Z_train = sc.fit_transform(X_train)
Z_test = sc.transform(X_test)

In [13]:
print(f'Z_train shape is: {Z_train.shape}')
print(f'y_train shape is: {y_train.shape}')
print(f'Z_test shape is: {Z_test.shape}')
print(f'y_test shape is: {y_test.shape}')

Z_train shape is: (1949, 90)
y_train shape is: (1949,)
Z_test shape is: (4548, 90)
y_test shape is: (4548,)


## Standardizing predictors is required

Let's remind ourselves of our new loss function _(with penalty)_. This is the L2 Regularization:

$$MSE + \alpha \|\beta\|^2$$

<details><summary>Why do you think scaling(/standardizing) the featrues is required in a regularization context?</summary>
    Recall that the size of each <b>coefficient</b> ($\beta$) depends on the scale of its corresponding <b>variable</b>. Our <b>penalty term</b> depends on these <b>coefficients</b>. Scaling is required so that the <b><i>regularization penalizes each variable equally fairly</b></i>.
</details>

## But First: OLS(ordinary least squares/linear regression model)
Fitting and evaluating **without** any regularization

***Fact:*** 

*Why is linear regression referred to as least squares?*
The Least Squares Regression Line is the line that makes the vertical distance from the data points to the regression line as small as possible. It's called a “least squares” because the best line of fit is one that minimizes the variance (the sum of squares of the errors).

In [14]:
# Import the appropriate library and fit our OLS model.
from sklearn.linear_model import LinearRegression

In [15]:
ols = LinearRegression() # instantiate
ols.fit(Z_train, y_train) # model fit

LinearRegression()

In [16]:
# How does the model score (R^2) on the training and test data?
print(ols.score(Z_train, y_train))
print(ols.score(Z_test, y_test))

0.4069819524268312
0.22021547039841416


<details><summary>What do these $R^2$s tell you?</summary>
    <b><i>Train vs test score differs by a lot - model hasn't generalized well or in other words, is overfit!</b></i>
</details>

In [17]:
ols.coef_ # Estimated coefficients for the linear regression model

array([-6.25749886e+01, -3.02971610e+01, -5.01103963e+01, -1.31671753e+02,
       -6.05594065e+01, -6.69469156e+01,  1.06254306e+02,  1.17149975e+02,
       -3.63264002e+01, -5.28038087e+01,  9.10286690e+01, -2.06348431e-01,
       -7.51604769e-01, -1.51877470e-01, -2.69504397e-01, -6.37134850e-01,
       -1.46230396e+00, -5.61003144e-01,  6.66040229e-01,  6.43672195e+01,
       -1.63465852e-01, -2.26602773e-01,  4.96393198e-01,  3.61062695e-01,
       -1.51450896e-02, -5.61980374e-02, -2.76987557e-01,  3.72765182e-02,
       -8.77421185e-03,  2.51096607e-01,  2.84960544e+01,  3.51506311e-01,
       -1.52413698e-01,  1.45769693e+00,  2.63918881e-01, -1.26622223e-01,
       -2.43408218e-01,  1.68109546e-01,  8.29009668e-02,  4.50602888e-02,
        5.04419929e+01, -9.30023851e-01, -1.83270244e-01,  1.19871740e+00,
       -1.10452250e-01, -1.13190897e+00, -1.43140646e-01, -3.35884795e-01,
        6.51264578e-01,  1.37020668e+02, -4.10760519e+00, -4.42930669e-01,
        1.96024882e+00, -

In [18]:
len(ols.coef_) # there are 90! corresponding to each our features

90

## And Now: Ridge

### Let's think about this...

$$ Loss function = MSE + \alpha\|\beta\|^2$$ 

$$ Loss function =\frac{1}{n}\|\mathbf{y} - \mathbf{X}\beta\|^2 + \alpha\|\beta\|^2 $$

<details><summary>What's the optimal value of $\beta$ when $\alpha = 0$?</summary>
Our problem <b>reduces to OLS solution</b>, because the penalty term becomes 0!

To minimize the error, above equation can be written as:

$$ 0 =\|\mathbf{y} - \mathbf{X}\hat\beta\|^2 + 0 $$

With some matrix math, we can calculate $\hat\beta$ as: <a href="https://online.stat.psu.edu/stat462/node/132/">(link for further reference)</a>

$$ \hat{\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} $$
</details>

<details><summary>What's the optimal value of $\beta$ when $\alpha = \infty$?</summary>
When $\alpha = \infty$, Anything besides $\hat{\beta} = \mathbf{0}$ will cause our whole loss function to be <b>$\infty$</b>. That is not ideal as we want as minimal errors as possible for our model. So, it must be that $\hat{\beta} = \mathbf{0}$!
</details>

<details><summary>Some facts...</summary>
$\alpha$ is a constant for the <i>strength</i> of the regularization parameter. The <b>higher the value, the greater the impact</b> of this new component in the loss function. If the value was <b>zero, we would revert back to just the least squares</b> loss function. If the value was a billion, however, the residual sum of squares component would have a much smaller effect on the loss/cost than the regularization term.
</details>

### We can look at a traceplot to visualize above:
Similar to visualisation on slide#15 on "Regularization" deck: $\beta$ increases as $\alpha$ decreases (x-axis left to right is reducing value of $\alpha$)

![](../imgs/ridge-trace.png)

### Ok, so which $\alpha$ is best?

We'll primarily choose the **optimal** $\alpha$ via **cross validation**.

In [20]:
# Linear least squares with L2 regularization
from sklearn.linear_model import Ridge

In [21]:
# Instantiate.
# alpha is the Regularization strength. we're passing 10, instead of the default 1 for stronger regularization.
ridge_model = Ridge(alpha=10)

# Fit.
ridge_model.fit(Z_train, y_train)

# Evaluate model using R2.
print(ridge_model.score(Z_train, y_train))
print(ridge_model.score(Z_test, y_test))

0.37708346347575306
0.24423639703199995


<details><summary>What do these $R^2$s tell you?</summary>
<b><i>Train vs test score delta has definitely improved vs traditional OLS previously, but still isn't close, so still overfit to some degree.</b></i>
</details>

## Brute-forcing the answer

In [22]:
# Ridge regression with "built-in" cross-validation (advancing from above approach w/o cv)
from sklearn.linear_model import RidgeCV

In [23]:
# Set up a list of ridge alphas to check.
# np.logspace generates 100 values equally between 0 and 5,
# then converts them to alphas between 10^0 and 10^5 (that is, in logscale).
r_alphas = np.logspace(0, 5, 100)

# Cross-validate over our list of ridge alphas.
# alphas: pass an Array of alpha values to try. It is still the Regularization strength
ridge_cv = RidgeCV(alphas=r_alphas, scoring='r2', cv=5).fit(Z_train, y_train)# fitting 5-fold CV

In [24]:
print(ridge_cv.score(Z_train, y_train))
print(ridge_cv.score(Z_test, y_test))

0.3484515362721913
0.29488871636997716


In [25]:
# getting the optimal value of alpha from ridge cv
ridge_cv.alpha_

104.76157527896652

<details><summary>What do these $R^2$s tell you?</summary>
<b><i>Train vs test score delta has definitely improved vs non-CV ridge regression</b></i>
</details>

## Defining the LASSO

LASSO regression (or, a.k.a L1 regression) is largely the same as ridge, except with a **different penalty term** (_absolute value_ of coefficients instead of _squared_).

$$
\begin{align}
\text{minimize: MSE + penalty} &= \textstyle\frac{1}{n}\sum (y_i - \hat{y}_i)^2 + \alpha \sum |\beta_j| \\ \\
                               &= \textstyle\frac{1}{n}\|\mathbf{y} - \hat{\mathbf{y}}\|^2 + \alpha \|\beta\|_1 \\ \\
                               &= \textstyle\frac{1}{n}\|\mathbf{y} - \mathbf{X}\hat{\beta}\|^2 + \alpha \|\beta\|_1
\end{align}
$$

The penalty is now made up from the **$\mathcal{l}_1$-norm**, otherwise known as **Manhattan distance**. It is simply the absolute sum of the vector components.

### The LASSO traceplot looks a little different...
We'll see it later and discuss what LASSO actually does differently from Ridge.

In [26]:
# Imports similar to Ridge, this time for Lasso instead
from sklearn.linear_model import Lasso, LassoCV

## LASSO Regression

In [27]:
# Reminder of results from evaluations before this
print(" OLS ".center(18, "="))# syntax: str.center(width, fillchar=' ')
print(ols.score(Z_train, y_train))
print(ols.score(Z_test, y_test))
print()
print(" Ridge ".center(18, "="))
print(ridge_cv.score(Z_train, y_train))
print(ridge_cv.score(Z_test, y_test))

0.4069819524268312
0.22021547039841416

0.3484515362721913
0.29488871636997716


In [28]:
# Set up a list of Lasso alphas to check.
# np.logspace generates 100 values equally between -3 and 0,
# then converts them to alphas between 10^-3 and 1 (that is, in logscale).
l_alphas = np.logspace(-3, 0, 100)

# Cross-validate over our list of Lasso alphas.
lasso_cv = LassoCV(alphas=l_alphas, cv=5, max_iter=50000).fit(Z_train, y_train);

In [29]:
print(lasso_cv.score(Z_train, y_train))
print(lasso_cv.score(Z_test, y_test))

0.3366703928164062
0.2855544326156588


In [30]:
# getting the optimal value of alpha from ridge cv
lasso_cv.alpha_

0.007564633275546291

## Ridge vs LASSO, what's the diff?!
Let's check out the **coefficients** of the Lasso and Ridge models.

In [31]:
ridge_cv.coef_

array([-7.41082332e-04, -6.73622850e-02,  4.11979658e-03,  3.58118418e-02,
       -7.86169814e-03,  7.96837868e-02,  4.07202575e-02, -6.40134004e-02,
        1.67657002e-03,  2.63542441e-02,  4.89229448e-02, -5.96326152e-04,
        3.97892719e-03, -3.38291402e-02, -4.33716705e-03,  1.37005906e-01,
        3.07581441e-02,  3.60944933e-02, -6.74979641e-03,  6.29819066e-04,
        8.10392864e-03,  3.73389457e-02,  6.71509211e-03,  7.15269227e-02,
        1.12923745e-02, -1.19654956e-02, -7.94404967e-02, -1.56703211e-02,
       -3.95481031e-03, -4.59646775e-02, -6.81012094e-02, -6.04899669e-02,
       -3.27140260e-04,  4.46369117e-02,  7.03665877e-02, -5.18587000e-02,
        2.51915541e-02,  3.39269447e-02,  4.95212161e-02, -2.47711305e-02,
        3.96451349e-03,  6.67705387e-04, -3.08860641e-02,  2.35098736e-02,
        3.55708420e-02,  3.93321581e-02,  4.86155004e-02, -3.50408563e-02,
        5.62724938e-02,  3.71036473e-02, -2.69137546e-02, -1.40643736e-02,
       -1.15979037e-02,  

In [33]:
lasso_cv.coef_

array([-0.        , -0.        , -0.        ,  0.        , -0.        ,
        0.        ,  0.        , -0.        ,  0.        ,  0.        ,
        0.        ,  0.        , -0.        , -0.        , -0.        ,
        0.16835564,  0.        ,  0.        , -0.        , -0.        ,
       -0.        ,  0.        , -0.        ,  0.06311839,  0.        ,
       -0.        , -0.05120041, -0.        , -0.        , -0.07081332,
       -0.16039109, -0.        , -0.        , -0.        ,  0.00492753,
       -0.01006693,  0.        ,  0.        ,  0.        , -0.        ,
       -0.        , -0.        , -0.        , -0.        ,  0.01658083,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
       -0.        , -0.        , -0.        ,  0.        ,  0.        ,
        0.        , -0.        , -0.        , -0.        , -0.        ,
       -0.        ,  0.00238066, -0.28613203, -0.        ,  0.        ,
        0.13542067,  0.        ,  0.25586092, -0.06659136, -0.  

## By comparing the coef from ridge vs lasso, we can conclude that Lasso is effective in use for **feature selection** as it shrinks the coefs of those features that are **not relevant or not a good predictor** of response, to zero. And, only leaves the strongly correlated features.

## Cliffsnotes: L.A.S.S.O.
LASSO is actually an acronym:

* **L**east
* **A**bsolute
* **S**hrinkage and
* **S**election
* **O**perator

**SHRINKAGE**: Higher $\alpha$ "shrinks" $\beta$ towards $\mathbf{0}$.

**SELECTION**: Higher $\alpha$ zeros out small $\beta$s.

![](../imgs/lasso-trace.svg)

## Mathematical interpretation of what LASSO was doing here
Mathematically, in Lasso, We're doing an optimization problem, so actually, this $\alpha$ is a **Lagrange multiplier**. This means that optimizing our loss function:

$$ \|\mathbf{y} - \mathbf{X}\beta\|^2 + \alpha\|\beta\|_1 $$

is equivalent to optimizing the **constrained loss function**:

$$ \|\mathbf{y} - \mathbf{X}\beta\|^2 \quad \text{such that} \quad \|\beta\|_1 \le t $$ _(we explained this concept with a constant, s while covering the intuition deck)_

## [TRY OUT THIS APP!](https://timothykbook.shinyapps.io/RegularizationPlot/) 
to see for yourself on changing $\alpha$ or $\lambda$'s impact on coefficient $\beta$

## Now onto Regularizing Logistic Regression
_Note: we'll cover more on logistic regression in a "**classification**" machine learning context in future lessons_
For now, we just take it that it is yet another variation of linear regression model

In [35]:
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

### Let's take a look at the LogisticRegression documentation:

In [36]:
LogisticRegression().get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

### Noticed the parameter, `penalty`? Regularization is the hidden default for logistic regression!
Unless regularization is necessary, **it should not be done!!** (It makes interpreting the coefficients much more difficult.) In newer version of Scikit-Learn, 0.21 and higher, you can disable regularization by passing in `penalty='none'`! 

If you _do_ want to regularize, note that there is a much friendlier `LogisticRegressionCV` we will use.

In [37]:
# generate a random classification problem
from sklearn.datasets import make_classification

In [38]:
# create dataset with 1000 rows, 200 cols only 15 of which are informative features-->to simulate for a noisy dataset
X, y = make_classification(n_samples=1000, n_features=200, n_informative=15, random_state=123)
print(X, y)

[[-0.4405465  -0.77063522 -0.48266557 ...  1.84166626 -0.91101634
   0.96037298]
 [-0.77339715 -0.21886082 -2.11639665 ... -0.55028366  0.63370087
   0.71023287]
 [-2.61619714 -0.90948965 -3.21558029 ...  0.16634024  0.51847037
   0.03842324]
 ...
 [-0.06833379 -1.28884401 -0.77598478 ...  0.69427821 -0.36789168
   1.14419823]
 [ 0.25399511 -0.61331032 -0.23905231 ... -0.69668623 -0.10576577
  -2.77347662]
 [-0.25432662  0.74350343 -0.14684419 ...  0.17122285  0.47578506
   0.3489792 ]] [1 0 0 0 0 0 1 0 1 1 1 1 0 1 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0 1 1 0 0 1 1 0
 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1 0 0 1 1 1 0 0 1 0 0 0 1 0 1 0 1 0
 1 1 1 0 1 0 1 0 0 1 0 0 0 1 0 1 0 0 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1
 0 0 0 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 0 0 1 1 1 1 0 1 0 1 0 0
 1 0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 1 1 1 0 0 1 0 0 1 1 1 0 0 0 0 1 0 0
 1 1 0 1 0 1 0 1 1 0 0 1 1 0 1 0 1 1 0 1 0 0 1 0 1 1 0 0 1 1 1 1 0 1 1 0 0
 0 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 0 0 1 0

In [39]:
# train-test split
# scaling standardization
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

In [40]:
# logistic regression model training
# C: Inverse of regularization strength (1/alpha), smaller values specify stronger regularization
# here 1e9 or 10^9: large value to depreciate regularization
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train_sc, y_train)

# See the gap in training vs testing metric! Overfit!
print(logreg.score(X_train_sc, y_train))
print(logreg.score(X_test_sc, y_test))

0.9306666666666666
0.636


In [41]:
# we arrive at the same outcome by setting penalty='none'
logreg_pen_none = LogisticRegression(penalty='none')
logreg_pen_none.fit(X_train_sc, y_train)

# See the gap in training vs testing metric! Overfit!
print(logreg_pen_none.score(X_train_sc, y_train))
print(logreg_pen_none.score(X_test_sc, y_test))

0.9306666666666666
0.636


In [42]:
# logistic regression_cv (5-fold cv) model training
# sklearn recommendation on solver from documentation: "liblinear" for small datasets,
# if not specified, default solver "lbgs" is the algorithm to use in optimization problem with "l2" penalty 
logreg_cv = LogisticRegressionCV(cv=5, penalty="l1", solver="liblinear")
logreg_cv.fit(X_train_sc, y_train)

# we don't see overfitting now! training-testing metrics are close
print(logreg_cv.score(X_train_sc, y_train))
print(logreg_cv.score(X_test_sc, y_test))

0.8146666666666667
0.812


In [43]:
# getting the optimal value of "C" for class 1, inverse of "alpha"
logreg_cv.C_

array([0.04641589])

logistic regression CV sklearn [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html)

# Elephant in the Room: Categorical Variables
Think about it. What does it mean to scale a binary variable? _(Our response, `y` in a binary classification problem either takes the value 1 or 0, this is also referred to as class 1 or class 0.)_ How about a categorical variable dummified into several columns? What does it mean to shrink the coefficients associated with these columns? What happens if the LASSO zeros out one category, but not others? Not sure, either.

It turns out, it's **not a great idea to combine scaling and categorical data**. It often just doesn't make sense to do. This is true for all algorithms where we need to scale. So what do we do? A few options:

* Set separate regularization parameters for each x-variable (not available in Scikit-Learn).
* Carry out the _grouped LASSO_ technique (not available in Scikit-Learn, and doesn't solve all problems anyway).
* Manually decide on a scale for these variables (time consuming, unintuitive, still doesn't work with regularization).
* Don't use those variables (but you want them!).
* Just do it anyway. Who knows, it'll probably be fine! (¯\_(ツ)_/¯)

In the bonus section, we'll explore `Elastic Net` regularization using the same classification dataset but **without scaling**.

## Important Notes
- The $\alpha$ hyperparameter for regularization is **unrelated** to significance level in hypothesis testing.
- In certain resources, including [ISLR](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf), you'll see that $\lambda$ is used instead of $\alpha$ for regularization strength.
- We must standardize before regularizing, but regularization and standardization are not the same things!
- **FROM NOW ON, YOU MUST PAY ATTENTION TO REGULARIZATION WHEN CONDUCTING LOGISTIC REGRESSION!!!**
- Ridge regression is sometimes called **weight decay**, but usually only when regularizing neural networks.
- LASSO regression is sometimes called **basis pursuit**, but that's very old fashioned.
- The y-intercept for these models are not regularized.

## Recap
- Regularization is used when evidence suggests our model is suffering from high error due to variance.
- Evidence to suggest our model suffers from high error due to variance includes substantially better performance on our training set than our testing set.
- LASSO tends to be "more brutal" than Ridge regularization in that it will zero out coefficients.
- If you want to combine LASSO and Ridge regularization, there is a technique called "ElasticNet" that does exactly this.

## ElasticNet Regression (bonus)

---

Can't decide?

![](../imgs/why-not-both.jpg)

The Elastic Net combines the Ridge and Lasso penalties.  It adds *both* penalties to the loss function:

$$
\begin{eqnarray}
SSE + Ridge + Lasso &=& \sum_{i=1}^n \left(y_i - \hat{y}_i\right)^2 + \alpha\left[\rho\sum_{j=1}^p |\beta_j| + (1-\rho)\sum_{j=1}^p \beta_j^2\right] \\
&=& \|\mathbf{y} - \mathbf{X}\beta\|^2 + \alpha\left(\rho\|\beta\|_1 + (1 - \rho)\|\beta\|^2\right)
\end{eqnarray}
$$


In the elastic net, the effect of the ridge versus the lasso is balanced by the $\rho$ (L1 ratio) parameter.  It is the ratio of Lasso penalty to Ridge penalty and must be between zero and one.

`ElasticNet` in sklearn has two parameters:
- `alpha`: the regularization strength.
- `l1_ratio`: the amount of L1 vs L2 penalty (i.e., $\rho$). An l1_ratio of **0 is equivalent to the Ridge**, whereas an **l1_ratio of 1 is equivalent to the Lasso**.


In [44]:
from sklearn.linear_model import ElasticNet

Calculate the coefficients with both alpha values and an l1_ratio of 0.5 (note that an l1_ration of 0.5 is the default, so as to balance between ridge and lasso, so we don't need to explicitly declare l1_ratio=0.5).
- Using a $\rho$ value below 0.05 can empirically cause issues in `sklearn`.

In [45]:
from sklearn.linear_model import ElasticNetCV

In [46]:
# Set up a list of alphas to check.
enet_alphas = np.linspace(0.5, 1.0, 100)# Return evenly spaced numbers over a specified interval

# Instantiate model.
enet_model = ElasticNetCV(alphas=enet_alphas, cv=5)

# Fit model using optimal alpha.
enet_model = enet_model.fit(X_train, y_train)

# Generate predictions.
enet_model_preds = enet_model.predict(X_test)
enet_model_preds_train = enet_model.predict(X_train)

# Evaluate model.
print(enet_model.score(X_train, y_train))
print(enet_model.score(X_test, y_test))

0.21246113948141954
0.21368840628470087


In [47]:
# Here is the optimal value of alpha.
enet_model.alpha_

0.5