# Lasso Regression and Ridge Regression

### Learning Objectives

By the end of this lesson students will:

- Understand regularization at a high level.
- Be able to use Ridge regression to apply regularization
- Be able to use Lasso regression to apply regularization

## Regularization

- Regularization is a method for "constraining" or "regularizing" the size of the coefficients, thus "shrinking" them toward zero.
- It reduces model variance and thus minimizes overfitting. 
- Often improves model generalization.

Our goal is to locate the optimum model complexity, and thus regularization is useful when we believe our model is too complex.

<a id="how-does-regularization-work"></a>
### How Does Regularization Work?

For a normal linear regression model, we estimate the coefficients using the least squares criterion, which minimizes the residual sum of squares (RSS).

For a regularized linear regression model, we minimize the sum of RSS and a "penalty term" that penalizes coefficient size.

### Ridge regression  minimizes: $$\text{RSS} + \alpha \sum_{j=1}^p \beta_j^2$$

This is __L2__ regularization, aka _Euclidian_ distance, uses Pythagorean Theorem. Think _squared_.

### Lasso regression minimizes: $$\text{RSS} + \alpha \sum_{j=1}^p |\beta_j|$$

This is __L1__ regularization, aka _Manhattan_, _Taxicab_, and [many other names](https://en.wikipedia.org/wiki/Taxicab_geometry). Think _absolute value_.

- $p$ is the number of features.
- $\beta_j$ is a model coefficient.
- $\alpha$ is a tuning parameter:
    - A tiny $\alpha$ imposes no penalty on the coefficient size, and is equivalent to a normal linear regression model.
    - Increasing the $\alpha$ penalizes the coefficients and thus shrinks them.
    

## A larger alpha results in more regularization ☝️

- Lasso regression shrinks coefficients all the way to zero, thus removing them from the model.
- Ridge regression shrinks coefficients toward zero, but they rarely reach zero.

#### Parameters:

- **alpha:** must be positive, increase for more regularization
- **normalize:** scales the features (same as using StandardScaler)
    Sometimes this helps and sometimes it hurts

# Boston housing 

In [None]:
# usual imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split, cross_val_score

In [None]:
# read in the data
df_boston = pd.read_csv('../data/boston_data.csv')

In [None]:
# inspect 
df_boston.head()

In [None]:
df_boston.info()

### Break into X and y

In [None]:
X = df_boston.drop('MEDV', axis=1)
y = df_boston['MEDV']

#### Save the feature columns for later use

In [None]:
X.head()

In [None]:
y.head()

### Split into training and test sets

### Instantiate the models

In [None]:

                 # default alpha is 1
                 # default alpha is 1

### Fit and score the models

Let's fit and score the models. We aren't creating a validation dataset and using cross validation here because we want to focus on what Ridge and Lasso do.

### Vanilla Linear Regression 🍦


### Ridge (L2)

### Lasso (L1)

### Let's exammine the cofficients 

Make a DataFrame, plot, and look at the absolute value of the magnitudes of each one.

### Vanilla Linear Regression

#### Ridge

In [None]:
df_ridge.plot(kind='barh', title='Ridge Coefficients');

In [None]:
df_ridge_vals = df_ridge.T.abs().sort_values(by=0, ascending=False)
df_ridge_vals.columns=['Ridge']
df_ridge_vals

#### Lasso

In [None]:
df_lasso = pd.DataFrame( [lasso.coef_], columns=feature_cols)
df_lasso

In [None]:
df_lasso.plot(kind='barh');

In [None]:
df_lasso_vals = df_lasso.T.abs().sort_values(by=0, ascending=False)
df_lasso_vals.columns=['Lasso']
df_lasso_vals

#### concatenate the DataFrames

#### Sum the coefficients for each model.

# Summary

You've seen how to use regularization with Linear Regression model variants to decrease variance and attempt to improve generalizability in your models. Most machine learning models have regularization hyperparameters. You definitely want to try them out!

<a id="advice-for-applying-regularization"></a>
### Advice for Applying Regularization


**How should you choose between lasso regression and ridge regression?**

- Lasso regression is preferred if we believe many features are irrelevant or if we prefer a sparse model.
- Ridge can work particularly well if there is a high degree of multicollinearity in your model. Can be harder to interpret feature importances.
- ElasticNet regression is a combination of lasso regression and Ridge Regression. Requires more tuning and less interpretable, but might work better

Most models have parameters for regularization. In Logistic Regression its `C`, higher values mean less regularization, and it's applied by default. Fun! 😉

### Check for understanding

1. How does regularization relate to the bias-variance tradeoff?
2. How does Ridge Regression differ from Lasso Regression?


### More resources

#### The docs:

- [Lasso](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)
- [Ridge](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)  
- [ElasticNet](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html) 


To go deeper with Bias/Variance and Regularization with Ridge Regression, [here's a good article](https://towardsdatascience.com/ridge-regression-for-better-usage-2f19b3a202db).