# Cheat Sheet - Regularized Regressions

In this cheat sheet, we'll extend our knowledge of linear regression with our first automated model building approach. 

It is worth calling out that sklearn does not support this functionality out of the box. They would rather have us use inherently sparse models, like ElasticNet or Lasso. We'll cover those, but I think stepwise is a nice place to start. For more info on this, it's probably worth doing some additional reading on an ML approach to model building (sklearn-style) vs. the causal approach (statsmodels). In the ML approach, we don't care too much about p-values (which is traditionally used in stepwise model building). 

## Instructions

**1.	Start a new project and import the grocery data (again)**

In [2]:
#Import the required library, and then the data
import seaborn
import pandas as pd
diamonds = seaborn.load_dataset("diamonds")

**2. Run some summary statistics**

Running $\text{dtypes}$ and $\text{describe()}$ is a good way to get a summary of your dataset.

In [3]:
diamonds.dtypes

carat       float64
cut        category
color      category
clarity    category
depth       float64
table       float64
price         int64
x           float64
y           float64
z           float64
dtype: object

In [4]:
diamonds.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


In [5]:
diamonds.isna().sum()

carat      0
cut        0
color      0
clarity    0
depth      0
table      0
price      0
x          0
y          0
z          0
dtype: int64

In [6]:
diamonds["cut"].value_counts()

Ideal        21551
Premium      13791
Very Good    12082
Good          4906
Fair          1610
Name: cut, dtype: int64

In [7]:
diamonds["clarity"].value_counts()

SI1     13065
VS2     12258
SI2      9194
VS1      8171
VVS2     5066
VVS1     3655
IF       1790
I1        741
Name: clarity, dtype: int64

In [8]:
diamonds["color"].value_counts()

G    11292
E     9797
F     9542
H     8304
D     6775
I     5422
J     2808
Name: color, dtype: int64

**3. Split the data into test and train**

In [9]:
target='price'
predictors=['carat', 'x', 'y', 'z', 'cut', 'color', 'clarity']

In [10]:
from sklearn.model_selection import train_test_split
X=pd.get_dummies(diamonds[predictors])
y=diamonds[target]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=0) 
#we set random state so we all get the same answers!

**4. Look at the linear model help file from sci-kit learn and train your model**

To create our model, we will use the sci-kit learn library. Referencing the following documentation, we can train our model.

https://scikit-learn.org/stable/modules/linear_model.html

There are three types of regularized regressions
1. Ridge
2. Lasso
3. ElasticNet

I'll show you one, and then you should try the rest!

In [110]:
#Ridge model first - make sure we optimize alpha!
import numpy as np
from sklearn import linear_model


reg = linear_model.RidgeCV(alphas=np.logspace(-6, 6, 20))
reg.fit(X_train, y_train)
reg.alpha_

10.0

In [111]:
from sklearn import metrics
r2=metrics.r2_score(y_train, reg.predict(X_train))
print(r2)

#not bad! Let's check for overfitting

0.9188417562348511


In [100]:
r2=metrics.r2_score(y_test, reg.predict(X_test))
print(r2)
#perfect.

0.9205064503948074


For a Lasso regression, you just need to change RidgeCV to LassoCV. For elasticnet - you guessed it! ElasticNetCV. Try both, and see how they perform compared to Ridge. 

There are other ways to play with hyperparameter tuning. For example:

In [102]:
#let sklearn do it all for you 
reg2 = linear_model.LassoCV(cv=10)
reg2.fit(X_train, y_train)
reg2.alpha_

3.9611066118293086

In [103]:
r2=metrics.r2_score(y_train, reg2.predict(X_train))
print(r2)

0.918411972391226


In [109]:
#elastic net has two parameters: you could gridsearch and let sklearn do its thing
reg3 = linear_model.ElasticNetCV(cv=10, l1_ratio = [.1, .5, .7, .9, .95, .99, 1])
reg3.fit(X_train, y_train)
reg3.alpha_

3.9611066118293086