# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [2]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

We'll make a first selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
X = df.drop(df.select_dtypes(['object']),axis=1)
df2=X
df2.fillna(df2.median(),inplace=True)
X= X.drop('SalePrice',axis = 1)
# remove "object"-type features and SalesPrice from `X`
y = df['SalePrice']

# Impute null values
X.fillna(X.median(),inplace=True)

# Create y



Look at the information of `X` again

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [5]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.linear_model import LinearRegression
# Split in train and test
dftest=df2.sample(frac=.2,random_state=1)
dftrain=df2.drop(dftest.index)
# Fit the model and print R2 and MSE for train and test
regression=LinearRegression()
regression.fit(dftest.drop('SalePrice',axis=1),dftest['SalePrice'])
print(regression.score(dftest.drop('SalePrice',axis=1),dftest['SalePrice']))
print(regression.score(dftrain.drop('SalePrice',axis=1),dftrain['SalePrice']))

0.9120350638422956
0.7084318143222732


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [6]:
from sklearn import preprocessing

# Scale the data and perform train test split
scale = preprocessing.MinMaxScaler()
transformed = scale.fit_transform(X)
X = pd.DataFrame(transformed, columns = X.columns)

  return self.partial_fit(X, y)


Perform the same linear regression on this data and print out R-squared and MSE.

In [7]:
df2=X
dftest=df2.sample(frac=.2,random_state=1)
dftrain=df2.drop(dftest.index)
# Fit the model and print R2 and MSE for train and test
regression=LinearRegression()
regression.fit(dftrain,y[dftrain.index])
print(regression.score(dftest,y[dftest.index]))
print(regression.score(dftrain,y[dftrain.index]))

0.8146371673340913
0.8047661304423562


## Include dummy variables

We haven't included dummy variables so far: let's use our "object" variables again and create dummies

In [36]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null float64
MSSubClass       1460 non-null float64
LotFrontage      1460 non-null float64
LotArea          1460 non-null float64
OverallQual      1460 non-null float64
OverallCond      1460 non-null float64
YearBuilt        1460 non-null float64
YearRemodAdd     1460 non-null float64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null float64
BsmtFinSF2       1460 non-null float64
BsmtUnfSF        1460 non-null float64
TotalBsmtSF      1460 non-null float64
1stFlrSF         1460 non-null float64
2ndFlrSF         1460 non-null float64
LowQualFinSF     1460 non-null float64
GrLivArea        1460 non-null float64
BsmtFullBath     1460 non-null float64
BsmtHalfBath     1460 non-null float64
FullBath         1460 non-null float64
HalfBath         1460 non-null float64
BedroomAbvGr     1460 non-null float64
KitchenAbvGr     1460 non-null floa

In [44]:
import numpy as np

In [47]:
# Create X_cat which contains only the categorical variables
X_cat=df[df.select_dtypes(['object']).columns.astype('category')]
np.shape(X_cat)

(1460, 43)

In [48]:
# Make dummies
cat_dummies=pd.get_dummies(X_cat)
cat_dum_df = pd.DataFrame(cat_dummies)

In [49]:
cat_dum_df

Unnamed: 0,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Street_Grvl,Street_Pave,Alley_Grvl,Alley_Pave,LotShape_IR1,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,0,0,0,1,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0
1,0,0,0,1,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0
2,0,0,0,1,0,0,1,0,0,1,...,0,0,0,1,0,0,0,0,1,0
3,0,0,0,1,0,0,1,0,0,1,...,0,0,0,1,1,0,0,0,0,0
4,0,0,0,1,0,0,1,0,0,1,...,0,0,0,1,0,0,0,0,1,0
5,0,0,0,1,0,0,1,0,0,1,...,0,0,0,1,0,0,0,0,1,0
6,0,0,0,1,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0
7,0,0,0,1,0,0,1,0,0,1,...,0,0,0,1,0,0,0,0,1,0
8,0,0,0,0,1,0,1,0,0,0,...,0,0,0,1,1,0,0,0,0,0
9,0,0,0,1,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0


Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [50]:
df_cat=pd.concat([df2,cat_dum_df],axis=1)

In [51]:
df_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Columns: 289 entries, Id to SaleCondition_Partial
dtypes: float64(37), uint8(252)
memory usage: 781.4 KB


Perform the same linear regression on this data and print out R-squared and MSE.

In [52]:
dftest=df_cat.sample(frac=.2,random_state=1)
dftrain=df_cat.drop(dftest.index)
# Fit the model and print R2 and MSE for train and test
regression=LinearRegression()
regression.fit(dftrain,y[dftrain.index])
print(regression.score(dftest,y[dftest.index]))
print(regression.score(dftrain,y[dftrain.index]))

-1.9225292576469821e+18
0.9348567582254832


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far far off. Similarly, the scale of the Testing MSE is orders of magnitude higher then that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [53]:
from sklearn.linear_model import Lasso, Ridge
ridge=Ridge(1)
ridge.fit(dftrain,y[dftrain.index])
lasso=Lasso(1)
lasso.fit(dftrain,y[dftrain.index])
print(lasso.score(dftest,y[dftest.index]))
print(lasso.score(dftrain,y[dftrain.index]))
print(ridge.score(dftest,y[dftest.index]))
print(ridge.score(dftrain,y[dftrain.index]))

0.8699059415538675
0.9347583173999161
0.8689879902765233
0.9188230828252542




With a higher regularization parameter (alpha = 10)

In [54]:
ridge=Ridge(10)
ridge.fit(dftrain,y[dftrain.index])
lasso=Lasso(10)
lasso.fit(dftrain,y[dftrain.index])
print(lasso.score(dftest,y[dftest.index]))
print(lasso.score(dftrain,y[dftrain.index]))
print(ridge.score(dftest,y[dftest.index]))
print(ridge.score(dftrain,y[dftrain.index]))

0.8799054308760433
0.9332155928641535
0.8503609112713046
0.8904422954870598


## Ridge

With default parameter (alpha = 1)

In [None]:
# Your code here

With default parameter (alpha = 10)

In [None]:
# Your code here

## Look at the metrics, what are your main conclusions?

Both regularization methods greatly constrained the overfitting of our model to the training data and drastically improved the generalization of our results to the test data.

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

In [72]:
np.sum(abs(lasso.coef_)<.1)

73

In [73]:
np.sum(abs(ridge.coef_)<.1)

3

Compare with the total length of the parameter space and draw conclusions!

In [74]:
len(lasso.coef_)

289

Using L1 Normalization allowed us to automatically feature select in that it reduced about a fourth of our features included in the model. Ridge on the other hand only reduced 3 down to 

## Summary

Great! You now know how to perform Lasso and Ridge regression.