# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

We'll make a first selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [25]:
# Load necessary packages
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# remove "object"-type features and SalesPrice from `X`
non_object = list(filter(lambda x: df[x].dtype != np.object, df.columns))
X = df[non_object]
X = X.drop(['SalePrice'], axis = 1)
# Impute null values
for col in X.columns:
    median = X[col].median()
    X[col].fillna(value = median, inplace = True)

# Create y
y = df['SalePrice']

Look at the information of `X` again

In [26]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [29]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X,y)

# Fit the model and print R2 and MSE for train and test
linreg = LinearRegression()
linreg.fit(X_train, y_train)

train_r2 = linreg.score(X_train, y_train)
train_mse = mean_squared_error(y_train, linreg.predict(X_train))

test_r2 = linreg.score(X_test, y_test)
test_mse = mean_squared_error(y_test, linreg.predict(X_test))

print ('r2_train ', train_r2)
print('mse_train ', train_mse)

print ('r2_test ', test_r2)
print('mse_test ', test_mse)

r2_train  0.867050997273126
mse_train  862319767.9476825
r2_test  0.515750844768049
mse_test  2792891884.65966


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [30]:
from sklearn import preprocessing

# Scale the data and perform train test split

X_scaled = preprocessing.scale(X)



Perform the same linear regression on this data and print out R-squared and MSE.

In [31]:
# Your code here

X_train, X_test, y_train, y_test = train_test_split(X_scaled,y)

# Fit the model and print R2 and MSE for train and test
linreg = LinearRegression()
linreg.fit(X_train, y_train)

train_r2 = linreg.score(X_train, y_train)
train_mse = mean_squared_error(y_train, linreg.predict(X_train))

test_r2 = linreg.score(X_test, y_test)
test_mse = mean_squared_error(y_test, linreg.predict(X_test))

print ('r2_train ', train_r2)
print('mse_train ', train_mse)

print ('r2_test ', test_r2)
print('mse_test ', test_mse)

r2_train  0.8051142191317928
mse_train  1226361729.6325958
r2_test  0.8235411799371564
mse_test  1119423068.1737351


## Include dummy variables

We haven't included dummy variables so far: let's use our "object" variables again and create dummies

In [35]:
# Create X_cat which contains only the categorical variables
X_cat = df[list(filter(lambda x: df[x].dtype == np.object, df.columns))]

(1460, 43)

In [36]:
# Make dummies
X_cat = pd.get_dummies(X_cat)
np.shape(X_cat)

(1460, 252)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [43]:
X_scaled_df = pd.DataFrame(X_scaled, columns = X.columns)

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,-1.730865,0.073375,-0.220875,-0.207142,0.651479,-0.5172,1.050994,0.878668,0.514104,0.575425,...,0.351,-0.752176,0.216503,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,-1.599111,0.138777
1,-1.728492,-0.872563,0.46032,-0.091886,-0.071836,2.179628,0.156734,-0.429577,-0.57075,1.171992,...,-0.060731,1.626195,-0.704483,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,-0.48911,-0.614439
2,-1.72612,0.073375,-0.084636,0.07348,0.651479,-0.5172,0.984752,0.830215,0.325915,0.092907,...,0.631726,-0.752176,-0.070361,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,0.990891,0.138777
3,-1.723747,0.309859,-0.44794,-0.096897,0.651479,-0.5172,-1.863632,-0.720298,-0.57075,-0.499274,...,0.790804,-0.752176,-0.176048,4.092524,-0.116339,-0.270208,-0.068692,-0.087688,-1.599111,-1.367655
4,-1.721374,0.073375,0.641972,0.375148,1.374795,-0.5172,0.951632,0.733308,1.366489,0.463568,...,1.698485,0.780197,0.56376,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,2.100892,0.138777


In [46]:
# Your code here
X_new = pd.concat([X_scaled_df, X_cat], axis = 1)
X_new.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,-1.730865,0.073375,-0.220875,-0.207142,0.651479,-0.5172,1.050994,0.878668,0.514104,0.575425,...,0,0,0,1,0,0,0,0,1,0
1,-1.728492,-0.872563,0.46032,-0.091886,-0.071836,2.179628,0.156734,-0.429577,-0.57075,1.171992,...,0,0,0,1,0,0,0,0,1,0
2,-1.72612,0.073375,-0.084636,0.07348,0.651479,-0.5172,0.984752,0.830215,0.325915,0.092907,...,0,0,0,1,0,0,0,0,1,0
3,-1.723747,0.309859,-0.44794,-0.096897,0.651479,-0.5172,-1.863632,-0.720298,-0.57075,-0.499274,...,0,0,0,1,1,0,0,0,0,0
4,-1.721374,0.073375,0.641972,0.375148,1.374795,-0.5172,0.951632,0.733308,1.366489,0.463568,...,0,0,0,1,0,0,0,0,1,0


Perform the same linear regression on this data and print out R-squared and MSE.

In [47]:
# Your code here

X_train, X_test, y_train, y_test = train_test_split(X_new,y)

# Fit the model and print R2 and MSE for train and test
linreg = LinearRegression()
linreg.fit(X_train, y_train)

train_r2 = linreg.score(X_train, y_train)
train_mse = mean_squared_error(y_train, linreg.predict(X_train))

test_r2 = linreg.score(X_test, y_test)
test_mse = mean_squared_error(y_test, linreg.predict(X_test))

print ('r2_train ', train_r2)
print('mse_train ', train_mse)

print ('r2_test ', test_r2)
print('mse_test ', test_mse)

r2_train  0.9465618486738968
mse_train  355285527.99336475
r2_test  -9.457745985460683e+18
mse_test  4.992060971721355e+28


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far far off. Similarly, the scale of the Testing MSE is orders of magnitude higher then that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [51]:
# Your code here
from sklearn.linear_model import Lasso, Ridge
lasso = Lasso()
lasso.fit(X_train, y_train)

train_r2 = lasso.score(X_train, y_train)
train_mse = mean_squared_error(y_train, lasso.predict(X_train))

test_r2 = lasso.score(X_test, y_test)
test_mse = mean_squared_error(y_test, lasso.predict(X_test))

print('lasso with alpha = 1')
print('')
print ('r2_train ', train_r2)
print('mse_train ', train_mse)
print('')
print ('r2_test ', test_r2)
print('mse_test ', test_mse)

lasso with alpha = 1

r2_train  0.9439743610344493
mse_train  372488535.38311225

r2_test  0.7988686767195288
mse_test  1061626978.2278401


With a higher regularization parameter (alpha = 10)

In [53]:
# Your code here
lasso = Lasso(alpha=10)
lasso.fit(X_train, y_train)

train_r2 = lasso.score(X_train, y_train)
train_mse = mean_squared_error(y_train, lasso.predict(X_train))

test_r2 = lasso.score(X_test, y_test)
test_mse = mean_squared_error(y_test, lasso.predict(X_test))

print('lasso with alpha = 10')
print('')
print ('r2_train ', train_r2)
print('mse_train ', train_mse)
print('')
print ('r2_test ', test_r2)
print('mse_test ', test_mse)

lasso with alpha = 10

r2_train  0.9422319474971241
mse_train  384073036.3104295

r2_test  0.8216514041386561
mse_test  941373416.1706312


## Ridge

With default parameter (alpha = 1)

In [54]:
# Your code here

ridge = Ridge()
ridge.fit(X_train, y_train)

train_r2 = ridge.score(X_train, y_train)
train_mse = mean_squared_error(y_train, ridge.predict(X_train))

test_r2 = ridge.score(X_test, y_test)
test_mse = mean_squared_error(y_test, ridge.predict(X_test))

print('ridge with alpha = 1')
print('')
print ('r2_train ', train_r2)
print('mse_train ', train_mse)
print('')
print ('r2_test ', test_r2)
print('mse_test ', test_mse)

ridge with alpha = 1

r2_train  0.9252815085056401
mse_train  496768657.6476704

r2_test  0.8699005936894559
mse_test  686700794.9731388


With default parameter (alpha = 10)

In [55]:
# Your code here

ridge = Ridge(alpha=10)
ridge.fit(X_train, y_train)

train_r2 = ridge.score(X_train, y_train)
train_mse = mean_squared_error(y_train, ridge.predict(X_train))

test_r2 = ridge.score(X_test, y_test)
test_mse = mean_squared_error(y_test, ridge.predict(X_test))

print('ridge with alpha = 10')
print('')
print ('r2_train ', train_r2)
print('mse_train ', train_mse)
print('')
print ('r2_test ', test_r2)
print('mse_test ', test_mse)

ridge with alpha = 10

r2_train  0.897409754920297
mse_train  682075043.4962487

r2_test  0.893797393091827
mse_test  560566851.6117699


## Look at the metrics, what are your main conclusions?

Conclusions here

smaller difference between r2 and mse between train and test with ridge

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

In [56]:
# number of Ridge params almost zero
print(sum(abs(ridge.coef_) < 10**(-10)))

9


In [57]:
# number of Lasso params almost zero
print(sum(abs(lasso.coef_) < 10**(-10)))

76


Compare with the total length of the parameter space and draw conclusions!

In [59]:
# your code here
print('lasso', len(lasso.coef_))
print('percentage of variables removed with lasso', 76/289)

lasso 289
percentage of variables removed with lasso 0.2629757785467128


## Summary

Great! You now know how to perform Lasso and Ridge regression.