# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [2]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

First, make a selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [3]:
# Load necessary packages
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# remove "object"-type features and SalesPrice from `X`
features = [col for col in df.columns if df[col].dtype in [np.float64, np.int64] and col!='SalePrice']
X = df[features]

# Impute null values
for col in X:
    med = X[col].median()
    X[col].fillna(value = med, inplace = True)

# Create y
y = df['SalePrice']

Look at the information of `X` again

In [4]:
X.head().transpose()

Unnamed: 0,0,1,2,3,4
Id,1.0,2.0,3.0,4.0,5.0
MSSubClass,60.0,20.0,60.0,70.0,60.0
LotFrontage,65.0,80.0,68.0,60.0,84.0
LotArea,8450.0,9600.0,11250.0,9550.0,14260.0
OverallQual,7.0,6.0,7.0,7.0,8.0
OverallCond,5.0,8.0,5.0,5.0,5.0
YearBuilt,2003.0,1976.0,2001.0,1915.0,2000.0
YearRemodAdd,2003.0,1976.0,2002.0,1970.0,2000.0
MasVnrArea,196.0,0.0,162.0,0.0,350.0
BsmtFinSF1,706.0,978.0,486.0,216.0,655.0


## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [5]:
import sklearn.metrics as metric

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Fit the model and print R2 and MSE for train and test
linreg = LinearRegression()
linreg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [6]:
#train
y_hat_train = linreg.predict(X_train)
metric.r2_score(y_train, y_hat_train), metric.mean_squared_error(y_train, y_hat_train)

(0.8046400702344444, 1186117094.2978184)

In [7]:
#test
y_hat_test = linreg.predict(X_test)
metric.r2_score(y_test, y_hat_test), metric.mean_squared_error(y_test, y_hat_test)

(0.824086540554548, 1232328149.8815506)

## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [8]:
from sklearn import preprocessing

# Scale the data and perform train test split
X_scaled = preprocessing.scale(X)
X_train_scaled, X_test_scaled, y_train, y_test = train_test_split(X_scaled, y)

Perform the same linear regression on this data and print out R-squared and MSE.

In [9]:
# Your code here
linreg_scaled = LinearRegression()
linreg_scaled.fit(X_train_scaled, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [10]:
y_hat_train_scaled = linreg_scaled.predict(X_train_scaled)
metric.r2_score(y_train, y_hat_train_scaled), metric.mean_squared_error(y_train, y_hat_train_scaled)

(0.8200239675757008, 1188075258.6318893)

In [11]:
y_hat_test_scaled = linreg_scaled.predict(X_test_scaled)
metric.r2_score(y_test, y_hat_test_scaled), metric.mean_squared_error(y_test, y_hat_test_scaled)

(0.7781210759159756, 1203095283.6889532)

## Include dummy variables

Your model hasn't included dummy variables so far: let's use the "object" variables again and create dummies

In [12]:
# Create X_cat which contains only the categorical variables
features_cat = [col for col in df.columns if df[col].dtype in [np.object]]
X_cat = df[features_cat]

np.shape(X_cat)

(1460, 43)

In [13]:
# Make dummies
X_cat = pd.get_dummies(X_cat)
np.shape(X_cat)

(1460, 252)

In [14]:
X_all = pd.concat([pd.DataFrame(X_scaled), X_cat], axis = 1)

In [15]:
X_all.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,-1.730865,0.073375,-0.220875,-0.207142,0.651479,-0.5172,1.050994,0.878668,0.514104,0.575425,...,0,0,0,1,0,0,0,0,1,0
1,-1.728492,-0.872563,0.46032,-0.091886,-0.071836,2.179628,0.156734,-0.429577,-0.57075,1.171992,...,0,0,0,1,0,0,0,0,1,0
2,-1.72612,0.073375,-0.084636,0.07348,0.651479,-0.5172,0.984752,0.830215,0.325915,0.092907,...,0,0,0,1,0,0,0,0,1,0
3,-1.723747,0.309859,-0.44794,-0.096897,0.651479,-0.5172,-1.863632,-0.720298,-0.57075,-0.499274,...,0,0,0,1,1,0,0,0,0,0
4,-1.721374,0.073375,0.641972,0.375148,1.374795,-0.5172,0.951632,0.733308,1.366489,0.463568,...,0,0,0,1,0,0,0,0,1,0


Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [16]:
# Your code here
X_train_all, X_test_all, y_train, y_test = train_test_split(X_all, y)
linreg_all = LinearRegression()
linreg_all.fit(X_train_all, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Perform the same linear regression on this data and print out R-squared and MSE.

In [17]:
# Your code here
y_hat_train_all = linreg_all.predict(X_train_all)
metric.r2_score(y_train, y_hat_train_all), metric.mean_squared_error(y_train, y_hat_train_all)

(0.9355130588318349, 402643175.49589044)

In [18]:
y_hat_test_all = linreg_all.predict(X_test_all)
metric.r2_score(y_test, y_hat_test_all), metric.mean_squared_error(y_test, y_hat_test_all)

(-5.875210528055431e+21, 3.8150545681927415e+31)

Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far off. Similarly, the scale of the Testing MSE is orders of magnitude higher than that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [19]:
# Your code here
lasso = Lasso()
lasso.fit(X_train_all, y_train)

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

In [20]:
y_h_lasso_train = lasso.predict(X_train_all)
metric.r2_score(y_train, y_h_lasso_train), metric.mean_squared_error(y_train, y_h_lasso_train)

(0.9455084807589708, 340233820.1712495)

In [21]:
y_h_lasso_test = lasso.predict(X_test_all)
metric.r2_score(y_test, y_h_lasso_test), metric.mean_squared_error(y_test, y_h_lasso_test)

(0.8064161178982726, 1257032527.1137457)

With a higher regularization parameter (alpha = 10)

In [22]:
# Your code here
lasso = Lasso(alpha=10)
lasso.fit(X_train_all, y_train)

Lasso(alpha=10, copy_X=True, fit_intercept=True, max_iter=1000, normalize=False,
      positive=False, precompute=False, random_state=None, selection='cyclic',
      tol=0.0001, warm_start=False)

In [23]:
y_h_lasso_train = lasso.predict(X_train_all)
metric.r2_score(y_train, y_h_lasso_train), metric.mean_squared_error(y_train, y_h_lasso_train)

(0.9441542045192635, 348689642.01326275)

In [24]:
y_h_lasso_test = lasso.predict(X_test_all)
metric.r2_score(y_test, y_h_lasso_test), metric.mean_squared_error(y_test, y_h_lasso_test)

(0.823673227621524, 1144973878.3740826)

## Ridge

With default parameter (alpha = 1)

In [25]:
# Your code here
ridge = Ridge()
ridge.fit(X_train_all, y_train)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)

In [26]:
y_h_ridge_train = ridge.predict(X_train_all)
metric.r2_score(y_train, y_h_ridge_train), metric.mean_squared_error(y_train, y_h_ridge_train)

(0.9276049986787073, 452019473.9276067)

In [27]:
y_h_ridge_test = ridge.predict(X_test_all)
metric.r2_score(y_test, y_h_ridge_test), metric.mean_squared_error(y_test, y_h_ridge_test)

(0.8427457082485419, 1021127160.0387732)

With default parameter (alpha = 10)

In [28]:
# Your code here
ridge = Ridge(alpha=10)
ridge.fit(X_train_all, y_train)

Ridge(alpha=10, copy_X=True, fit_intercept=True, max_iter=None, normalize=False,
      random_state=None, solver='auto', tol=0.001)

In [29]:
y_h_ridge_train = ridge.predict(X_train_all)
metric.r2_score(y_train, y_h_ridge_train), metric.mean_squared_error(y_train, y_h_ridge_train)

(0.9036030167767437, 601882904.1994766)

In [30]:
y_h_ridge_test = ridge.predict(X_test_all)
metric.r2_score(y_test, y_h_ridge_test), metric.mean_squared_error(y_test, y_h_ridge_test)

(0.846021578640885, 999855370.295347)

## Look at the metrics, what are your main conclusions?   

Conclusions here

In [31]:
#Ridge performs better on test data, with alpha=10 best fit for test data

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

Compare with the total length of the parameter space and draw conclusions!

In [32]:
# number of Ridge params almost zero
(abs(ridge.coef_)<1).sum()

11

In [33]:
# number of Lasso params almost zero
(abs(lasso.coef_)<1).sum()

74

Lasso was very effective to essentially perform variable selection and remove about 25% of the variables from your model!

In [34]:
# your code here
(abs(lasso.coef_)<1).sum()/len(lasso.coef_)

0.2560553633217993

## Summary

Great! You now know how to perform Lasso and Ridge regression.