# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [3]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

We'll make a first selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [4]:
# Load necessary packages


# remove "object"-type features and SalesPrice from `X`


# Impute null values


# Create y
# Load necessary packages
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# remove "object"-type features and SalesPrice from `X`
features = [col for col in df.columns if df[col].dtype in [np.float64, np.int64] and col!='SalePrice']
X = df[features]

# Impute null values
for col in X:
    med = X[col].median()
    X[col].fillna(value = med, inplace = True)

# Create y
y = df.SalePrice

Look at the information of `X` again

In [5]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [7]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error

# Split in train and test

# Fit the model and print R2 and MSE for train and test
# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X,y)

# Fit the model and print R2 and MSE for train and test
linreg = LinearRegression()
linreg.fit(X_train, y_train)
#linear regression for xtrain and y train score
print('Training r^2:', linreg.score(X_train, y_train))
print('Testing r^2:', linreg.score(X_test, y_test))
#get the mean squared error of y train and the linear regression predict of xtrain
print('Training MSE:', mean_squared_error(y_train, linreg.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, linreg.predict(X_test)))


Training r^2: 0.8002792703767118
Testing r^2: 0.8419604129830918
Training MSE: 1276864534.0518208
Testing MSE: 951354909.433369


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [8]:
from sklearn import preprocessing

# scale the data and perform train test split
X_scaled = preprocessing.scale(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled,y)


Perform the same linear regression on this data and print out R-squared and MSE.

In [9]:
# Your code here
linreg_norm = LinearRegression()
linreg_norm.fit(X_train, y_train)
print('Training r^2:', linreg_norm.score(X_train, y_train))
print('Testing r^2:', linreg_norm.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, linreg_norm.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, linreg_norm.predict(X_test)))

Training r^2: 0.7987511627404219
Testing r^2: 0.841777187574803
Training MSE: 1285875639.5455616
Testing MSE: 954600272.7646444


## Include dummy variables

We haven't included dummy variables so far: let's use our "object" variables again and create dummies

In [10]:
# Create X_cat which contains only the categorical variables
# Create X_cat which contains only the categorical variables
features_cat = [col for col in df.columns if df[col].dtype in [np.object]]
X_cat = df[features_cat]

np.shape(X_cat)

(1460, 43)

In [11]:
# Make dummies
# Make dummies
X_cat = pd.get_dummies(X_cat)
np.shape(X_cat)

(1460, 252)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [12]:
# Your code here
X_all = pd.concat([pd.DataFrame(X_scaled), X_cat], axis = 1)

Perform the same linear regression on this data and print out R-squared and MSE.

In [13]:
# Your code here
X_train, X_test, y_train, y_test = train_test_split(X_all, y)
linreg_all = LinearRegression()
linreg_all.fit(X_train, y_train)
print('Training r^2:', linreg_all.score(X_train, y_train))
print('Testing r^2:', linreg_all.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, linreg_all.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, linreg_all.predict(X_test)))

Training r^2: 0.9378350168099928
Testing r^2: -4.86154676687089e+16
Training MSE: 377416423.7379076
Testing MSE: 3.407271789418969e+26


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far far off. Similarly, the scale of the Testing MSE is orders of magnitude higher then that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [14]:
# Your code here
from sklearn.linear_model import Lasso, Ridge

lasso = Lasso() 
lasso.fit(X_train, y_train)
print('Training r^2:', lasso.score(X_train, y_train))
print('Testing r^2:', lasso.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, lasso.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, lasso.predict(X_test)))

Training r^2: 0.9377942063217
Testing r^2: 0.8743345274503859
Training MSE: 377664192.6224483
Testing MSE: 880741130.4569166


With a higher regularization parameter (alpha = 10)

In [15]:
# Your code here
from sklearn.linear_model import Lasso, Ridge

lasso = Lasso(alpha=10) #Lasso is also known as the L1 norm.
lasso.fit(X_train, y_train)
print('Training r^2:', lasso.score(X_train, y_train))
print('Testing r^2:', lasso.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, lasso.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, lasso.predict(X_test)))

Training r^2: 0.9366202378138652
Testing r^2: 0.8821517812107345
Training MSE: 384791597.36176413
Testing MSE: 825952995.1460052


## Ridge

With default parameter (alpha = 1)

In [16]:
# Your code here
from sklearn.linear_model import Lasso, Ridge

ridge = Ridge() #Lasso is also known as the L1 norm.
ridge.fit(X_train, y_train)
print('Training r^2:', ridge.score(X_train, y_train))
print('Testing r^2:', ridge.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, ridge.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, ridge.predict(X_test)))

Training r^2: 0.9240559879657525
Testing r^2: 0.8816260757184347
Training MSE: 461071747.39623636
Testing MSE: 829637463.4425193


With default parameter (alpha = 10)

In [17]:
# Your code here
from sklearn.linear_model import Lasso, Ridge

ridge = Ridge(alpha = 10) #Lasso is also known as the L1 norm.
ridge.fit(X_train, y_train)
print('Training r^2:', ridge.score(X_train, y_train))
print('Testing r^2:', ridge.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, ridge.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, ridge.predict(X_test)))

Training r^2: 0.8980811805414304
Testing r^2: 0.8856473434337042
Training MSE: 618770156.0872627
Testing MSE: 801453939.348277


## Look at the metrics, what are your main conclusions?

Conclusions here

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

In [22]:
ridge.coef_

array([  -172.8131268 ,  -5152.65676822,  -2946.12721339,   4490.68455693,
        14988.98877711,   5114.66550093,   3039.00691734,   1545.56979317,
         2048.53935029,    642.70591032,   1008.29842454,    184.7482657 ,
         1225.03427924,   5416.07277993,  10887.22176555,    817.2813523 ,
        13104.4053565 ,   3241.48900316,   -444.28194048,   3653.75786914,
         2625.37056317,  -1718.12278431,  -2834.65716119,   5846.84218744,
         2993.96310616,  -2019.01732277,   9690.93794638,    259.59747463,
         2222.80220149,   -523.7244643 ,    881.00479674,   1845.50815524,
         1907.96316621,   1458.80238408,   -898.65450185,    275.79743806,
         -645.55506881,  -7802.62442741,   8510.92145356,   2024.54255522,
         2268.29661558,  -5001.13619695,  -6620.3010912 ,   6620.3010912 ,
         -693.57245245,  -2039.66705584,   3209.74866886,   7628.01629652,
       -12150.65491603,   1312.88995064, -12627.00604274,   6555.62803258,
         -398.31196795,  

In [18]:
# number of Ridge params almost zero
print(sum(abs(ridge.coef_) < 10**(-10)))

9


In [19]:
# number of Lasso params almost zero
print(sum(abs(lasso.coef_) < 10**(-10)))

73


Compare with the total length of the parameter space and draw conclusions!

In [20]:
# your code here
len(lasso.coef_)

289

In [21]:
sum(abs(lasso.coef_) < 10**(-10))/289

0.25259515570934254

## Summary

Great! You now know how to perform Lasso and Ridge regression.