# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

We'll make a first selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [19]:
# Load necessary packages
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


# remove "object"-type features and SalesPrice from `X`
features = [col for col in df.columns if df[col].dtype in [np.float64, np.int64] and col!='SalePrice']
X = df[features]

# Impute null values
for column in X:
    X[column].fillna(value= X[column].median(), inplace=True)
# Create y
y = df.SalePrice

In [20]:
X.isna().sum()

Id               0
MSSubClass       0
LotFrontage      0
LotArea          0
OverallQual      0
OverallCond      0
YearBuilt        0
YearRemodAdd     0
MasVnrArea       0
BsmtFinSF1       0
BsmtFinSF2       0
BsmtUnfSF        0
TotalBsmtSF      0
1stFlrSF         0
2ndFlrSF         0
LowQualFinSF     0
GrLivArea        0
BsmtFullBath     0
BsmtHalfBath     0
FullBath         0
HalfBath         0
BedroomAbvGr     0
KitchenAbvGr     0
TotRmsAbvGrd     0
Fireplaces       0
GarageYrBlt      0
GarageCars       0
GarageArea       0
WoodDeckSF       0
OpenPorchSF      0
EnclosedPorch    0
3SsnPorch        0
ScreenPorch      0
PoolArea         0
MiscVal          0
MoSold           0
YrSold           0
dtype: int64

Look at the information of `X` again

In [22]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [29]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Fit the model and print R2 and MSE for train and test
linreg = LinearRegression()
linreg.fit(X_train, y_train)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [30]:
print(linreg.score(X_train, y_train), mean_squared_error(y_train, linreg.predict(X_train)))

0.8209433355238992 1102226136.7077594


In [31]:
print(linreg.score(X_test, y_test), mean_squared_error(y_test, linreg.predict(X_test)))

0.7839695663973767 1460342178.3368423


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [32]:
from sklearn import preprocessing

X_scaled = preprocessing.scale(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
linreg = LinearRegression()
linreg.fit(X_train, y_train)




LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Perform the same linear regression on this data and print out R-squared and MSE.

In [33]:
print(linreg.score(X_train, y_train), mean_squared_error(y_train, linreg.predict(X_train)))
print(linreg.score(X_test, y_test), mean_squared_error(y_test, linreg.predict(X_test)))

0.801507729897501 1236089905.7083972
0.8287153733383981 1120793997.845457


In [36]:
X_scaled

array([[-1.73086488,  0.07337496, -0.22087509, ..., -0.08768781,
        -1.5991111 ,  0.13877749],
       [-1.7284922 , -0.87256276,  0.46031974, ..., -0.08768781,
        -0.48911005, -0.61443862],
       [-1.72611953,  0.07337496, -0.08463612, ..., -0.08768781,
         0.99089135,  0.13877749],
       ...,
       [ 1.72611953,  0.30985939, -0.1754621 , ...,  4.95311151,
        -0.48911005,  1.64520971],
       [ 1.7284922 , -0.87256276, -0.08463612, ..., -0.08768781,
        -0.8591104 ,  1.64520971],
       [ 1.73086488, -0.87256276,  0.23325479, ..., -0.08768781,
        -0.1191097 ,  0.13877749]])

## Include dummy variables

We haven't included dummy variables so far: let's use our "object" variables again and create dummies

In [34]:
features_cat = [col for col in df.columns if df[col].dtype in [np.object]]
X_cat = df[features_cat]

In [35]:
X_cat = pd.get_dummies(X_cat)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [37]:
X_all = pd.concat([pd.DataFrame(X_scaled), X_cat], axis=1)

Perform the same linear regression on this data and print out R-squared and MSE.

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X_all, y)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
print(linreg.score(X_train, y_train), mean_squared_error(y_train, linreg.predict(X_train)))
print(linreg.score(X_test, y_test), mean_squared_error(y_test, linreg.predict(X_test)))


0.9366506034618687 373530044.9461187
-2.8085569128670306e+19 2.113939099002042e+29


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far far off. Similarly, the scale of the Testing MSE is orders of magnitude higher then that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

In [39]:
from sklearn.linear_model import Lasso, Ridge

## Lasso

With default parameter (alpha = 1)

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X_all, y)
linreg = Lasso()
linreg.fit(X_train, y_train)
print(linreg.score(X_train, y_train), mean_squared_error(y_train, linreg.predict(X_train)))
print(linreg.score(X_test, y_test), mean_squared_error(y_test, linreg.predict(X_test)))


0.9453580939084304 315364992.0710216
0.8312125496053924 1335562632.1270173


With a higher regularization parameter (alpha = 10)

In [41]:
# Your code hereX_train, X_test, y_train, y_test = train_test_split(X_all, y)
linreg = Lasso(alpha=10)
linreg.fit(X_train, y_train)
print(linreg.score(X_train, y_train), mean_squared_error(y_train, linreg.predict(X_train)))
print(linreg.score(X_test, y_test), mean_squared_error(y_test, linreg.predict(X_test)))


0.9437753553226992 324499745.5458211
0.837415042864253 1286484230.8407419


## Ridge

With default parameter (alpha = 1)

In [42]:
X_train, X_test, y_train, y_test = train_test_split(X_all, y)
linreg = Ridge()
linreg.fit(X_train, y_train)
print(linreg.score(X_train, y_train), mean_squared_error(y_train, linreg.predict(X_train)))
print(linreg.score(X_test, y_test), mean_squared_error(y_test, linreg.predict(X_test)))


0.9322031352117632 407906088.0887437
0.7871336290311843 1527767977.2266047


With default parameter (alpha = 10)

In [43]:
X_train, X_test, y_train, y_test = train_test_split(X_all, y)
linreg = Lasso(alpha=10)
linreg.fit(X_train, y_train)
print(linreg.score(X_train, y_train), mean_squared_error(y_train, linreg.predict(X_train)))
print(linreg.score(X_test, y_test), mean_squared_error(y_test, linreg.predict(X_test)))


0.9432041874262295 351979545.15729433
0.8183483128943126 1197638124.9556024


## Look at the metrics, what are your main conclusions?

Conclusions here

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

In [None]:
# number of Ridge params almost zero

In [None]:
# number of Lasso params almost zero

Compare with the total length of the parameter space and draw conclusions!

In [None]:
# your code here

## Summary

Great! You now know how to perform Lasso and Ridge regression.