# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [3]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

In [4]:
print(df['Street'].dtype)

object


Look at df.info

In [5]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

We'll make a first selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [6]:
# Load necessary packages
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# remove "object"-type features and SalesPrice from `X`
features = [col for col in df.columns if df[col].dtype not in [object] and col!='SalePrice']
X = df[features]
# Impute null values
X.isnull().sum()
for col in X.columns:
    med = X[col].median()
    X[col].fillna(value=med, inplace=True)
    

# Create y
y = df['SalePrice']



Look at the information of `X` again

In [7]:
X.info()
len(y)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

1460

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [17]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit the model and print R2 and MSE for train and test
linreg = LinearRegression()
linreg.fit(X_train, y_train)


print("Training Rsq: ", linreg.score(X_train, y_train))
print("Test Rsq: ", linreg.score(X_test, y_test))

print("Train MSE: ", mean_squared_error(y_train, linreg.predict(X_train)))
print("Test MSE: ", mean_squared_error(y_test, linreg.predict(X_test)))

Training Rsq:  0.8223583652339906
Test Rsq:  0.7706973714998399
Train MSE:  1108428055.823818
Test MSE:  1489000220.7469826


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [26]:
from sklearn import preprocessing

# Scale the data and perform train test split

X_scaled = preprocessing.scale(X)

X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)
X_scaled_df.head()

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

Perform the same linear regression on this data and print out R-squared and MSE.

In [30]:
# Your code here
linreg_norm = LinearRegression()
linreg_norm.fit(X_train, y_train)

print("Train r^2: ", linreg_norm.score(X_train, y_train))
print("Test r^2: ", linreg_norm.score(X_test, y_test))

print("Train MSE: ", mean_squared_error(y_train, linreg_norm.predict(X_train)))
print("Test MSE: ", mean_squared_error(y_test, linreg_norm.predict(X_test)))



Train r^2:  0.8143230405145303
Test r^2:  0.7967701584244086
Train MSE:  1212948854.1652727
Test MSE:  1142638670.22824


## Include dummy variables

We haven't included dummy variables so far: let's use our "object" variables again and create dummies

In [43]:
# Create X_cat which contains only the categorical variables
features_cat = [col for col in df.columns if df[col].dtype in [object]]

X_cat = df[features_cat]
np.shape(X_cat)

(1460, 43)

In [46]:
# Make dummies
X_cat = pd.get_dummies(X_cat)
X_cat.shape
X_cat.head()

Unnamed: 0,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Street_Grvl,Street_Pave,Alley_Grvl,Alley_Pave,LotShape_IR1,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,0,0,0,1,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0
1,0,0,0,1,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0
2,0,0,0,1,0,0,1,0,0,1,...,0,0,0,1,0,0,0,0,1,0
3,0,0,0,1,0,0,1,0,0,1,...,0,0,0,1,1,0,0,0,0,0
4,0,0,0,1,0,0,1,0,0,1,...,0,0,0,1,0,0,0,0,1,0


Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [50]:
# Your code here
X_scaled_df.shape, X_cat.shape

((1460, 37), (1460, 252))

In [54]:
X_all = pd.concat([X_scaled_df, X_cat], axis = 1)

In [56]:
X_all.shape

(1460, 289)

Perform the same linear regression on this data and print out R-squared and MSE.

In [58]:
# Your code here
X_train, X_test, y_train, y_test = train_test_split(X_all, y)
linreg_all = LinearRegression()
linreg_all.fit(X_train, y_train)
print("Train r^2: ", linreg_all.score(X_train, y_train))
print("Test r^2: ", linreg_all.score(X_test, y_test))
print("Train MSE: ", mean_squared_error(y_train, linreg_all.predict(X_train)))
print("Test MSE: ", mean_squared_error(y_test, linreg_all.predict(X_test)))

Train r^2:  0.9382310125450031
Test r^2:  -6.367824090801666e+21
Train MSE:  395642035.913242
Test MSE:  3.826182623055971e+31


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far far off. Similarly, the scale of the Testing MSE is orders of magnitude higher then that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [61]:
# Your code here
from sklearn.linear_model import Lasso, Ridge
lasso = Lasso()
lasso.fit(X_train, y_train)
print("Train r^2: ", lasso.score(X_train, y_train))
print("Test r^2: ", lasso.score(X_test, y_test))
print("Train MSE: ", mean_squared_error(y_train, lasso.predict(X_train)))
print("Test MSE: ", mean_squared_error(y_test, lasso.predict(X_test)))

Train r^2:  0.9429058659081541
Test r^2:  0.8198271092996983
Train MSE:  365698716.80767304
Test MSE:  1082590181.062099


With a higher regularization parameter (alpha = 10)

In [62]:
# Your code here
lasso = Lasso(alpha=10)
lasso.fit(X_train, y_train)
print("Train r^2: ", lasso.score(X_train, y_train))
print("Test r^2: ", lasso.score(X_test, y_test))
print("Train MSE: ", mean_squared_error(y_train, lasso.predict(X_train)))
print("Test MSE: ", mean_squared_error(y_test, lasso.predict(X_test)))

Train r^2:  0.9412207949401519
Test r^2:  0.827479660809333
Train MSE:  376491914.7522631
Test MSE:  1036608917.7754762


## Ridge

With default parameter (alpha = 1)

In [63]:
# Your code here
ridge = Ridge()
ridge.fit(X_train, y_train)
print("Train r^2: ", ridge.score(X_train, y_train))
print("Test r^2: ", ridge.score(X_test, y_test))
print("Train MSE: ", mean_squared_error(y_train, ridge.predict(X_train)))
print("Test MSE: ", mean_squared_error(y_test, ridge.predict(X_test)))

Train r^2:  0.9294715165940687
Test r^2:  0.8335499227290634
Train MSE:  451748262.58769494
Test MSE:  1000135029.1386648


With default parameter (alpha = 10)

In [65]:
# Your code here
ridge = Ridge(alpha=10)
ridge.fit(X_train, y_train)
print("Train r^2: ", ridge.score(X_train, y_train))
print("Test r^2: ", ridge.score(X_test, y_test))
print("Train MSE: ", mean_squared_error(y_train, ridge.predict(X_train)))
print("Test MSE: ", mean_squared_error(y_test, ridge.predict(X_test)))

Train r^2:  0.9076971829711367
Test r^2:  0.8463552752651059
Train MSE:  591216983.7077788
Test MSE:  923192549.7974312


## Look at the metrics, what are your main conclusions?

Conclusions here

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

In [68]:
# number of Ridge params almost zero
print(sum(abs(ridge.coef_) < 10**(-10)))

9


In [70]:
# number of Lasso params almost zero
print(sum(abs(lasso.coef_) < 10**(-10)))

63


Compare with the total length of the parameter space and draw conclusions!

In [73]:
# your code here
len(linreg.coef_)

289

In [76]:
sum(abs(lasso.coef_) < 10**(-10)) / len(lasso.coef_)

0.2179930795847751

## Summary

Great! You now know how to perform Lasso and Ridge regression.