# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [45]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


Look at df.info

In [46]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

First, make a selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [47]:
# Load necessary packages
import numpy as np






# remove "object"-type features and SalesPrice from `X`
features = [col for col in df.columns if df[col].dtype in [np.float64, np.int64] and 
            col!='SalePrice']
data = df[features]

# Impute null values
for col in data.columns:
    #print(col)
    #print(data[col])
    data[col].fillna(data[col].mean(), inplace=True)

data.isna().sum()


X=data
# Create y
y=df.SalePrice


Look at the information of `X` again

In [48]:
X.shape

(1460, 37)

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [49]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


# Split in train and test
X_train, X_test, y_train, y_test = train_test_split (X,y, test_size=0.3, random_state=12 )

# Fit the model and print R2 and MSE for train and test
reg =  LinearRegression().fit(X_train, y_train)
y_hat_train = reg.predict(X_train) 

print('Training r^2:', reg.score(X_train, y_train))
print('Testing r^2:',reg.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, reg.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, reg.predict(X_test)))

Training r^2: 0.8605540732037886
Testing r^2: 0.6466526213390591
Training MSE: 846300047.025201
Testing MSE: 2423448188.093814


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [51]:
from sklearn import preprocessing

# Scale the data and perform train test split
X_scaled=preprocessing.scale(X)
X_train, X_test, y_train, y_test = train_test_split (X_scaled,y, test_size=0.3, random_state=12 )


Perform the same linear regression on this data and print out R-squared and MSE.

In [52]:
# Your code here

reg =  LinearRegression().fit(X_train, y_train)
y_hat_train = reg.predict(X_train) 

print('Training r^2:', reg.score(X_train, y_train))
print('Testing r^2:',reg.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, reg.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, reg.predict(X_test)))

Training r^2: 0.8605537387933268
Testing r^2: 0.6467893223017898
Training MSE: 846302076.5688105
Testing MSE: 2422510618.6637087


## Include dummy variables

Your model hasn't included dummy variables so far: let's use the "object" variables again and create dummies

In [57]:
# Create X_cat which contains only the categorical variables
cat_features = [col for col in df.columns if df[col].dtype in [np.object] and 
            col!='SalePrice']
cat_data = df[cat_features]
cat_data.head(2)
X_cat= cat_data

In [58]:
# Make dummies
X_cat = pd.get_dummies(X_cat, drop_first=True)
np.shape(X_cat)

(1460, 209)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [62]:
# Your code here
X_all= pd.concat([pd.DataFrame(X_scaled), X_cat], axis=1)

#np.concatenate([pd.DataFrame(X_scaled), X_cat])

Perform the same linear regression on this data and print out R-squared and MSE.

In [63]:
# Your code here
X_train, X_test, y_train, y_test = train_test_split (X_all,y, test_size=0.3, random_state=12 )
reg =  LinearRegression().fit(X_train, y_train)
y_hat_train = reg.predict(X_train) 

print('Training r^2:', reg.score(X_train, y_train))
print('Testing r^2:',reg.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, reg.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, reg.predict(X_test)))


Training r^2: 0.9437339794769517
Testing r^2: -1.8397718099125166e+19
Training MSE: 341479574.9765166
Testing MSE: 1.261815405603126e+29


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far off. Similarly, the scale of the Testing MSE is orders of magnitude higher than that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [77]:
from sklearn.linear_model import Lasso, Ridge, LinearRegression
# Your code here


lasso = Lasso(alpha=1)
lasso.fit(X_train, y_train)

y_h_lasso_train = (lasso.predict(X_train))
y_h_lasso_test = (lasso.predict(X_test))

print('Training r^2:', lasso.score(X_train, y_train))
print('Testing r^2:', lasso.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, y_h_lasso_train))
print('Testing MSE:', mean_squared_error(y_test, y_h_lasso_test))

Training r^2: 0.9436643735561844
Testing r^2: 0.6804284074079939
Training MSE: 341902014.6660228
Testing MSE: 2191795507.209634


With a higher regularization parameter (alpha = 10)

In [78]:
# Your code here
lasso10 = Lasso(alpha=10)
lasso10.fit(X_train, y_train)

y_h_lasso_train = (lasso10.predict(X_train))
y_h_lasso_test = (lasso10.predict(X_test))

print('Training r^2:', lasso10.score(X_train, y_train))
print('Testing r^2:', lasso10.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, y_h_lasso_train))
print('Testing MSE:', mean_squared_error(y_test, y_h_lasso_test))

Training r^2: 0.9408709772069287
Testing r^2: 0.6784074292780236
Training MSE: 358855191.5428914
Testing MSE: 2205656472.602428


## Ridge

With default parameter (alpha = 1)

In [81]:
# Your code here
ridge = Ridge(alpha=1)
ridge.fit(X_train, y_train)

y_h_ridge_train = ridge.predict(X_train)
y_h_ridge_test = ridge.predict(X_test)

print('Training r^2:', ridge.score(X_train, y_train))
print('Testing r^2:', ridge.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, y_h_ridge_train))
print('Testing MSE:', mean_squared_error(y_test, y_h_ridge_test))

Training r^2: 0.9397191314274647
Testing r^2: 0.6770269021193969
Training MSE: 365845766.02378577
Testing MSE: 2215124877.4732013


With default parameter (alpha = 10)

In [82]:
# Your code here
ridge10 = Ridge(alpha=10)
ridge10.fit(X_train, y_train)

y_h_ridge_train = ridge10.predict(X_train)
y_h_ridge_test = ridge10.predict(X_test)


print('Training r^2:', ridge10.score(X_train, y_train))
print('Testing r^2:', ridge10.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, y_h_ridge_train))
print('Testing MSE:', mean_squared_error(y_test, y_h_ridge_test))

Training r^2: 0.9301796789848084
Testing r^2: 0.6869144923693935
Training MSE: 423740888.787844
Testing MSE: 2147310414.6440845


## Look at the metrics, what are your main conclusions?   

Conclusions here
#comparing to linear regression, ridge and lasso did much better job on the training data. alpha 10 did improved the  ridge test  rsquare but didn not help with lasso  test


## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

Compare with the total length of the parameter space and draw conclusions!

In [83]:
# number of Ridge params almost zero
print(sum(abs(ridge.coef_) < 10**(-10)))

7


In [84]:
# number of Lasso params almost zero
print(sum(abs(lasso.coef_) < 10**(-10)))

10


Lasso was very effective to essentially perform variable selection and remove about 25% of the variables from your model!

In [89]:
# your code here
print(X_all.shape)

len(lasso.coef_)



(1460, 246)


246

## Summary

Great! You now know how to perform Lasso and Ridge regression.