# Multiple Linear Regression
## Housing Case Study

#### Problem Statement:

A US-based housing company named Surprise Housing has decided to enter the Australian market. The company uses data analytics to purchase houses at a price below their actual values and flip them on at a higher price. For the same purpose, the company has collected a data set from the sale of houses in Australia. The data is provided in the CSV file below.



The company is looking at prospective properties to buy to enter the market. You are required to build a regression model using regularisation in order to predict the actual value of the prospective properties and decide whether to invest in them or not.



The company wants to know:

Which variables are significant in predicting the price of a house, and

How well those variables describe the price of a house.

Also, determine the optimal value of lambda for ridge and lasso regression.



Business Goal



You are required to model the price of houses with the available independent variables. This model will then be used by the management to understand how exactly the prices vary with the variables. They can accordingly manipulate the strategy of the firm and concentrate on areas that will yield high returns. Further, the model will be a good way for management to understand the pricing dynamics of a new market.

**So interpretation is important!**

## Step 1: Reading and Understanding the Data

Let us first import NumPy and Pandas and read the housing dataset

In [3]:
# Supress Warnings

import warnings
warnings.filterwarnings('ignore')

In [4]:
import numpy as np
import pandas as pd

In [5]:
housing = pd.read_csv("train.csv")

In [6]:
# Check the head of the dataset
housing.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


Inspect the various aspects of the housing dataframe

In [7]:
housing.shape

(1460, 81)

In [8]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [9]:
housing.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


## Step 2: Visualising the Data

Let's now spend some time doing what is arguably the most important step - **understanding the data**.
- If there is some obvious multicollinearity going on, this is the first place to catch it
- Here's where you'll also identify if some predictors directly have a strong association with the outcome variable

We'll visualise our data using `matplotlib` and `seaborn`.

In [14]:
import matplotlib.pyplot as plt
import seaborn as sns

In [15]:
%matplotlib

Using matplotlib backend: <object object at 0x0000012D2382BF20>


#### Visualising Numeric Variables

Let's make a pairplot of all the numeric variables

In [16]:
sns.pairplot(housing)
plt.show()

KeyboardInterrupt: 

#### Visualising Categorical Variables

As you might have noticed, there are a few categorical variables as well. Let's make a boxplot for some of these variables.

In [None]:
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.boxplot(x = 'MSZoning', y = 'SalePrice', data = housing)
plt.subplot(2,3,2)
sns.boxplot(x = 'Street', y = 'SalePrice', data = housing)
plt.subplot(2,3,3)
sns.boxplot(x = 'LotShape', y = 'SalePrice', data = housing)
plt.subplot(2,3,4)
sns.boxplot(x = 'LandContour', y = 'SalePrice', data = housing)
plt.subplot(2,3,5)
sns.boxplot(x = 'Utilities', y = 'SalePrice', data = housing)
plt.subplot(2,3,6)
sns.boxplot(x = 'LotConfig', y = 'SalePrice', data = housing)
plt.show()

In [None]:
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.boxplot(x = 'LandSlope', y = 'SalePrice', data = housing)
plt.subplot(2,3,2)
sns.boxplot(x = 'Neighborhood', y = 'SalePrice', data = housing)
plt.subplot(2,3,3)
sns.boxplot(x = 'Condition1', y = 'SalePrice', data = housing)
plt.subplot(2,3,4)
sns.boxplot(x = 'Condition2', y = 'SalePrice', data = housing)
plt.subplot(2,3,5)
sns.boxplot(x = 'BldgType', y = 'SalePrice', data = housing)
plt.subplot(2,3,6)
sns.boxplot(x = 'HouseStyle', y = 'SalePrice', data = housing)
plt.show()

In [None]:
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.boxplot(x = 'RoofStyle', y = 'SalePrice', data = housing)
plt.subplot(2,3,2)
sns.boxplot(x = 'RoofMatl', y = 'SalePrice', data = housing)
plt.subplot(2,3,3)
sns.boxplot(x = 'Exterior1st', y = 'SalePrice', data = housing)
plt.subplot(2,3,4)
sns.boxplot(x = 'Exterior2nd', y = 'SalePrice', data = housing)
plt.subplot(2,3,5)
sns.boxplot(x = 'MasVnrType', y = 'SalePrice', data = housing)
plt.subplot(2,3,6)
sns.boxplot(x = 'ExterQual', y = 'SalePrice', data = housing)
plt.show()

In [None]:
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.boxplot(x = 'ExterCond', y = 'SalePrice', data = housing)
plt.subplot(2,3,2)
sns.boxplot(x = 'Foundation', y = 'SalePrice', data = housing)
plt.subplot(2,3,3)
sns.boxplot(x = 'BsmtQual', y = 'SalePrice', data = housing)
plt.subplot(2,3,4)
sns.boxplot(x = 'BsmtCond', y = 'SalePrice', data = housing)
plt.subplot(2,3,5)
sns.boxplot(x = 'BsmtExposure', y = 'SalePrice', data = housing)
plt.subplot(2,3,6)
sns.boxplot(x = 'BsmtFinType1', y = 'SalePrice', data = housing)
plt.show()

In [None]:
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.boxplot(x = 'BsmtFinType2', y = 'SalePrice', data = housing)
plt.subplot(2,3,2)
sns.boxplot(x = 'Heating', y = 'SalePrice', data = housing)
plt.subplot(2,3,3)
sns.boxplot(x = 'HeatingQC', y = 'SalePrice', data = housing)
plt.subplot(2,3,4)
sns.boxplot(x = 'CentralAir', y = 'SalePrice', data = housing)
plt.subplot(2,3,5)
sns.boxplot(x = 'Electrical', y = 'SalePrice', data = housing)
plt.subplot(2,3,6)
sns.boxplot(x = 'KitchenQual', y = 'SalePrice', data = housing)
plt.show()

In [None]:
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.boxplot(x = 'Functional', y = 'SalePrice', data = housing)
plt.subplot(2,3,2)
sns.boxplot(x = 'FireplaceQu', y = 'SalePrice', data = housing)
plt.subplot(2,3,3)
sns.boxplot(x = 'GarageType', y = 'SalePrice', data = housing)
plt.subplot(2,3,4)
sns.boxplot(x = 'GarageFinish', y = 'SalePrice', data = housing)
plt.subplot(2,3,5)
sns.boxplot(x = 'GarageQual', y = 'SalePrice', data = housing)
plt.subplot(2,3,6)
sns.boxplot(x = 'GarageCond', y = 'SalePrice', data = housing)
plt.show()

In [None]:
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.boxplot(x = 'PavedDrive', y = 'SalePrice', data = housing)
plt.subplot(2,3,2)
sns.boxplot(x = 'PoolQC', y = 'SalePrice', data = housing)
plt.subplot(2,3,3)
sns.boxplot(x = 'Fence', y = 'SalePrice', data = housing)
plt.subplot(2,3,4)
sns.boxplot(x = 'MiscFeature', y = 'SalePrice', data = housing)
plt.subplot(2,3,5)
sns.boxplot(x = 'SaleType', y = 'SalePrice', data = housing)
plt.subplot(2,3,6)
sns.boxplot(x = 'SaleCondition', y = 'SalePrice', data = housing)
plt.show()

We can visualise some of these categorical features doesnt impact on sales price much. We can remove those columns from dataframe to optimize execution time. Those columns are Street Utilities Lotshape Lotconfig LandSlope PavedDrive Fence.
These columns will be removed later.

We can also visualise some of these categorical features parallely by using the `hue` argument.

In [None]:
plt.figure(figsize = (10, 5))
sns.boxplot(x = 'GarageType', y = 'SalePrice', hue = 'GarageQual', data = housing)
plt.show()

In [None]:
plt.figure(figsize = (10, 5))
sns.boxplot(x = 'GarageType', y = 'SalePrice', hue = 'GarageCond', data = housing)
plt.show()

In [None]:
plt.figure(figsize = (10, 5))
sns.boxplot(x = 'Heating', y = 'SalePrice', hue = 'HeatingQC', data = housing)
plt.show()

## Step 3: Data Preparation

- You can see that your dataset has MSSubClass column with values but its categorical column

- We can convert it into categorical value

In [10]:
# Import label encoder
from sklearn import preprocessing

# label_encoder object knows
# how to understand word labels.
label_encoder = preprocessing.LabelEncoder()

# Encode labels in column 'MSZoning'.
housing['MSZoning']= label_encoder.fit_transform(housing['MSZoning'])
housing['MSZoning'] = pd.to_numeric(housing['MSZoning'])

# Encode labels in column 'LotShape'.
housing['LotShape']= label_encoder.fit_transform(housing['LotShape'])
housing['LotShape']= pd.to_numeric(housing['LotShape'])

# Encode labels in column 'LandContour'.
housing['LandContour']= label_encoder.fit_transform(housing['LandContour'])
housing['LandContour']= pd.to_numeric(housing['LandContour'])

# Encode labels in column 'LotConfig'.
housing['LotConfig']= label_encoder.fit_transform(housing['LotConfig'])
housing['LotConfig']= pd.to_numeric(housing['LotConfig'])

# Encode labels in column 'LandSlope'.
housing['LandSlope']= label_encoder.fit_transform(housing['LandSlope'])
housing['LandSlope']= pd.to_numeric(housing['LandSlope'])

# Encode labels in column 'Condition2'.
housing['Condition2']= label_encoder.fit_transform(housing['Condition2'])
housing['Condition2']= pd.to_numeric(housing['Condition2'])

# Encode labels in column 'Condition1'.
housing['Condition1']= label_encoder.fit_transform(housing['Condition1'])
housing['Condition1']= pd.to_numeric(housing['Condition1'])

# Encode labels in column 'BldgType'.
housing['BldgType']= label_encoder.fit_transform(housing['BldgType'])
housing['BldgType']= pd.to_numeric(housing['BldgType'])

# Encode labels in column 'HouseStyle'.
housing['HouseStyle']= label_encoder.fit_transform(housing['HouseStyle'])
housing['HouseStyle']= pd.to_numeric(housing['HouseStyle'])

# Encode labels in column 'RoofStyle'.
housing['RoofStyle']= label_encoder.fit_transform(housing['RoofStyle'])
housing['RoofStyle']= pd.to_numeric(housing['RoofStyle'])

# Encode labels in column 'RoofMatl'.
housing['RoofMatl']= label_encoder.fit_transform(housing['RoofMatl'])
housing['RoofMatl']= pd.to_numeric(housing['RoofMatl'])

# Encode labels in column 'Exterior1st'.
housing['Exterior1st']= label_encoder.fit_transform(housing['Exterior1st'])
housing['Exterior1st']= pd.to_numeric(housing['Exterior1st'])

# Encode labels in column 'Exterior2nd'.
housing['Exterior2nd']= label_encoder.fit_transform(housing['Exterior2nd'])
housing['Exterior2nd']= pd.to_numeric(housing['Exterior2nd'])

# Encode labels in column 'MasVnrType'.
housing['MasVnrType']= label_encoder.fit_transform(housing['MasVnrType'])
housing['MasVnrType']= pd.to_numeric(housing['MasVnrType'])

# Encode labels in column 'ExterQual'.
housing['ExterQual']= label_encoder.fit_transform(housing['ExterQual'])
housing['ExterQual']= pd.to_numeric(housing['ExterQual'])

# Encode labels in column 'ExterCond'.
housing['ExterCond']= label_encoder.fit_transform(housing['ExterCond'])
housing['ExterCond']= pd.to_numeric(housing['ExterCond'])

# Encode labels in column 'Foundation'.
housing['Foundation']= label_encoder.fit_transform(housing['Foundation'])
housing['Foundation']= pd.to_numeric(housing['Foundation'])

# Encode labels in column 'BsmtQual'.
housing['BsmtQual']= label_encoder.fit_transform(housing['BsmtQual'])
housing['BsmtQual']= pd.to_numeric(housing['BsmtQual'])

# Encode labels in column 'BsmtCond'.
housing['BsmtCond']= label_encoder.fit_transform(housing['BsmtCond'])
housing['BsmtCond']= pd.to_numeric(housing['BsmtCond'])

# Encode labels in column 'BsmtExposure'.
housing['BsmtExposure']= label_encoder.fit_transform(housing['BsmtExposure'])
housing['BsmtExposure']= pd.to_numeric(housing['BsmtExposure'])

# Encode labels in column 'BsmtFinType1'.
housing['BsmtFinType1']= label_encoder.fit_transform(housing['BsmtFinType1'])
housing['BsmtFinType1']= pd.to_numeric(housing['BsmtFinType1'])

# Encode labels in column 'BsmtFinType2'.
housing['BsmtFinType2']= label_encoder.fit_transform(housing['BsmtFinType2'])
housing['BsmtFinType2']= pd.to_numeric(housing['BsmtFinType2'])

# Encode labels in column 'HeatingQC'.
housing['HeatingQC']= label_encoder.fit_transform(housing['HeatingQC'])
housing['HeatingQC']= pd.to_numeric(housing['HeatingQC'])

# Encode labels in column 'CentralAir'.
housing['CentralAir']= label_encoder.fit_transform(housing['CentralAir'])
housing['CentralAir']= pd.to_numeric(housing['CentralAir'])

# Encode labels in column 'Electrical'.
housing['Electrical']= label_encoder.fit_transform(housing['Electrical'])
housing['Electrical']= pd.to_numeric(housing['Electrical'])

# Encode labels in column 'KitchenQual'.
housing['KitchenQual']= label_encoder.fit_transform(housing['KitchenQual'])
housing['KitchenQual']= pd.to_numeric(housing['KitchenQual'])

# Encode labels in column 'Functional'.
housing['Functional']= label_encoder.fit_transform(housing['Functional'])
housing['Functional']= pd.to_numeric(housing['Functional'])

# Encode labels in column 'FireplaceQu'.
housing['FireplaceQu']= label_encoder.fit_transform(housing['FireplaceQu'])
housing['FireplaceQu']= pd.to_numeric(housing['FireplaceQu'])

# Encode labels in column 'GarageType'.
housing['GarageType']= label_encoder.fit_transform(housing['GarageType'])
housing['GarageType']= pd.to_numeric(housing['GarageType'])

# Encode labels in column 'GarageFinish'.
housing['GarageFinish']= label_encoder.fit_transform(housing['GarageFinish'])
housing['GarageFinish']= pd.to_numeric(housing['GarageFinish'])

# Encode labels in column 'GarageQual'.
housing['GarageQual']= label_encoder.fit_transform(housing['GarageQual'])
housing['GarageQual']= pd.to_numeric(housing['GarageQual'])

# Encode labels in column 'GarageCond'.
housing['GarageCond']= label_encoder.fit_transform(housing['GarageCond'])
housing['GarageCond']= pd.to_numeric(housing['GarageCond'])

# Encode labels in column 'PavedDrive'.
housing['PavedDrive']= label_encoder.fit_transform(housing['PavedDrive'])
housing['PavedDrive']= pd.to_numeric(housing['PavedDrive'])

# Encode labels in column 'SaleType'.
housing['SaleType']= label_encoder.fit_transform(housing['SaleType'])
housing['SaleType']= pd.to_numeric(housing['SaleType'])

# Encode labels in column 'SaleCondition'.
housing['SaleCondition']= label_encoder.fit_transform(housing['SaleCondition'])
housing['SaleCondition']= pd.to_numeric(housing['SaleCondition'])

In [18]:
#housing['MSSubClass'] = pd.Categorical(housing.MSSubClass)
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   int32  
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   int32  
 8   LandContour    1460 non-null   int32  
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   int32  
 11  LandSlope      1460 non-null   int32  
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   int32  
 14  Condition2     1460 non-null   int32  
 15  BldgType       1460 non-null   int32  
 16  HouseStyle     1460 non-null   int32  
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [19]:
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.boxplot(x = 'MSSubClass', y = 'SalePrice', data = housing)

<Axes: xlabel='MSSubClass', ylabel='SalePrice'>

In [11]:
housing.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [12]:
# split into X and y
X = housing.loc[:, ['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea',
                    'LotShape', 'LandContour', 'LotConfig',
                    'LandSlope', 'Condition1', 'Condition2', 'BldgType',
                    'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
                    'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
                    'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
                    'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
                    'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF',
                    'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
                    'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
                    'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
                    'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
                    'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
                    'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
                    'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
                    'MiscVal', 'MoSold', 'YrSold', 'SaleType',
                    'SaleCondition' ]] # predictors in variable X

y = housing['SalePrice'] # response variable in Y

In [13]:
X.dropna()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,LotShape,LandContour,LotConfig,LandSlope,Condition1,Condition2,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,60,3,65.0,8450,3,3,4,0,2,2,...,61,0,0,0,0,0,2,2008,8,4
1,20,3,80.0,9600,3,3,2,0,1,2,...,0,0,0,0,0,0,5,2007,8,4
2,60,3,68.0,11250,0,3,4,0,2,2,...,42,0,0,0,0,0,9,2008,8,4
3,70,3,60.0,9550,0,3,0,0,2,2,...,35,272,0,0,0,0,2,2006,8,0
4,60,3,84.0,14260,0,3,2,0,2,2,...,84,0,0,0,0,0,12,2008,8,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,60,3,62.0,7917,3,3,4,0,2,2,...,40,0,0,0,0,0,8,2007,8,4
1456,20,3,85.0,13175,3,3,4,0,2,2,...,0,0,0,0,0,0,2,2010,8,4
1457,70,3,66.0,9042,3,3,4,0,2,2,...,60,0,0,0,0,2500,5,2010,8,4
1458,20,3,68.0,9717,3,3,4,0,2,2,...,0,112,0,0,0,0,4,2010,8,4


In [14]:
# drop columns. This decision has been taken from VIF which we have calculated later stage and also crom correlation
housing_drop_columns =  ['LotFrontage','MasVnrArea','GarageYrBlt','BsmtFinSF1','TotalBsmtSF',
                         'BsmtFinSF2','GrLivArea','LowQualFinSF','2ndFlrSF','1stFlrSF','BsmtUnfSF',
                        'YearBuilt','GarageCars','GarageArea','MSSubClass','TotRmsAbvGrd']
#'LandContour_Low','3SsnPorch', 'ScreenPorch'
X = X.drop(housing_drop_columns, axis=1)

In [15]:
cor = X.corr()
cor

Unnamed: 0,MSZoning,LotArea,LotShape,LandContour,LotConfig,LandSlope,Condition1,Condition2,BldgType,HouseStyle,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
MSZoning,1.0,-0.034452,0.061887,-0.017854,-0.009895,-0.022055,-0.027874,0.044606,0.00569,-0.105315,...,-0.154704,0.115509,0.000362,0.019089,-0.003128,0.009293,-0.031496,-0.020628,0.097437,0.009494
LotArea,-0.034452,1.0,-0.165315,-0.149083,-0.121161,0.436868,0.023846,0.022164,-0.205721,-0.03319,...,0.084774,-0.01834,0.020423,0.04316,0.077672,0.038068,0.001205,-0.014261,0.012292,0.034169
LotShape,0.061887,-0.165315,1.0,0.085434,0.221102,-0.099951,-0.115003,-0.043768,0.116262,-0.104026,...,-0.075412,0.078213,-0.036459,-0.053054,-0.020051,-0.042061,-0.033455,0.036449,-0.000911,-0.038118
LandContour,-0.017854,-0.149083,0.085434,1.0,-0.025527,-0.374267,0.024801,-0.016185,0.051143,0.075234,...,0.040676,-0.058742,-0.021404,0.003836,-0.013098,0.020912,-0.011599,0.020507,-0.025754,0.033809
LotConfig,-0.009895,-0.121161,0.221102,-0.025527,1.0,-0.007256,0.021457,0.033868,0.107229,-0.032945,...,-0.054614,-0.070429,-0.030479,-0.004657,-0.046798,-0.018427,0.018902,-0.005992,0.014325,0.051579
LandSlope,-0.022055,0.436868,-0.099951,-0.374267,-0.007256,1.0,-0.016762,-0.026322,-0.053582,-0.031793,...,-0.032622,-0.008843,0.008694,0.052976,-0.015505,-0.003518,0.007072,-0.002305,0.054858,-0.043095
Condition1,-0.027874,0.023846,-0.115003,0.024801,0.021457,-0.016762,1.0,-0.074268,-0.023501,0.096714,...,0.085861,-0.079213,0.07061,0.011043,0.008742,-0.011454,-0.009868,-0.009819,-0.002338,0.057747
Condition2,0.044606,0.022164,-0.043768,-0.016185,0.033868,-0.026322,-0.074268,1.0,0.009014,-0.02694,...,0.034507,0.013098,-0.003693,-0.008576,-0.00218,0.126814,0.004049,-0.021495,0.004848,0.045074
BldgType,0.00569,-0.205721,0.116262,0.051143,0.107229,-0.053582,-0.023501,0.009014,1.0,0.066552,...,-0.03716,-0.114726,-0.022845,-0.028046,-0.02828,-0.009583,-0.025764,0.002006,-0.040306,-0.00353
HouseStyle,-0.105315,-0.03319,-0.104026,0.075234,-0.032945,-0.031793,0.096714,-0.02694,0.066552,1.0,...,0.136452,-0.065176,-0.026934,-0.025323,0.07663,-0.040903,0.025728,-0.018005,0.048582,0.022753


In [861]:
# plotting correlations on a heatmap

# figure size
plt.figure(figsize=(16,8))

# heatmap
sns.heatmap(X.corr(), cmap="YlGnBu", annot=True)
plt.show()


# **Step 4: Splitting the Data into Training and Testing Sets**
As you know, the first basic step for regression is performing a train-test split.

In [16]:
# scaling the features - necessary before using Ridge or Lasso
from sklearn.preprocessing import scale

# storing column names in cols, since column names are (annoyingly) lost after
# scaling (the df is converted to a numpy array)
cols = X.columns
X = pd.DataFrame(scale(X))
X.columns = cols
X.columns

Index(['MSZoning', 'LotArea', 'LotShape', 'LandContour', 'LotConfig',
       'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearRemodAdd', 'RoofStyle', 'RoofMatl',
       'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond',
       'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
       'BsmtFinType2', 'HeatingQC', 'CentralAir', 'Electrical', 'BsmtFullBath',
       'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr',
       'KitchenQual', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition'],
      dtype='object')

In [17]:
# split into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.7,
                                                    test_size = 0.3, random_state=100)

# **3. Model Building and Evaluation**

# **Linear Regression**
Let's now try predicting car prices, a dataset using linear regression.

In [18]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

In [19]:
# Instantiate
lm = LinearRegression()

# Fit a line
lm.fit(X_train, y_train)

In [20]:
# Print the coefficients and intercept
print(lm.intercept_)
print(lm.coef_)

180798.0087733301
[-2.83525854e+02  4.84344306e+03 -2.35821275e+03  3.74662914e+03
 -5.11290203e+01  2.18951218e+03 -4.26618697e+02 -1.18739899e+03
 -8.38872061e+03 -5.89529913e+03  2.61982850e+04  3.83126277e+03
  2.41177446e+03  3.95276478e+03  5.88283800e+03 -7.42226190e+02
 -2.29362858e+03 -1.68354024e+03 -6.27873920e+03  5.26646672e+02
  3.61541462e+03 -1.07696246e+04  1.67585610e+03 -4.34830818e+03
 -2.54557394e+03 -1.30893103e+01 -1.15948810e+03  2.38250594e+02
 -1.35229574e+03  7.58343196e+03  1.05641533e+03  1.16784205e+04
  6.57626593e+03  6.15217796e+03  1.72605118e+03 -9.54245695e+03
  1.93672398e+03  4.48721752e+03 -4.35652085e+03 -8.12690595e+00
 -2.42491935e+03  7.88059546e+02 -2.23297324e+02  1.57422337e+03
  3.36844906e+03  3.60161692e+02  4.01632249e+02  1.14030125e+03
  1.39943436e+03  9.25834840e+02 -5.79747296e+02 -1.18122191e+03
 -2.69694203e+03 -6.39722186e+02  9.46770571e+02]


In [21]:
from sklearn.metrics import r2_score, mean_squared_error

In [22]:
y_pred_train = lm.predict(X_train)
y_pred_test = lm.predict(X_test)

metric = []
r2_train_lr = r2_score(y_train, y_pred_train)
print(r2_train_lr)
metric.append(r2_train_lr)

r2_test_lr = r2_score(y_test, y_pred_test)
print(r2_test_lr)
metric.append(r2_test_lr)

rss1_lr = np.sum(np.square(y_train - y_pred_train))
print(rss1_lr)
metric.append(rss1_lr)

rss2_lr = np.sum(np.square(y_test - y_pred_test))
print(rss2_lr)
metric.append(rss2_lr)

mse_train_lr = mean_squared_error(y_train, y_pred_train)
print(mse_train_lr)
metric.append(mse_train_lr**0.5)

mse_test_lr = mean_squared_error(y_test, y_pred_test)
print(mse_test_lr)
metric.append(mse_test_lr**0.5)

0.8014935844033103
0.7911917057806257
1266609647686.294
588570555357.2267
1240557931.1325111
1343768391.2265449


In [23]:
# Check for the VIF values of the feature variables.
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Calculate the VIFs again for the new model

vif = pd.DataFrame()
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif


Unnamed: 0,Features,VIF
16,Exterior2nd,3.96
15,Exterior1st,3.94
10,OverallQual,3.33
37,Fireplaces,2.73
18,ExterQual,2.54
38,FireplaceQu,2.51
31,FullBath,2.46
12,YearRemodAdd,2.44
40,GarageFinish,2.25
21,BsmtQual,2.18


I did lots of trial and error method to improve R2 Score of test and train data. Later i find VIF and its less than 5. So I stopped here and process for L1/L2 steps.

**Ridge and Lasso Regression¶**

Let's now try predicting house prices, a dataset used in simple linear regression, to perform ridge and lasso regression.

## **Ridge Regression**

In [40]:
# list of alphas to tune - if value too high it will lead to underfitting, if it is too low,
# it will not handle the overfitting
params = {'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1,
 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0,
 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 20, 50, 100, 500, 1000 ]}

ridge = Ridge()

# cross validation
folds = 5
model_cv = GridSearchCV(estimator = ridge,
                        param_grid = params,
                        scoring= 'neg_mean_absolute_error',
                        cv = folds,
                        return_train_score=True,
                        verbose = 1)
model_cv.fit(X_train, y_train)

Fitting 5 folds for each of 28 candidates, totalling 140 fits


In [41]:
# Printing the best hyperparameter alpha
print(model_cv.best_params_)

{'alpha': 500}


In [91]:
#Fitting Ridge model for alpha = 10 and printing coefficients which have been penalised
alpha = 500
ridge = Ridge(alpha=alpha)

ridge.fit(X_train, y_train)
print(ridge.coef_)

[ -599.21351736  4741.1649488  -2394.57070502  1763.37688894
  -484.66450632  1270.61470749  -114.55920431  -511.93450718
 -5028.924285   -1903.22175153 15224.08522533  1598.82356767
  3776.41307347  3637.68335153  4332.89963434  -501.92865003
  -740.73143371 -2029.86611021 -7660.1891566    563.06664242
  3064.53536459 -9198.63233239   916.74275697 -3625.42588775
 -1735.08971174   -97.98960599 -2536.4190542   1166.23812477
   122.01714833  4904.68522735   295.3456335   8723.56495861
  4366.58889508  5260.0672164    476.11356563 -8662.44577514
  2065.40864685  5326.35901461 -5263.93572141 -2155.36598325
 -3902.51210768   190.33787708   463.85923844  1451.26339065
  3623.21814535  1894.96086308   316.93039543   955.21063167
  1421.06643062  1477.17748449  -438.60589272  -306.5993758
 -1525.0763117   -578.60678415  1197.21123424]


In [92]:
# Lets calculate some metrics such as R2 score, RSS and RMSE
y_pred_train = ridge.predict(X_train)
y_pred_test = ridge.predict(X_test)

metric2 = []
r2_train_lr = r2_score(y_train, y_pred_train)
print(r2_train_lr)
metric2.append(r2_train_lr)

r2_test_lr = r2_score(y_test, y_pred_test)
print(r2_test_lr)
metric2.append(r2_test_lr)

rss1_lr = np.sum(np.square(y_train - y_pred_train))
print(rss1_lr)
metric2.append(rss1_lr)

rss2_lr = np.sum(np.square(y_test - y_pred_test))
print(rss2_lr)
metric2.append(rss2_lr)

mse_train_lr = mean_squared_error(y_train, y_pred_train)
print(mse_train_lr)
metric2.append(mse_train_lr**0.5)

mse_test_lr = mean_squared_error(y_test, y_pred_test)
print(mse_test_lr)
metric2.append(mse_test_lr**0.5)

0.7820521719580862
0.7762684665112415
1390659444735.6738
630634876879.2559
1362056263.2082996
1439805654.97547


## **## Lasso**

In [93]:
lasso = Lasso()

# cross validation
model_cv = GridSearchCV(estimator = lasso,
                        param_grid = params,
                        scoring= 'neg_mean_absolute_error',
                        cv = folds,
                        return_train_score=True,
                        verbose = 1)

model_cv.fit(X_train, y_train)

Fitting 5 folds for each of 28 candidates, totalling 140 fits


In [94]:
# Printing the best hyperparameter alpha
print(model_cv.best_params_)

{'alpha': 1000}


In [95]:
#Fitting Ridge model for alpha = 100 and printing coefficients which have been penalised

alpha =1000

lasso = Lasso(alpha=alpha)

lasso.fit(X_train, y_train)

In [96]:
lasso.coef_

array([-0.00000000e+00,  5.22861890e+03, -1.76476149e+03,  2.22997068e+03,
       -0.00000000e+00,  8.78937763e+02, -0.00000000e+00, -0.00000000e+00,
       -7.00465412e+03, -3.27765172e+03,  2.68957702e+04,  1.88393447e+03,
        1.68860624e+03,  3.26251502e+03,  4.89963965e+03, -1.20350211e+02,
       -1.48726772e+03, -7.90162474e+02, -6.28464864e+03,  0.00000000e+00,
        2.69416735e+03, -1.04621229e+04,  9.23065991e+02, -3.23084949e+03,
       -2.19354877e+03, -0.00000000e+00, -5.13133139e+02,  2.50875908e+00,
       -0.00000000e+00,  6.84628519e+03,  3.80876086e+02,  1.08737037e+04,
        4.99043141e+03,  5.47766685e+03,  0.00000000e+00, -9.48586002e+03,
        1.00335661e+03,  4.91923125e+03, -3.94410128e+03, -5.05975014e+02,
       -2.58271707e+03,  0.00000000e+00,  0.00000000e+00,  9.66937221e+02,
        3.01991567e+03,  0.00000000e+00, -0.00000000e+00,  4.81650974e+02,
        4.10513522e+02,  0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
       -1.23106831e+03, -

In [97]:
# Lets calculate some metrics such as R2 score, RSS and RMSE

y_pred_train = lasso.predict(X_train)
y_pred_test = lasso.predict(X_test)

metric3 = []
r2_train_lr = r2_score(y_train, y_pred_train)
print(r2_train_lr)
metric3.append(r2_train_lr)

r2_test_lr = r2_score(y_test, y_pred_test)
print(r2_test_lr)
metric3.append(r2_test_lr)

rss1_lr = np.sum(np.square(y_train - y_pred_train))
print(rss1_lr)
metric3.append(rss1_lr)

rss2_lr = np.sum(np.square(y_test - y_pred_test))
print(rss2_lr)
metric3.append(rss2_lr)

mse_train_lr = mean_squared_error(y_train, y_pred_train)
print(mse_train_lr)
metric3.append(mse_train_lr**0.5)

mse_test_lr = mean_squared_error(y_test, y_pred_test)
print(mse_test_lr)
metric3.append(mse_test_lr**0.5)

0.7961451691469081
0.7898894192064888
1300736274491.9243
592241326841.8533
1273982639.0714245
1352149148.040761


In [98]:
# Creating a table which contain all the metrics

lr_table = {'Metric': ['R2 Score (Train)','R2 Score (Test)','RSS (Train)','RSS (Test)',
                       'MSE (Train)','MSE (Test)'],
        'Linear Regression': metric
        }

lr_metric = pd.DataFrame(lr_table ,columns = ['Metric', 'Linear Regression'] )

rg_metric = pd.Series(metric2, name = 'Ridge Regression')
ls_metric = pd.Series(metric3, name = 'Lasso Regression')

final_metric = pd.concat([lr_metric, rg_metric, ls_metric], axis = 1)

final_metric

Unnamed: 0,Metric,Linear Regression,Ridge Regression,Lasso Regression
0,R2 Score (Train),0.8014936,0.7820522,0.7961452
1,R2 Score (Test),0.7911917,0.7762685,0.7898894
2,RSS (Train),1266610000000.0,1390659000000.0,1300736000000.0
3,RSS (Test),588570600000.0,630634900000.0,592241300000.0
4,MSE (Train),35221.55,36906.05,35692.89
5,MSE (Test),36657.45,37944.77,36771.58


**## Lets observe the changes in the coefficients after regularization**

In [99]:
betas = pd.DataFrame(index=X.columns)

In [100]:
betas.rows = X.columns

In [101]:
betas['Linear'] = lm.coef_
betas['Ridge'] = ridge.coef_
betas['Lasso'] = lasso.coef_

In [102]:
pd.set_option('display.max_rows', None)
betas.head(68)

Unnamed: 0,Linear,Ridge,Lasso
MSZoning,-283.525854,-599.213517,-0.0
LotArea,4843.44306,4741.164949,5228.6189
LotShape,-2358.212751,-2394.570705,-1764.761486
LandContour,3746.629136,1763.376889,2229.970684
LotConfig,-51.12902,-484.664506,-0.0
LandSlope,2189.512182,1270.614707,878.937763
Condition1,-426.618697,-114.559204,-0.0
Condition2,-1187.398993,-511.934507,-0.0
BldgType,-8388.720611,-5028.924285,-7004.654117
HouseStyle,-5895.299128,-1903.221752,-3277.651716
