<h2 style='color:blue' align='center'>L1 and L2 Regularization with Regression </h2>

In [30]:
# import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [31]:
# Suppress Warnings for clean notebook
import warnings
warnings.filterwarnings('ignore')

**We are going to use Melbourne House Price Dataset where we'll predict House Predictions based on various features.**
#### The Dataset Link is
https://www.kaggle.com/anthonypino/melbourne-housing-market

In [32]:
# read dataset
dataset = pd.read_csv('./Melbourne_housing_FULL.csv')

In [33]:
def basic_dataset_inspection(table):
    print("Top 5 Sample of dataset")
    print(table.head())
    print("Bottom 5 Sample of dataset")
    print(table.tail())
    print("Column - Names of Given dataset")
    print(table.columns)
    print()
    print("Shape(rows x columns) - of Given dataset")
    print(table.shape)
    print()
    print("Data types - Column Names")
    print(table.dtypes)
    print()
    print("Summry of dataset")
    print(table.info())
    print()
    print("To see the count of null/nan values in columns of dataset")
    print(table.isnull().value_counts())
    print()
    print("Dataset Summary ")
    print(table.describe())
    print("Unique Values under each column")
    print(table.nunique())
    print()
basic_dataset_inspection(dataset)

Top 5 Sample of dataset
       Suburb             Address  Rooms Type      Price Method SellerG  \
0  Abbotsford       68 Studley St      2    h        NaN     SS  Jellis   
1  Abbotsford        85 Turner St      2    h  1480000.0      S  Biggin   
2  Abbotsford     25 Bloomburg St      2    h  1035000.0      S  Biggin   
3  Abbotsford  18/659 Victoria St      3    u        NaN     VB  Rounds   
4  Abbotsford        5 Charles St      3    h  1465000.0     SP  Biggin   

        Date  Distance  Postcode  ...  Bathroom  Car  Landsize  BuildingArea  \
0  3/09/2016       2.5    3067.0  ...       1.0  1.0     126.0           NaN   
1  3/12/2016       2.5    3067.0  ...       1.0  1.0     202.0           NaN   
2  4/02/2016       2.5    3067.0  ...       1.0  0.0     156.0          79.0   
3  4/02/2016       2.5    3067.0  ...       2.0  1.0       0.0           NaN   
4  4/03/2017       2.5    3067.0  ...       2.0  0.0     134.0         150.0   

   YearBuilt         CouncilArea Lattitude  

In [34]:
# let's use limited columns which makes more sense for serving our purpose
cols_to_use = ['Suburb', 'Rooms', 'Type', 'Method', 'SellerG', 'Regionname', 'Propertycount', 
               'Distance', 'CouncilArea', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'BuildingArea', 'Price']
dataset = dataset[cols_to_use]

In [35]:
dataset.head()

Unnamed: 0,Suburb,Rooms,Type,Method,SellerG,Regionname,Propertycount,Distance,CouncilArea,Bedroom2,Bathroom,Car,Landsize,BuildingArea,Price
0,Abbotsford,2,h,SS,Jellis,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,1.0,126.0,,
1,Abbotsford,2,h,S,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,1.0,202.0,,1480000.0
2,Abbotsford,2,h,S,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,0.0,156.0,79.0,1035000.0
3,Abbotsford,3,u,VB,Rounds,Northern Metropolitan,4019.0,2.5,Yarra City Council,3.0,2.0,1.0,0.0,,
4,Abbotsford,3,h,SP,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,3.0,2.0,0.0,134.0,150.0,1465000.0


In [36]:
dataset.shape

(34857, 15)

#### Checking for Nan values

In [37]:
dataset.isna().sum()

Suburb               0
Rooms                0
Type                 0
Method               0
SellerG              0
Regionname           3
Propertycount        3
Distance             1
CouncilArea          3
Bedroom2          8217
Bathroom          8226
Car               8728
Landsize         11810
BuildingArea     21115
Price             7610
dtype: int64

#### Handling Missing values

In [38]:
# Some feature's missing values can be treated as zero (another class for NA values or absence of that feature)
# like 0 for Propertycount, Bedroom2 will refer to other class of NA values
# like 0 for Car feature will mean that there's no car parking feature with house
cols_to_fill_zero = ['Propertycount', 'Distance', 'Bedroom2', 'Bathroom', 'Car']
dataset[cols_to_fill_zero] = dataset[cols_to_fill_zero].fillna(0)

# other continuous features can be imputed with mean for faster results since our focus is on Reducing overfitting
# using Lasso and Ridge Regression
dataset['Landsize'] = dataset['Landsize'].fillna(dataset.Landsize.mean())
dataset['BuildingArea'] = dataset['BuildingArea'].fillna(dataset.BuildingArea.mean())

**Drop NA values of Price, since it's our predictive variable we won't impute it**

In [39]:
dataset.dropna(inplace=True)

In [40]:
dataset.shape

(27244, 15)

#### Let's one hot encode the categorical features

In [41]:
dataset = pd.get_dummies(dataset, drop_first=True)

In [42]:
dataset.head()

Unnamed: 0,Rooms,Propertycount,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,Price,Suburb_Aberfeldie,...,CouncilArea_Moorabool Shire Council,CouncilArea_Moreland City Council,CouncilArea_Nillumbik Shire Council,CouncilArea_Port Phillip City Council,CouncilArea_Stonnington City Council,CouncilArea_Whitehorse City Council,CouncilArea_Whittlesea City Council,CouncilArea_Wyndham City Council,CouncilArea_Yarra City Council,CouncilArea_Yarra Ranges Shire Council
1,2,4019.0,2.5,2.0,1.0,1.0,202.0,160.2564,1480000.0,False,...,False,False,False,False,False,False,False,False,True,False
2,2,4019.0,2.5,2.0,1.0,0.0,156.0,79.0,1035000.0,False,...,False,False,False,False,False,False,False,False,True,False
4,3,4019.0,2.5,3.0,2.0,0.0,134.0,150.0,1465000.0,False,...,False,False,False,False,False,False,False,False,True,False
5,3,4019.0,2.5,3.0,2.0,1.0,94.0,160.2564,850000.0,False,...,False,False,False,False,False,False,False,False,True,False
6,4,4019.0,2.5,3.0,1.0,2.0,120.0,142.0,1600000.0,False,...,False,False,False,False,False,False,False,False,True,False


In [43]:
dataset.shape

(27244, 745)

In [44]:
dataset.corr()['Price']

Rooms                                     0.465231
Propertycount                            -0.059017
Distance                                 -0.211415
Bedroom2                                  0.301524
Bathroom                                  0.339020
                                            ...   
CouncilArea_Whitehorse City Council       0.024560
CouncilArea_Whittlesea City Council      -0.108886
CouncilArea_Wyndham City Council         -0.102752
CouncilArea_Yarra City Council            0.015450
CouncilArea_Yarra Ranges Shire Council   -0.023038
Name: Price, Length: 745, dtype: float64

In [45]:
corr_mat=dataset.corr()
print(corr_mat["Price"].sort_values(ascending=False))

Price                               1.000000
Rooms                               0.465231
Regionname_Southern Metropolitan    0.363670
Bathroom                            0.339020
SellerG_Marshall                    0.307554
                                      ...   
CouncilArea_Hume City Council      -0.141549
Regionname_Western Metropolitan    -0.172641
Regionname_Northern Metropolitan   -0.187410
Distance                           -0.211415
Type_u                             -0.346403
Name: Price, Length: 745, dtype: float64


In [46]:
print(dataset.shape)
n = dataset.shape[0]
p = dataset.shape[1]
p=p-1
print(n,p)

(27244, 745)
27244 744


#### Let's bifurcate our dataset into train and test dataset

In [47]:
X = dataset.drop('Price', axis=1)
Y = dataset['Price']

In [48]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=2)

#### Let's train our Linear Regression Model on training dataset and check the accuracy on test set

In [49]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

reg = LinearRegression().fit(X_train, Y_train)

In [50]:
def adjusted_r2_score(r2_score,n,p):
    return(1-(1-r2_score)*(n-1)/(n-p-1))
    
def model_performance_matrix(model,X_train,Y_train,X_test,Y_test,n,p):
    print('Coefficients: ', model.coef_)
    print('Intercept:',model.intercept_)
    Y_pred = model.predict(X_test)
    print("Train R2 Score:",model.score(X_train,Y_train))
    print("Test R2 Score:",model.score(X_test,Y_test))
    print("R score:",r2_score(Y_test,Y_pred))
    print("MAE:",mean_absolute_error(Y_test,Y_pred))
    print("MSE:",mean_squared_error(Y_test,Y_pred))
    print("RMSE:",mean_squared_error(Y_test,Y_pred,squared=False))
    print("Adj R2 - Score", adjusted_r2_score(r2_score(Y_test,Y_pred),n,p))
    

In [51]:
model_performance_matrix(reg,X_train, Y_train,X_test,Y_test,n,p)

Coefficients:  [ 2.64001655e+05  4.92118905e+00 -4.64732744e+04 -8.22349031e+04
  1.17151153e+05  4.29703140e+04  2.35173742e+00  4.70024530e+02
  2.61342069e+05 -4.61008827e+04 -1.32312259e+05  2.15453995e+05
  1.20133839e+05  2.72683980e+05  1.74702166e+05 -8.00201053e+04
 -1.50967673e+05 -4.94461444e+04  1.17711126e+05 -1.28880253e+05
 -3.49349656e+04 -8.54261329e+03  6.19157896e+04 -2.57258028e+05
 -1.10868116e+05 -2.42391674e+05  1.79856957e+05 -1.22395100e+05
  2.04625284e+05  3.78258911e+04  1.46681364e+05  2.32198628e+03
  7.35395733e+04 -5.29198412e+04  1.90482393e+05 -3.20451670e+05
  7.49515013e+04 -2.83019804e+04  2.69103300e+04  2.78824270e+05
  1.51386730e+05 -1.06324212e+05 -8.45087700e+04  2.96568248e+05
  1.50644588e+05  3.47383320e-07  1.75999576e+05 -1.20057166e+05
 -9.64544651e+03  3.68377378e+05 -1.35478625e+05 -6.38092125e+04
  3.53493139e+04  1.12364756e+05  1.50210824e+04  3.80702958e+04
 -2.07783975e+04 -9.60193574e-07 -2.80289620e+05  3.62799617e+06
 -1.708131

**Here training score is 68% but test score is 13.85% which is very low**

<h4 style='color:purple'>Normal Regression is clearly overfitting the data, let's try other models</h4>

#### Using Lasso (L1 Regularized) Regression Model

In [52]:
from sklearn import linear_model
lasso_reg = linear_model.Lasso(alpha=50, max_iter=100, tol=0.1)
lasso_reg.fit(X_train, Y_train)

In [53]:
model_performance_matrix(lasso_reg,X_train, Y_train,X_test,Y_test,n,p)

Coefficients:  [ 2.70967779e+05  4.63745956e+00 -3.03357705e+04 -8.48946624e+04
  1.23220366e+05  4.17503159e+04  2.56283488e+00  8.65129010e+01
  2.19220733e+05 -1.20065548e+05 -0.00000000e+00  3.24095470e+05
  1.55485391e+05  2.97071682e+05  1.38140476e+05 -9.41251645e+04
 -1.30121499e+05 -0.00000000e+00  1.51646664e+05 -9.60503347e+04
 -1.74328672e+04 -0.00000000e+00  2.29652095e+04 -1.42906742e+05
 -0.00000000e+00 -2.72295256e+05  0.00000000e+00 -3.51581901e+04
  2.31085764e+05  6.08260117e+04  3.20753304e+04 -0.00000000e+00
  0.00000000e+00 -0.00000000e+00  1.90154504e+05 -1.90416055e+05
  3.35876186e+04 -8.78863211e+04  5.12664874e+04  2.57628933e+05
  1.86342954e+05 -5.53309467e+03 -0.00000000e+00  6.29880653e+04
  1.70718008e+04  0.00000000e+00  2.46076632e+05 -1.20882624e+05
 -0.00000000e+00  4.88763467e+05 -0.00000000e+00  1.27428350e+04
  0.00000000e+00  1.92776878e+05  3.58671817e+04  6.42019523e+04
 -0.00000000e+00  0.00000000e+00 -1.73551008e+05  0.00000000e+00
 -1.948158

#### Using Ridge (L2 Regularized) Regression Model

In [54]:
from sklearn.linear_model import Ridge
ridge_reg= Ridge(alpha=50, max_iter=100, tol=0.1)
ridge_reg.fit(X_train, Y_train)

In [55]:
model_performance_matrix(ridge_reg,X_train, Y_train,X_test,Y_test,n,p)

Coefficients:  [ 2.74565399e+05  1.43900376e+00 -3.08679934e+04 -8.54802356e+04
  1.30784473e+05  3.79897031e+04  3.01027203e+00  3.48784768e+01
  1.28807037e+05 -6.43936116e+04 -2.20191147e+04  1.41225935e+05
  6.01541283e+04  1.36604921e+05  6.82450737e+04 -4.48758586e+04
 -1.18261081e+05 -1.40375061e+04  9.47702189e+04 -5.29144382e+04
 -4.41704919e+04 -4.98005912e+04  2.40852122e+04 -3.83936672e+04
 -1.58353300e+04 -1.68201754e+05  8.08493825e+03 -6.27880702e+04
  1.48801899e+05  3.01077775e+04  2.59806529e+04 -1.38083800e+04
  1.17008911e+04  9.42462899e+02  7.84154905e+04 -6.16676997e+04
  3.86961098e+04 -5.72485218e+04  4.67434700e+04  6.78005371e+04
  7.05287853e+04 -3.93760752e+04 -2.74948996e+04  2.49850015e+04
  2.99162838e+04  0.00000000e+00  1.14984883e+05 -7.42920919e+04
 -1.36014255e+04  3.69878814e+05 -4.72146099e+04 -8.03342124e+03
  2.80914008e+03  5.43349020e+04  5.11640785e+04  4.72233711e+04
  2.35059755e+03  0.00000000e+00 -9.56251249e+04  2.17239495e+04
 -1.351305

**We see that Lasso and Ridge Regularizations prove to be beneficial when our Simple Linear Regression Model overfits. These results may not be that contrast but significant in most cases.Also that L1 & L2 Regularizations are used in Neural Networks too**