L1 and L2 Regularization | Lasso Regression ( L1 Regularization) | Ridge Regression (L2 Regularization)

In [1]:
""" 
Now, we will look into,

1) What is overfitting and underfitting?
2) How to address overfitting using L1 and L2 regularization?
3) Write code in python and sklearn for housing price prediction where we will see a model overfit when we use simple linear regression. Then we will use Lasso regression (L1 Regularization) and Ridge regression (L2 Regularization) to address this overfitting issue.

Signal: 
    It refers to the true underlying pattern of the data that helps the machine learning model to learn from the data.

Noise: 
    Noise is unnecessary and irrelevant data that reduces the performance of the model.

Bias: 
    Bias is a prediction error that is introduced in the model due to oversimplifying the machine learning algorithms. Or it is the difference between the predicted values and the actual values.

Variance: 
    If the machine learning model performs well with the training dataset, but does not perform well with the test dataset, then variance occurs.

Underfitting:
    When a model has not learned the patterns in the training data well and is unable to generalize well on the new data, it is known as underfitting. An underfit model has poor performance on the training data and will result in unreliable predictions. Underfitting occurs due to high bias and low variance.

Overfitting:
    When a model performs very well for training data but has poor performance with test data (new data), it is known as overfitting. In this case, the machine learning model learns the details and noise in the training data such that it negatively affects the performance of the model on test data. Overfitting can happen due to low bias and high variance.

Balanced fit:
    To find the good fit model, we need to look at the performance of a machine learning model over time with the training data. As the algorithm learns over time, the error for the model on the training data reduces, as well as the error on the test dataset. 
    
    If we train the model for too long, the model may learn the unnecessary details and the noise in the training set and hence lead to overfitting. In order to achieve a good fit, we need to stop training at a point where the error starts to increase.

"""

' \nNow, we will look into,\n\n1) What is overfitting and underfitting?\n2) How to address overfitting using L1 and L2 regularization?\n3) Write code in python and sklearn for housing price prediction where we will see a model overfit when we use simple linear regression. Then we will use Lasso regression (L1 Regularization) and Ridge regression (L2 Regularization) to address this overfitting issue.\n\nSignal: \n    It refers to the true underlying pattern of the data that helps the machine learning model to learn from the data.\n\nNoise: \n    Noise is unnecessary and irrelevant data that reduces the performance of the model.\n\nBias: \n    Bias is a prediction error that is introduced in the model due to oversimplifying the machine learning algorithms. Or it is the difference between the predicted values and the actual values.\n\nVariance: \n    If the machine learning model performs well with the training dataset, but does not perform well with the test dataset, then variance occurs

In [2]:
""" 
Problem:

We are going to use Melbourne House Price Dataset where we'll predict House Price based on various features.

Dataset: https://www.kaggle.com/anthonypino/melbourne-housing-market
    
"""

" \nProblem:\n\nWe are going to use Melbourne House Price Dataset where we'll predict House Price based on various features.\n\nDataset: https://www.kaggle.com/anthonypino/melbourne-housing-market\n    \n"

In [3]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
# Suppress warnings for clean notebook
import warnings
warnings.filterwarnings('ignore')

In [5]:
# Read dataset
dataset = pd.read_csv('./Melbourne_housing_FULL.csv')
dataset.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,3/09/2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra City Council,-37.8014,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra City Council,-37.7996,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra City Council,-37.8079,144.9934,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,4/02/2016,2.5,3067.0,...,2.0,1.0,0.0,,,Yarra City Council,-37.8114,145.0116,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra City Council,-37.8093,144.9944,Northern Metropolitan,4019.0


In [6]:
dataset.nunique()

Suburb             351
Address          34009
Rooms               12
Type                 3
Price             2871
Method               9
SellerG            388
Date                78
Distance           215
Postcode           211
Bedroom2            15
Bathroom            11
Car                 15
Landsize          1684
BuildingArea       740
YearBuilt          160
CouncilArea         33
Lattitude        13402
Longtitude       14524
Regionname           8
Propertycount      342
dtype: int64

In [7]:
dataset.shape

(34857, 21)

In [8]:
# Let's use limited columns which makes more sense for serving our purpose
cols_to_use = ['Suburb', 'Rooms', 'Type', 'Method', 'SellerG', 'Regionname', 'Propertycount', 
               'Distance', 'CouncilArea', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'BuildingArea', 'Price']
dataset = dataset[cols_to_use]
dataset.head()

Unnamed: 0,Suburb,Rooms,Type,Method,SellerG,Regionname,Propertycount,Distance,CouncilArea,Bedroom2,Bathroom,Car,Landsize,BuildingArea,Price
0,Abbotsford,2,h,SS,Jellis,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,1.0,126.0,,
1,Abbotsford,2,h,S,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,1.0,202.0,,1480000.0
2,Abbotsford,2,h,S,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,0.0,156.0,79.0,1035000.0
3,Abbotsford,3,u,VB,Rounds,Northern Metropolitan,4019.0,2.5,Yarra City Council,3.0,2.0,1.0,0.0,,
4,Abbotsford,3,h,SP,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,3.0,2.0,0.0,134.0,150.0,1465000.0


In [9]:
dataset.shape

(34857, 15)

In [10]:
# Look for NaN / Missing value in any of the columns
dataset.columns[dataset.isna().any()]

Index(['Regionname', 'Propertycount', 'Distance', 'CouncilArea', 'Bedroom2',
       'Bathroom', 'Car', 'Landsize', 'BuildingArea', 'Price'],
      dtype='object')

In [11]:
dataset.isna().sum()

Suburb               0
Rooms                0
Type                 0
Method               0
SellerG              0
Regionname           3
Propertycount        3
Distance             1
CouncilArea          3
Bedroom2          8217
Bathroom          8226
Car               8728
Landsize         11810
BuildingArea     21115
Price             7610
dtype: int64

In [12]:
cols_to_fill_zero = ['Propertycount', 'Distance', 'Bedroom2', 'Bathroom', 'Car']
dataset[cols_to_fill_zero] = dataset[cols_to_fill_zero].fillna(0)
dataset.isna().sum()

Suburb               0
Rooms                0
Type                 0
Method               0
SellerG              0
Regionname           3
Propertycount        0
Distance             0
CouncilArea          3
Bedroom2             0
Bathroom             0
Car                  0
Landsize         11810
BuildingArea     21115
Price             7610
dtype: int64

In [13]:
"""
Other continuous features can be imputed with mean for faster results since our focus is on Reducing overfitting using Lasso and Ridge Regression.
"""

dataset['Landsize'] = dataset['Landsize'].fillna(dataset.Landsize.mean())
dataset['BuildingArea'] = dataset['BuildingArea'].fillna(dataset.BuildingArea.mean())
dataset.isna().sum()

Suburb              0
Rooms               0
Type                0
Method              0
SellerG             0
Regionname          3
Propertycount       0
Distance            0
CouncilArea         3
Bedroom2            0
Bathroom            0
Car                 0
Landsize            0
BuildingArea        0
Price            7610
dtype: int64

In [14]:
dataset.dropna(inplace = True)
dataset.isna().sum()

Suburb           0
Rooms            0
Type             0
Method           0
SellerG          0
Regionname       0
Propertycount    0
Distance         0
CouncilArea      0
Bedroom2         0
Bathroom         0
Car              0
Landsize         0
BuildingArea     0
Price            0
dtype: int64

In [15]:
dataset.shape

(27244, 15)

Let's one hot encode the categorical features

In [16]:
# If you set drop_first = True , then it will drop the first category to avoid dummy-variable trap.
dataset = pd.get_dummies(dataset, drop_first = True)
dataset.head()

Unnamed: 0,Rooms,Propertycount,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,Price,Suburb_Aberfeldie,...,CouncilArea_Moorabool Shire Council,CouncilArea_Moreland City Council,CouncilArea_Nillumbik Shire Council,CouncilArea_Port Phillip City Council,CouncilArea_Stonnington City Council,CouncilArea_Whitehorse City Council,CouncilArea_Whittlesea City Council,CouncilArea_Wyndham City Council,CouncilArea_Yarra City Council,CouncilArea_Yarra Ranges Shire Council
1,2,4019.0,2.5,2.0,1.0,1.0,202.0,160.2564,1480000.0,0,...,0,0,0,0,0,0,0,0,1,0
2,2,4019.0,2.5,2.0,1.0,0.0,156.0,79.0,1035000.0,0,...,0,0,0,0,0,0,0,0,1,0
4,3,4019.0,2.5,3.0,2.0,0.0,134.0,150.0,1465000.0,0,...,0,0,0,0,0,0,0,0,1,0
5,3,4019.0,2.5,3.0,2.0,1.0,94.0,160.2564,850000.0,0,...,0,0,0,0,0,0,0,0,1,0
6,4,4019.0,2.5,3.0,1.0,2.0,120.0,142.0,1600000.0,0,...,0,0,0,0,0,0,0,0,1,0


In [17]:
# In pandas axis = 0 refers to horizontal axis or rows and axis = 1 refers to vertical axis or columns
X = dataset.drop('Price', axis = 1)
y = dataset.Price

In [18]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2)

Linear Regression

In [19]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train, y_train)

In [20]:
reg.score(X_train, y_train)

0.6827792395792723

In [21]:
reg.score(X_test, y_test)

0.13853683161562136

In [22]:
"""
Here training score is 68% but test score is 13.85% which is very low. Normal Regression is clearly overfitting the data, let's try other models to address this overfitting issue.
"""

"\nHere training score is 68% but test score is 13.85% which is very low. Normal Regression is clearly overfitting the data, let's try other models to address this overfitting issue.\n"

Regularization

In [23]:
"""
Regularization adds the penalty as model complexity increases. The regularization parameter (lambda) penalizes all the parameters except intercept so that the model generalizes the data and won’t overfit.
"""

'\nRegularization adds the penalty as model complexity increases. The regularization parameter (lambda) penalizes all the parameters except intercept so that the model generalizes the data and won’t overfit.\n'

Lasso Regression ( L1 Regularization)

In [24]:
"""
Lasso shrinks the less important feature’s coefficient to zero; thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features.
"""

'\nLasso shrinks the less important feature’s coefficient to zero; thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features.\n'

In [25]:
from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=50, max_iter=100, tol=0.1)
lasso_reg.fit(X_train, y_train)

In [26]:
lasso_reg.score(X_train, y_train)

0.6766985624766824

In [27]:
lasso_reg.score(X_test, y_test)

0.6636111369404488

Ridge Regression (L2 Regularization)

In [28]:
"""
Ridge regression adds “squared magnitude of the coefficient” as penalty term to the loss function. 
"""

'\nRidge regression adds “squared magnitude of the coefficient” as penalty term to the loss function. \n'

In [29]:
from sklearn.linear_model import Ridge
ridge_reg= Ridge(alpha=50, max_iter=100, tol=0.1)
ridge_reg.fit(X_train, y_train)

In [30]:
ridge_reg.score(X_train, y_train)

0.6622376739684328

In [31]:
ridge_reg.score(X_test, y_test)

0.6670848945194959

In [32]:
"""
Conclusion:
    We see that Lasso and Ridge regularisations prove to be beneficial when our simple linear regression model overfits. These results may not be that striking, but they are significant in most cases. Also, L1 and L2 regularisations are used in neural networks too.

"""

'\nConclusion:\n    We see that Lasso and Ridge regularisations prove to be beneficial when our simple linear regression model overfits. These results may not be that striking, but they are significant in most cases. Also, L1 and L2 regularisations are used in neural networks too.\n\n'