# Recap:

1. Implementation of Linear Regression on Cereal data and on Airbnb Data
2. Errors - Mean Absolute Error and Root Mean Squared Error

# Agenda:

1. Bias and Variance
2. Regularization
3. Bias Variance Trade off

## Loading the standard libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

## Loading the data

In [3]:
data = pd.read_csv('Melbourne_housing_Full.csv')
data.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,3/09/2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra City Council,-37.8014,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra City Council,-37.7996,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra City Council,-37.8079,144.9934,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,4/02/2016,2.5,3067.0,...,2.0,1.0,0.0,,,Yarra City Council,-37.8114,145.0116,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra City Council,-37.8093,144.9944,Northern Metropolitan,4019.0


In [4]:
data.shape

(34857, 21)

In [5]:
data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [6]:
## lets use some limited set of columns for makes sense of the algorithm

cols_to_use = ['Suburb', 'Rooms', 'Type', 'Method', 'SellerG', 'Regionname', 'Propertycount', 'Distance', 'CouncilArea', 
              'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'BuildingArea', 'Price']

data = data[cols_to_use]
data.head()

Unnamed: 0,Suburb,Rooms,Type,Method,SellerG,Regionname,Propertycount,Distance,CouncilArea,Bedroom2,Bathroom,Car,Landsize,BuildingArea,Price
0,Abbotsford,2,h,SS,Jellis,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,1.0,126.0,,
1,Abbotsford,2,h,S,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,1.0,202.0,,1480000.0
2,Abbotsford,2,h,S,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,0.0,156.0,79.0,1035000.0
3,Abbotsford,3,u,VB,Rounds,Northern Metropolitan,4019.0,2.5,Yarra City Council,3.0,2.0,1.0,0.0,,
4,Abbotsford,3,h,SP,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,3.0,2.0,0.0,134.0,150.0,1465000.0


In [7]:
data.shape

(34857, 15)

## Step 3 : Data proprecessing

## Check missing values in the data

In [8]:
data.isnull().sum()

Suburb               0
Rooms                0
Type                 0
Method               0
SellerG              0
Regionname           3
Propertycount        3
Distance             1
CouncilArea          3
Bedroom2          8217
Bathroom          8226
Car               8728
Landsize         11810
BuildingArea     21115
Price             7610
dtype: int64

In [9]:
data.isnull().sum() / len(data) * 100

Suburb            0.000000
Rooms             0.000000
Type              0.000000
Method            0.000000
SellerG           0.000000
Regionname        0.008607
Propertycount     0.008607
Distance          0.002869
CouncilArea       0.008607
Bedroom2         23.573457
Bathroom         23.599277
Car              25.039447
Landsize         33.881286
BuildingArea     60.576068
Price            21.832057
dtype: float64

## Observation: 

- Landsize and BuidlingArea contain more than 30% missing values in the data hence dropping those columns

In [10]:
data = data.drop(['Landsize', 'BuildingArea'], axis = 1)
data.head()

Unnamed: 0,Suburb,Rooms,Type,Method,SellerG,Regionname,Propertycount,Distance,CouncilArea,Bedroom2,Bathroom,Car,Price
0,Abbotsford,2,h,SS,Jellis,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,1.0,
1,Abbotsford,2,h,S,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,1.0,1480000.0
2,Abbotsford,2,h,S,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,0.0,1035000.0
3,Abbotsford,3,u,VB,Rounds,Northern Metropolitan,4019.0,2.5,Yarra City Council,3.0,2.0,1.0,
4,Abbotsford,3,h,SP,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,3.0,2.0,0.0,1465000.0


In [11]:
data.isnull().sum()

Suburb              0
Rooms               0
Type                0
Method              0
SellerG             0
Regionname          3
Propertycount       3
Distance            1
CouncilArea         3
Bedroom2         8217
Bathroom         8226
Car              8728
Price            7610
dtype: int64

## Missing value imputation on other columns

In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34857 entries, 0 to 34856
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         34857 non-null  object 
 1   Rooms          34857 non-null  int64  
 2   Type           34857 non-null  object 
 3   Method         34857 non-null  object 
 4   SellerG        34857 non-null  object 
 5   Regionname     34854 non-null  object 
 6   Propertycount  34854 non-null  float64
 7   Distance       34856 non-null  float64
 8   CouncilArea    34854 non-null  object 
 9   Bedroom2       26640 non-null  float64
 10  Bathroom       26631 non-null  float64
 11  Car            26129 non-null  float64
 12  Price          27247 non-null  float64
dtypes: float64(6), int64(1), object(6)
memory usage: 3.5+ MB


In [14]:
##

data['Propertycount'] = data['Propertycount'].fillna(data['Propertycount'].mean())
data['Bedroom2'] = data['Bedroom2'].fillna(data['Bedroom2'].mean())
data['Bathroom'] = data['Bathroom'].fillna(data['Bathroom'].mean())
data['Car'] = data['Car'].fillna(data['Car'].mean())
data['Price'] = data['Price'].fillna(data['Price'].mean())
data['Distance'] = data['Distance'].fillna(data['Distance'].mean())
data['Regionname'] = data['Regionname'].fillna(data['Regionname'].mode()[0])
data['CouncilArea'] = data['CouncilArea'].fillna(data['CouncilArea'].mode()[0])

In [15]:
data.isnull().sum()

Suburb           0
Rooms            0
Type             0
Method           0
SellerG          0
Regionname       0
Propertycount    0
Distance         0
CouncilArea      0
Bedroom2         0
Bathroom         0
Car              0
Price            0
dtype: int64

## Encoding the categorical variables

In [22]:
data['CouncilArea'].unique()

array(['Yarra City Council', 'Moonee Valley City Council',
       'Port Phillip City Council', 'Darebin City Council',
       'Hobsons Bay City Council', 'Stonnington City Council',
       'Boroondara City Council', 'Monash City Council',
       'Glen Eira City Council', 'Whitehorse City Council',
       'Maribyrnong City Council', 'Bayside City Council',
       'Moreland City Council', 'Manningham City Council',
       'Melbourne City Council', 'Banyule City Council',
       'Brimbank City Council', 'Kingston City Council',
       'Hume City Council', 'Knox City Council', 'Maroondah City Council',
       'Casey City Council', 'Melton City Council',
       'Greater Dandenong City Council', 'Nillumbik Shire Council',
       'Cardinia Shire Council', 'Whittlesea City Council',
       'Frankston City Council', 'Macedon Ranges Shire Council',
       'Yarra Ranges Shire Council', 'Wyndham City Council',
       'Moorabool Shire Council', 'Mitchell Shire Council'], dtype=object)

## All the object variables in the data are nominal hence apply one hot encoding on all the object


In [23]:
data_ohe = pd.get_dummies(data[['Suburb', 'Type', 'Method', 'SellerG', 'Regionname', 'CouncilArea']])
data_ohe.head()

Unnamed: 0,Suburb_Abbotsford,Suburb_Aberfeldie,Suburb_Airport West,Suburb_Albanvale,Suburb_Albert Park,Suburb_Albion,Suburb_Alphington,Suburb_Altona,Suburb_Altona Meadows,Suburb_Altona North,...,CouncilArea_Moorabool Shire Council,CouncilArea_Moreland City Council,CouncilArea_Nillumbik Shire Council,CouncilArea_Port Phillip City Council,CouncilArea_Stonnington City Council,CouncilArea_Whitehorse City Council,CouncilArea_Whittlesea City Council,CouncilArea_Wyndham City Council,CouncilArea_Yarra City Council,CouncilArea_Yarra Ranges Shire Council
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [24]:
data = pd.concat([data, data_ohe], axis = 1)
data.head()

Unnamed: 0,Suburb,Rooms,Type,Method,SellerG,Regionname,Propertycount,Distance,CouncilArea,Bedroom2,...,CouncilArea_Moorabool Shire Council,CouncilArea_Moreland City Council,CouncilArea_Nillumbik Shire Council,CouncilArea_Port Phillip City Council,CouncilArea_Stonnington City Council,CouncilArea_Whitehorse City Council,CouncilArea_Whittlesea City Council,CouncilArea_Wyndham City Council,CouncilArea_Yarra City Council,CouncilArea_Yarra Ranges Shire Council
0,Abbotsford,2,h,SS,Jellis,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,...,0,0,0,0,0,0,0,0,1,0
1,Abbotsford,2,h,S,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,...,0,0,0,0,0,0,0,0,1,0
2,Abbotsford,2,h,S,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,...,0,0,0,0,0,0,0,0,1,0
3,Abbotsford,3,u,VB,Rounds,Northern Metropolitan,4019.0,2.5,Yarra City Council,3.0,...,0,0,0,0,0,0,0,0,1,0
4,Abbotsford,3,h,SP,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,3.0,...,0,0,0,0,0,0,0,0,1,0


In [28]:
data = data.drop(['Suburb', 'Type', 'Method', 'SellerG', 'Regionname', 'CouncilArea'], axis = 1)
data.head()

Unnamed: 0,Rooms,Propertycount,Distance,Bedroom2,Bathroom,Car,Price,Suburb_Abbotsford,Suburb_Aberfeldie,Suburb_Airport West,...,CouncilArea_Moorabool Shire Council,CouncilArea_Moreland City Council,CouncilArea_Nillumbik Shire Council,CouncilArea_Port Phillip City Council,CouncilArea_Stonnington City Council,CouncilArea_Whitehorse City Council,CouncilArea_Whittlesea City Council,CouncilArea_Wyndham City Council,CouncilArea_Yarra City Council,CouncilArea_Yarra Ranges Shire Council
0,2,4019.0,2.5,2.0,1.0,1.0,1050173.0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
1,2,4019.0,2.5,2.0,1.0,1.0,1480000.0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
2,2,4019.0,2.5,2.0,1.0,0.0,1035000.0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
3,3,4019.0,2.5,3.0,2.0,1.0,1050173.0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
4,3,4019.0,2.5,3.0,2.0,0.0,1465000.0,1,0,0,...,0,0,0,0,0,0,0,0,1,0


In [29]:
data.describe()

Unnamed: 0,Rooms,Propertycount,Distance,Bedroom2,Bathroom,Car,Price,Suburb_Abbotsford,Suburb_Aberfeldie,Suburb_Airport West,...,CouncilArea_Moorabool Shire Council,CouncilArea_Moreland City Council,CouncilArea_Nillumbik Shire Council,CouncilArea_Port Phillip City Council,CouncilArea_Stonnington City Council,CouncilArea_Whitehorse City Council,CouncilArea_Whittlesea City Council,CouncilArea_Wyndham City Council,CouncilArea_Yarra City Council,CouncilArea_Yarra Ranges Shire Council
count,34857.0,34857.0,34857.0,34857.0,34857.0,34857.0,34857.0,34857.0,34857.0,34857.0,...,34857.0,34857.0,34857.0,34857.0,34857.0,34857.0,34857.0,34857.0,34857.0,34857.0
mean,3.031012,7572.888306,11.184929,3.084647,1.624798,1.728845,1050173.0,0.00393,0.002295,0.004648,...,0.000201,0.060877,0.002525,0.036721,0.041885,0.01773,0.023754,0.017902,0.034025,0.002926
std,0.969933,4427.89975,6.788795,0.857337,0.633013,0.875119,567135.7,0.06257,0.047853,0.068015,...,0.01417,0.239109,0.050183,0.18808,0.20033,0.131969,0.152285,0.132596,0.181295,0.054016
min,1.0,83.0,0.0,0.0,0.0,0.0,85000.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,4385.0,6.4,3.0,1.0,1.0,695000.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,3.0,6763.0,10.3,3.0,1.624798,1.728845,1050173.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,4.0,10412.0,14.0,3.084647,2.0,2.0,1150000.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,16.0,21650.0,48.1,30.0,12.0,26.0,11200000.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Feature Scaling on Propertycount and Distance variables

In [30]:
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
mms

In [32]:
data[['Propertycount', 'Distance']]= mms.fit_transform(data[['Propertycount', 'Distance']])
data.head()

Unnamed: 0,Rooms,Propertycount,Distance,Bedroom2,Bathroom,Car,Price,Suburb_Abbotsford,Suburb_Aberfeldie,Suburb_Airport West,...,CouncilArea_Moorabool Shire Council,CouncilArea_Moreland City Council,CouncilArea_Nillumbik Shire Council,CouncilArea_Port Phillip City Council,CouncilArea_Stonnington City Council,CouncilArea_Whitehorse City Council,CouncilArea_Whittlesea City Council,CouncilArea_Wyndham City Council,CouncilArea_Yarra City Council,CouncilArea_Yarra Ranges Shire Council
0,2,0.182501,0.051975,2.0,1.0,1.0,1050173.0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
1,2,0.182501,0.051975,2.0,1.0,1.0,1480000.0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
2,2,0.182501,0.051975,2.0,1.0,0.0,1035000.0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
3,3,0.182501,0.051975,3.0,2.0,1.0,1050173.0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
4,3,0.182501,0.051975,3.0,2.0,0.0,1465000.0,1,0,0,...,0,0,0,0,0,0,0,0,1,0


## Step 4 : Seperate X and y

In [33]:
X = data.drop('Price', axis = 1)
y = data['Price']

## Step 5 : Split the data into train test sets

In [34]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

## Step 6 : Apply Linear Regression on train set

In [35]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr

In [36]:
lr.fit(X_train, y_train)

## Step 7: Perform Predictions

In [37]:
y_pred = lr.predict(X_test)
y_pred

array([ 714496.,  829184., 1227264., ..., 1535744.,  637696.,  676352.])

## Check the accuracy on Train set and test set

In [38]:
lr.score(X_train, y_train)

0.4938046467174002

In [40]:
lr.score(X_test, y_test)

-1.7133638126869647e+21

## Accuracy on train data is high compared to accuracy on test data. This is the case of overfitting

- Ways to avoid overfitting:
1. Increase the data size(increase the number of rows in the data) - A random approach to avoid overfitting. There is no guranrantee that overfitting problem will be solved. There is a possibility of solving overfiiting or vice versa.

2. Regularization Techniques -   
    a. Lasso Regression  (L1 Regularization)  - It does not consider less important features from the data while fitting the algorithm  
    b. Ridge Regression  (L2 Regularization) - It considers all the features but makes the less important features to 0.

In [42]:
from sklearn.linear_model import Lasso
lasso = Lasso()
lasso

In [43]:
from sklearn.linear_model import Ridge
ridge = Ridge()
ridge

In [44]:
lasso.fit(X_train, y_train)

In [45]:
lasso.score(X_train, y_train)

0.49570632707964046

In [46]:
lasso.score(X_test, y_test)

0.47173907802314774

In [47]:
ridge.fit(X_train, y_train)

In [48]:
ridge.score(X_train, y_train)

0.49477863583862225

In [49]:
ridge.score(X_test, y_test)

0.4754289708755264