**Underfitting:**<br>
Line is too loosely fit to the data (eg. straight line through the data points).

**Overfitting:**<br>
Line is too tightly fit to the data (eg. bending line passing through each data points exactly).

**Balanced fit:**<br>
Line matches the overall shape of the data and gives a good estimate where future data points will lie.

**Reducing overfitting:**<br>
Overfit lines will have many parameters as part of the linear equation. You want to shrink parameters that are close to 0 and remove them from the line function.

**Regularization:**<br>
Adds a 'penalty' to the Mean Squared Error (MSE) summation that increases as more higher order polynomials are added to the predicted line function. That way, MSE calculation will be balanced between finding an equation with low error, but also an equation with a lower amount of theta parameters.

**L1 Regularization:**<br>
Absolute value of Theta penalty value is summed.

**L2 Regularization:**<br>
Theta penalty value is squared before it is summed.

## Import dataset. Drop irrelevant columns. Impute columns with 'NA' values. Convert columns with string values to use one hot encoding.

In [20]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

dataset = pd.read_csv('./Melbourne_housing_FULL.csv')

In [21]:
cols_to_use = ['Suburb', 'Rooms', 'Type', 'Method', 'SellerG', 'Regionname', 'Propertycount', 
               'Distance', 'CouncilArea', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'BuildingArea', 'Price']
dataset = dataset[cols_to_use]
dataset.head()

Unnamed: 0,Suburb,Rooms,Type,Method,SellerG,Regionname,Propertycount,Distance,CouncilArea,Bedroom2,Bathroom,Car,Landsize,BuildingArea,Price
0,Abbotsford,2,h,SS,Jellis,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,1.0,126.0,,
1,Abbotsford,2,h,S,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,1.0,202.0,,1480000.0
2,Abbotsford,2,h,S,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,0.0,156.0,79.0,1035000.0
3,Abbotsford,3,u,VB,Rounds,Northern Metropolitan,4019.0,2.5,Yarra City Council,3.0,2.0,1.0,0.0,,
4,Abbotsford,3,h,SP,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,3.0,2.0,0.0,134.0,150.0,1465000.0


In [22]:
dataset.isna().sum()

Suburb               0
Rooms                0
Type                 0
Method               0
SellerG              0
Regionname           3
Propertycount        3
Distance             1
CouncilArea          3
Bedroom2          8217
Bathroom          8226
Car               8728
Landsize         11810
BuildingArea     21115
Price             7610
dtype: int64

In [23]:
cols_to_fill_zero = ['Propertycount', 'Distance', 'Bedroom2', 'Bathroom', 'Car']
dataset[cols_to_fill_zero] = dataset[cols_to_fill_zero].fillna(0)

dataset['Landsize'] = dataset['Landsize'].fillna(dataset.Landsize.mean())
dataset['BuildingArea'] = dataset['BuildingArea'].fillna(dataset.BuildingArea.mean())

In [24]:
dataset.dropna(inplace=True)
dataset.shape

(27244, 15)

In [25]:
dataset.head()

Unnamed: 0,Suburb,Rooms,Type,Method,SellerG,Regionname,Propertycount,Distance,CouncilArea,Bedroom2,Bathroom,Car,Landsize,BuildingArea,Price
1,Abbotsford,2,h,S,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,1.0,202.0,160.2564,1480000.0
2,Abbotsford,2,h,S,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,0.0,156.0,79.0,1035000.0
4,Abbotsford,3,h,SP,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,3.0,2.0,0.0,134.0,150.0,1465000.0
5,Abbotsford,3,h,PI,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,3.0,2.0,1.0,94.0,160.2564,850000.0
6,Abbotsford,4,h,VB,Nelson,Northern Metropolitan,4019.0,2.5,Yarra City Council,3.0,1.0,2.0,120.0,142.0,1600000.0


In [26]:
dataset = pd.get_dummies(dataset, drop_first=True)
dataset.head()

Unnamed: 0,Rooms,Propertycount,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,Price,Suburb_Aberfeldie,...,CouncilArea_Moorabool Shire Council,CouncilArea_Moreland City Council,CouncilArea_Nillumbik Shire Council,CouncilArea_Port Phillip City Council,CouncilArea_Stonnington City Council,CouncilArea_Whitehorse City Council,CouncilArea_Whittlesea City Council,CouncilArea_Wyndham City Council,CouncilArea_Yarra City Council,CouncilArea_Yarra Ranges Shire Council
1,2,4019.0,2.5,2.0,1.0,1.0,202.0,160.2564,1480000.0,0,...,0,0,0,0,0,0,0,0,1,0
2,2,4019.0,2.5,2.0,1.0,0.0,156.0,79.0,1035000.0,0,...,0,0,0,0,0,0,0,0,1,0
4,3,4019.0,2.5,3.0,2.0,0.0,134.0,150.0,1465000.0,0,...,0,0,0,0,0,0,0,0,1,0
5,3,4019.0,2.5,3.0,2.0,1.0,94.0,160.2564,850000.0,0,...,0,0,0,0,0,0,0,0,1,0
6,4,4019.0,2.5,3.0,1.0,2.0,120.0,142.0,1600000.0,0,...,0,0,0,0,0,0,0,0,1,0


## Perform training/testing sample set splits.

In [27]:
x = dataset.drop('Price', axis=1)
y = dataset['Price']

In [10]:
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.3, random_state=2)

## Train linear regression model. Accuracy on testing sample set is very low.
## When the accuracy score for training data is much higher than the accuracy score for testing data (68% vs 13%), that means the model 'overfit' to the training data.

In [11]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(train_x, train_y)

In [12]:
reg.score(test_x, test_y)

0.1385368316179305

In [13]:
reg.score(train_x, train_y)

0.6827792395792723

## Using Lasso (L1 Regularized) Regression Model.

In [14]:
from sklearn import linear_model
lasso_reg = linear_model.Lasso(alpha=50, max_iter=100, tol=0.1)
lasso_reg.fit(train_x, train_y)

Lasso(alpha=50, max_iter=100, tol=0.1)

In [15]:
lasso_reg.score(test_x, test_y)

0.6636111369404488

In [16]:
lasso_reg.score(train_x, train_y)

0.6766985624766824

## Using Ridge (L2 Regularized) Regression Model.

In [17]:
from sklearn.linear_model import Ridge
ridge_reg= Ridge(alpha=50, max_iter=100, tol=0.1)
ridge_reg.fit(train_x, train_y)

Ridge(alpha=50, max_iter=100, tol=0.1)

In [18]:
ridge_reg.score(test_x, test_y)

0.6670848945194959

In [19]:
ridge_reg.score(train_x, train_y)

0.6622376739684328

## Now the accuracy scores are similar for both the training samples and the testing samples.