### Introduction

when we train the model, using linear regression in complex situation, it will possible to occur, overfit and underfit

<div align='center'>
    <img src='overfit.png' width=600>
</div>

*this will occur overfit in some cases, the score might be*, **less** 

In [2]:
# importing the modules
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

In [3]:
# loading the dataset
dataset = pd.read_csv('Melbourne_housing_FULL.csv')
dataset.head(3)

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,3/09/2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra City Council,-37.8014,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra City Council,-37.7996,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra City Council,-37.8079,144.9934,Northern Metropolitan,4019.0


In [4]:
# let's use limited columns which makes more sense for serving our purpose
cols_to_use = ['Suburb', 'Rooms', 'Type', 'Method', 'SellerG', 'Regionname', 'Propertycount', 
               'Distance', 'CouncilArea', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'BuildingArea', 'Price']
dataset = dataset[cols_to_use]
dataset.head(2)

Unnamed: 0,Suburb,Rooms,Type,Method,SellerG,Regionname,Propertycount,Distance,CouncilArea,Bedroom2,Bathroom,Car,Landsize,BuildingArea,Price
0,Abbotsford,2,h,SS,Jellis,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,1.0,126.0,,
1,Abbotsford,2,h,S,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,1.0,202.0,,1480000.0


**`note`** : *check for <b>Nan</b> values if the dataset is bigger*

In [5]:
dataset.isna().any(), dataset.isna().sum()

(Suburb           False
 Rooms            False
 Type             False
 Method           False
 SellerG          False
 Regionname        True
 Propertycount     True
 Distance          True
 CouncilArea       True
 Bedroom2          True
 Bathroom          True
 Car               True
 Landsize          True
 BuildingArea      True
 Price             True
 dtype: bool,
 Suburb               0
 Rooms                0
 Type                 0
 Method               0
 SellerG              0
 Regionname           3
 Propertycount        3
 Distance             1
 CouncilArea          3
 Bedroom2          8217
 Bathroom          8226
 Car               8728
 Landsize         11810
 BuildingArea     21115
 Price             7610
 dtype: int64)

if the **NaN** values in ['Propertycount', 'Distance', 'Bedroom2', 'Bathroom', 'Car'], then there is no available thing in the house, so we need to consider these **NaN** as `0`

In [6]:
cols_to_fill_zero = ['Propertycount', 'Distance', 'Bedroom2', 'Bathroom', 'Car']
dataset[cols_to_fill_zero] = dataset[cols_to_fill_zero].fillna(0)
dataset.head()

Unnamed: 0,Suburb,Rooms,Type,Method,SellerG,Regionname,Propertycount,Distance,CouncilArea,Bedroom2,Bathroom,Car,Landsize,BuildingArea,Price
0,Abbotsford,2,h,SS,Jellis,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,1.0,126.0,,
1,Abbotsford,2,h,S,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,1.0,202.0,,1480000.0
2,Abbotsford,2,h,S,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,0.0,156.0,79.0,1035000.0
3,Abbotsford,3,u,VB,Rounds,Northern Metropolitan,4019.0,2.5,Yarra City Council,3.0,2.0,1.0,0.0,,
4,Abbotsford,3,h,SP,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,3.0,2.0,0.0,134.0,150.0,1465000.0


In [7]:
dataset.isna().sum()

Suburb               0
Rooms                0
Type                 0
Method               0
SellerG              0
Regionname           3
Propertycount        0
Distance             0
CouncilArea          3
Bedroom2             0
Bathroom             0
Car                  0
Landsize         11810
BuildingArea     21115
Price             7610
dtype: int64

for the `Landsize` and `BuildingArea` we fill <b>NaN</b> as mean() value

In [8]:
dataset['Landsize'] = dataset['Landsize'].fillna(dataset['Landsize'].mean())
dataset['BuildingArea'] = dataset['BuildingArea'].fillna(dataset['BuildingArea'].mean())

In [9]:
dataset.isna().sum()

Suburb              0
Rooms               0
Type                0
Method              0
SellerG             0
Regionname          3
Propertycount       0
Distance            0
CouncilArea         3
Bedroom2            0
Bathroom            0
Car                 0
Landsize            0
BuildingArea        0
Price            7610
dtype: int64

we have to predict only the price column, not consider that, in the Regsinname and CouncilArea, only 3 NaN values, we can drop these 3 values, nothing will change our dataset so we can predict with huge datasets.

In [10]:
dataset.dropna(inplace=True)

In [11]:
dataset.isna().sum()

Suburb           0
Rooms            0
Type             0
Method           0
SellerG          0
Regionname       0
Propertycount    0
Distance         0
CouncilArea      0
Bedroom2         0
Bathroom         0
Car              0
Landsize         0
BuildingArea     0
Price            0
dtype: int64

In [12]:
dataset.head(2)

Unnamed: 0,Suburb,Rooms,Type,Method,SellerG,Regionname,Propertycount,Distance,CouncilArea,Bedroom2,Bathroom,Car,Landsize,BuildingArea,Price
1,Abbotsford,2,h,S,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,1.0,202.0,160.2564,1480000.0
2,Abbotsford,2,h,S,Biggin,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,0.0,156.0,79.0,1035000.0


Since ML Require only numeric values as inputs, so we can use OneHotEncoding Method,

In [13]:
dataset = pd.get_dummies(dataset, drop_first=True)
dataset.head()

Unnamed: 0,Rooms,Propertycount,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,Price,Suburb_Aberfeldie,...,CouncilArea_Moorabool Shire Council,CouncilArea_Moreland City Council,CouncilArea_Nillumbik Shire Council,CouncilArea_Port Phillip City Council,CouncilArea_Stonnington City Council,CouncilArea_Whitehorse City Council,CouncilArea_Whittlesea City Council,CouncilArea_Wyndham City Council,CouncilArea_Yarra City Council,CouncilArea_Yarra Ranges Shire Council
1,2,4019.0,2.5,2.0,1.0,1.0,202.0,160.2564,1480000.0,0,...,0,0,0,0,0,0,0,0,1,0
2,2,4019.0,2.5,2.0,1.0,0.0,156.0,79.0,1035000.0,0,...,0,0,0,0,0,0,0,0,1,0
4,3,4019.0,2.5,3.0,2.0,0.0,134.0,150.0,1465000.0,0,...,0,0,0,0,0,0,0,0,1,0
5,3,4019.0,2.5,3.0,2.0,1.0,94.0,160.2564,850000.0,0,...,0,0,0,0,0,0,0,0,1,0
6,4,4019.0,2.5,3.0,1.0,2.0,120.0,142.0,1600000.0,0,...,0,0,0,0,0,0,0,0,1,0


In [14]:
x = dataset.drop('Price', axis=1)
y = dataset['Price']

In [15]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.22, random_state=7)

In [16]:
# creating the model
model_linear_regression = LinearRegression().fit(x_train, y_train)

In [17]:
model_linear_regression.score(x_test, y_test)

0.6779817300837394

In [18]:
model_linear_regression.score(x_train, y_train)

0.6786918635341249

since both training and testing will produce, same, but less score

## L1 Regularization Lasso

In [19]:
model_lasso = Lasso(alpha=50, max_iter=100, tol=0.1).fit(x_train, y_train)

In [20]:
model_lasso.score(x_test, y_test)

0.6805606903066399

In [21]:
model_lasso.score(x_train, y_train)

0.6742338086673998

since lasso produce same as normal Linear regression, so we try to switch L2 Regularization

## L2 Regularization Ridge

In [22]:
model_ridge = Ridge(alpha=50, max_iter=100, tol=0.1).fit(x_train, y_train)

In [23]:
model_ridge.score(x_test, y_test)

0.6760762634813163

In [24]:
model_ridge.score(x_train, y_train)

0.6623329557178814