<b>Using data create a model (Linear Regression) to predict a
house&#39;s value. We want to be able to understand what creates value
in a house, as though we were a real estate developer. The project
should follow the guideline as:</b>
1. Examine and explore data (visualization, interactions among
features)
2. Apply the model for prediction with holdout and cross
validation
3. Using PCA, apply the model with holdout and cross
validation
4. Visualize the residue and homoscedasticity
5. Tune the model if necessary
6. Write up analysis for each section (for example, explain why
the model is overfitting, explain why applying PCA is better,
etc.)
7. Include conclusions for summary

In [1]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from matplotlib import rc
import matplotlib.ticker as ticker
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline

# 1. Examine and explore data (visualization, interactions among features)

In [2]:
data=pd.read_csv("Melbourne_housing_FULL.csv")

In [3]:
data.shape

(34857, 21)

# There are 34857 rows and 21 columns. This sample is large enough for testing.

In [7]:
data.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,3/09/2016,2.5,3067.0,2.0,1.0,1.0,126.0,,,Yarra City Council,-37.8014,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,2.0,1.0,1.0,202.0,,,Yarra City Council,-37.7996,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,2.0,1.0,0.0,156.0,79.0,1900.0,Yarra City Council,-37.8079,144.9934,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,4/02/2016,2.5,3067.0,3.0,2.0,1.0,0.0,,,Yarra City Council,-37.8114,145.0116,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,3.0,2.0,0.0,134.0,150.0,1900.0,Yarra City Council,-37.8093,144.9944,Northern Metropolitan,4019.0


In [15]:
pd.set_option('display.max_columns', 500,'display.max_rows', 500)

In [16]:
data.dtypes

Suburb            object
Address           object
Rooms              int64
Type              object
Price            float64
Method            object
SellerG           object
Date              object
Distance         float64
Postcode         float64
Bedroom2         float64
Bathroom         float64
Car              float64
Landsize         float64
BuildingArea     float64
YearBuilt        float64
CouncilArea       object
Lattitude        float64
Longtitude       float64
Regionname        object
Propertycount    float64
dtype: object

In [17]:
data.describe(include="all")

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
count,34857,34857,34857.0,34857,27247.0,34857,34857,34857,34856.0,34856.0,26640.0,26631.0,26129.0,23047.0,13742.0,15551.0,34854,26881.0,26881.0,34854,34854.0
unique,351,34009,,3,,9,388,78,,,,,,,,,33,,,8,
top,Reservoir,5 Charles St,,h,,S,Jellis,28/10/2017,,,,,,,,,Boroondara City Council,,,Southern Metropolitan,
freq,844,6,,23980,,19744,3359,1119,,,,,,,,,3675,,,11836,
mean,,,3.031012,,1050173.0,,,,11.184929,3116.062859,3.084647,1.624798,1.728845,593.598993,160.2564,1965.289885,,-37.810634,145.001851,,7572.888306
std,,,0.969933,,641467.1,,,,6.788892,109.023903,0.98069,0.724212,1.010771,3398.841946,401.26706,37.328178,,0.090279,0.120169,,4428.090313
min,,,1.0,,85000.0,,,,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,,-38.19043,144.42379,,83.0
25%,,,2.0,,635000.0,,,,6.4,3051.0,2.0,1.0,1.0,224.0,102.0,1940.0,,-37.86295,144.9335,,4385.0
50%,,,3.0,,870000.0,,,,10.3,3103.0,3.0,2.0,2.0,521.0,136.0,1970.0,,-37.8076,145.0078,,6763.0
75%,,,4.0,,1295000.0,,,,14.0,3156.0,4.0,2.0,2.0,670.0,188.0,2000.0,,-37.7541,145.0719,,10412.0


In [18]:
data.isnull().sum()

Suburb               0
Address              0
Rooms                0
Type                 0
Price             7610
Method               0
SellerG              0
Date                 0
Distance             1
Postcode             1
Bedroom2          8217
Bathroom          8226
Car               8728
Landsize         11810
BuildingArea     21115
YearBuilt        19306
CouncilArea          3
Lattitude         7976
Longtitude        7976
Regionname           3
Propertycount        3
dtype: int64