## Housing Prices in Melbourne, Australia

## Objective: Create a model to predict a house's value. I want to be able to understand what creates value in a house as if I was a real estate developer.

In [1]:
from sklearn import preprocessing
%matplotlib inline
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import math
import operator
import seaborn as sns
sns.set_style('white')

In [2]:
#import file
df = pd.read_csv("MEL_housing_prices.csv", encoding='latin1')

Column Details in csv

Suburb: Suburb

Address: Address

Rooms: Number of rooms

Type: br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse; dev site - development site; o res - other residential.

Price: Price in Australian dollars

Method: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.

SellerG: Real Estate Agent

Date: Date sold

Postcode: Postal code

Regionname: General Region (West, North West, North, North east ...etc)

Propertycount: Number of properties that exist in the suburb.

Distance: Distance from CBD in Kilometres

CouncilArea: Governing council for the area

# Data Cleaning Section

In [3]:
# Make the columns easier to type
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')

In [4]:
df.dtypes

suburb            object
address           object
rooms              int64
type              object
price            float64
method            object
sellerg           object
date              object
postcode           int64
regionname        object
propertycount      int64
distance         float64
councilarea       object
dtype: object

## We can get rid of a few redundant values:

## The specific address can be dropped since we have information about the suburb, postcode, and the propertycount of the suburb. The real estate agent (sellerg) is also irrelevant to the broader real estate market conditions.

In [5]:
df.drop(['address', 'sellerg'], 1, inplace=True)

In [6]:
df.head(1)

Unnamed: 0,suburb,rooms,type,price,method,date,postcode,regionname,propertycount,distance,councilarea
0,Abbotsford,3,h,1490000.0,S,1/04/2017,3067,Northern Metropolitan,4019,3.0,Yarra City Council


## After looking at the data, most of the classifications are about how the property was sold. Since we're trying to predict a house's price, understanding the method of sales isn't relevant to my objective. 

In [9]:
#count of all values in each column
df.count()

suburb           63023
rooms            63023
type             63023
price            48433
method           63023
date             63023
postcode         63023
regionname       63023
propertycount    63023
distance         63023
councilarea      63023
dtype: int64

In [10]:
# how many times a value stored in the column organized by the value
df['method'].value_counts()

S     34063
PI     9790
SP     8916
VB     5956
SN     2674
PN      651
W       484
SA      416
SS       73
Name: method, dtype: int64

Method: 
S - property sold;  
PI - property passed in; 
SP - property sold prior;
VB - vendor bid; 
SN - sold not disclosed; 
PN - sold prior not disclosed; 
W - withdrawn prior to auction;  
SA - sold after auction; 
SS - sold after auction price not disclosed. 
N/A - price or highest bid not available.
NB - no bid;

In [11]:
df.drop(['method'], 1, inplace=True)

# Narrowing the location

## There are 8 regions in Melbourne...

In [12]:
# Shows how many times each region is referenced .
df['regionname'].value_counts()

Southern Metropolitan         17559
Northern Metropolitan         16781
Western Metropolitan          11717
Eastern Metropolitan          10396
South-Eastern Metropolitan     5212
Eastern Victoria                564
Northern Victoria               556
Western Victoria                238
Name: regionname, dtype: int64

In [19]:
# counts the number of unique values
df['regionname'].value_counts().nunique()

8

## 34 council areas...

In [18]:
df['councilarea'].value_counts()

Boroondara City Council           5132
Darebin City Council              4182
Banyule City Council              3656
Monash City Council               3592
Bayside City Council              3331
Brimbank City Council             3296
Moreland City Council             3030
Hume City Council                 2939
Glen Eira City Council            2934
Melbourne City Council            2728
Whittlesea City Council           2545
Moonee Valley City Council        2512
Kingston City Council             2378
Manningham City Council           2225
Maribyrnong City Council          2083
Stonnington City Council          1991
Whitehorse City Council           1811
Port Phillip City Council         1771
Yarra City Council                1698
Wyndham City Council              1542
Maroondah City Council            1451
Hobsons Bay City Council          1351
Knox City Council                 1043
Greater Dandenong City Council     948
Frankston City Council             835
Melton City Council      

In [20]:
# counts the number of unique values
df['councilarea'].value_counts().nunique()

34

## But 182 unique postcodes! That is too much variation for an initial attempt with the model. If my analysis isn't predictive to begin with I can reconsider including this into the model so for now I will drop.

In [17]:
# counts the number of unique values
df['postcode'].value_counts().nunique()

182

In [21]:
df.drop(['postcode'], 1, inplace=True)

# Addressing missing data (NaN's)

## We have 63,023 houses sold, yet 14,590 of the entries are missing data (23.15%).

In [9]:
#Count total number of rows
total = df.shape[0]
total

63023

In [10]:
#number of rows with NaN values
nan_rows = np.count_nonzero(df.isnull().values)
nan_rows

14590

In [11]:
#percentage of rows with missing values
nan_rows / total

0.23150278469765007

## Unfortunately, all of those NaN values are in the price column, which is the key value we're trying to use to make our prediction! 

In [6]:
df.isnull().sum()

suburb               0
address              0
rooms                0
type                 0
price            14590
method               0
sellerg              0
date                 0
postcode             0
regionname           0
propertycount        0
distance             0
councilarea          0
dtype: int64

In [33]:
df.head()

Unnamed: 0,suburb,rooms,type,price,date,regionname,propertycount,distance,councilarea,is_nan
0,Abbotsford,3,h,1490000.0,1/04/2017,Northern Metropolitan,4019,3.0,Yarra City Council,0
1,Abbotsford,3,h,1220000.0,1/04/2017,Northern Metropolitan,4019,3.0,Yarra City Council,0
2,Abbotsford,3,h,1420000.0,1/04/2017,Northern Metropolitan,4019,3.0,Yarra City Council,0
3,Aberfeldie,3,h,1515000.0,1/04/2017,Western Metropolitan,1543,7.5,Moonee Valley City Council,0
4,Airport West,2,h,670000.0,1/04/2017,Western Metropolitan,3464,10.4,Moonee Valley City Council,0


## The major question is whether this is because of pure randomness or as a result of an unseen factor. The answer will dictate how to proceed.

In [34]:
df['is_nan'] = df['price'].apply(lambda x: np.isnan(x))

In [50]:
df.head(1)

Unnamed: 0,suburb,rooms,type,price,date,regionname,propertycount,distance,councilarea,is_nan
0,Abbotsford,3,h,1490000.0,1/04/2017,Northern Metropolitan,4019,3.0,Yarra City Council,False


In [57]:
df1 = df.loc[df['price'].isnull()].drop(columns='price')
df2 = df.loc[~df['price'].isnull()].drop(columns='price')

In [59]:
sns.heatmap(df2)

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

In [39]:
df.head()

Unnamed: 0,suburb,rooms,type,price,date,regionname,propertycount,distance,councilarea,is_nan
0,Abbotsford,3,h,1490000.0,1/04/2017,Northern Metropolitan,4019,3.0,Yarra City Council,False
1,Abbotsford,3,h,1220000.0,1/04/2017,Northern Metropolitan,4019,3.0,Yarra City Council,False
2,Abbotsford,3,h,1420000.0,1/04/2017,Northern Metropolitan,4019,3.0,Yarra City Council,False
3,Aberfeldie,3,h,1515000.0,1/04/2017,Western Metropolitan,1543,7.5,Moonee Valley City Council,False
4,Airport West,2,h,670000.0,1/04/2017,Western Metropolitan,3464,10.4,Moonee Valley City Council,False


# ToDo: I need to convert the categorical data

# ToDo: date adjustment?