# Melbourne Housing Market

The classic Data Science challenge: Analysis and prediction of housing prices in Melbourne, Australia  
source: [Kaggle](https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market)

## 0. Imports

In [14]:
import pandas as pd

In [15]:
# Set pandas options
pd.set_option('display.float_format', '{:.2f}'.format)

## 1. Data analysis

In [22]:
data = pd.read_csv('Melbourne_housing_FULL.csv')
data_raw = data.copy()

In [17]:
data_raw.shape

(34857, 21)

In [18]:
data_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34857 entries, 0 to 34856
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         34857 non-null  object 
 1   Address        34857 non-null  object 
 2   Rooms          34857 non-null  int64  
 3   Type           34857 non-null  object 
 4   Price          27247 non-null  float64
 5   Method         34857 non-null  object 
 6   SellerG        34857 non-null  object 
 7   Date           34857 non-null  object 
 8   Distance       34856 non-null  float64
 9   Postcode       34856 non-null  float64
 10  Bedroom2       26640 non-null  float64
 11  Bathroom       26631 non-null  float64
 12  Car            26129 non-null  float64
 13  Landsize       23047 non-null  float64
 14  BuildingArea   13742 non-null  float64
 15  YearBuilt      15551 non-null  float64
 16  CouncilArea    34854 non-null  object 
 17  Lattitude      26881 non-null  float64
 18  Longti

In [19]:
data_raw.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,3/09/2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra City Council,-37.8,145.0,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra City Council,-37.8,145.0,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra City Council,-37.81,144.99,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,4/02/2016,2.5,3067.0,...,2.0,1.0,0.0,,,Yarra City Council,-37.81,145.01,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra City Council,-37.81,144.99,Northern Metropolitan,4019.0


In [21]:
data_raw.isnull().sum()

Suburb               0
Address              0
Rooms                0
Type                 0
Price             7610
Method               0
SellerG              0
Date                 0
Distance             1
Postcode             1
Bedroom2          8217
Bathroom          8226
Car               8728
Landsize         11810
BuildingArea     21115
YearBuilt        19306
CouncilArea          3
Lattitude         7976
Longtitude        7976
Regionname           3
Propertycount        3
dtype: int64

In [28]:
# Splitting of data into numerical and categorical data
data_raw_num = data_raw.select_dtypes(include=['int64', 'float64'])
data_raw_cat = data_raw.select_dtypes(include=['object'])
print(f"shape of data_raw_num:  {data_raw_num.shape}")
print(f"shape of data_raw_cat:  {data_raw_cat.shape}")
print(f"shape of full dataset : {data_raw.shape}")

shape of data_raw_num:  (34857, 13)
shape of data_raw_cat:  (34857, 8)
shape of full dataset : (34857, 21)


In [29]:
data_raw_num.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,34857.0,27247.0,34856.0,34856.0,26640.0,26631.0,26129.0,23047.0,13742.0,15551.0,26881.0,26881.0,34854.0
mean,3.03,1050173.34,11.18,3116.06,3.08,1.62,1.73,593.6,160.26,1965.29,-37.81,145.0,7572.89
std,0.97,641467.13,6.79,109.02,0.98,0.72,1.01,3398.84,401.27,37.33,0.09,0.12,4428.09
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.19,144.42,83.0
25%,2.0,635000.0,6.4,3051.0,2.0,1.0,1.0,224.0,102.0,1940.0,-37.86,144.93,4385.0
50%,3.0,870000.0,10.3,3103.0,3.0,2.0,2.0,521.0,136.0,1970.0,-37.81,145.01,6763.0
75%,4.0,1295000.0,14.0,3156.0,4.0,2.0,2.0,670.0,188.0,2000.0,-37.75,145.07,10412.0
max,16.0,11200000.0,48.1,3978.0,30.0,12.0,26.0,433014.0,44515.0,2106.0,-37.39,145.53,21650.0


Some observations that will need further investigation:
* The minimum of 'Bedroom2' is 0.
* The minimum of 'Bathroom' is 0.
* The minimum of 'Landsize' is 0 [meters].
* The maximum of 'Landsize' is 4330141 [meters].
* The minimum value of 'BuildingArea' is 0.
* The maximum value of 'BuildingArea' is 44515 [meters].
* The minimum value of 'YearBuilt' is 1196. (Melbourne was founded in 1835)
* The maximum value of 'YearBuilt' is 2106. (Dataset was created in 2018)

In [46]:
print(f"Number of entries with bedroom2 < 1: {(data_raw_num['Bedroom2'] < 1).sum()}")
print(f"Number of entries with bathroom < 1: {(data_raw_num['Bathroom'] < 1).sum()}")
print(f"Number of entries with landsize = 0: {(data_raw_num['Landsize'] == 0).sum()}")
print(f"Number of entries with buildingarea = 0: {(data_raw_num['BuildingArea'] == 0).sum()}")
print(f"Number of entries with YearBuilt < 1835: {(data_raw_num['YearBuilt'] < 1835).sum()}")
print(f"Number of entries with YearBuilt > 2018: {(data_raw_num['YearBuilt'] > 2018).sum()}")
print(f"Number of entries with buildingarea > landsize: {(data_raw_num['BuildingArea'] > data_raw_num['Landsize']).sum()}")

Number of entries with bedroom2 < 1: 17
Number of entries with bathroom < 1: 46
Number of entries with landsize = 0: 2437
Number of entries with buildingarea = 0: 76
Number of entries with YearBuilt < 1835: 4
Number of entries with YearBuilt > 2018: 2
Number of entries with buildingarea > landsize: 1751


In [45]:
print(f"Number of entries with buildingarea = 0 and landsize = 0: {((data_raw_num['Landsize'] == 0) & (data_raw_num['BuildingArea'] == 0)).sum()}")
print(f"Number of entries with buildingarea > landsize and landsize = 0: {((data_raw_num['BuildingArea'] > data_raw_num['Landsize']) & (data_raw_num['Landsize'] == 0)).sum()}")

Number of entries with buildingarea = 0 and landsize = 0: 0
Number of entries with buildingarea > landsize and landsize = 0: 1327
