### Capstone Project: Melbourne Housing Market

 Data source from kaggle. (This data was scraped from publicly available results posted every week from Domain.com.au.)
 
 This project will use some features offered in the dataset to predict houses price.


    
    Attribute Information:
    
    Suburb         Suburb
    Address        Address
    Rooms          Number of rooms
    Price          Price in dollars
    Method         S - property sold; 
                   SP - property sold prior;
                   PI - property passed in; 
                   PN - sold prior not disclosed; 
                   SN - sold not disclosed; 
                   NB - no### Dataset characteristics
    Type           br - bedroom(s); 
                   h - house,cottage,villa, semi,terrace; 
                   u - unit, duplex; 
                   t - townhouse; 
                   dev site - development site; 
                   o res - other residential.
    SellerG        Real Estate Agent
    Date           Date sold
    Distance       Distance from CBD
    Regionname     General Region (West, North West, North, North east                    ...etc)
    Bedroom2       Scraped # of Bedrooms (from different source)
    Bathroom       Number of Bathrooms
    Car            Number of carspots
    Landsize       Land Size
    BuildingArea   Building Size
    YearBuilt      Year the house was built
    CouncilArea    Governing council for the area
    Lattitude      Self explanitory
    Longtitude     Self explanitory
    Propertycount  Number of properties that exist in the suburb.


## Load moduls

In [3]:
# data modules
import numpy as np
import scipy.stats as stats
import pandas as pd

# plotting modules
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

# make sure charts appear in the notebook:
%matplotlib inline
%config InlineBackend.figure_format ='retina'

## Load data

In [4]:
df = pd.read_csv('datasets/Melbourne_housing_data.csv')

In [5]:
df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,3/09/2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra,-37.8014,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,4/02/2016,2.5,3067.0,...,2.0,1.0,0.0,,,Yarra,-37.8114,145.0116,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0


## Gl

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19740 entries, 0 to 19739
Data columns (total 21 columns):
Suburb           19740 non-null object
Address          19740 non-null object
Rooms            19740 non-null int64
Type             19740 non-null object
Price            15396 non-null float64
Method           19740 non-null object
SellerG          19740 non-null object
Date             19740 non-null object
Distance         19732 non-null float64
Postcode         19732 non-null float64
Bedroom2         15327 non-null float64
Bathroom         15327 non-null float64
Car              15327 non-null float64
Landsize         14944 non-null float64
BuildingArea     8617 non-null float64
YearBuilt        9351 non-null float64
CouncilArea      15296 non-null object
Lattitude        15448 non-null float64
Longtitude       15448 non-null float64
Regionname       19732 non-null object
Propertycount    19732 non-null float64
dtypes: float64(12), int64(1), object(8)
memory usage: 3.2+ MB


In [13]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Rooms,19740.0,2.947163,0.981048,1.0,2.0,3.0,4.0,12.0
Price,15396.0,1054957.0,645255.717547,85000.0,630000.0,880000.0,1301000.0,9000000.0
Distance,19732.0,9.861509,5.554233,0.0,6.1,9.2,12.6,47.4
Postcode,19732.0,3106.534,88.429928,3000.0,3046.0,3101.0,3147.0,3978.0
Bedroom2,15327.0,2.900568,1.007491,0.0,2.0,3.0,3.0,30.0
Bathroom,15327.0,1.548509,0.713385,0.0,1.0,1.0,2.0,12.0
Car,15327.0,1.578065,0.972221,0.0,1.0,2.0,2.0,26.0
Landsize,14944.0,583.9171,3785.423175,0.0,166.0,420.0,663.0,433014.0
BuildingArea,8617.0,196.807,561.558007,0.0,94.0,132.0,199.0,40468.0
YearBuilt,9351.0,1874.166,393.354888,1.0,1930.0,1965.0,1997.0,2106.0


In [10]:
df.isnull().sum()

Suburb               0
Address              0
Rooms                0
Type                 0
Price             4344
Method               0
SellerG              0
Date                 0
Distance             8
Postcode             8
Bedroom2          4413
Bathroom          4413
Car               4413
Landsize          4796
BuildingArea     11123
YearBuilt        10389
CouncilArea       4444
Lattitude         4292
Longtitude        4292
Regionname           8
Propertycount        8
dtype: int64

In [None]:
# Since I want to predict house price, 
# null value in price column needs to be removed.

