# Housing Prices

Using this [Kaggle data](https://www.kaggle.com/anthonypino/melbourne-housing-market) create a model to predict a house's value. We want to be able to understand what creates value in a house, as though we were a real estate developer.

---

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn import ensemble

%matplotlib inline

In [2]:
file = 'C:/Users/Carter Carlson/Documents/Thinkful/Large Databases/Housing prices.csv'
df = pd.read_csv(file)
df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,3/09/2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra City Council,-37.8014,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra City Council,-37.7996,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra City Council,-37.8079,144.9934,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,4/02/2016,2.5,3067.0,...,2.0,1.0,0.0,,,Yarra City Council,-37.8114,145.0116,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra City Council,-37.8093,144.9944,Northern Metropolitan,4019.0


In [3]:
print(df.groupby('Type')['Type'].count())

Type
h    23980
t     3580
u     7297
Name: Type, dtype: int64


In [4]:
# Remove columns`
df = df.drop(columns=['Lattitude', 'Longtitude', 'Car', 'Address', 'Date', 
                      'CouncilArea', 'Regionname', 'Method', 'SellerG', 'Suburb'])

# Convert type into numerical values
type_list = { 'h': 1, 
              't': 2,
              'u': 3 }
df['Type'] = [type_list[i] for i in df['Type']]

nullcount = df.isnull().sum()
print('\n NaN as a percentage of occurences by column\n\n{}'.format(nullcount[nullcount>0]/len(df)))


 NaN as a percentage of occurences by column

Price            0.218321
Distance         0.000029
Postcode         0.000029
Bedroom2         0.235735
Bathroom         0.235993
Landsize         0.338813
BuildingArea     0.605761
YearBuilt        0.553863
Propertycount    0.000086
dtype: float64


In [5]:
# Based off these results, there are over 50% of NaN values for BuildingArea and YearBuilt.  
# 50% of missing row in a column may skew the results if the 50% available rows are all weighted
# towards a specific price, so I'll remove those columns.

df = df.drop(columns=['BuildingArea', 'YearBuilt'])

# Now that we've removed the columns with the most NaN's, let's drop any row that still contains NaN.
df = df.dropna()

# How many rows do we have left?
print('Rows left: {}'.format(len(df)))

Rows left: 17973


In [7]:
X = df.drop(columns='Price', axis=1)
Y = df['Price']

params = {'n_estimators': 500,
          'max_depth': 2,
          'loss': 'deviance'}

regr = ensemble.RandomForestClassifier()
regr.fit(X, Y)

feature_importance = regr.feature_importances_

num = 0

print('Relative importance by feature:\n')
for col in X:
    print(col, ':  ', feature_importance[num])
    num += 1

Relative importance by feature:

Rooms :   0.0280930569745666
Type :   0.015095938949003445
Distance :   0.12845481456306557
Postcode :   0.09326364636894494
Bedroom2 :   0.029411794142440745
Bathroom :   0.04976720006178763
Landsize :   0.5475959732881284
Propertycount :   0.10831757565206279
