## Property Price Prediction -- Making the first baseline
written by Pengcheng Xie at 05-10-2018

### These are the main tasks I've done in this file
1. Doing feature engineering
2. Making train and test datasets
3. Building model (XGBoost)
4. Test the Model (the score function is *mean_squared_error*). 

### TODO: 
Building other models such as Lasso regression, Ridge regression, SVR and so on so forth.

Then do stacking and improving the test score.

Also can redo data cleaning and feature engineering to elevate the performance of the model.

In [42]:
import pandas as pd

import matplotlib.pyplot as plt
% matplotlib inline

import seaborn as sb

import re

### 1. Import the cleaning dataset
It is in *./data/data.csv*

In [72]:
pd.set_option('max_columns', 1000)

In [3]:
# read raw data
trainf = pd.read_csv('./data/data.csv')

trainf.head()

Unnamed: 0.1,Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,Bathroom,Car,Landsize,CouncilArea,Regionname,Propertycount
0,1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,1.0,1.0,202.0,Yarra City Council,Northern Metropolitan,4019.0
1,2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,1.0,0.0,156.0,Yarra City Council,Northern Metropolitan,4019.0
2,4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,2.0,0.0,134.0,Yarra City Council,Northern Metropolitan,4019.0
3,5,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,2.0,1.0,94.0,Yarra City Council,Northern Metropolitan,4019.0
4,6,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,1.0,2.0,120.0,Yarra City Council,Northern Metropolitan,4019.0


### 2. Feature Engineering

In [74]:
train = trainf.drop(['Postcode','Unnamed: 0','Address','Suburb'],axis=1)
train.shape

(27244, 13)

In [47]:
train.head()

Unnamed: 0,Suburb,Rooms,Type,Price,Method,SellerG,Date,Distance,Bathroom,Car,Landsize,CouncilArea,Regionname,Propertycount
0,Abbotsford,2,h,1480000.0,S,Biggin,3/12/2016,2.5,1.0,1.0,202.0,Yarra City Council,Northern Metropolitan,4019.0
1,Abbotsford,2,h,1035000.0,S,Biggin,4/02/2016,2.5,1.0,0.0,156.0,Yarra City Council,Northern Metropolitan,4019.0
2,Abbotsford,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,2.0,0.0,134.0,Yarra City Council,Northern Metropolitan,4019.0
3,Abbotsford,3,h,850000.0,PI,Biggin,4/03/2017,2.5,2.0,1.0,94.0,Yarra City Council,Northern Metropolitan,4019.0
4,Abbotsford,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,1.0,2.0,120.0,Yarra City Council,Northern Metropolitan,4019.0


In [75]:
# Rewrite sold Date to only remain Year info as Sold_Year

train['Date'] = train['Date'].apply(lambda x: re.search(r'(\d{4})',str(x)).group())

In [76]:
train.rename(columns={'Date':'Sold_Year'}, inplace = True)
train.head()

Unnamed: 0,Rooms,Type,Price,Method,SellerG,Sold_Year,Distance,Bathroom,Car,Landsize,CouncilArea,Regionname,Propertycount
0,2,h,1480000.0,S,Biggin,2016,2.5,1.0,1.0,202.0,Yarra City Council,Northern Metropolitan,4019.0
1,2,h,1035000.0,S,Biggin,2016,2.5,1.0,0.0,156.0,Yarra City Council,Northern Metropolitan,4019.0
2,3,h,1465000.0,SP,Biggin,2017,2.5,2.0,0.0,134.0,Yarra City Council,Northern Metropolitan,4019.0
3,3,h,850000.0,PI,Biggin,2017,2.5,2.0,1.0,94.0,Yarra City Council,Northern Metropolitan,4019.0
4,4,h,1600000.0,VB,Nelson,2016,2.5,1.0,2.0,120.0,Yarra City Council,Northern Metropolitan,4019.0


In [57]:
corr=train.corr()['Price']
corr.sort_values()

Distance        -0.211415
Propertycount   -0.059017
Landsize         0.027101
Car              0.239317
Bathroom         0.422082
Rooms            0.465231
Price            1.000000
Name: Price, dtype: float64

### 3. Make the training set and testing set
I separate the dataset into 80% for training and 20% for testing

In [82]:
import xgboost as xgb

from sklearn.metrics import mean_squared_error

import numpy as np

from sklearn.model_selection import train_test_split

In [83]:
y=train['Price']
y.head()

0    1480000.0
1    1035000.0
2    1465000.0
3     850000.0
4    1600000.0
Name: Price, dtype: float64

In [84]:
#one-hot coding
train1 = train.drop(['Price'],axis=1)
X=pd.get_dummies(train1).reset_index(drop=True)

In [85]:
X.shape

(27244, 407)

In [86]:
# 80% for training and 20% for testing
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=123)

In [88]:
matrix=xgb.DMatrix(data=X,label=y)

### 4. Building, training and testing the model
I use XGBoost Model for the regression

In [89]:
#XGBoost

xg_reg=xgb.XGBRegressor(objective='reg:linear',colsample_bytree=0.4,learning_rate=0.01,max_depth=8,alpha=10,n_estimators=600,subsample=0.7)

In [90]:
xg_reg.fit(X_train,y_train)

XGBRegressor(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.4, gamma=0, learning_rate=0.01, max_delta_step=0,
       max_depth=8, min_child_weight=1, missing=None, n_estimators=600,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=0.7)

In [91]:
pred=xg_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test,pred))
rmse

288059.91951198713

In [92]:
logrmse = np.sqrt(mean_squared_error(np.log(y_test),np.log(pred)))
logrmse

0.2110988203011321