# Decision Tree Regressor

In [41]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt 

## Repeat the data import process

In [42]:
data = pd.read_csv('flat-prices.csv')
data.head()

Unnamed: 0,month,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,resale_price
0,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,10 TO 12,31.0,IMPROVED,1977,9000
1,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,04 TO 06,31.0,IMPROVED,1977,6000
2,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,10 TO 12,31.0,IMPROVED,1977,8000
3,1990-01,ANG MO KIO,1 ROOM,309,ANG MO KIO AVE 1,07 TO 09,31.0,IMPROVED,1977,6000
4,1990-01,ANG MO KIO,3 ROOM,216,ANG MO KIO AVE 1,04 TO 06,73.0,NEW GENERATION,1976,47200


In [43]:
label = LabelEncoder()

data['town'] = label.fit_transform(data['town'])
data['flat_type'] = label.fit_transform(data['flat_type'])
data['storey_range'] = label.fit_transform(data['storey_range'])
data['flat_model'] = label.fit_transform(data['flat_model'])

selected_features = data[['town', 'flat_type', 'storey_range', 'floor_area_sqm', 'flat_model', 'lease_commence_date']]

X = selected_features.values 

y = data['resale_price']

## Summary of Encoding

Town:
0 - ANG MO KIO
1 - BEDOK
2 - BISHAN
3 - BUKIT BATOK
4 - BUKIT MERAH
5 - BUKIT PANJANG
6 - BUKIT TIMAH
7 - CENTRAL AREA
8 - CHOA CHU KANG
9 - CLEMENTI
10 - GEYLANG
11 - HOUGANG
12 - JURONG EAST
13 - JURONG WEST
14 - KALLANG/WHAMPOA
15 - LIM CHU KANG
16 - MARINE PARADE
17 - PASIR RIS
18 - QUEENSTOWN
19 - SEMBAWANG
20 - SENGKANG
21 - SERANGOON
22 - TAMPINES
23 - TOA PAYOH
24 - WOODLANDS
25 - YISHUN

Flat Type:
0 - 1 ROOM
1 - 2 ROOM
2 - 3 ROOM
3 - 4 ROOM
4 - 5 ROOM
5 - EXECUTIVE 
6 - MULTI GENERATION

Storey Range:
0 - 01 TO 03 
1 - 04 TO 06 
2 - 07 TO 09
3 - 10 TO 12
4 - 13 TO 15
5 - 16 TO 18
6 - 19 TO 21
7 - 22 TO 24
8 - 25 TO 27

Flat Model:
0 - 2-ROOM
1 - APARTMENT
2 - IMPROVED
3 - IMPROVED-MAISONETTE
4 - MAISONETTE
5 - MODEL A
6 - MODEL A-MAISONETTE
7 - MULTI GENERATION
8 - NEW GENERATION
9 - PREMIUM APARTMENT
10 - SIMPLIFIED
11 - STANDARD
12 - TERRACE

## Training the model

The Decision Tree model has many parameters that affects the model, such as max_depth, min_samples_split, min_samples_leaf, max_features etc.

In this model, the max_depth and min_samples_split parameter are optimised.

max_depth = max depth of tree
min_samples_split = minimum number of samples required to split a node

In [44]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

In [45]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.25, random_state= 42)

grid_values = {'max_depth':[10, 20, 30, 40, 50], 'min_samples_split': [100, 500, 1000, 5000, 10000]}
reg = DecisionTreeRegressor()
grid_reg = GridSearchCV(reg, param_grid= grid_values, scoring= 'neg_mean_squared_error')
grid_reg.fit(X_train, y_train)

best_params = grid_reg.best_params_

best_params

{'max_depth': 20, 'min_samples_split': 500}

## Evaluate the model using train data

In [46]:
best_model = grid_reg.best_estimator_
y_pred_train = best_model.predict(X_train)
RMSE_train = sqrt(mean_squared_error(y_pred_train, y_train))

RMSE_train

69034.9971775791

## Evaluate the model using test data

In [47]:
y_pred_test = best_model.predict(X_test)
RMSE_test = sqrt(mean_squared_error(y_pred_test, y_test))

RMSE_test

69953.17312318627

## Prediction Example 

A 5-ROOM flat in BEDOK, FLOOR 04 TO 06, IMPROVED, FLOOR AREA = 121, LEASE COMMENCED in 1980


The feature matrix parameters are:
1 - Bias
1 - BEDOK
4 - 5-ROOM
1 - 04 TO 06
121 - FLOOR AREA
2 - IMPROVED
1980 - LEASE COMMENCEMENT DATE


Actual Resale Price: 145,000

In [48]:
X_example = np.array([1, 4, 1, 121, 2, 1980])
y_example = best_model.predict(X_example.reshape(1, -1))

y_example[0]

300002.457002457

## Conclusion

The Decision Tree Regressor with optimised parameters performs better than the Simple Linear Regression model as it has a lower RMSE in both train and test data.