# Problem Statement

### We want to develop a model such that we can predict the resale price of flats based on the other features of the flats

# Data Preprocessing

### Using dataset from https://beta.data.gov.sg/collections/189/datasets/d_8b84c4ee58e3cfc0ece0d773c8ca6abc/view

In [2]:
import pandas
loadeddata = pandas.read_csv('ResaleflatpricesbasedonregistrationdatefromJan2017onwards.csv')
loadeddata.head()

Unnamed: 0,month,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,remaining_lease,resale_price
0,2017-01,ANG MO KIO,2 ROOM,406,ANG MO KIO AVE 10,10 TO 12,44.0,Improved,1979,61 years 04 months,232000.0
1,2017-01,ANG MO KIO,3 ROOM,108,ANG MO KIO AVE 4,01 TO 03,67.0,New Generation,1978,60 years 07 months,250000.0
2,2017-01,ANG MO KIO,3 ROOM,602,ANG MO KIO AVE 5,01 TO 03,67.0,New Generation,1980,62 years 05 months,262000.0
3,2017-01,ANG MO KIO,3 ROOM,465,ANG MO KIO AVE 10,04 TO 06,68.0,New Generation,1980,62 years 01 month,265000.0
4,2017-01,ANG MO KIO,3 ROOM,601,ANG MO KIO AVE 5,01 TO 03,67.0,New Generation,1980,62 years 05 months,265000.0


### First, check for any empty values

In [3]:
loadeddata.isna().any()

month                  False
town                   False
flat_type              False
block                  False
street_name            False
storey_range           False
floor_area_sqm         False
flat_model             False
lease_commence_date    False
remaining_lease        False
resale_price           False
dtype: bool

### Next, check for any values that is not possible

In [4]:
loadeddata.max()

month                             2024-02
town                               YISHUN
flat_type                MULTI-GENERATION
block                                  9B
street_name                       ZION RD
storey_range                     49 TO 51
floor_area_sqm                      249.0
flat_model                        Type S2
lease_commence_date                  2022
remaining_lease        97 years 09 months
resale_price                    1568888.0
dtype: object

In [5]:
loadeddata.min()

month                             2017-01
town                           ANG MO KIO
flat_type                          1 ROOM
block                                   1
street_name                  ADMIRALTY DR
storey_range                     01 TO 03
floor_area_sqm                       31.0
flat_model                         2-room
lease_commence_date                  1966
remaining_lease        41 years 09 months
resale_price                     140000.0
dtype: object

# Feature Engineering

We will be choosing the below features in order to predict the resale price of flats

In [6]:
data = loadeddata[['flat_type', 'town', 'storey_range', 'floor_area_sqm', 'remaining_lease', 'resale_price']]
data.head()

Unnamed: 0,flat_type,town,storey_range,floor_area_sqm,remaining_lease,resale_price
0,2 ROOM,ANG MO KIO,10 TO 12,44.0,61 years 04 months,232000.0
1,3 ROOM,ANG MO KIO,01 TO 03,67.0,60 years 07 months,250000.0
2,3 ROOM,ANG MO KIO,01 TO 03,67.0,62 years 05 months,262000.0
3,3 ROOM,ANG MO KIO,04 TO 06,68.0,62 years 01 month,265000.0
4,3 ROOM,ANG MO KIO,01 TO 03,67.0,62 years 05 months,265000.0


Since we want the remaining_lease to be an integer, we add a new column where they are expressed in months

In [7]:
yearinmonths = data.remaining_lease.str.split(' ').str[0].astype(int) * 12 
months = data.remaining_lease.str.split(' ').str[2].fillna(0).astype(int)
data.loc[:, 'remaining_lease_in_months'] = yearinmonths + months
data.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[:, 'remaining_lease_in_months'] = yearinmonths + months


Unnamed: 0,flat_type,town,storey_range,floor_area_sqm,remaining_lease,resale_price,remaining_lease_in_months
0,2 ROOM,ANG MO KIO,10 TO 12,44.0,61 years 04 months,232000.0,736
1,3 ROOM,ANG MO KIO,01 TO 03,67.0,60 years 07 months,250000.0,727
2,3 ROOM,ANG MO KIO,01 TO 03,67.0,62 years 05 months,262000.0,749
3,3 ROOM,ANG MO KIO,04 TO 06,68.0,62 years 01 month,265000.0,745
4,3 ROOM,ANG MO KIO,01 TO 03,67.0,62 years 05 months,265000.0,749


# Model Selection

Since we are trying to predict future values from the already known values and the other features, this is a regression problem. I have chosen to use polynomial regression and random forest as the 2 models in order to solve the problem statement.

First, we split the data into 2 sets.

## Polynomial Regression

In [8]:
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(data[['floor_area_sqm', 'remaining_lease_in_months']], data['resale_price'], test_size = 0.2)

In [31]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
   
for x in range(10):
    poly = PolynomialFeatures(degree=x)
    polyxtrain = poly.fit_transform(xtrain)
    polyxtest = poly.fit_transform(xtest)

    regression = linear_model.LinearRegression()
    polymodel = regression.fit(polyxtrain, ytrain)
    r = polymodel.score(polyxtest, ytest)
    
    print('The coefficient of determination of our Polynomial Regression Model is ' + str(round(r, 5)) + f' at degree {x}')

The coefficient of determination of our Polynomial Regression Model is -0.0 at degree 0
The coefficient of determination of our Polynomial Regression Model is 0.50914 at degree 1
The coefficient of determination of our Polynomial Regression Model is 0.56908 at degree 2
The coefficient of determination of our Polynomial Regression Model is 0.5986 at degree 3
The coefficient of determination of our Polynomial Regression Model is 0.61979 at degree 4
The coefficient of determination of our Polynomial Regression Model is 0.62747 at degree 5
The coefficient of determination of our Polynomial Regression Model is 0.41533 at degree 6
The coefficient of determination of our Polynomial Regression Model is 0.53873 at degree 7
The coefficient of determination of our Polynomial Regression Model is 0.47547 at degree 8
The coefficient of determination of our Polynomial Regression Model is 0.42729 at degree 9


#### From this, we can tell that our model overfits when its degree is higher than 2, since the coefficient of determination is larger than 1

## Random Forest

Since the random forest is just a lot of descision trees, it should not affect the results if the strings are converted into categorical numbers. Since the RandomForestRegressor cannot take in strings, we will convert the strings into categorical numbers

In [10]:
for i in data.flat_type.unique():
    data.replace(i, list(data.flat_type.unique()).index(i), inplace=True)

for i in data.town.unique():
    data.replace(i, list(data.town.unique()).index(i), inplace=True)

for i in data.storey_range.unique():
    data.replace(i, list(data.storey_range.unique()).index(i), inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.replace(i, list(data.flat_type.unique()).index(i), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.replace(i, list(data.town.unique()).index(i), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.replace(i, list(data.storey_range.unique()).index(i), inplace=True)


In [11]:
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(data[['flat_type', 'town', 'storey_range', 'floor_area_sqm', 'remaining_lease_in_months']], data['resale_price'], test_size = 0.2)

In [12]:
from sklearn.ensemble import RandomForestRegressor

for x in range(10, 26):
    forestmodel = RandomForestRegressor(max_depth=x)
    forestmodel.fit(xtrain, ytrain)
    r = forestmodel.score(xtest, ytest)
    
    print('The coefficient of determination of our Random Forest Model is: ' + str(round(r, 5)) + f' at depth {x}')

The coefficient of determination of our Random Forest Model is: 0.77628 at depth 10
The coefficient of determination of our Random Forest Model is: 0.80362 at depth 11
The coefficient of determination of our Random Forest Model is: 0.82697 at depth 12
The coefficient of determination of our Random Forest Model is: 0.84713 at depth 13
The coefficient of determination of our Random Forest Model is: 0.86421 at depth 14
The coefficient of determination of our Random Forest Model is: 0.87691 at depth 15
The coefficient of determination of our Random Forest Model is: 0.88624 at depth 16
The coefficient of determination of our Random Forest Model is: 0.89263 at depth 17
The coefficient of determination of our Random Forest Model is: 0.89663 at depth 18
The coefficient of determination of our Random Forest Model is: 0.89857 at depth 19
The coefficient of determination of our Random Forest Model is: 0.89895 at depth 20
The coefficient of determination of our Random Forest Model is: 0.8987 at de

# Model Evaluation

Based on the numbers above, it appears that the best parameters is degree 5 for the polynomial regression and around depth 20 for random forest.

We will now be using 3 metrics the evaluate which model is better, R squared, mean square error and maximum error

## R squared

In [32]:
poly = PolynomialFeatures(degree=5)
polyxtrain = poly.fit_transform(xtrain)
polyxtest = poly.fit_transform(xtest)

regression = linear_model.LinearRegression()
polymodel = regression.fit(polyxtrain, ytrain)
polymodel.score(polyxtest, ytest)

0.6274659430056416

In [33]:
forestmodel = RandomForestRegressor(max_depth=20)
forestmodel.fit(xtrain, ytrain)
forestmodel.score(xtest, ytest)

0.8988984015008162

## Mean Squared Error

In [34]:
from sklearn.metrics import mean_squared_error
mean_squared_error(ytest, polymodel.predict(polyxtest))

10955361426.892527

In [35]:
mean_squared_error(ytest, forestmodel.predict(xtest))

2973163209.107365

## Maximum Error

In [36]:
from sklearn.metrics import max_error
max_error(ytest, polymodel.predict(polyxtest))

2633914.267997654

In [37]:
max_error(ytest, forestmodel.predict(xtest))

706613.1204794436

##### From all 3 metrics, the random forest model was much better than the polynomial regression model. Therefore, we can conclude that using random forest will be much better.

# Hyperparameter Tuning

For hyperparameter tuning, we will be changing n_estimators, max_depth and max_leaf_nodes to tune our model.

In [69]:
from sklearn.model_selection import *
grid = {'n_estimators': range(25, 101, 25),
        'max_depth': range(19, 22),
        'max_leaf_nodes': range(3, 13, 3)}
clf = GridSearchCV(estimator = RandomForestRegressor(), param_grid = grid)
clf.fit(xtrain, ytrain)

In [70]:
clf.best_params_

{'max_depth': 19, 'max_leaf_nodes': 12, 'n_estimators': 75}

Now that we have all of the parameters, we can input them into our final model and compare against our inital one

In [73]:
forestmodel = RandomForestRegressor(max_depth=20)
forestmodel.fit(xtrain, ytrain)

final_model = RandomForestRegressor(max_depth=clf.best_params_['max_depth'], 
                                    n_estimators=clf.best_params_['n_estimators'],
                                    max_leaf_nodes=clf.best_params_['max_leaf_nodes']) 
final_model.fit(xtrain, ytrain)

In [78]:
finalRsquared = round(final_model.score(xtest, ytest), 5)
finalMSE = round(mean_squared_error(ytest, final_model.predict(xtest)), 2)
finalmaxerror = round(max_error(ytest, final_model.predict(xtest)), 2)

forestRsquared = round(forestmodel.score(xtest, ytest), 5)
forestMSE = round(mean_squared_error(ytest, forestmodel.predict(xtest)), 2)
forestmaxerror = round(max_error(ytest, forestmodel.predict(xtest)), 2)

print(f'Our final model had a coefficient of determination of {finalRsquared} while our inital model had a coefficient of determination of {forestRsquared}\nOur final model had a mean squared error of {finalMSE} while our inital model had a mean squared error of {forestMSE}\nOur final model had a maximum error of {finalmaxerror} while our inital model had a maximum error of {forestmaxerror}')

Our final model had a coefficient of determination of 0.55445 while our inital model had a coefficient of determination of 0.89906
Our final model had a mean squared error of 13102553185.27 while our inital model had a mean squared error of 2968336023.25
Our final model had a maximum error of 783661.95 while our inital model had a maximum error of 718334.55


# Model Interpretation

Based on the metrics we have chosen, it appears that the inital model with no tuning is much more accurate with a lower R squared value and lower mean squared error value, as well as having a lower variance with a lower maximum error. From this, we can thoroughly conclude that the inital model with no tuning is better.

We also want to find the features based on their importance

In [96]:
for a, b in zip(list(xtrain.columns), forestmodel.feature_importances_):
    print(f'{round(b * 100, 3)}%', a)

3.215% flat_type
21.157% town
10.959% storey_range
47.209% floor_area_sqm
17.461% remaining_lease_in_months


In [98]:
for a, b in zip(list(xtrain.columns), final_model.feature_importances_):
    print(f'{round(b * 100, 3)}%', a)

2.498% flat_type
2.808% town
13.455% storey_range
70.479% floor_area_sqm
10.759% remaining_lease_in_months


From this, we can conclude that both models agree that floor_area_sqm is the most important factor in determining the resale price of a flat and the flat type has little to no impact on the resale price. 

Both models think that storey_range and time remaining_lease_in_months somewhat affect the resale price, but the inital model places much more importance the town the flat is in compared to the final model.