# Problem Statement

### We want to develop a model such that we can predict the resale price of flats based on the other features of the flats

# Data Preprocessing

### Using dataset from https://beta.data.gov.sg/collections/189/datasets/d_8b84c4ee58e3cfc0ece0d773c8ca6abc/view

In [159]:
import pandas
loadeddata = pandas.read_csv('ResaleflatpricesbasedonregistrationdatefromJan2017onwards.csv')
loadeddata.head()

Unnamed: 0,month,town,flat_type,block,street_name,storey_range,floor_area_sqm,flat_model,lease_commence_date,remaining_lease,resale_price
0,2017-01,ANG MO KIO,2 ROOM,406,ANG MO KIO AVE 10,10 TO 12,44.0,Improved,1979,61 years 04 months,232000.0
1,2017-01,ANG MO KIO,3 ROOM,108,ANG MO KIO AVE 4,01 TO 03,67.0,New Generation,1978,60 years 07 months,250000.0
2,2017-01,ANG MO KIO,3 ROOM,602,ANG MO KIO AVE 5,01 TO 03,67.0,New Generation,1980,62 years 05 months,262000.0
3,2017-01,ANG MO KIO,3 ROOM,465,ANG MO KIO AVE 10,04 TO 06,68.0,New Generation,1980,62 years 01 month,265000.0
4,2017-01,ANG MO KIO,3 ROOM,601,ANG MO KIO AVE 5,01 TO 03,67.0,New Generation,1980,62 years 05 months,265000.0


### First, check for any empty values

In [160]:
loadeddata.isna().any()

month                  False
town                   False
flat_type              False
block                  False
street_name            False
storey_range           False
floor_area_sqm         False
flat_model             False
lease_commence_date    False
remaining_lease        False
resale_price           False
dtype: bool

### Next, check for any values that is not possible

In [161]:
loadeddata.max()

month                             2024-02
town                               YISHUN
flat_type                MULTI-GENERATION
block                                  9B
street_name                       ZION RD
storey_range                     49 TO 51
floor_area_sqm                      249.0
flat_model                        Type S2
lease_commence_date                  2022
remaining_lease        97 years 09 months
resale_price                    1568888.0
dtype: object

In [162]:
loadeddata.min()

month                             2017-01
town                           ANG MO KIO
flat_type                          1 ROOM
block                                   1
street_name                  ADMIRALTY DR
storey_range                     01 TO 03
floor_area_sqm                       31.0
flat_model                         2-room
lease_commence_date                  1966
remaining_lease        41 years 09 months
resale_price                     140000.0
dtype: object

# Feature Engineering

We will be choosing the below features in order to predict the resale price of flats

In [163]:
data = loadeddata[['flat_type', 'town', 'storey_range', 'floor_area_sqm', 'remaining_lease', 'resale_price']]
data.head()

Unnamed: 0,flat_type,town,storey_range,floor_area_sqm,remaining_lease,resale_price
0,2 ROOM,ANG MO KIO,10 TO 12,44.0,61 years 04 months,232000.0
1,3 ROOM,ANG MO KIO,01 TO 03,67.0,60 years 07 months,250000.0
2,3 ROOM,ANG MO KIO,01 TO 03,67.0,62 years 05 months,262000.0
3,3 ROOM,ANG MO KIO,04 TO 06,68.0,62 years 01 month,265000.0
4,3 ROOM,ANG MO KIO,01 TO 03,67.0,62 years 05 months,265000.0


Since we want the remaining_lease to be an integer, we add a new column where they are expressed in months

In [164]:
yearinmonths = data.remaining_lease.str.split(' ').str[0].astype(int) * 12 
months = data.remaining_lease.str.split(' ').str[2].fillna(0).astype(int)
data.loc[:, 'remaining_lease_in_months'] = yearinmonths + months
data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[:, 'remaining_lease_in_months'] = yearinmonths + months


Unnamed: 0,flat_type,town,storey_range,floor_area_sqm,remaining_lease,resale_price,remaining_lease_in_months
0,2 ROOM,ANG MO KIO,10 TO 12,44.0,61 years 04 months,232000.0,736
1,3 ROOM,ANG MO KIO,01 TO 03,67.0,60 years 07 months,250000.0,727
2,3 ROOM,ANG MO KIO,01 TO 03,67.0,62 years 05 months,262000.0,749
3,3 ROOM,ANG MO KIO,04 TO 06,68.0,62 years 01 month,265000.0,745
4,3 ROOM,ANG MO KIO,01 TO 03,67.0,62 years 05 months,265000.0,749
...,...,...,...,...,...,...,...
173329,EXECUTIVE,YISHUN,04 TO 06,142.0,63 years 05 months,820000.0,761
173330,EXECUTIVE,YISHUN,01 TO 03,154.0,63 years 10 months,850000.0,766
173331,EXECUTIVE,YISHUN,10 TO 12,142.0,62 years 11 months,795000.0,755
173332,EXECUTIVE,YISHUN,07 TO 09,146.0,62 years 10 months,935000.0,754


# Model Selection

Since we are trying to predict future values from the already known values and the other features, this is a regression problem. I have chosen to use polynomial regression and random forest as the 2 models in order to solve the problem statement.

## Polynomial Regression

In [165]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
from sklearn.model_selection import train_test_split

poly = PolynomialFeatures(degree=20)
poly_variables = poly.fit_transform(data[['remaining_lease_in_months', 'floor_area_sqm']])

poly_var_train, poly_var_test, res_train, res_test = train_test_split(poly_variables, data[['remaining_lease_in_months', 'floor_area_sqm', 'resale_price']], test_size = 0.2)

regression = linear_model.LinearRegression()

model = regression.fit(poly_var_train, res_train)
score = model.score(poly_var_test, res_test)

print('The accuracy of our Polynomial Regression Model is: ' + str(round(score * 100, 2)) + '%')

The accuracy of our Polynomial Regression Model is: 62.95%


## Random Forest

max depth 30

# Model Evaluation

idk comment on it or smth

# Hyperparameter Tuning

https://scikit-learn.org/stable/modules/grid_search.html

# Model Interpretation