<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px"> 

# Project 1: Project 2 - Singapore Housing Data and Kaggle Challenge

--- 
# Part 2 Modeling and refining
---

# Contents:
- [Problem Statement](#Problem-Statement)
- [All imports](#All-imports)
- [Preprocessing](#Preprocessing)
- [Data Dictionary](#Data-Dictionary)
- [Train-Test Split](#Train-Test-Split)
- [Linear Regression](#Linear-Regression)
- [Ridge Regression](#Ridge-Regression)
- [Lasso Regression](#Lasso-Regression)
- [Predictions](#Predictions)

## Problem Statement

Housing pricing affects the decision making process of buyers in their assessment of the unit. This project attempts to build a linear regression model, using the data contain in the dataset folder. The goal is to have the model accurately predict the sales price of the houses in the test set, which will be evaluated based on common evaluation metrics such as R2 and RMSE.
This will give those who are impacted by housing prices, e.g. owners, buyers and agents additional avenue to gauge the prices of the house unit and aid their decision making process.

## All imports
Libraries and data imports

In [1]:
# Imports:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression, LassoCV, Lasso, RidgeCV, Ridge
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures, PowerTransformer
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.metrics import r2_score, mean_squared_error
import statsmodels.api as sm

In [2]:
df = pd.read_csv('../../data/output/cleaner_train.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149695 entries, 0 to 149694
Data columns (total 45 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   flat_type                  149695 non-null  object 
 1   street_name                149695 non-null  object 
 2   floor_area_sqm             149695 non-null  float64
 3   flat_model                 149695 non-null  object 
 4   lease_commence_date        149695 non-null  int64  
 5   resale_price               149695 non-null  float64
 6   Tranc_Year                 149695 non-null  int64  
 7   Tranc_Month                149695 non-null  int64  
 8   mid                        149695 non-null  int64  
 9   max_floor_lvl              149695 non-null  int64  
 10  commercial                 149695 non-null  int64  
 11  market_hawker              149695 non-null  int64  
 12  multistorey_carpark        149695 non-null  int64  
 13  precinct_pavilion          14

## Preprocessing
As mentioned in Part1 (EDA), some features would have to engineered, e.g. remaining lease at point of sale, storey level and the max floor level could be combined to hopefully increase the fit of the feature

### Initial Feature engineering

In [4]:
df['transaction_age'] = df['Tranc_Year'] - df['lease_commence_date']
df.drop(columns = ['lease_commence_date'], inplace = True)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149695 entries, 0 to 149694
Data columns (total 45 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   flat_type                  149695 non-null  object 
 1   street_name                149695 non-null  object 
 2   floor_area_sqm             149695 non-null  float64
 3   flat_model                 149695 non-null  object 
 4   resale_price               149695 non-null  float64
 5   Tranc_Year                 149695 non-null  int64  
 6   Tranc_Month                149695 non-null  int64  
 7   mid                        149695 non-null  int64  
 8   max_floor_lvl              149695 non-null  int64  
 9   commercial                 149695 non-null  int64  
 10  market_hawker              149695 non-null  int64  
 11  multistorey_carpark        149695 non-null  int64  
 12  precinct_pavilion          149695 non-null  int64  
 13  total_dwelling_units       14

In [6]:
#from OLS of previous iteration + EDA 
drop_list = ['flat_model',
             'Tranc_Month',
             'max_floor_lvl',
             '2room_sold',
             'multigen_sold',
             '5room_sold',
             'total_dwelling_units',
             '4room_sold',
             '1room_rental',
             '3room_sold',
             'studio_apartment_sold',
             '3room_rental',
             '1room_sold',
             'exec_sold',
             '2room_rental',
             'other_room_rental']

Based ont the list generated by OLS, flat_model as well since the mdoels are tied to the house units. Transaction month removed from its years does not sufficiently capture the nature of the time and max_floor_lvl was seen to be quite similar to the mid level range in the EDA. This may be the reason for the discrepancies in the linear regression and this is another attempt to reduce that.

In [7]:
df.drop(columns = drop_list, inplace = True)
print(df.shape)
df.info()

(149695, 29)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149695 entries, 0 to 149694
Data columns (total 29 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   flat_type                  149695 non-null  object 
 1   street_name                149695 non-null  object 
 2   floor_area_sqm             149695 non-null  float64
 3   resale_price               149695 non-null  float64
 4   Tranc_Year                 149695 non-null  int64  
 5   mid                        149695 non-null  int64  
 6   commercial                 149695 non-null  int64  
 7   market_hawker              149695 non-null  int64  
 8   multistorey_carpark        149695 non-null  int64  
 9   precinct_pavilion          149695 non-null  int64  
 10  planning_area              149695 non-null  object 
 11  Mall_Nearest_Distance      149695 non-null  float64
 12  Mall_Within_500m           149695 non-null  float64
 13  Mall_Within_1km 

|        Feature Name       |                        Feature Description per data source                        |
|:-------------------------:|:---------------------------------------------------------------------------------:|
| resale_price              |  the property's sale price in Singapore dollars. This is the target variable that |
| flat_type                 |  type of the resale flat unit, e.g.   3 ROOM                                      |
| street_name               |  street name where the resale flat   resides, e.g. TAMPINES ST 42                 |
| floor_area_sqm            |  floor area of the resale flat unit   in square metres                            |
| lease_commence_date       |  commencement year of the flat   unit's 99-year lease                             |
| Tranc_Year                |  year of resale transaction                                                       |
| mid                       |  middle value of storey_range                                                     |
| commercial                |  boolean value if resale flat has   commercial units in the same block            |
| market_hawker             |  boolean value if resale flat has a   market or hawker centre in the same block   |
| multistorey_carpark       |  boolean value if resale flat has a   multistorey carpark in the same block       |
| precinct_pavilion         |  boolean value if resale flat has a   pavilion in the same block                  |
| planning_area             |  Government planning area that the   flat is located                              |
| Mall_Nearest_Distance     |  distance (in metres) to the   nearest mall                                       |
| Mall_Within_500m          |  number of malls within 500 metres                                                |
| Mall_Within_1km           |  number of malls within 1 kilometre                                               |
| Mall_Within_2km           |  number of malls within 2   kilometres                                            |
| Hawker_Nearest_Distance   |  distance (in metres) to the   nearest hawker centre                              |
| Hawker_Within_500m        |  number of hawker centres within   500 metres                                     |
| Hawker_Within_1km         |  number of hawker centres within 1   kilometre                                    |
| Hawker_Within_2km         |  number of hawker centres within 2   kilometres                                   |
| mrt_nearest_distance      |  distance (in metres) to the   nearest MRT station                                |
| bus_interchange           |  boolean value if the nearest MRT   station is also a bus interchange             |
| mrt_interchange           |  boolean value if the nearest MRT   station is a train interchange station        |
| bus_stop_nearest_distance |  distance (in metres) to the   nearest bus stop                                   |
| pri_sch_nearest_distance  |  distance (in metres) to the   nearest primary school                             |
| pri_sch_name              |  name of the nearest primary school                                               |
| pri_sch_affiliation       |  boolean value if the nearest   primary school has a secondary school affiliation |
| sec_sch_name              |  name of the nearest secondary   school                                           |
| cutoff_point              |  PSLE cutoff point of the nearest   secondary school                              |

### Train-Test Split
The dataframe is split into a train set and a hold out 'test' set. <br>
As was done in the EDA, the features are split into numerical and caategorical for processing before combining back later on.

In [8]:
features = df.columns.to_list()
features.remove('resale_price')

In [9]:
#Create features matrix (X) and target vector (y)
y = df['resale_price']
X = df[features]

In [10]:
#train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

In [11]:
cat_feat = ['flat_type', 'street_name', 'commercial', 'market_hawker',   
            'multistorey_carpark', 'precinct_pavilion', 'planning_area', 'bus_interchange', 
            'mrt_interchange', 'pri_sch_name', 'pri_sch_affiliation', 'sec_sch_name'
           ]
num_feat = list(set(features) - set(cat_feat))

In [12]:
df.columns

Index(['flat_type', 'street_name', 'floor_area_sqm', 'resale_price',
       'Tranc_Year', 'mid', 'commercial', 'market_hawker',
       'multistorey_carpark', 'precinct_pavilion', 'planning_area',
       'Mall_Nearest_Distance', 'Mall_Within_500m', 'Mall_Within_1km',
       'Mall_Within_2km', 'Hawker_Nearest_Distance', 'Hawker_Within_500m',
       'Hawker_Within_1km', 'Hawker_Within_2km', 'mrt_nearest_distance',
       'bus_interchange', 'mrt_interchange', 'bus_stop_nearest_distance',
       'pri_sch_nearest_distance', 'pri_sch_name', 'pri_sch_affiliation',
       'sec_sch_name', 'cutoff_point', 'transaction_age'],
      dtype='object')

### Numerical features
Standard scalar will be applied scale the individual features for modelling since there is a wide range of values within.

In [13]:
# use sklearn StandardScaler for train set
ss = StandardScaler()
X_train_num_scaled = pd.DataFrame(ss.fit_transform(X_train[num_feat]))
X_test_num_scaled = pd.DataFrame(ss.transform(X_test[num_feat]))
X_train_num_scaled.columns = X_train[num_feat].columns
X_test_num_scaled.columns = X_test[num_feat].columns

In [14]:
X_train_num_scaled

Unnamed: 0,Mall_Within_500m,mrt_nearest_distance,transaction_age,floor_area_sqm,bus_stop_nearest_distance,Hawker_Within_2km,Mall_Within_2km,Hawker_Within_500m,Hawker_Nearest_Distance,Mall_Nearest_Distance,pri_sch_nearest_distance,cutoff_point,mid,Tranc_Year,Hawker_Within_1km,Mall_Within_1km
0,0.702171,-0.613129,0.072858,-0.541699,-1.154799,-0.449616,0.808221,-0.638824,-0.231883,-0.459972,-0.466729,-0.308567,-0.594812,-1.627976,-0.239898,0.850506
1,-0.672096,1.120651,0.326780,1.013054,-0.725771,-0.698875,0.519910,-0.638824,-0.365588,2.057940,1.217344,-1.108322,0.501553,-0.170781,-0.239898,-1.264725
2,2.076439,-0.136382,-0.011783,0.235678,-0.210831,-0.948134,0.519910,-0.638824,2.061511,-0.839730,-0.740956,1.690821,0.501553,-0.899379,-0.823760,0.850506
3,-0.672096,0.163551,-1.112112,0.522080,-0.674254,-0.449616,-0.345026,-0.638824,-0.114057,0.047441,-0.068970,-1.108322,-1.142994,0.922115,-0.823760,0.850506
4,0.702171,-0.868952,0.242139,0.276592,0.059141,-0.449616,-0.345026,-0.638824,-0.315097,-0.513236,-0.220496,-0.508505,0.867008,-1.627976,-0.239898,-0.559648
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112266,3.450707,-0.932453,1.003906,-1.237246,0.608036,0.298160,-0.345026,3.276387,-0.904470,-1.114592,-0.308308,1.041020,-1.142994,-0.899379,1.511688,0.850506
112267,-0.672096,0.297706,0.242139,0.358421,-0.453163,-0.698875,-0.056714,-0.638824,-0.379281,0.432855,0.672200,-1.108322,0.501553,0.193518,-0.239898,0.145429
112268,3.450707,-1.141542,0.411421,-1.196332,0.488096,-0.449616,0.231598,-0.638824,-0.555852,-1.340450,-0.950931,-0.358551,-1.142994,-0.899379,0.343964,2.260660
112269,-0.672096,0.098324,-1.535315,-1.114503,1.069723,0.298160,-0.345026,0.666246,-0.652393,-0.182235,1.370085,1.640836,1.597917,0.922115,0.927826,1.555583


### Categorical features
One Hot encoding will be used to dummify the categorical columns into a matrix.

In [15]:
#calling out categories to look for non binary cat.
train_cat_df = X_train[cat_feat].reset_index()
test_cat_df = X_test[cat_feat].reset_index()
train_cat_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112271 entries, 0 to 112270
Data columns (total 13 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   index                112271 non-null  int64 
 1   flat_type            112271 non-null  object
 2   street_name          112271 non-null  object
 3   commercial           112271 non-null  int64 
 4   market_hawker        112271 non-null  int64 
 5   multistorey_carpark  112271 non-null  int64 
 6   precinct_pavilion    112271 non-null  int64 
 7   planning_area        112271 non-null  object
 8   bus_interchange      112271 non-null  int64 
 9   mrt_interchange      112271 non-null  int64 
 10  pri_sch_name         112271 non-null  object
 11  pri_sch_affiliation  112271 non-null  int64 
 12  sec_sch_name         112271 non-null  object
dtypes: int64(8), object(5)
memory usage: 11.1+ MB


For those with object type, it is likely string values hence it will be one hot encoded. However, the list will be refined so that categories with less datapoints can be combined to have data points as a group for modelling.

In [16]:
#set list with object dtype to OHE
cat_list = list(train_cat_df.dtypes[train_cat_df.dtypes == object].index)
cat_list

['flat_type', 'street_name', 'planning_area', 'pri_sch_name', 'sec_sch_name']

In [17]:
#set threshold
round(train_cat_df.shape[0]*0.0001)

11

In [18]:
#setup OHE
OHE = OneHotEncoder(sparse_output = False, handle_unknown = 'infrequent_if_exist', min_frequency = 0.0001)
train_OHE_cat_df = pd.DataFrame(OHE.fit_transform(train_cat_df[cat_list]), columns = OHE.get_feature_names_out())
test_OHE_cat_df = pd.DataFrame(OHE.transform(test_cat_df[cat_list]), columns = OHE.get_feature_names_out())
train_OHE_cat_df.tail()

Unnamed: 0,flat_type_1 ROOM,flat_type_2 ROOM,flat_type_3 ROOM,flat_type_4 ROOM,flat_type_5 ROOM,flat_type_EXECUTIVE,flat_type_MULTI-GENERATION,street_name_ADMIRALTY DR,street_name_ADMIRALTY LINK,street_name_AH HOOD RD,...,sec_sch_name_Yio Chu Kang Secondary School,sec_sch_name_Yishun Secondary School,sec_sch_name_Yishun Town Secondary School,sec_sch_name_Yuan Ching Secondary School,sec_sch_name_Yuhua Secondary School,sec_sch_name_Yusof Ishak Secondary School,sec_sch_name_Yuying Secondary School,sec_sch_name_Zhenghua Secondary School,sec_sch_name_Zhonghua Secondary School,sec_sch_name_infrequent_sklearn
112266,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
112267,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
112268,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
112269,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
112270,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
#concat categories train sets
train_cat_df = pd.concat([train_cat_df, train_OHE_cat_df], axis = 1)
train_cat_df.drop(columns = cat_list, inplace = True)
#concat categories test set
test_cat_df = pd.concat([test_cat_df, test_OHE_cat_df], axis = 1)
test_cat_df.drop(columns = cat_list, inplace = True)

In [20]:
#concat train set
Z_train = pd.concat([X_train_num_scaled, train_cat_df], axis = 1)
Z_train.set_index('index', inplace = True)

#concat test set
Z_test = pd.concat([X_test_num_scaled, test_cat_df], axis = 1)
Z_test.set_index('index', inplace = True)

Z_train.head()

Unnamed: 0_level_0,Mall_Within_500m,mrt_nearest_distance,transaction_age,floor_area_sqm,bus_stop_nearest_distance,Hawker_Within_2km,Mall_Within_2km,Hawker_Within_500m,Hawker_Nearest_Distance,Mall_Nearest_Distance,...,sec_sch_name_Yio Chu Kang Secondary School,sec_sch_name_Yishun Secondary School,sec_sch_name_Yishun Town Secondary School,sec_sch_name_Yuan Ching Secondary School,sec_sch_name_Yuhua Secondary School,sec_sch_name_Yusof Ishak Secondary School,sec_sch_name_Yuying Secondary School,sec_sch_name_Zhenghua Secondary School,sec_sch_name_Zhonghua Secondary School,sec_sch_name_infrequent_sklearn
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5021,0.702171,-0.613129,0.072858,-0.541699,-1.154799,-0.449616,0.808221,-0.638824,-0.231883,-0.459972,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
27234,-0.672096,1.120651,0.32678,1.013054,-0.725771,-0.698875,0.51991,-0.638824,-0.365588,2.05794,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
33321,2.076439,-0.136382,-0.011783,0.235678,-0.210831,-0.948134,0.51991,-0.638824,2.061511,-0.83973,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
102732,-0.672096,0.163551,-1.112112,0.52208,-0.674254,-0.449616,-0.345026,-0.638824,-0.114057,0.047441,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17312,0.702171,-0.868952,0.242139,0.276592,0.059141,-0.449616,-0.345026,-0.638824,-0.315097,-0.513236,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Modelling
- [Linear Regression](#Linear-Regression)
- [Ridge Regression](#Ridge-Regression)
- [Lasso Regression](#Lasso-Regression)

### Linear Regression

In [21]:
#Instantiate your model
lr = LinearRegression()

#Obtain Cross-validation scores|
print(cross_val_score(lr, Z_train, y_train, cv =5))

#Obtain MEAN Cross-validation score
cross_val_score(lr, Z_train, y_train).mean()

[ 9.16561101e-01  9.15181524e-01  9.15189905e-01 -3.60055375e+15
  9.17081416e-01]


-54586253712926.805

In [22]:
# Train your model
lr.fit(Z_train, y_train)

#Generate predictions
y_train_preds = lr.predict(Z_train)
y_test_preds = lr.predict(Z_test)

In [23]:
# Train score:
lr.score(Z_train, y_train)

0.9176220341478379

In [24]:
# Test score:
lr.score(Z_test, y_test)

0.9166728697546245

In [25]:
#MSE
train_MSE = mean_squared_error(y_train, y_train_preds)
test_MSE = mean_squared_error(y_test, y_test_preds)
#RMSE 
print(np.sqrt(train_MSE))
print(np.sqrt(test_MSE))

41113.990245483656
41296.11609844952


### Ridge Regression

In [26]:
#alpha_list = np.logspace(-3, 1, 10)
ridgecv = RidgeCV(alphas = np.logspace(-3, 1, 10) , scoring='r2', cv = 5)
ridgecv.fit(Z_train, y_train)

In [27]:
#optimal alpha
ridgecv.alpha_

0.007742636826811269

In [28]:
#Instantiate the model
ridge = Ridge(alpha = ridgecv.alpha_)
ridge_scores = cross_val_score(ridge, Z_train, y_train, cv = 5)

print (ridge_scores)
print (np.mean(ridge_scores))

[0.9165408  0.91519027 0.91519581 0.91656822 0.91707942]
0.9161149040328785


In [29]:
#Retrain ridge
ridge.fit(Z_train, y_train)

#Generate predictions
y_train_preds = ridge.predict(Z_train)
y_test_preds = ridge.predict(Z_test)

In [30]:
# Train score:
ridge_train_score = ridge.score(Z_train, y_train)
ridge_train_score

0.9176216835245917

In [31]:
# Test  score:
ridge_test_score = ridgecv.score(Z_test, y_test)
ridge_test_score

0.9166717909827738

In [32]:
#MSE
train_MSE = mean_squared_error(y_train, y_train_preds)
test_MSE = mean_squared_error(y_test, y_test_preds)
#RMSE 
print(np.sqrt(train_MSE))
print(np.sqrt(test_MSE))

41114.07774160739
41296.38341200796


In [75]:
coefs_ridge = pd.DataFrame(list(zip(Z_train.columns, ridgecv.coef_)))
coefs_ridge.sort_values(by = 1, ascending = True)

Unnamed: 0,0,1
205,street_name_GHIM MOH RD,-193307.174717
120,street_name_C'WEALTH CL,-176218.865917
121,street_name_C'WEALTH CRES,-174197.455285
317,street_name_LOWER DELTA RD,-170315.312211
492,street_name_WHAMPOA DR,-163360.459948
...,...,...
635,pri_sch_name_Fairfield Methodist School,317695.115633
426,street_name_STRATHMORE AVE,343532.563942
652,pri_sch_name_Henry Park Primary School,385031.837075
177,street_name_DAWSON RD,427051.803571


### Lasso Regression

In [33]:
l_alphas = np.logspace(-3, 1, 10)

lassocv = LassoCV(alphas=l_alphas, cv=5, max_iter=1000)

lassocv.fit(Z_train, y_train)

  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descen

In [34]:
lassocv.alpha_

0.001

In [35]:
lasso = Lasso(alpha=lassocv.alpha_)

lasso_scores = cross_val_score(lasso, Z_train, y_train)

print (lasso_scores)
print (np.mean(lasso_scores))

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


[0.91621178 0.91522901 0.91516345 0.91667102 0.91698163]
0.9160513788485725


In [36]:
#Retrain ridge
lasso.fit(Z_train, y_train)

#Generate predictions
y_train_preds = lasso.predict(Z_train)
y_test_preds = lasso.predict(Z_test)

  model = cd_fast.enet_coordinate_descent(


In [37]:
# Train score:
lasso_train_score = lasso.score(Z_train, y_train)
lasso_train_score

0.9175799469522133

In [38]:
# Test  score:
lasso_test_score = lasso.score(Z_test, y_test)
lasso_test_score

0.9165583637195707

In [39]:
#MSE
train_MSE = mean_squared_error(y_train, y_train_preds)
test_MSE = mean_squared_error(y_test, y_test_preds)
#RMSE 
print(np.sqrt(train_MSE))
print(np.sqrt(test_MSE))

41124.49154625952
41324.480396604806


In [78]:
coefs_lasso = pd.DataFrame(list(zip(Z_train.columns, lassocv.coef_)))
coefs_lasso.sort_values(by = 1, ascending = True)

Unnamed: 0,0,1
810,sec_sch_name_Fairfield Methodist School,-186868.402591
253,street_name_JLN TECK WHYE,-176391.883039
148,street_name_CHOA CHU KANG ST 51,-172655.347971
325,street_name_MARSILING RD,-171297.945344
859,sec_sch_name_Queenstown Secondary School,-167347.687632
...,...,...
425,street_name_STIRLING RD,401628.787189
328,street_name_MEI LING ST,423580.369616
426,street_name_STRATHMORE AVE,492498.150013
155,street_name_CLARENCE LANE,573910.857816


### Results Analysis

| Version | Score Description | Linear Regression | Ridge Regression | Lasso Regression |
|---------|-------------------|-------------------|------------------|------------------|
| V3      | Train Score       | 0.92552           | 0.92551          | 0.92411          |
| V3      | Test Score        | -1564517453.13769 | 0.92415          | 0.92446          |
| V3      | Train RMSE        | 39089.89          | 39090.38         | 39097.75         |
| V3      | Test RMSE         | 5664718809        | 39443.09         | 39452.04         |
| V4      | Train Score       | 0.917622034       | 0.917622         | 0.91758          |
| V4      | Test Score        | 0.91667287        | 0.916672         | 0.916558         |
| V4      | Train RMSE        | 41113.99025       | 41114.08         | 41124.49         |
| V4      | Test RMSE         | 41296.1161        | 41296.38         | 41324.48         |        |   |    |

## Conclusion

The modelling process has shown a rather linear relation between the various features and price even though the model is not perfect. In fact, through the iterative process of versioning, the reduction of multiple features does not seems to impact the models score greatly. Thus it is quite clear that a house price is determined not just by a single factor but rather quite alot of factors combined together since there was no clear correlation between price resale_price and any one factor. 

While a common understanding is the per area price as evident in the EDA, looking at the coeficients of the models gives us a better understanding of the other drivers (such as street address) of resale prices. This advises us in which area we can possibly look to further refine the model to be as accurate and as simple as possible.

In the meantime, the current model version with the RMSE scores (which represent the pricing error) of around $41k, is less than 10% of the avereage resale price (\\$448,661), which gives it sufficient uility to gauge housing prices. This will give those who are impacted by housing prices, e.g. owners, buyers and agents additional avenue to gauge the prices of the house unit and aid their decision making process. 

Therefore, for future improvements on the resale price model, the recommendation is to look into reducing the current features and impossible, to look at the inclusion of other features not indicated in the current set such as the actual state of the hosue (e.g. degree of renovation done, furnishing provided etc.) 


### Pickling in progress

In [41]:
import pickle

In [42]:
pickle.dump(lr, open('lr.pkl', 'wb'))

In [43]:
pickle.dump(ridgecv, open('ridgecv.pkl', 'wb'))

In [44]:
pickle.dump(lassocv, open('lassocv.pkl', 'wb'))

## Predictions

In [45]:
exam = pd.read_csv('../../data/test.csv')

  exam = pd.read_csv('../../data/test.csv')


In [46]:
def bool_converter(df, feature):
    for n in feature:
        df[n].replace({'Y': 1, 'N' : 0}, inplace = True)

In [47]:
object_features = ['commercial', 'market_hawker', 'multistorey_carpark', 'precinct_pavilion']
bool_converter(exam, object_features)

In [48]:
exam['Hawker_Within_500m'].replace([np.nan], 0, inplace = True)
exam['Hawker_Within_1km'].replace([np.nan], 0, inplace = True)
exam['Hawker_Within_2km'].replace([np.nan], 0, inplace = True)

exam['Mall_Within_500m'].replace([np.nan], 0, inplace = True)
exam['Mall_Within_1km'].replace([np.nan], 0, inplace = True)
exam['Mall_Within_2km'].replace([np.nan], 0, inplace = True)

In [49]:
choose_mask = pd.read_csv('../../data/dataset_description_chosen.csv')
mask_list = list(choose_mask.loc[choose_mask['Decision( 1 = Accept, 0 = Reject)'] == 0, 'Codebook / Data Dictionary'])

In [50]:
#feature engineer
exam['transaction_age'] = exam['Tranc_Year'] - exam['lease_commence_date']
exam.drop(columns = ['lease_commence_date'], inplace = True)

In [51]:
#set up csv as a drop list with the inclusion of id.
exam.drop(columns = mask_list, inplace = True)
exam.drop(columns = drop_list, inplace = True)

In [52]:
exam.isnull().sum()

id                            0
flat_type                     0
street_name                   0
floor_area_sqm                0
Tranc_Year                    0
mid                           0
commercial                    0
market_hawker                 0
multistorey_carpark           0
precinct_pavilion             0
planning_area                 0
Mall_Nearest_Distance        84
Mall_Within_500m              0
Mall_Within_1km               0
Mall_Within_2km               0
Hawker_Nearest_Distance       0
Hawker_Within_500m            0
Hawker_Within_1km             0
Hawker_Within_2km             0
mrt_nearest_distance          0
bus_interchange               0
mrt_interchange               0
bus_stop_nearest_distance     0
pri_sch_nearest_distance      0
pri_sch_name                  0
pri_sch_affiliation           0
sec_sch_name                  0
cutoff_point                  0
transaction_age               0
dtype: int64

In [53]:
exam['Mall_Nearest_Distance'].replace(np.nan,0, inplace = True)

In [54]:
id_frame = exam[['id']]
id_frame

Unnamed: 0,id
0,114982
1,95653
2,40303
3,109506
4,100149
...,...
16732,23347
16733,54003
16734,128921
16735,69352


In [55]:
X_exam_num = pd.DataFrame(ss.transform(exam[num_feat]))
X_exam_num.columns = exam[num_feat].columns

In [56]:
X_exam_cat_df = exam[cat_feat].reset_index()
X_exam_cat_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16737 entries, 0 to 16736
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   index                16737 non-null  int64 
 1   flat_type            16737 non-null  object
 2   street_name          16737 non-null  object
 3   commercial           16737 non-null  int64 
 4   market_hawker        16737 non-null  int64 
 5   multistorey_carpark  16737 non-null  int64 
 6   precinct_pavilion    16737 non-null  int64 
 7   planning_area        16737 non-null  object
 8   bus_interchange      16737 non-null  int64 
 9   mrt_interchange      16737 non-null  int64 
 10  pri_sch_name         16737 non-null  object
 11  pri_sch_affiliation  16737 non-null  int64 
 12  sec_sch_name         16737 non-null  object
dtypes: int64(8), object(5)
memory usage: 1.7+ MB


In [57]:
exam_OHE_cat_df = pd.DataFrame(OHE.transform(X_exam_cat_df[cat_list]), columns = OHE.get_feature_names_out())
exam_OHE_cat_df.tail()

Unnamed: 0,flat_type_1 ROOM,flat_type_2 ROOM,flat_type_3 ROOM,flat_type_4 ROOM,flat_type_5 ROOM,flat_type_EXECUTIVE,flat_type_MULTI-GENERATION,street_name_ADMIRALTY DR,street_name_ADMIRALTY LINK,street_name_AH HOOD RD,...,sec_sch_name_Yio Chu Kang Secondary School,sec_sch_name_Yishun Secondary School,sec_sch_name_Yishun Town Secondary School,sec_sch_name_Yuan Ching Secondary School,sec_sch_name_Yuhua Secondary School,sec_sch_name_Yusof Ishak Secondary School,sec_sch_name_Yuying Secondary School,sec_sch_name_Zhenghua Secondary School,sec_sch_name_Zhonghua Secondary School,sec_sch_name_infrequent_sklearn
16732,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16733,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16734,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16735,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16736,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [58]:
X_exam_cat_df = pd.concat([X_exam_cat_df, exam_OHE_cat_df], axis = 1)
X_exam_cat_df.drop(columns = cat_list, inplace = True)

In [59]:
#concat test set
Z_exam = pd.concat([X_exam_num, X_exam_cat_df], axis = 1)
Z_exam.set_index('index', inplace = True)

In [60]:
Z_exam.isna().sum()

Mall_Within_500m                             0
mrt_nearest_distance                         0
transaction_age                              0
floor_area_sqm                               0
bus_stop_nearest_distance                    0
                                            ..
sec_sch_name_Yusof Ishak Secondary School    0
sec_sch_name_Yuying Secondary School         0
sec_sch_name_Zhenghua Secondary School       0
sec_sch_name_Zhonghua Secondary School       0
sec_sch_name_infrequent_sklearn              0
Length: 902, dtype: int64

In [61]:
y_exam_preds = ridge.predict(Z_exam)

In [62]:
y_exam_preds_df = pd.DataFrame(y_exam_preds, columns = ['resale_price'])

In [63]:
submission_df = pd.concat([id_frame, y_exam_preds_df], axis = 1)

In [64]:
submission_df

Unnamed: 0,id,resale_price
0,114982,334742.948942
1,95653,485274.553227
2,40303,342931.255004
3,109506,317057.277038
4,100149,436652.165198
...,...,...
16732,23347,352470.636968
16733,54003,501597.297368
16734,128921,383210.588763
16735,69352,512622.009746


In [65]:
submission_df.rename(columns={'id': 'Id','resale_price': 'Predicted'}, inplace = True)

In [66]:
#data export to already created output folder
submission_df.to_csv('../../data/output/submission_df.csv', index = False)