<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px"> 

# Project 2: Singapore Housing Data and Kaggle Challenge

--- 
# Part 2 Modeling and refining
---

# Contents:
- [Problem Statement](#Problem-Statement)
- [All imports](#All-imports)
- [Preprocessing](#Preprocessing)
- [Data Dictionary](#Data-Dictionary)
- [Train-Test Split](#Train-Test-Split)
- [Linear Regression](#Linear-Regression)
- [Ridge Regression](#Ridge-Regression)
- [Lasso Regression](#Lasso-Regression)
- [Predictions](#Predictions)

## Problem Statement

Housing pricing affects the decision making process of buyers in their assessment of the unit. This project attempts to build a linear regression model, using the data contain in the dataset folder. The goal is to have the model accurately predict the sales price of the houses in the test set, which will be evaluated based on common evaluation metrics such as R2 and RMSE.
This will give those who are impacted by housing prices, e.g. owners, buyers and agents additional data to inform their own decision making process. 

## All imports
Libraries and data imports

In [1]:
# Imports:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression, LassoCV, Lasso, RidgeCV, Ridge
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures, PowerTransformer
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.metrics import r2_score, mean_squared_error
import statsmodels.api as sm

In [2]:
df = pd.read_csv('../../data/output/cleaner_train.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149772 entries, 0 to 149771
Data columns (total 45 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   flat_type                  149772 non-null  object 
 1   street_name                149772 non-null  object 
 2   floor_area_sqm             149772 non-null  float64
 3   flat_model                 149772 non-null  object 
 4   lease_commence_date        149772 non-null  int64  
 5   resale_price               149772 non-null  float64
 6   Tranc_Year                 149772 non-null  int64  
 7   Tranc_Month                149772 non-null  int64  
 8   mid                        149772 non-null  int64  
 9   max_floor_lvl              149772 non-null  int64  
 10  commercial                 149772 non-null  int64  
 11  market_hawker              149772 non-null  int64  
 12  multistorey_carpark        149772 non-null  int64  
 13  precinct_pavilion          14

## Preprocessing
As mentioned in Part1 (EDA), some features would have to engineered, e.g. remaining lease at point of sale, storey level and the max floor level could be combined to hopefully increase the fit of the feature

### Initial Feature engineering

With the baseline in v1 done, it is possible to test if some features would help.

In [4]:
df['transaction_age'] = df['Tranc_Year'] - df['lease_commence_date']
df.drop(columns = ['lease_commence_date'], inplace = True)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149772 entries, 0 to 149771
Data columns (total 45 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   flat_type                  149772 non-null  object 
 1   street_name                149772 non-null  object 
 2   floor_area_sqm             149772 non-null  float64
 3   flat_model                 149772 non-null  object 
 4   resale_price               149772 non-null  float64
 5   Tranc_Year                 149772 non-null  int64  
 6   Tranc_Month                149772 non-null  int64  
 7   mid                        149772 non-null  int64  
 8   max_floor_lvl              149772 non-null  int64  
 9   commercial                 149772 non-null  int64  
 10  market_hawker              149772 non-null  int64  
 11  multistorey_carpark        149772 non-null  int64  
 12  precinct_pavilion          149772 non-null  int64  
 13  total_dwelling_units       14

### Data Dictionary

|        Feature Name       |                        Feature Description per data source                        |
|:-------------------------:|:---------------------------------------------------------------------------------:|
| resale_price              | the property's sale price in Singapore dollars. This is the target variable.      |
| flat_type                 |  type of the resale flat unit, e.g.   3 ROOM                                      |
| street_name               |  street name where the resale flat   resides, e.g. TAMPINES ST 42                 |
| floor_area_sqm            |  floor area of the resale flat unit   in square metres                            |
| flat_model                |  HDB model of the resale flat, e.g.   Multi Generation                            |
| lease_commence_date       |  commencement year of the flat   unit's 99-year lease                             |
| Tranc_Year                |  year of resale transaction                                                       |
| Tranc_Month               |  month of resale transaction                                                      |
| mid                       |  middle value of storey_range                                                     |
| max_floor_lvl             |  highest floor of the resale flat                                                 |
| commercial                |  boolean value if resale flat has   commercial units in the same block            |
| market_hawker             |  boolean value if resale flat has a   market or hawker centre in the same block   |
| multistorey_carpark       |  boolean value if resale flat has a   multistorey carpark in the same block       |
| precinct_pavilion         |  boolean value if resale flat has a   pavilion in the same block                  |
| total_dwelling_units      |  total number of residential   dwelling units in the resale flat                  |
| 1room_sold                |  number of 1-room residential units   in the resale flat                          |
| 2room_sold                |  number of 2-room residential units   in the resale flat                          |
| 3room_sold                |  number of 3-room residential units   in the resale flat                          |
| 4room_sold                |  number of 4-room residential units   in the resale flat                          |
| 5room_sold                |  number of 5-room residential units   in the resale flat                          |
| exec_sold                 |  number of executive type   residential units in the resale flat block            |
| multigen_sold             |  number of multi-generational type   residential units in the resale flat block   |
| studio_apartment_sold     |  number of studio apartment type   residential units in the resale flat block     |
| 1room_rental              |  number of 1-room rental   residential units in the resale flat block             |
| 2room_rental              |  number of 2-room rental   residential units in the resale flat block             |
| 3room_rental              |  number of 3-room rental   residential units in the resale flat block             |
| other_room_rental         |  number of "other" type   rental residential units in the resale flat block       |
| planning_area             |  Government planning area that the   flat is located                              |
| Mall_Nearest_Distance     |  distance (in metres) to the   nearest mall                                       |
| Mall_Within_500m          |  number of malls within 500 metres                                                |
| Mall_Within_1km           |  number of malls within 1 kilometre                                               |
| Mall_Within_2km           |  number of malls within 2   kilometres                                            |
| Hawker_Nearest_Distance   |  distance (in metres) to the   nearest hawker centre                              |
| Hawker_Within_500m        |  number of hawker centres within   500 metres                                     |
| Hawker_Within_1km         |  number of hawker centres within 1   kilometre                                    |
| Hawker_Within_2km         |  number of hawker centres within 2   kilometres                                   |
| mrt_nearest_distance      |  distance (in metres) to the   nearest MRT station                                |
| bus_interchange           |  boolean value if the nearest MRT   station is also a bus interchange             |
| mrt_interchange           |  boolean value if the nearest MRT   station is a train interchange station        |
| bus_stop_nearest_distance |  distance (in metres) to the   nearest bus stop                                   |
| pri_sch_nearest_distance  |  distance (in metres) to the   nearest primary school                             |
| pri_sch_name              |  name of the nearest primary school                                               |
| pri_sch_affiliation       |  boolean value if the nearest   primary school has a secondary school affiliation |
| sec_sch_name              |  name of the nearest secondary   school                                           |
| cutoff_point              |  PSLE cutoff point of the nearest   secondary school                              |
| transaction_age           |  'Tranc_Year' - 'lease_commence_date'                                             |

### Train-Test Split
The dataframe is split into a train set and a hold out 'test' set. <br>
As was done in the EDA, the features are split into numerical and caategorical for processing before combining back later on.

In [6]:
features = df.columns.to_list()
features.remove('resale_price')

In [7]:
#Create features matrix (X) and target vector (y)
y = df['resale_price']
X = df[features]

In [8]:
#train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

In [9]:
cat_feat = ['flat_type', 'street_name', 'flat_model', 'commercial', 'market_hawker',   
            'multistorey_carpark', 'precinct_pavilion', 'planning_area', 'bus_interchange', 
            'mrt_interchange', 'pri_sch_name', 'pri_sch_affiliation', 'sec_sch_name'
           ]
num_feat = list(set(features) - set(cat_feat))

### Numerical features
Standard scalar will be applied scale the individual features for modelling since there is a wide range of values within.

In [10]:
# use sklearn StandardScaler for train set
ss = StandardScaler()
X_train_num_scaled = pd.DataFrame(ss.fit_transform(X_train[num_feat]))
X_test_num_scaled = pd.DataFrame(ss.transform(X_test[num_feat]))
X_train_num_scaled.columns = X_train[num_feat].columns
X_test_num_scaled.columns = X_test[num_feat].columns

In [11]:
X_train_num_scaled

Unnamed: 0,Mall_Within_2km,cutoff_point,2room_rental,mid,1room_sold,2room_sold,Mall_Within_500m,3room_sold,4room_sold,exec_sold,...,Hawker_Within_500m,Hawker_Within_2km,transaction_age,other_room_rental,Tranc_Month,studio_apartment_sold,mrt_nearest_distance,Mall_Within_1km,pri_sch_nearest_distance,5room_sold
0,-0.056933,-1.107259,-0.056527,-0.048992,-0.024062,-0.151432,-0.671314,-0.572480,-0.483135,-0.310921,...,-0.639332,-0.451629,-0.436830,-0.013751,-0.172644,-0.082435,1.217885,0.142231,-0.982031,0.889710
1,1.097444,-1.107259,-0.056527,1.045130,-0.024062,-0.151432,3.424053,-0.572480,-0.040249,-0.310921,...,-0.639332,-0.700918,-0.352343,-0.013751,-1.071769,-0.082435,-1.182624,1.548442,-0.438532,2.682960
2,0.231661,-1.057314,-0.056527,-0.048992,-0.024062,-0.151432,0.693809,-0.572480,-1.103175,2.824966,...,-0.639332,-0.700918,0.577014,-0.013751,0.127064,-0.082435,-1.174373,-0.560874,0.641285,-0.783989
3,-0.634122,-1.107259,-0.056527,-0.048992,-0.024062,-0.151432,0.693809,1.020995,-0.616000,-0.310921,...,1.958586,0.046949,0.577014,-0.013751,1.325897,-0.082435,-0.293837,-0.560874,-0.394824,-0.783989
4,-0.634122,1.240134,-0.056527,-1.143114,-0.024062,-0.151432,-0.671314,0.716786,0.004040,-0.310921,...,0.659627,1.791971,0.745988,-0.013751,-1.371477,-0.082435,0.022902,-0.560874,0.121074,-0.783989
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112324,0.520256,0.690745,-0.056527,1.592191,-0.024062,-0.151432,-0.671314,-0.572480,1.022677,-0.310921,...,0.659627,2.290548,-1.281700,-0.013751,-1.071769,-0.082435,-1.011052,-1.263980,3.172083,0.590835
112325,-0.634122,-1.107259,-0.056527,-0.048992,-0.024062,-0.151432,-0.671314,2.411663,-1.103175,-0.310921,...,0.659627,-0.451629,1.590857,-0.013751,-0.772061,-0.082435,0.711000,-1.263980,2.155829,-0.754102
112326,-0.056933,1.939358,-0.056527,-0.048992,-0.024062,-0.151432,-0.671314,-0.572480,3.547125,-0.310921,...,1.958586,1.293393,1.590857,-0.013751,1.026189,-0.082435,0.514377,-0.560874,-0.720921,-0.783989
112327,0.231661,-1.057314,-0.056527,-0.596053,-0.024062,-0.151432,-0.671314,1.281745,-0.616000,-0.258656,...,1.958586,1.791971,1.590857,-0.013751,0.426773,-0.082435,-0.683963,-1.263980,-0.431725,-0.754102


### Categorical features
One Hot encoding will be used to dummify the categorical columns into a matrix.

In [12]:
#calling out categories to look for non binary cat.
train_cat_df = X_train[cat_feat].reset_index()
test_cat_df = X_test[cat_feat].reset_index()
train_cat_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112329 entries, 0 to 112328
Data columns (total 14 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   index                112329 non-null  int64 
 1   flat_type            112329 non-null  object
 2   street_name          112329 non-null  object
 3   flat_model           112329 non-null  object
 4   commercial           112329 non-null  int64 
 5   market_hawker        112329 non-null  int64 
 6   multistorey_carpark  112329 non-null  int64 
 7   precinct_pavilion    112329 non-null  int64 
 8   planning_area        112329 non-null  object
 9   bus_interchange      112329 non-null  int64 
 10  mrt_interchange      112329 non-null  int64 
 11  pri_sch_name         112329 non-null  object
 12  pri_sch_affiliation  112329 non-null  int64 
 13  sec_sch_name         112329 non-null  object
dtypes: int64(8), object(6)
memory usage: 12.0+ MB


For those with object type, it is likely string values hence it will be one hot encoded. However, the list will be refined so that categories with less datapoints can be combined to have data points as a group for modelling.

In [13]:
#set list with object dtype to OHE
cat_list = list(train_cat_df.dtypes[train_cat_df.dtypes == object].index)
cat_list

['flat_type',
 'street_name',
 'flat_model',
 'planning_area',
 'pri_sch_name',
 'sec_sch_name']

In [14]:
#set threshold
round(train_cat_df.shape[0]*0.0001)

11

In [15]:
#setup OHE
OHE = OneHotEncoder(sparse_output = False, handle_unknown = 'infrequent_if_exist', min_frequency = 0.0001)
train_OHE_cat_df = pd.DataFrame(OHE.fit_transform(train_cat_df[cat_list]), columns = OHE.get_feature_names_out())
test_OHE_cat_df = pd.DataFrame(OHE.transform(test_cat_df[cat_list]), columns = OHE.get_feature_names_out())
train_OHE_cat_df.tail()

Unnamed: 0,flat_type_1 ROOM,flat_type_2 ROOM,flat_type_3 ROOM,flat_type_4 ROOM,flat_type_5 ROOM,flat_type_EXECUTIVE,flat_type_MULTI-GENERATION,street_name_ADMIRALTY DR,street_name_ADMIRALTY LINK,street_name_AH HOOD RD,...,sec_sch_name_Yio Chu Kang Secondary School,sec_sch_name_Yishun Secondary School,sec_sch_name_Yishun Town Secondary School,sec_sch_name_Yuan Ching Secondary School,sec_sch_name_Yuhua Secondary School,sec_sch_name_Yusof Ishak Secondary School,sec_sch_name_Yuying Secondary School,sec_sch_name_Zhenghua Secondary School,sec_sch_name_Zhonghua Secondary School,sec_sch_name_infrequent_sklearn
112324,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
112325,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
112326,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
112327,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
112328,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
#concat categories train sets
train_cat_df = pd.concat([train_cat_df, train_OHE_cat_df], axis = 1)
train_cat_df.drop(columns = cat_list, inplace = True)
#concat categories test set
test_cat_df = pd.concat([test_cat_df, test_OHE_cat_df], axis = 1)
test_cat_df.drop(columns = cat_list, inplace = True)

In [17]:
#concat train set
Z_train = pd.concat([X_train_num_scaled, train_cat_df], axis = 1)
Z_train.set_index('index', inplace = True)

#concat test set
Z_test = pd.concat([X_test_num_scaled, test_cat_df], axis = 1)
Z_test.set_index('index', inplace = True)

Z_train.head()

Unnamed: 0_level_0,Mall_Within_2km,cutoff_point,2room_rental,mid,1room_sold,2room_sold,Mall_Within_500m,3room_sold,4room_sold,exec_sold,...,sec_sch_name_Yio Chu Kang Secondary School,sec_sch_name_Yishun Secondary School,sec_sch_name_Yishun Town Secondary School,sec_sch_name_Yuan Ching Secondary School,sec_sch_name_Yuhua Secondary School,sec_sch_name_Yusof Ishak Secondary School,sec_sch_name_Yuying Secondary School,sec_sch_name_Zhenghua Secondary School,sec_sch_name_Zhonghua Secondary School,sec_sch_name_infrequent_sklearn
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
30484,-0.056933,-1.107259,-0.056527,-0.048992,-0.024062,-0.151432,-0.671314,-0.57248,-0.483135,-0.310921,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
87242,1.097444,-1.107259,-0.056527,1.04513,-0.024062,-0.151432,3.424053,-0.57248,-0.040249,-0.310921,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
79640,0.231661,-1.057314,-0.056527,-0.048992,-0.024062,-0.151432,0.693809,-0.57248,-1.103175,2.824966,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
63804,-0.634122,-1.107259,-0.056527,-0.048992,-0.024062,-0.151432,0.693809,1.020995,-0.616,-0.310921,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
146273,-0.634122,1.240134,-0.056527,-1.143114,-0.024062,-0.151432,-0.671314,0.716786,0.00404,-0.310921,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Modelling #2
- [OLS](#OLS) - quick review of the numerical features
- [Linear Regression](#Linear-Regression)
- [Ridge Regression](#Ridge-Regression)
- [Lasso Regression](#Lasso-Regression)
- [Results Analysis](#Results-Analysis)

#### OLS

Using OLS, this allows a quick review of the numerical features included currently and check for features which are not very impactful and can be dropped in the next iteration. The limitation being it works for numerical features and probably not categories but atleast it some area of improvement can be done.

In [18]:
# Statsmodels OLS 
X_train_num = sm.add_constant(X_train[num_feat])
ols = sm.OLS(y_train, X_train_num).fit() 

In [19]:
ols.pvalues[ols.pvalues > 0.005]

2room_rental             0.018370
1room_sold               0.006674
3room_sold               0.010931
4room_sold               0.010897
exec_sold                0.030395
3room_rental             0.207591
total_dwelling_units     0.028192
1room_rental             0.089188
multigen_sold            0.801153
other_room_rental        0.033103
studio_apartment_sold    0.014717
dtype: float64

Using the a p-value of 0.005 as my my threshold since it would be that with 99.5% confidence that the features are not having an impact and may be dropped in the next iteration.

### Linear Regression

In [20]:
#Instantiate your model
lr = LinearRegression()

#Obtain Cross-validation scores
print(cross_val_score(lr, Z_train, y_train, cv =5))

#Obtain MEAN Cross-validation score
cross_val_score(lr, Z_train, y_train).mean()

[0.92574967 0.92498421 0.9241191  0.92540709 0.92347876]


0.9247474569568096

In [21]:
# Train your model
lr.fit(Z_train, y_train)

#Generate predictions
y_train_preds = lr.predict(Z_train)
y_test_preds = lr.predict(Z_test)

In [22]:
# Train score:
lr.score(Z_train, y_train)

0.926179126360245

In [23]:
# Test score:
lr.score(Z_test, y_test)

-267759160690952.22

In [24]:
#MSE
train_MSE = mean_squared_error(y_train, y_train_preds)
test_MSE = mean_squared_error(y_test, y_test_preds)
#RMSE 
print(np.sqrt(train_MSE))
print(np.sqrt(test_MSE))

38915.30427526979
2343474971158.487


### Ridge Regression

In [25]:
#alpha_list = np.logspace(-3, 1, 10)
ridgecv = RidgeCV(alphas = np.logspace(-3, 1, 10) , scoring='r2', cv = 5)
ridgecv.fit(Z_train, y_train)

In [26]:
#optimal alpha
ridgecv.alpha_

0.021544346900318832

In [27]:
#Instantiate the model
ridge = Ridge(alpha = ridgecv.alpha_)
ridge_scores = cross_val_score(ridge, Z_train, y_train)

print (ridge_scores)
print (np.mean(ridge_scores))

[0.92573897 0.92495868 0.92414662 0.92541301 0.92349769]
0.924750994006662


In [28]:
#Retrain ridge
ridge.fit(Z_train, y_train)

#Generate predictions
y_train_preds = ridge.predict(Z_train)
y_test_preds = ridge.predict(Z_test)

In [29]:
# Train score:
ridge_train_score = ridge.score(Z_train, y_train)
ridge_train_score

0.926177755657659

In [30]:
# Test  score:
ridge_test_score = ridge.score(Z_test, y_test)
ridge_test_score

0.9247602589130399

In [31]:
#MSE
train_MSE = mean_squared_error(y_train, y_train_preds)
test_MSE = mean_squared_error(y_test, y_test_preds)
#RMSE 
print(np.sqrt(train_MSE))
print(np.sqrt(test_MSE))

38915.66556238254
39283.64671544856


### Lasso Regression

In [32]:
l_alphas = np.logspace(-3, 1, 10)

lassocv = LassoCV(alphas=l_alphas, cv=5, max_iter=1000)
lassocv.fit(Z_train, y_train)

  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descen

In [33]:
lassocv.alpha_

0.001

In [34]:
#Instantiate the model
lasso = Lasso(alpha=lassocv.alpha_)

lasso_scores = cross_val_score(lasso, Z_train, y_train)

print (lasso_scores)
print (np.mean(lasso_scores))

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


[0.92571281 0.92485418 0.92414614 0.92536617 0.92344554]
0.9247049699867944


In [35]:
#Retrain ridge
lasso.fit(Z_train, y_train)

#Generate predictions
y_train_preds = lasso.predict(Z_train)
y_test_preds = lasso.predict(Z_test)

  model = cd_fast.enet_coordinate_descent(


In [36]:
# Train score:
lasso_train_score = lasso.score(Z_train, y_train)
lasso_train_score

0.9261502160893221

In [37]:
# Test  score:
lasso_test_score = lasso.score(Z_test, y_test)
lasso_test_score

0.9247308274126094

In [38]:
#MSE
train_MSE = mean_squared_error(y_train, y_train_preds)
test_MSE = mean_squared_error(y_test, y_test_preds)
#RMSE 
print(np.sqrt(train_MSE))
print(np.sqrt(test_MSE))

38922.92367751651
39291.32924867723


In [39]:
coefs_df = pd.DataFrame(list(zip(Z_train.columns, abs(lassocv.coef_))))
coefs_df.sort_values(by = 1, ascending = True)

Unnamed: 0,0,1
682,pri_sch_name_Greendale Primary School,2.620115
863,sec_sch_name_Jurongville Secondary School,30.620776
549,street_name_YISHUN AVE 6,90.122283
13,3room_rental,103.369827
310,street_name_KIM TIAN PL,125.439067
...,...,...
333,street_name_MARINE CRES,278768.333303
334,street_name_MARINE DR,293413.570642
195,street_name_DOVER CL EAST,301359.634028
169,street_name_CLARENCE LANE,325401.778038


### Results Analysis

| Version | Score Description | Linear Regression | Ridge Regression | Lasso Regression |
|---------|-------------------|-------------------|------------------|------------------|
| V1      | Train Score       | 0.92618           | 0.92618          | 0.92615          |
| V1      | Test Score        | -6846639454563    | 0.92476          | 0.92473          |
| V1      | Train RMSE        | 38915.20          | 38915.67         | 38922.83         |
| V1      | Test RMSE         | 374737262352      | 39283.65         | 39291.22         |
| V2      | Train Score       | 0.92618           | 0.92618          | 0.92615          |
| V2      | Test Score        | -267759160690952  | 0.92476          | 0.92473          |
| V2      | Train RMSE        | 38915.30          | 38915.67         | 38922.92         |
| V2      | Test RMSE         | 2343474971158     | 39283.65         | 39291.33         |

Similar to v1, trying to account for a small feature does not really remove the discrepant results and the scores barely made any difference.

In this iteration, in order to more aggresive reduce the features, an additonal modeling in the form of Ordinary Least Squares (OLS) is done on the numeric features to determine if there is any that can be dropped.

With a the a p-value threshold of > 0.005 since it would be that with 99.5% confidence that the features are not having an impact and may be dropped in the next iteration.


# Pickling in progress

In [40]:
import pickle

In [41]:
pickle.dump(lr, open('lr.pkl', 'wb'))

In [42]:
pickle.dump(ridgecv, open('ridgecv.pkl', 'wb'))

In [43]:
pickle.dump(lassocv, open('lassocv.pkl', 'wb'))