<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px"> 

# Project 2: Singapore Housing Data and Kaggle Challenge

--- 
# Part 2 Modeling and refining
---

# Contents:
- [Problem Statement](#Problem-Statement)
- [All imports](#All-imports)
- [Preprocessing](#Preprocessing)
- [Data Dictionary](#Data-Dictionary)
- [Train-Test Split](#Train-Test-Split)
- [Linear Regression](#Linear-Regression)
- [Ridge Regression](#Ridge-Regression)
- [Lasso Regression](#Lasso-Regression)
- [Predictions](#Predictions)

## Problem Statement

Housing pricing affects the decision making process of buyers in their assessment of the unit. This project attempts to build a linear regression model, using the data contain in the dataset folder. The goal is to have the model accurately predict the sales price of the houses in the test set, which will be evaluated based on common evaluation metrics such as R2 and RMSE.
This will give those who are impacted by housing prices, e.g. owners, buyers and agents additional data to inform their own decision making process. 

## All imports
Libraries and data imports

In [57]:
# Imports:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression, LassoCV, Lasso, RidgeCV, Ridge
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures, PowerTransformer
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.metrics import r2_score, mean_squared_error
import statsmodels.api as sm

In [58]:
df = pd.read_csv('../../data/output/cleaner_train.csv')

In [59]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149772 entries, 0 to 149771
Data columns (total 45 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   flat_type                  149772 non-null  object 
 1   street_name                149772 non-null  object 
 2   floor_area_sqm             149772 non-null  float64
 3   flat_model                 149772 non-null  object 
 4   lease_commence_date        149772 non-null  int64  
 5   resale_price               149772 non-null  float64
 6   Tranc_Year                 149772 non-null  int64  
 7   Tranc_Month                149772 non-null  int64  
 8   mid                        149772 non-null  int64  
 9   max_floor_lvl              149772 non-null  int64  
 10  commercial                 149772 non-null  int64  
 11  market_hawker              149772 non-null  int64  
 12  multistorey_carpark        149772 non-null  int64  
 13  precinct_pavilion          14

## Preprocessing
- Train-Test split - Split to trian set and hold out 'test' set.
- Numerical transformation - Standard scalar of numerical values with a range of values.
- Categorical transformation - One hot encoding of object categories

### Initial Feature engineering

As mentioned in Part1 (EDA), some features would have to engineered, e.g. remaining lease at point of sale could be combined to hopefully increase the fit of the feature. However, for the first iteration, it would be run as is in its two party component 'lease_commence_date' and 'Tranc_Year'

### Data Dictionary

|        Feature Name       |                        Feature Description per data source                        |
|:-------------------------:|:---------------------------------------------------------------------------------:|
| resale_price              | the property's sale price in Singapore dollars. This is the target variable.      |
| flat_type                 |  type of the resale flat unit, e.g.   3 ROOM                                      |
| street_name               |  street name where the resale flat   resides, e.g. TAMPINES ST 42                 |
| floor_area_sqm            |  floor area of the resale flat unit   in square metres                            |
| flat_model                |  HDB model of the resale flat, e.g.   Multi Generation                            |
| lease_commence_date       |  commencement year of the flat   unit's 99-year lease                             |
| Tranc_Year                |  year of resale transaction                                                       |
| Tranc_Month               |  month of resale transaction                                                      |
| mid                       |  middle value of storey_range                                                     |
| max_floor_lvl             |  highest floor of the resale flat                                                 |
| commercial                |  boolean value if resale flat has   commercial units in the same block            |
| market_hawker             |  boolean value if resale flat has a   market or hawker centre in the same block   |
| multistorey_carpark       |  boolean value if resale flat has a   multistorey carpark in the same block       |
| precinct_pavilion         |  boolean value if resale flat has a   pavilion in the same block                  |
| total_dwelling_units      |  total number of residential   dwelling units in the resale flat                  |
| 1room_sold                |  number of 1-room residential units   in the resale flat                          |
| 2room_sold                |  number of 2-room residential units   in the resale flat                          |
| 3room_sold                |  number of 3-room residential units   in the resale flat                          |
| 4room_sold                |  number of 4-room residential units   in the resale flat                          |
| 5room_sold                |  number of 5-room residential units   in the resale flat                          |
| exec_sold                 |  number of executive type   residential units in the resale flat block            |
| multigen_sold             |  number of multi-generational type   residential units in the resale flat block   |
| studio_apartment_sold     |  number of studio apartment type   residential units in the resale flat block     |
| 1room_rental              |  number of 1-room rental   residential units in the resale flat block             |
| 2room_rental              |  number of 2-room rental   residential units in the resale flat block             |
| 3room_rental              |  number of 3-room rental   residential units in the resale flat block             |
| other_room_rental         |  number of "other" type   rental residential units in the resale flat block       |
| planning_area             |  Government planning area that the   flat is located                              |
| Mall_Nearest_Distance     |  distance (in metres) to the   nearest mall                                       |
| Mall_Within_500m          |  number of malls within 500 metres                                                |
| Mall_Within_1km           |  number of malls within 1 kilometre                                               |
| Mall_Within_2km           |  number of malls within 2   kilometres                                            |
| Hawker_Nearest_Distance   |  distance (in metres) to the   nearest hawker centre                              |
| Hawker_Within_500m        |  number of hawker centres within   500 metres                                     |
| Hawker_Within_1km         |  number of hawker centres within 1   kilometre                                    |
| Hawker_Within_2km         |  number of hawker centres within 2   kilometres                                   |
| mrt_nearest_distance      |  distance (in metres) to the   nearest MRT station                                |
| bus_interchange           |  boolean value if the nearest MRT   station is also a bus interchange             |
| mrt_interchange           |  boolean value if the nearest MRT   station is a train interchange station        |
| bus_stop_nearest_distance |  distance (in metres) to the   nearest bus stop                                   |
| pri_sch_nearest_distance  |  distance (in metres) to the   nearest primary school                             |
| pri_sch_name              |  name of the nearest primary school                                               |
| pri_sch_affiliation       |  boolean value if the nearest   primary school has a secondary school affiliation |
| sec_sch_name              |  name of the nearest secondary   school                                           |
| cutoff_point              |  PSLE cutoff point of the nearest   secondary school                              |

### Train-Test Split
The dataframe is split into a train set and a hold out 'test' set. <br>
As was done in the EDA, the features are split into numerical and caategorical for processing before combining back later on.

In [60]:
features = df.columns.to_list()
features.remove('resale_price')

In [61]:
#Create features matrix (X) and target vector (y)
y = df['resale_price']
X = df[features]

In [62]:
#train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

### Numerical features
Standard scalar will be applied scale the individual features for modelling since there is a wide range of values within.

In [63]:
cat_feat = ['flat_type', 'street_name', 'flat_model', 'commercial', 'market_hawker',   
            'multistorey_carpark', 'precinct_pavilion', 'planning_area', 'bus_interchange', 
            'mrt_interchange', 'pri_sch_name', 'pri_sch_affiliation', 'sec_sch_name'
           ]
num_feat = list(set(features) - set(cat_feat))

In [64]:
# use sklearn StandardScaler for train set
ss = StandardScaler()
X_train_num_scaled = pd.DataFrame(ss.fit_transform(X_train[num_feat]))
X_test_num_scaled = pd.DataFrame(ss.transform(X_test[num_feat]))
X_train_num_scaled.columns = X_train[num_feat].columns
X_test_num_scaled.columns = X_test[num_feat].columns

In [65]:
X_train_num_scaled

Unnamed: 0,Mall_Within_500m,bus_stop_nearest_distance,Hawker_Nearest_Distance,1room_sold,4room_sold,1room_rental,studio_apartment_sold,lease_commence_date,Hawker_Within_500m,2room_rental,...,3room_rental,Mall_Nearest_Distance,3room_sold,total_dwelling_units,exec_sold,Mall_Within_1km,Hawker_Within_1km,max_floor_lvl,mrt_nearest_distance,multigen_sold
0,-0.671314,0.043367,-0.231896,-0.024062,-0.483135,-0.031834,-0.082435,0.389746,-0.639332,-0.056527,...,-0.090780,0.697390,-0.572480,-0.696521,-0.310921,0.142231,-0.241507,-0.017258,1.217885,-0.023618
1,3.424053,-0.552277,0.082751,-0.024062,-0.040249,-0.031834,-0.082435,0.555643,-0.639332,-0.056527,...,-0.090780,-1.295222,-0.572480,0.674524,-0.310921,1.548442,-0.824898,2.401785,-1.182624,-0.023618
2,0.693809,-0.215541,0.134040,-0.024062,-1.103175,-0.031834,-0.082435,-0.356790,-0.639332,-0.056527,...,-0.090780,-0.869960,-0.572480,-1.107834,2.824966,-0.560874,-0.824898,-0.662337,-1.174373,-0.023618
3,0.693809,-0.772176,-0.757778,-0.024062,-0.616000,-0.031834,-0.082435,-0.854482,1.958586,-0.056527,...,-0.090780,-0.603098,1.020995,0.126106,-0.310921,-0.560874,0.341884,-0.501067,-0.293837,-0.023618
4,-0.671314,-0.251724,-0.699184,-0.024062,0.004040,-0.031834,-0.082435,-0.771533,0.659627,-0.056527,...,1.455899,0.620261,0.716786,0.263210,-0.310921,-0.560874,0.341884,-0.662337,0.022902,-0.023618
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112324,-0.671314,-1.542915,-0.944423,-0.024062,1.022677,-0.031834,-0.082435,1.136283,0.659627,-0.056527,...,-0.090780,1.166863,-0.572480,0.297487,-0.310921,-1.263980,0.925274,1.595437,-1.011052,-0.023618
112325,-0.671314,-0.028837,-0.869708,-0.024062,-1.103175,-0.031834,-0.082435,-1.352173,0.659627,-0.056527,...,-0.090780,0.963408,2.411663,1.411460,-0.310921,-1.263980,-0.241507,-0.178528,0.711000,-0.023618
112326,-0.671314,1.086486,-0.758644,-0.024062,3.547125,-0.031834,-0.082435,-1.269224,1.958586,-0.056527,...,-0.090780,0.180434,-0.572480,1.462875,-0.310921,-0.560874,1.508665,0.144011,0.514377,-0.023618
112327,-0.671314,1.057931,-0.770636,-0.024062,-0.616000,-0.031834,-0.082435,-1.352173,1.958586,-0.056527,...,-0.090780,1.086927,1.281745,0.468867,-0.258656,-1.263980,0.925274,-0.501067,-0.683963,-0.023618


### Categorical features
One Hot encoding will be used to dummify the categorical columns into a matrix.

In [66]:
#calling out categories to look for non binary cat.
train_cat_df = X_train[cat_feat].reset_index()
test_cat_df = X_test[cat_feat].reset_index()
train_cat_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112329 entries, 0 to 112328
Data columns (total 14 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   index                112329 non-null  int64 
 1   flat_type            112329 non-null  object
 2   street_name          112329 non-null  object
 3   flat_model           112329 non-null  object
 4   commercial           112329 non-null  int64 
 5   market_hawker        112329 non-null  int64 
 6   multistorey_carpark  112329 non-null  int64 
 7   precinct_pavilion    112329 non-null  int64 
 8   planning_area        112329 non-null  object
 9   bus_interchange      112329 non-null  int64 
 10  mrt_interchange      112329 non-null  int64 
 11  pri_sch_name         112329 non-null  object
 12  pri_sch_affiliation  112329 non-null  int64 
 13  sec_sch_name         112329 non-null  object
dtypes: int64(8), object(6)
memory usage: 12.0+ MB


For those with object type, it is likely string values hence it will be one hot encoded. However, the list will be refined so that categories with less datapoints can be combined to have data points as a group for modelling.

In [67]:
#set list with object dtype to OHE
cat_list = list(train_cat_df.dtypes[train_cat_df.dtypes == object].index)
cat_list

['flat_type',
 'street_name',
 'flat_model',
 'planning_area',
 'pri_sch_name',
 'sec_sch_name']

In [68]:
#set threshold
round(train_cat_df.shape[0]*0.0001)

11

In [69]:
#setup OHE
OHE = OneHotEncoder(sparse_output = False, handle_unknown = 'infrequent_if_exist', min_frequency = 0.0001)
train_OHE_cat_df = pd.DataFrame(OHE.fit_transform(train_cat_df[cat_list]), columns = OHE.get_feature_names_out())
test_OHE_cat_df = pd.DataFrame(OHE.transform(test_cat_df[cat_list]), columns = OHE.get_feature_names_out())
train_OHE_cat_df.tail()

Unnamed: 0,flat_type_1 ROOM,flat_type_2 ROOM,flat_type_3 ROOM,flat_type_4 ROOM,flat_type_5 ROOM,flat_type_EXECUTIVE,flat_type_MULTI-GENERATION,street_name_ADMIRALTY DR,street_name_ADMIRALTY LINK,street_name_AH HOOD RD,...,sec_sch_name_Yio Chu Kang Secondary School,sec_sch_name_Yishun Secondary School,sec_sch_name_Yishun Town Secondary School,sec_sch_name_Yuan Ching Secondary School,sec_sch_name_Yuhua Secondary School,sec_sch_name_Yusof Ishak Secondary School,sec_sch_name_Yuying Secondary School,sec_sch_name_Zhenghua Secondary School,sec_sch_name_Zhonghua Secondary School,sec_sch_name_infrequent_sklearn
112324,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
112325,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
112326,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
112327,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
112328,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [70]:
#concat categories train sets
train_cat_df = pd.concat([train_cat_df, train_OHE_cat_df], axis = 1)
train_cat_df.drop(columns = cat_list, inplace = True)
#concat categories test set
test_cat_df = pd.concat([test_cat_df, test_OHE_cat_df], axis = 1)
test_cat_df.drop(columns = cat_list, inplace = True)

In [71]:
#concat train set
Z_train = pd.concat([X_train_num_scaled, train_cat_df], axis = 1)
Z_train.set_index('index', inplace = True)

#concat test set
Z_test = pd.concat([X_test_num_scaled, test_cat_df], axis = 1)
Z_test.set_index('index', inplace = True)

Z_train.head()

Unnamed: 0_level_0,Mall_Within_500m,bus_stop_nearest_distance,Hawker_Nearest_Distance,1room_sold,4room_sold,1room_rental,studio_apartment_sold,lease_commence_date,Hawker_Within_500m,2room_rental,...,sec_sch_name_Yio Chu Kang Secondary School,sec_sch_name_Yishun Secondary School,sec_sch_name_Yishun Town Secondary School,sec_sch_name_Yuan Ching Secondary School,sec_sch_name_Yuhua Secondary School,sec_sch_name_Yusof Ishak Secondary School,sec_sch_name_Yuying Secondary School,sec_sch_name_Zhenghua Secondary School,sec_sch_name_Zhonghua Secondary School,sec_sch_name_infrequent_sklearn
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
30484,-0.671314,0.043367,-0.231896,-0.024062,-0.483135,-0.031834,-0.082435,0.389746,-0.639332,-0.056527,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
87242,3.424053,-0.552277,0.082751,-0.024062,-0.040249,-0.031834,-0.082435,0.555643,-0.639332,-0.056527,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
79640,0.693809,-0.215541,0.13404,-0.024062,-1.103175,-0.031834,-0.082435,-0.35679,-0.639332,-0.056527,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
63804,0.693809,-0.772176,-0.757778,-0.024062,-0.616,-0.031834,-0.082435,-0.854482,1.958586,-0.056527,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
146273,-0.671314,-0.251724,-0.699184,-0.024062,0.00404,-0.031834,-0.082435,-0.771533,0.659627,-0.056527,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Modelling #1
- [Linear Regression](#Linear-Regression)
- [Ridge Regression](#Ridge-Regression)
- [Lasso Regression](#Lasso-Regression)
- [Results Analysis](#Results-Analysis) - of the models tested0

### Linear Regression

In [72]:
#Instantiate your model
lr = LinearRegression()

#Obtain Cross-validation scores
print(cross_val_score(lr, Z_train, y_train, cv =5))

#Obtain MEAN Cross-validation score
cross_val_score(lr, Z_train, y_train).mean()

[0.92574948 0.92498302 0.92412123 0.9254021  0.92348035]


0.9247465997170679

In [73]:
# Train your model
lr.fit(Z_train, y_train)

#Generate predictions
y_train_preds = lr.predict(Z_train)
y_test_preds = lr.predict(Z_test)

In [74]:
# Train score:
lr.score(Z_train, y_train)

0.9261795088793212

In [75]:
# Test score:
lr.score(Z_test, y_test)

-76711889067531.97

In [76]:
#MSE
train_MSE = mean_squared_error(y_train, y_train_preds)
test_MSE = mean_squared_error(y_test, y_test_preds)
#RMSE 
print(np.sqrt(train_MSE))
print(np.sqrt(test_MSE))

38915.203451039495
1254352181060.5674


### Ridge Regression

In [77]:
#alpha_list = np.logspace(-3, 1, 10)
ridgecv = RidgeCV(alphas = np.logspace(-3, 1, 10) , scoring='r2', cv = 5)
ridgecv.fit(Z_train, y_train)

In [78]:
ridgecv.alpha_

0.021544346900318832

In [79]:
#Instantiate the model
ridge = Ridge(alpha = ridgecv.alpha_)
ridge_scores = cross_val_score(ridge, Z_train, y_train, cv = 5)

print (ridge_scores)
print (np.mean(ridge_scores))

[0.92573897 0.92495868 0.92414661 0.92541301 0.92349769]
0.9247509940333934


In [80]:
#Retrain ridge
ridge.fit(Z_train, y_train)

#Generate predictions
y_train_preds = ridge.predict(Z_train)
y_test_preds = ridge.predict(Z_test)

In [81]:
# Train score:
ridge_train_score = ridge.score(Z_train, y_train)
ridge_train_score

0.9261777556570372

In [82]:
# Test  score:
ridge_test_score = ridge.score(Z_test, y_test)
ridge_test_score

0.9247602588881998

In [83]:
#MSE
train_MSE = mean_squared_error(y_train, y_train_preds)
test_MSE = mean_squared_error(y_test, y_test_preds)
#RMSE 
print(np.sqrt(train_MSE))
print(np.sqrt(test_MSE))

38915.665562546455
39283.646721933226


### Lasso Regression

In [87]:
l_alphas = np.logspace(-3, 1, 10)

lassocv = LassoCV(alphas=l_alphas, cv=5, max_iter=1000)
lassocv.fit(Z_train, y_train)

  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descen

In [88]:
lassocv.alpha_

0.001

In [89]:
lasso = Lasso(alpha=lassocv.alpha_)

lasso_scores = cross_val_score(lasso, Z_train, y_train)

print (lasso_scores)
print (np.mean(lasso_scores))

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


[0.92571231 0.92485553 0.9241447  0.92536717 0.92344725]
0.9247053877726905


In [90]:
#Retrain ridge
lasso.fit(Z_train, y_train)

#Generate predictions
y_train_preds = lasso.predict(Z_train)
y_test_preds = lasso.predict(Z_test)

  model = cd_fast.enet_coordinate_descent(


In [91]:
# Train score:
lasso_train_score = lasso.score(Z_train, y_train)
lasso_train_score

0.9261505725726171

In [92]:
# Test  score:
lasso_test_score = lasso.score(Z_test, y_test)
lasso_test_score

0.924731232168656

In [93]:
#MSE
train_MSE = mean_squared_error(y_train, y_train_preds)
test_MSE = mean_squared_error(y_test, y_test_preds)
#RMSE 
print(np.sqrt(train_MSE))
print(np.sqrt(test_MSE))

38922.829734189305
39291.22360499915


In [94]:
#check on lasso coefficients that is reduced to 0
coefs_df = pd.DataFrame(list(zip(Z_train.columns, abs(lassocv.coef_))))
coefs_df.sort_values(by = 1, ascending = True)

Unnamed: 0,0,1
682,pri_sch_name_Greendale Primary School,37.259727
6,studio_apartment_sold,59.172796
16,2room_sold,62.183298
869,sec_sch_name_Manjusri Secondary School,92.006903
784,pri_sch_name_White Sands Primary School,103.761078
...,...,...
38,flat_type_1 ROOM,283643.583597
334,street_name_MARINE DR,290998.649285
195,street_name_DOVER CL EAST,299953.031463
169,street_name_CLARENCE LANE,324669.590649


### Results Analysis


| Score Description | Linear Regression | Ridge Regression | Lasso Regression |
|-------------------|-------------------|------------------|------------------|
| Train R2 Score    | 0.92618           | 0.92618          | 0.92615          |
| Test R2 Score     | <mark> -6846639454563</mark>    | 0.92476          | 0.92473          |
| Train RMSE        | 38915.20          | 38915.67         | 38922.81         |
| Test RMSE         | <mark> 374737262352 </mark>      | 39283.65         | 39291.81         |

After modelling the data using linear regression, the test scored gave a discrepant result (highlighted in yellow) which is negative and large which does not make sense for as an coefficient of determination (R2) score. Therefore, the Root-mean-square deviation (RMSE) is used as well as a gauge of comparison. The RMSE for Linear regression in the test set is also oddly high (highlighted in yellow). Likely, the splitting of the data set has resulted in an imbalance that skews the results of when the more simplistic linear regression model is used.

Looking for alternatives, Ridge regression models and Lasso regression models were tested out as well. While the scores were both similar, the cross-validation Lasso regression model gave convergence warning, indicating that the most optimal alpha value derived may not be the true optimum. Therefore, the results of Ridge regression model will be used as a comparison in other iteration unless the convergence warning is resolved.

For the Ridge regression mode, the R2 train and test scores are quite high at 0.926 and 0.926 which means the model is fairly linear. The RMSE score indicate almost $40,000 in variance from the actual value for the predictions. Tellingly, for both metrics of assessment, the test score shows a slightly higher variation and poorer fit indicating that there may be some overfit.

Since the models uses features as is, this iteration is a baseline for comparison with the others which should likely show improvements. To improve the model, another iteration will be tested out with feature engineering.


# Pickling in progress

In [None]:
import pickle

In [None]:
pickle.dump(lr, open('lr.pkl', 'wb'))

In [None]:
pickle.dump(ridgecv, open('ridgecv.pkl', 'wb'))

In [None]:
pickle.dump(lassocv, open('lassocv.pkl', 'wb'))