# Project 2 - Singapore Housing Data and Kaggle Challenge

## Part 5 - Production Model and Insights

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from scipy import stats

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline, make_pipeline

import pickle

We did not find evidence of overfitting in our Linear Regression. However, we will run two of the regularisation methods (RidgeCV and LassoCV), to see whether it can further reduce the model complexity in case of overfitting.

## 1. Train-test-split

In [2]:
# Importing data.
df = pd.read_csv('../data/04_cleaned_df.csv')

In [3]:
# Defining numerical columns and categorical columns

num_columns = ['floor_area_sqm', 'mid_storey', 
                'mall_nearest_distance', 'hawker_nearest_distance',
                'mrt_nearest_distance',
               'pri_sch_nearest_distance', 
               'sec_sch_nearest_dist',
                'age_when_sold']

cat_columns = ['full_flat_type', 'commercial', 
               'planning_area',  'mrt_interchange',  
                'pri_sch_affiliation', 'pri_sch_name', 'sec_sch_name']

In [4]:
X = df.drop(columns= 'resale_price')
y = df['resale_price']

In [5]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

In [6]:
# Instantiating OHE, SS and CT
ohe = OneHotEncoder(min_frequency = 10)
ss = StandardScaler()
ct = make_column_transformer(
    (ohe, cat_columns),
    (ss, num_columns))


## 2. Modelling

### 2.1 Linear Regression

In [7]:
# Instantiating Linear Regression
lr = LinearRegression()
# Creating a pipeline that starts off with the column transformer, followed by the LinearRegression
lr_pipe = make_pipeline(ct, lr)

In [8]:
lr_pipe

In [9]:
lr_pipe.fit(X_train, y_train)
# Linear Regression Scores
print(f"The train score is: {lr_pipe.score(X_train, y_train)}")
print(f"The test score is: {lr_pipe.score(X_test, y_test)}")

y_test_preds_lr = lr_pipe.predict(X_test)

# Linear Regression Root Mean Squared Error
lr_rmse = np.sqrt(mean_squared_error(y_test, y_test_preds_lr))
print(f'The root mean squared error is {lr_rmse}.')

The train score is: 0.9094461971305231
The test score is: 0.9085714834872526
The root mean squared error is 43304.097016923675.


This is a repeat of the previous notebook. Now, let us see whether the score and RMSE can be improved by RidgeCV and LassoCV.

### 2.2 RidgeCV

The Ridge Regression aims to reduce overfitting and increase the generalisation ability of the model. This is by adding a penalty term to the cost function, and eventually reducing the coefficients compared to linear regression.

In [10]:
# Setting alpha range
r_alphas = np.logspace(-2, 1, 100)
# Instantiating Linear Regression
ridge = RidgeCV(alphas=r_alphas, scoring='r2', cv=5)
# Creating a pipeline that starts off with the column transformer, followed by the RidgeCV
ridge_pipe = make_pipeline(ct, ridge)

In [11]:
ridge_pipe

In [12]:
ridge_pipe.fit(X_train, y_train)
print(f"The train score is: {ridge_pipe.score(X_train, y_train)}")
print(f"The test score is: {ridge_pipe.score(X_test, y_test)}")

# RidgeCV Scores
y_test_preds_ridge = ridge_pipe.predict(X_test)

# RidgeCV Root Mean Squared Error
ridge_rmse = np.sqrt(mean_squared_error(y_test, y_test_preds_ridge))
print(f'The root mean squared error is {ridge_rmse}.')

The train score is: 0.909418642603309
The test score is: 0.908508229789419
The root mean squared error is 43319.07413114957.


In [13]:
ridge.alpha_

0.023101297000831605

There is very little difference between the RidgeCV score and Linear Regression score in terms of the R2 and RMSE. Let us see the LassoCV score.

### 2.3 LassoCV

Similar to the Ridge regression, the Lasso regression aims to reduce overfitting and increase the generalisation ability of the model. The main difference between the two is the type of penalty they impose, causeing different effects on the coefficient estimates.

In [14]:
# Setting alpha range
l_alphas = np.logspace(-20, 5, 25)
#l_alphas = [0.00001, 0.00005, 0.0001, 0.0005, 0.001]
# Instantiating Linear Regression
lasso = LassoCV(alphas=l_alphas, cv=5, max_iter=10000, tol=1e-1)
# Creating a pipeline that starts off with the column transformer, followed by the LassoCV
lasso_pipe = make_pipeline(ct, lasso)

In [15]:
lasso_pipe

In [16]:
lasso_pipe.fit(X_train, y_train)
print(f"The train score is: {lasso_pipe.score(X_train, y_train)}")
print(f"The test score is: {lasso_pipe.score(X_test, y_test)}")

# LassoCV Scores
y_test_preds_lasso = lasso_pipe.predict(X_test)

# LassoCV Root Mean Squared Error
lasso_rmse = np.sqrt(mean_squared_error(y_test, y_test_preds_lasso))
print(f'The root mean squared error is {lasso_rmse}.')

The train score is: 0.8994305424066004
The test score is: 0.8984566030864874
The root mean squared error is 45636.6749272274.


In [17]:
# Best alpha score
lasso.alpha_

1e-20

Based on the LassoCV score, I would not recommend using my LassoCV model. This is because:
1. There is a convergence warning that I got with the default tolerance. When I increased the LassoCV tolerance, the warning disappeared. However, this means that the score is not as optimal.
2. Additionally, I could not find the best alpha. As seen above, the best alpha score is consistently the smallest value of my alpha range, even when I reduced it to an extremely small number.

As such, I would not be using my LassoCV model.

## 3. Model Evaluation

We can say that all the models performed quite well, with R2 values around 0.90 and 0.91. Additionally, the difference between the train and test R2 values are small, less than 0.01. The difference between the train and test Root Mean Square Error values are small too. The results are as follows:


| Regression Type            | Train R2 | Test R2 | Root Mean Square Error |
|----------------------------|----------|---------|------------------------|
| Baseline Linear Regression | 0.909    | 0.909   | 43,304.10              |
| Ridge Regression           | 0.909    | 0.909   | 43,319.07              |
| Lasso Regression           | 0.899    | 0.898   | 45,636.67              |

Overall, the baseline linear regression model has the best performance in terms of R2 score and Root Mean Squared Error. Regularisation did not enhance the baseline's predictive performance. When it comes to the submission to Kaggle, I will be using the linear regression.

## 4. Recommendations and Next Steps

Overall, we managed to create a strong model with an R2 value of 0.91, which means that 91% of the variation of the resale price of an HDB house can be explained by our model. Also, our Root Mean Square Error is approximately 43,000. This means that on average, our predicted price can vary from the actual price by $43,000.

Based on the heatmap in the previous notebook, we found five main characteristics of an HDB house that would affect its resale value the most.
1. Size of the house - The larger the size of the house, the more expensive it is.
2. The storey of the house - The higher the storey of the house, the more expensive it is.
3. The age of the house when sold - The older the house is when sold, the cheaper it is.
4. Commercial - If the resale flat does not have commercial units in the same block, it is more expensive.
5. Its distance from the nearest MRT station - The nearer it is to an MRT station, the more expensive it is.

While there are many Singapore-specific columns, the 5 that we identified appear to be able to be generalised to other cities and countries as well. 

Going back to the problem statement:
**<center>Can we accurately predict the resale value of a Housing Development Board (HDB) flat based on various housing-related data?</center>**

I'd say that this model can reasonably predict the resale value of a HDB flat, with an error margin of approximately $43,000 on average. 

For our next steps, I'd recommend using this model as a baseline for us to build solutions for our primary audience (ie. home buyers). 

For example, for home buyers who are interested to find out the suggested resale price for a house they are interested to buy, we can have them fill up a form to describe the details of the house they are looking to buy. By filling up the form, it gives us the values for the different independent variables (eg. town, age of house, size of house) that we can put into the model to come up with a suggested price. This suggested price, along with the RMSE, can allow home buyers to decide on the price they would like to offer to the home seller. For example, a desperate buyer can offer a price at the high end of the range, while a buyer who is looking for a bargain can offer a price at the low end of the range. After all, the buying and selling of resale flats is still a very human process.

In [18]:
# Save the model to a file
with open('../data/05_final_lr_model.pkl', 'wb') as file:
    pickle.dump(lr, file)