<img src="https://gallery.mailchimp.com/f98d5ac0a3fbbdcdda35136ab/images/2002af76-5fd4-4185-9d49-28558b6b8772.png">

# `sg-hdb-resale-bokeh` 
# Part 2: Model Training
What we have done so far is to extract data from the .csv files, do some preliminary transformation to the data, and then loading all of it into an SQLite database. The next step is to work towards creating a simple predictive model for us to predict the price of a resale HDB unit.

In [None]:
import pandas as pd
from sqlalchemy import create_engine

In [None]:
sql_engine = create_engine('sqlite:///../data/processed/sg_hdb.db')
# Simple query to get the whole table
query = "SELECT * FROM sg_hdb_resale"
# Store result of query in a pandas dataframe
sg_hdb_resale_df = pd.read_sql_query(query, sql_engine)

In [None]:
# Observe result of query executed
sg_hdb_resale_df

In [None]:
# Do the same for the other table containing the resale HDB price index
query = "SELECT * FROM sg_hdb_pi"
sg_hdb_pi = pd.read_sql_query(query, sql_engine)

In [None]:
sg_hdb_pi

We are first going to inspect the data types of the imported dataframes.

In [None]:
sg_hdb_resale_df.dtypes

In [None]:
sg_hdb_pi.dtypes

The next immediate set of steps would consist of associating a price index from `sg_hdb_pi` to each observation belonging to `sg_hdb_resale_df` and afterwards adjusting the resale price values according to the indexes. This is akin to a left join but there's no key to relate both dataframes.

We first have to create a column for `sg_hdb_resale_df` stating the year and quarter for each transaction/observation.

In [None]:
# Converting the 'month' column to a datetime format
sg_hdb_resale_df['month'] = pd.to_datetime(sg_hdb_resale_df.month, format='%Y-%m')
sg_hdb_resale_df.dtypes

In [None]:
sg_hdb_resale_df

The following function will take in the year and month properties from the 'month' column to get a single output containing the year and quarter of the transaction.

In [None]:
def get_year_quarter(x):
    year = x.year
    # Floor division for the month property to get the month's quarter
    quarter = ((x.month-1)//3)+1
    # Combining the year and quarter properties into a single output
    year_quarter = '{}-Q{}'.format(year, quarter)
    return year_quarter

Here we iterate the function above to each observation using the `map` function.

In [None]:
sg_hdb_resale_df['resale_quarter'] = sg_hdb_resale_df['month'].map(get_year_quarter)

In [None]:
sg_hdb_resale_df.head()

In [None]:
# Current no. of observations
len(sg_hdb_resale_df)

Since we only have the resale price index up until Q4 of 2018, we would have to filter out transactions recorded after 2018.

In [None]:
sg_hdb_resale_df = sg_hdb_resale_df[sg_hdb_resale_df['month'] < pd.to_datetime(2019, format='%Y')]
# Now the current no. of observations has changed since some have been filtered out
len(sg_hdb_resale_df)

Now that we have the common column to match both `sg_hdb_resale_df` with `sg_hdb_pi`, we would like to do a left join.

In [None]:
sg_hdb_resale_df = pd.merge(sg_hdb_resale_df, sg_hdb_pi, left_on='resale_quarter', right_on='quarter', how='left')

In [None]:
sg_hdb_resale_df

After the join, we would like to apply the indexes to the recorded resale prices. Create the function that is able to do this for every observation.

In [None]:
def adjust_resale_price(x):
    # Converting indexes to multipliers
    index_multiplier = x['index']/100
    # Applying multiplier to observation's resale price
    adjusted_price = x['resale_price'] * index_multiplier
    return adjusted_price

Create a new column 'adjusted_resale_price' to contain these new derived values.

In [None]:
sg_hdb_resale_df['adjusted_resale_price'] = adjust_resale_price(sg_hdb_resale_df)

In [None]:
sg_hdb_resale_df.head()

Now that we have adjusted the resale prices according to the allocated price indexes, we are now going to build a simple linear regression model that allows us to predict a resale price of a unit, given a value of `floor_area_sqm`.

In [None]:
sg_hdb_resale_df

In [None]:
# Check number of null values across all columns
sg_hdb_resale_df.isnull().sum()

First step is to create 2 separate series containing the predictor and response values to train our model on.

In [None]:
sg_hdb_X = sg_hdb_resale_df['floor_area_sqm'].values
sg_hdb_Y = sg_hdb_resale_df['adjusted_resale_price'].values

In [None]:
# Import the relevant packages
import numpy as np
import sklearn
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import pickle
from matplotlib import pyplot as plt

When training a model, we have to create a train-test split to check the accuracy/performance of the model.

In [None]:
sg_hdb_X_train, sg_hdb_X_test, sg_hdb_Y_train, sg_hdb_Y_test = sklearn.model_selection.train_test_split(
    sg_hdb_X, sg_hdb_Y,
    test_size=0.3, random_state=7
)

In [None]:
# Reshaping needed when using a single variable for predictor
sg_hdb_X_train = sg_hdb_X_train.reshape(-1,1)
sg_hdb_X_test = sg_hdb_X_test.reshape(-1,1)

We are gonna create a simple linear regression model (a.k.a best fit line) from the dataset.

In [None]:
# Initialise model
lm = LinearRegression()
# Create model from the train sets
lm.fit(sg_hdb_X_train, sg_hdb_Y_train)

In [None]:
# To observe the model's coefficients
print('Coefficients: \n X:', lm.coef_,'\n c:', lm.intercept_)

After creating the model, we are going to evaluate its performance by pitting it against the test set.

First, we use the linear model to provide us with the predictions derived from the values in the test set.

In [None]:
sg_hdb_Y_pred = lm.predict(sg_hdb_X_test)
sg_hdb_Y_pred

Thereafter, we are going to calculate the errors, pitting the predicted values with actual historical values.

In [None]:
# Examine fitness of model
r2_score(sg_hdb_Y_test, sg_hdb_Y_pred)

Why does the model have such a score? Well, let's check how the linear model was created.

In [None]:
plt.scatter(sg_hdb_X_test, sg_hdb_Y_test, color='red')
plt.plot(sg_hdb_X_test, sg_hdb_Y_pred, color='blue')
plt.title(" SG HDB Resale ")
plt.xlabel('floor_area_sqm')
plt.ylabel("pred_resale_price")
plt.show()

A heavily sparsed set of data points can hardly be described by a single linear regression model, hence the low value of fitness.

For the sake of this exercise, let us just proceed and export (serialise) this model for deployment. Save the model under a name, for example like the one below: 'sg_hdb_lm_v1.pkl'

In [None]:
# Specify output location of model to be serialised
file_loc_name = '../models/sg_hdb_lm_v1.pkl'
pickle.dump(lm, open(file_loc_name, 'wb'))

Here, we just do a quick test by loading the model and then doing a single prediction to it.

In [None]:
# Test loading saved model
loaded_model = pickle.load(open(file_loc_name, 'rb'))
# Create a test value for test prediction
# Test value has to be contained in a numpy array format hence np.array
test_val = np.array(65)
# Reshaping value before feeding to .predict function
test_val_reshape = test_val.reshape(-1, 1)
# Conduct prediction
result = loaded_model.predict(test_val_reshape)
# Print out result
print(result)

Now that we have exported this model, time to create a simple API (Application Programming Interface) that allows us to use the model, potentially remotely.