# 0.0 Problem Statement

Predict the sold price / square foot of the property that sold on 2020

# 1.1 Take a Quick Look at Data Structure

The goal of this section is to analyze the data and only keep data that is likely to be useful to us. Processing data for ML algorithm will be done in later sections

We have the housing data as before. See PricingEstimateV1 for my reasons on including or excluding certain features.

We are only interested in listings that sold in 2020. We will also add additional features such as Interest Rate, and assessement value of the properties from 2020 and 2019. We will also add price per square foot of nearest 5 properties of every listing. See the get_data variable in PreprocessingPipelineV3 for details on the data we will use for our purpose

In [2]:
import PreprocessingPipelineV3

ModuleNotFoundError: No module named 'geopy'

In [None]:
house_data_dir = r"C:\Users\KI PC\OneDrive\Documents\Software Engineering and Computer Science\Internships\Riipen - KnockNow\BC-House-Pricing-Model"
seasons_data_path = r"C:\Users\KI PC\OneDrive\Documents\Software Engineering and Computer Science\Internships\Riipen - KnockNow\BC-House-Pricing-Model\month_seasons.csv"
#mortgage rate from here: https://www.ratehub.ca/historical-mortgage-rates-widget
interest_rate_2020_path = r"C:\Users\KI PC\OneDrive\Documents\Software Engineering and Computer Science\Internships\Riipen - KnockNow\BC-House-Pricing-Model\2020_month_by_month_interest_rate.csv"
assesement_data_path = r"C:\Users\KI PC\OneDrive\Documents\Software Engineering and Computer Science\Internships\Riipen - KnockNow\BC-House-Pricing-Model\West-van-assessments.csv"
longitude_latitude_data_path = r"C:\Users\KI PC\OneDrive\Documents\Software Engineering and Computer Science\Internships\Riipen - KnockNow\BC-House-Pricing-Model\longitude_latitude_data.csv"

In [None]:
#get data
import pandas as pd
house_data = PreprocessingPipelineV3.load_data(house_data_dir, "Spreadsheet")
assesement_data = pd.read_csv(assesement_data_path)
seasons_data = pd.read_csv(seasons_data_path)
interest_rate_2020_data = pd.read_csv(interest_rate_2020_path)

In [None]:
house_data_sold = PreprocessingPipelineV3.get_data.fit_transform([house_data, assesement_data, seasons_data, interest_rate_2020_data])
house_data_sold.to_csv('data_v3.csv')

In [None]:
house_data_sold.head()

We have the following columns:

In [None]:
house_data_sold.columns

We have seen them in PricingEstimateV1. Sold Price is not actually Sold Price / Floor Area The new columns are described below

- 2019/2020 Total Value, 2019/2020 Land Value, 2019/2020 Buildings Value: These are the assessed value of the properties from https://www.bcassessment.ca/

- Season: Fall / Summer / Winter / Spring

- Interest Rate: 5 year Mortgage Rate in the month that the property sold (https://www.ratehub.ca/historical-mortgage-rates-widget)

- price_sq_ft: These are the price/sq_ft of the closest 5 properties

- date_diff_tot: Total number of days difference between the sold date of subject property and the sold date of 5 nearest properties

- distance_total: total distance between the five properties to the subject property

## 1.2 Train and Test Split

We will again use stratified sampling





In [None]:
strat_train, strat_train_labels, strat_test, strat_test_labels =  PreprocessingPipelineV3.create_train_test_set(df = house_data_sold)

In [3]:
print(strat_train.shape)
print(strat_train_labels.shape)
print(strat_test.shape)
print(strat_test_labels.shape)

NameError: name 'strat_train' is not defined

Before, we calculate the error on test set, we might have a problem where some of the labels in categorical columns such as "S/A", "TypeDwel", "Showing Appts" don't have the same number of categories in training and test set. Let's see if that is the case here

In [None]:
for column in PreprocessingPipelineV3.nominal_columns:
    print('Result for :' + column)
    print(" ")
    print('Categories of ' + column + ' that is present in train set but not in test set')
    print(set(strat_train[column].unique()) - set(strat_test[column].unique()))
    print('Categories of ' + column + ' that is present in test set but not in train set')
    print(set(strat_test[column].unique()) - set(strat_train[column].unique()))
    print(" ")

As we can see, the test set don't have any rows where S/A is 'VWVRR' and where Showing Appts is 'Phone Seller First'. This will create a problem as we will have different number of features in training set vs. test set. To deal with this, we simply need to add enough rows (maximum of 2) from training set to train set so we don't have this problem. This is not an ideal situation as some of the rows will be used for both training and testing but the number of such rows at most will be 2. So we should be fine

In [None]:
rows_containing_vwval = (strat_train['S/A'] == 'VWVRR')
rows_containing_phone_seller_first = strat_train['Showing Appts'] == 'Phone Seller First'

In [None]:
(rows_containing_vwval & rows_containing_phone_seller_first).sum()

So, there are no rows that contains the missing categories in both columns. We need to add two rows to test set

In [None]:
#add predictor features and labels to test dataset
strat_test = strat_test.append(strat_train[rows_containing_vwval].head(1))
strat_test = strat_test.append(strat_train[rows_containing_phone_seller_first].head(1))  
strat_test_labels = strat_test_labels.append(strat_train_labels.loc[strat_test.index[-2:]])

In [None]:
strat_test.tail()

In [None]:
strat_test_labels.tail()

In [None]:
print(strat_train.shape)
print(strat_train_labels.shape)
print(strat_test.shape)
print(strat_test_labels.shape)

# 2.0 Prepare the Data for Machine Learning Algorithms

See the pipeline data_processing_pipeline in PreprocessingPipelineV3 for details on how the data was transformed for machine learning algorithms. Most of them stayed the same from PreprocessingPipeline

# 3.0 Selecting, Fine Tuning and Training a Model

As of now, we have two data sets: Training Set and Test Set. Typically training set is further divided into a training set and a validation set. The reduced training set is used to train the model, and the validation set is used to fine-tune the model. For our purpose, we will use repeated cross-validation instead. In repeated cross-validation, we use many small validation sets. Each model is evaluated once per validation set after it is trained on the rest of the data. By averaging out all the evaluations of a model, we get a much accurate measure of its performance. The drawback is that the training time is multiplied by the number of validation sets.

We will use the following models for testing: Decision Tree Regressor, Random Forest Regressor, and SVR

## 3.1 Initial Model(s) Selection

First, we need to preprocess the data

In [None]:
strat_train_prepared = PreprocessingPipelineV3.data_preprocessing_pipeline.fit_transform(strat_train)

In [None]:
strat_train_prepared.shape

The columns are as follows

In [None]:
attributes = (PreprocessingPipelineV3.numerical_columns +
              PreprocessingPipelineV3.nearest_property_data_columns +
              PreprocessingPipelineV3.ordinal_columns + 
              PreprocessingPipelineV3.one_hot_encoding_cols_catgs)

In [None]:
len(attributes)

In [None]:
attributes

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
#Function to show cross validation scores
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

### 3.1.1 Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(strat_train_prepared, strat_train_labels)

In [None]:
import numpy as np
from sklearn.model_selection import cross_val_score

scores_tree = cross_val_score(tree_reg, strat_train_prepared, strat_train_labels, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores_tree)
display_scores(tree_rmse_scores)

### 3.1.2 Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
forest_reg.fit(strat_train_prepared, strat_train_labels)

In [None]:
scores_forest = cross_val_score(forest_reg, strat_train_prepared, strat_train_labels, scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-scores_forest)
display_scores(forest_rmse_scores)

The results are still not great but The Random Forest Regressor is outperforming the Decision Tree Regressor by a signifciant margin

### 3.1.3 SVR

In [None]:
from sklearn.svm import SVR
svm_reg = SVR(kernel="linear")
svm_reg.fit(strat_train_prepared, strat_train_labels)

In [None]:
scores_svr = cross_val_score(svm_reg, strat_train_prepared, strat_train_labels, scoring="neg_mean_squared_error", cv=10)
svm_reg_rmse_scores = np.sqrt(-scores_svr)
display_scores(svm_reg_rmse_scores)

The Random Forest Regressor is outperfomring by quite a margin but we may get better results with SVR when we use Grid Search or Randomized Search. So for next stage, we will only keep SVR and Random Forest Regressor

In addition, Random Forest Algorithm combines the results of many decision trees to come up with results. Thus, it is unlikely to be helpful with ensemble learning. In addition, as shown above, it is likely to give better results than Decision Trees. So we will not consider Decision Tree any further.

# 3.2 Final Model Selection

### 3.2.1 Grid Search with Random Forest Regressor

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid_forest = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training 
grid_search_forest = GridSearchCV(forest_reg, param_grid_forest, cv=5,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)
grid_search_forest.fit(strat_train_prepared, strat_train_labels)

In [None]:
print(grid_search_forest.best_params_)
print(grid_search_forest.best_estimator_)
print("Best Score: {0}".format(np.sqrt(-grid_search_forest.best_score_)))

In [None]:
pd.DataFrame(grid_search_forest.cv_results_).head()

Better way to see the result is as following:

In [None]:
cvres = grid_search_forest.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

### 3.2.2 Randomized Search with Random Forest Regressor

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs_forest = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

forest_reg = RandomForestRegressor(random_state=42)
rnd_search_forest = RandomizedSearchCV(forest_reg, param_distributions=param_distribs_forest,
                                n_iter=50, cv=5, scoring='neg_mean_squared_error', random_state=42, verbose = 2)
rnd_search_forest.fit(strat_train_prepared, strat_train_labels)

In [None]:
print(rnd_search_forest.best_params_)
print(rnd_search_forest.best_estimator_)
print("Best Score: {0}".format(np.sqrt(-rnd_search_forest.best_score_)))

In [None]:
cvres = rnd_search_forest.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

In [None]:
feature_importances = rnd_search_forest.best_estimator_.feature_importances_
feature_importances

In [None]:
sorted(zip(feature_importances, attributes), reverse = True)

### 3.2.3 Grid Search with SVR

In [None]:
param_grid_svr = [
        {'kernel': ['linear'], 'C': [10., 30., 100., 300., 1000., 3000., 10000., 30000.0]},
        {'kernel': ['rbf'], 'C': [1.0, 3.0, 10., 30., 100., 300., 1000.0],
         'gamma': [0.01, 0.03, 0.1, 0.3, 1.0, 3.0]},
    ]
svm_reg = SVR()
grid_search_svr = GridSearchCV(svm_reg, param_grid_svr, cv=5,
                           scoring='neg_mean_squared_error',
                           verbose = 2) 

grid_search_svr.fit(strat_train_prepared, strat_train_labels)

In [None]:
print(grid_search_svr.best_params_)
print(grid_search_svr.best_estimator_)
print("Best Score: {0}".format(np.sqrt(-grid_search_svr.best_score_)))

In [None]:
cvres = grid_search_svr.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

### 3.2.4 Randomized Search with SVR

In [None]:
from scipy.stats import expon, reciprocal
param_distribs_svr = {
        'kernel': ['linear', 'rbf'],
        'C': reciprocal(20, 200000),
        'gamma': expon(scale=1.0),
    } 

svm_reg = SVR()
rnd_search_svr = RandomizedSearchCV(svm_reg, param_distributions=param_distribs_svr,
                                n_iter=50, cv=5, scoring='neg_mean_squared_error',
                                verbose=2, random_state=42)
rnd_search_svr.fit(strat_train_prepared, strat_train_labels)

In [None]:
print(rnd_search_svr.best_params_)
print(rnd_search_svr.best_estimator_)
print("Best Score: {0}".format(np.sqrt(-rnd_search_svr.best_score_)))

In [None]:
cvres = rnd_search_svr.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

The best model is the SVR model obtaind from Grid Search with mean error of 131.05 $/sq_Ft

## 3.3 Generalization Error For Final Model

Before, we calculate the error on test set, we might have a problem where some of the labels in categorical columns such as "S/A", "TypeDwel", "Showing Appts" don't have the same number of categories in training and test set. Let's see if that is the case here

In [None]:
from sklearn.metrics import mean_squared_error
#final_model = grid_search_svr.best_estimator_
final_model = grid_search_svr.best_estimator_
house_data_test_prepared = PreprocessingPipelineV3.data_preprocessing_pipeline.fit_transform(strat_test)
final_predictions = final_model.predict(house_data_test_prepared)
final_mse = mean_squared_error(strat_test_labels, final_predictions)
final_rmse = np.sqrt(final_mse)
print("Final RMSE: {}".format(final_rmse))

The mean error on test set is $1750.58 per square foot. The result is not great and we need better features and more data for the model to learn from

In [None]:
#export final model
import joblib
joblib.dump(final_model, "final_model.pkl")