# Introduction to Machine Learning for the Built Environment - Supervised Classification Models

- Created by Clayton Miller - clayton@nus.edu.sg - miller.clayton@gmail.com

This notebook is an introduction to the machine learning concepts of classification. We will use the ASHRAE Thermal Comfort Database II data set to predict what makes a person feel comfortable

In [None]:
import pandas as pd
from google.colab import drive
import os
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [None]:
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_curve, auc, precision_recall_curve
from sklearn.externals import joblib
from sklearn.model_selection import RandomizedSearchCV
from sklearn.feature_selection import SelectKBest

## Load the IEQ Data and find a classification objective

We can constrain the data to be able to predict a certain attribute

In [None]:
drive.mount('/content/gdrive')
os.chdir("/content/gdrive/My Drive/EDX Data Science for Construction, Architecture and Engineering/4 - Operations - Statistics and Visualization/")

In [None]:
ieq_data = pd.read_csv("ashrae_thermal_comfort_database_2.csv", index_col='Unnamed: 0')

In [None]:
ieq_data.head()

In [None]:
ieq_data.info()

In [None]:
ieq_data["ThermalSensation_rounded"].value_counts()

## Classification Objective -- Predict Thermal Sensation using a Random Forest Model

Let's use many of the other variables to predict thermal sensation as classification objective.

To do this we can use the Random Forest Classification Model. This model is a good all-purpose model that is able to ingest input features of various types. It is a form of a [decision-tree model](https://en.wikipedia.org/wiki/Decision_tree_learning).


## Creating Feature and Target Data Sets
The first thing we need to do is create the the feature data set and the target variable.

In [None]:
ieq_data.head()

In [None]:
list(ieq_data.columns)

Let's use the following columns as input features for the classification model. These features will be used by the model to try to predict `ThermalSensation_rounded`.

Several of the features are related to the building context (i.e.: `Country`, `City`), the environmental conditions (i.e.: `Air Temperature (C)`, `Relative humidity (%)`) and personal factors (i.e.: `Sex`, `Clo`, etc.)


In [None]:
feature_columns = [
 'Year',
 'Season',
 'Climate',
 'City',
 'Country',
 'Building type',
 'Cooling startegy_building level',
 'Sex',
 'Clo',
 'Met',
 'Air temperature (C)',
 'Relative humidity (%)',
 'Air velocity (m/s)']

In [None]:
features = ieq_data[feature_columns]

In [None]:
features.info()

The **target** variable is the column that we want to predict - in this case, thermal sensation. We will use the "rounded" version to minimize the number of categories

In [None]:
target = ieq_data['ThermalSensation_rounded']

In [None]:
target.head()

## Create dummy variables for the categories

Once again, we need to convert the categorical variables to dummy variables in order as that is the input the model expects

In [None]:
features_withdummies = pd.get_dummies(features)


In [None]:
features_withdummies.head()

## Create the Train and Test Split using SK Learn

Now we will create a function that will divide the data set into a random train/test combination.

In [None]:
features_train, features_test, target_train, target_test = train_test_split(features_withdummies, target, test_size=0.3, random_state=2)


In [None]:
features_train.head()

In [None]:
features_train.info()

In [None]:
features_test.info()

## Train the Random Forest Model and make the classification prediction

We now can call the Random Forest model from sklearn that was loaded before and specify various input features (or parameters) that influence the way the model is constructed.

These parameters can be optimized in order to achieve the best accuracy.

In [None]:
model_rf = RandomForestClassifier(oob_score = True, max_features = 'auto', n_estimators = 100, min_samples_leaf = 2, random_state = 2)

In [None]:
model_rf.fit(features_train, target_train)


## Out-of-Bag (OOB) Error Calculation

OOB is a metric to measure the accuracy of the classification to predict the right class. The fact that we have six classes to predict makes this classification a bit of challenge.


In [None]:
mean_model_accuracy = model_rf.oob_score_

print("Model accuracy: "+str(mean_model_accuracy))

The model is accurate about half the time in predicting if someone is comfortable. That seems low, but let's find where the baseline is.

## Create a Baseline Model to compare the accuracy of the model

Sci-kit learn allows you to create a baseline which is the accuracy in just random guessing



In [None]:
#Dummy Classifier model to get a baseline
baseline_rf = DummyClassifier(strategy='stratified',random_state=0)
baseline_rf.fit(features_train, target_train)
#DummyClassifier(constant=None, random_state=1, strategy='most_frequent')
baseline_model_accuracy = baseline_rf.score(features_test, target_test)
print("Model accuracy: "+str(baseline_model_accuracy))

The baseline model is only 28%, therefore our model is almost twice the accuracy at predicting the right value

## Classification Report

Classification is often evaluated by more than just accuracy -- there are several other metrics that are calculated to understand the success to classification. We can report that outlines the `precision`, `recall`, `f1-score`, and `support` metrics for each of the classes being predicted.

In [None]:
y_pred = model_rf.predict(features_test)
y_true = np.array(target_test)
categories = np.array(target.sort_values().unique())
print(classification_report(y_true, y_pred))

## Feature Importance

With Random Forest models, there is the built-in capability to calculate the **Feature Importance**. This value is calculated based on which features most contribute to accurate predictions.



In [None]:
importances = model_rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in model_rf.estimators_], axis=0)
indices = np.argsort(importances)[::-1]

In [None]:
# Print the feature ranking
print("Feature ranking:")

for f in range(features_withdummies.shape[1]):
    print("%d. feature %s (%f)" % (f + 1, features_withdummies.columns[indices[f]], importances[indices[f]]))

According to the feature importance analysis, it seems that the conventional environmental metrics are the best predictors of comfort followed by the personal factors

## Plot Feature Importance

We can also plot the feature importance in a line chart of the top features to get a better visual sense

In [None]:
# Plot the feature importances of the forest
plt.figure(figsize=(15,6))
plt.title("Feature Importances")
plt.barh(range(15), importances[indices][:15], align="center")
plt.yticks(range(15), features_withdummies.columns[indices][:15])#
plt.gca().invert_yaxis()
plt.tight_layout(pad=0.4)
plt.show()


## Classification Confusion Matrix Visualization

A confusion matrix is a visualization that helps a user understand which classes are being misclassified 

In this case we will look at absolute numbers of misclassifications and a normalized version of misclassification.

In [None]:
def plot_confusion_matrix(cm, categories, title='Confusion matrix', cmap='Reds'):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(categories))
    plt.xticks(tick_marks,categories, rotation=90)
    plt.yticks(tick_marks,categories)
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

In [None]:
# Compute confusion matrix: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
sns.set(font_scale=1.4)
cm = confusion_matrix(y_true, y_pred)
np.set_printoptions(precision=2)
print('Confusion matrix, without normalization')
print(cm)
plt.figure(figsize=(12,10))
plot_confusion_matrix(cm, categories)

In [None]:
# Normalize the confusion matrix by row (i.e by the number of samples
# in each class)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print('Normalized confusion matrix')
print(cm_normalized)
plt.figure(figsize=(12,10))
plot_confusion_matrix(cm_normalized, categories, title='Normalized Classification Error Matrix')
plt.show()