# Baseline Model

Prior to any machine learning, it is prudent to establish a baseline model with which to compare any trained models against. If none of the trained models can beat this "naive" model, then the conclusion is that either machine learning is not suitable for the predictive task or a different learning approach is needed. Our goal here is to create a *rules-based classifier* that can be used as a baseline to compare against machine learning classifiers.

In [None]:
import operator

import numpy as np
import pandas as pd
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import matthews_corrcoef

from src.features.features_utils import convert_categoricals_to_numerical
from src.features.features_utils import convert_target_to_numerical
from src.models.metrics_utils import confusion_matrix_to_dataframe
from src.models.metrics_utils import print_matthews_corrcoef

## Reading in the Data

First let's read in the training and validation features and target. Also, let's convert the categorical fields to a numerical form that is suitable for performance evaluation.

In [None]:
train_features = pd.read_csv('../data/processed/train-features.csv')
train_features = convert_categoricals_to_numerical(train_features)
train_features.head()

In [None]:
train_target = pd.read_csv('../data/processed/train-target.csv', index_col='full_name', squeeze=True)
train_target = convert_target_to_numerical(train_target)
train_target.head()

In [None]:
validation_features = pd.read_csv('../data/processed/validation-features.csv')
validation_features = convert_categoricals_to_numerical(validation_features)
validation_features.head()

In [None]:
validation_target = pd.read_csv('../data/processed/validation-target.csv', index_col='full_name',
                                squeeze=True)
validation_target = convert_target_to_numerical(validation_target)
validation_target.head()

## Performance Measure

Before building a baseline classifier, we first need to address the issue of how to compare and assess the quality of different classifiers. A **performance measure** is clearly needed. But which one? [Accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision) is affected by the probability of class membership of the target and therefore it is not a suitable metric for this problem, as there are many more non-laureates than laureates. In such situations accuracy can be very misleading.

The [Matthews Correlation Coefficient](https://en.wikipedia.org/wiki/Matthews_correlation_coefficient) (MCC) (also known as the [phi coefficient](https://en.wikipedia.org/wiki/Phi_coefficient)) is a suitable performance measure that can be used when there is a class imbalance. It is widely regarded as a balanced measure of binary classification performance. [Predicting Protein-Protein Interaction by the Mirrortree Method Possibilities and Limitations](https://www.researchgate.net/publication/259354929_Predicting_Protein-Protein_Interaction_by_the_Mirrortree_Method_Possibilities_and_Limitations) says that "MCC is a more robust measure of effectiveness of binary classification methods than such measures as precision, recall, and F-measure because it takes into account in a balanced way of all four factors contributing to the effectiveness; true positives, false positives, true negatives and false negatives". The MCC can be calculated directly from the [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) using the formula:

\begin{equation}
MCC = \frac{TP \times TN - FP \times FN}{{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}}
\end{equation}

where TP is the number of [true positives](https://en.wikipedia.org/wiki/True_positive), TN the number of true [negatives](https://en.wikipedia.org/wiki/True_negative), FP the number of [false positives](https://en.wikipedia.org/wiki/False_positive) and FN the number of [false negatives](https://en.wikipedia.org/wiki/False_negative). If any of the four sums in the denominator is zero, the denominator can be arbitrarily set to one; this results in a Matthews correlation coefficient of zero, which can be shown to be the correct limiting value.

The MCC is the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) between the observed and predicted binary classifications. It has an upper limit of +1 indicating a perfect prediction, a lower limit of -1 indicating total disagreement between prediction and observation and a mid value of 0 representing a random prediction.

## Baseline Classifier

How should we go about creating this baseline classifier? One idea is a classifier that always predicts the majority class. Let's go ahead and look at the MCC and confusion matrix for such a classifier on the training and validation data.

In [None]:
majority = DummyClassifier(strategy='most_frequent')
majority.fit(train_features, train_target)
majority_train_predict = majority.predict(train_target)
majority_train_predict_mcc = matthews_corrcoef(y_true=train_target, y_pred=majority_train_predict)
print_matthews_corrcoef(majority_train_predict_mcc, 'Majority class classifier', data_label='train')
majority_confusion_matrix_train = confusion_matrix_to_dataframe(
    confusion_matrix(y_true=train_target, y_pred=majority_train_predict))
majority_confusion_matrix_train

In [None]:
majority_validation_predict = majority.predict(validation_features)
majority_validation_predict_mcc = matthews_corrcoef(y_true=validation_target,
                                                    y_pred=majority_validation_predict)
print_matthews_corrcoef(majority_validation_predict_mcc, 'Majority class classifier',
                        data_label='validation')
index = ['Observed non-laureate', 'Observed laureate']
columns = ['Predicted non-laureate', 'Predicted laureate']
majority_confusion_matrix_validation = confusion_matrix_to_dataframe(
    confusion_matrix(y_true=validation_target, y_pred=majority_validation_predict), index=index,
    columns=columns)
majority_confusion_matrix_validation

We can see that a classifier which always predicts the negative class is equivalent to random guessing and therefore is completely useless. The runtime warning is screaming this out loud as the sum of TP and FP is zero. Note that if we had instead used accuracy as the performance measure we would have been completely misled into believing that this is a reasonable good classifier!

In [None]:
print('Majority class classifier accuracy (train):',
      round(accuracy_score(y_true=train_target, y_pred=majority_train_predict), 2))
print('Majority class classifier accuracy (validation):',
      round(accuracy_score(y_true=validation_target,
                           y_pred=majority_validation_predict), 2))

Surely we can do better than this classifier. The function below is a brute force approach to creating a baseline classifier. It fits a model for each of the predictors in turn and returns the best model, as judged by MCC on the validation set. 

In [None]:
def find_feature_with_highest_mcc(train_features, train_target, validation_features, validation_target):

    """Find the feature with the highest Matthews Correlation Coefficient (MCC)
    on the validation set.
    
    Prints the feature, it's MCC values on the training and validation sets as
    well as the confusion matrices.

    Args:
        train_features (pandas.Dataframe): Training features.
        train_target (pandas.Series): Training target.
        validation_features (pandas.Dataframe): Training features.
        validation_target (pandas.Series): Validation target.

    """

    validation_mccs = {}
    for feature in train_features.columns:
        validation_mccs[feature] = round(matthews_corrcoef(y_true=validation_target,
                                                           y_pred=validation_features[feature]), 2)
    highest_mcc = sorted(validation_mccs.items(), key=operator.itemgetter(1), reverse=True)[0]
    classifier_label = highest_mcc[0] + ' classifier'
    
    print_matthews_corrcoef(round(matthews_corrcoef(y_true=train_target,
                                                    y_pred=train_features[feature]), 2),
                            classifier_label, data_label='train')
    confusion_matrix_train = confusion_matrix_to_dataframe(
        confusion_matrix(y_true=train_target, y_pred=train_features[highest_mcc[0]]), index=index,
        columns=columns)
    display(confusion_matrix_train)
    
    print_matthews_corrcoef(highest_mcc[1], classifier_label, data_label='validation')
    confusion_matrix_validation = confusion_matrix_to_dataframe(
        confusion_matrix(y_true=validation_target, y_pred=validation_features[highest_mcc[0]]),
        index=index, columns=columns)
    display(confusion_matrix_validation)

In [None]:
find_feature_with_highest_mcc(train_features, train_target, validation_features, validation_target)

This classifier is not great, but it's much better than the previous one. The MCC's are low for the training and validation sets, however, they are definitely better than chance level performance. Examination of the confusion matrices illustrates that this classifier is slightly better than 50-50 at identifying the positive class and is quite good at identifying the negative class. This is also confirmed looking at the precision, recall and f1-score of the classes.

In [None]:
print(classification_report(y_true=validation_target,
                            y_pred=validation_features.num_workplaces_at_least_2))

This classifier is far from perfect, but it's not too bad for a "naive" rules-based classifier. It does seem like a classifier that is reasonable to use as a benchmark for comparing machine learning classifiers against.