# Baseline Model

Prior to any machine learning, it is prudent to establish a baseline model with which to compare any trained models against. If none of the trained models can beat this "naive" model, then the conclusion is either that machine learning is not suitable for the predictive task or a different learning approach is needed. My goal here is to create a rules-based baseline classifier that can be used as a benchmark to compare against machine learning classifiers.

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import matthews_corrcoef

from src.features.features_utils import convert_categoricals_to_numerical
from src.features.features_utils import convert_target_to_numerical
from src.models.metrics_utils import confusion_matrix_to_dataframe
from src.models.metrics_utils import print_matthews_corrcoef
from src.models.models_utils import baseline_model_predict

## Reading in the Data

First let's read in the training and validation features and target and convert the categorical fields to a numerical form that is suitable for building machine learning models.

In [None]:
train_features = pd.read_csv('../data/processed/train-features.csv')
train_features = convert_categoricals_to_numerical(train_features)
train_features.head()

In [None]:
train_target = pd.read_csv('../data/processed/train-target.csv', squeeze=True)
train_target = convert_target_to_numerical(train_target)
train_target.head()

In [None]:
validation_features = pd.read_csv('../data/processed/validation-features.csv')
validation_features = convert_categoricals_to_numerical(validation_features)
validation_features.head()

In [None]:
validation_target = pd.read_csv('../data/processed/validation-target.csv', squeeze=True)
validation_target = convert_target_to_numerical(validation_target)
validation_target.head()

## Performance Measure

Before building a baseline classifier, I first need to address the issue of how to compare and assess the quality of different classifiers. A *performance measure* is clearly needed. But which one? [Accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision) is affected by the probability of class membership of the target and therefore it is not a suitable metric for this problem due as there are many more non-laureates than laureates. In situations such as this, accuracy can be very misleading.

The [Matthews Correlation Coefficient](https://en.wikipedia.org/wiki/Matthews_correlation_coefficient) (MCC) is a suitable performance measure as it is generally regarded as a balanced measure of binary classification performance that can be used when there is a class imbalance. [Predicting Protein-Protein Interaction by the Mirrortree Method Possibilities and Limitations](https://www.researchgate.net/publication/259354929_Predicting_Protein-Protein_Interaction_by_the_Mirrortree_Method_Possibilities_and_Limitations) says that "MCC is a more robust measure of effectiveness of binary classification methods than such measures as precision, recall, and F-measure because it takes into account in a balanced way of all four factors contributing to the effectiveness; true positives, false positives, true negatives and false negatives". The MCC can be calculated directly from the [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) using the formula:

\begin{equation}
MCC = \frac{TP \times TN - FP \times FN}{{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}}
\end{equation}

where TP is the number of [true positives](https://en.wikipedia.org/wiki/True_positive), TN the number of true [negatives](https://en.wikipedia.org/wiki/True_negative), FP the number of [false positives](https://en.wikipedia.org/wiki/False_positive) and FN the number of [false negatives](https://en.wikipedia.org/wiki/False_negative). If any of the four sums in the denominator is zero, the denominator can be arbitrarily set to one; this results in a Matthews correlation coefficient of zero, which can be shown to be the correct limiting value.

The MCC is the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) between the observed and predicted binary classifications. It has an upper limit of +1 indicating a perfect prediction, a lower limit of -1 indicating total disagreement between prediction and observation and a mid value of 0 representing a random prediction.

## Baseline Classifier

How should I go about creating this baseline classifier? Well my first idea is a classifier that always predicts the majority class. Let's go ahead and look at the MCC and confusion matrix for such a classifier on the training and validation data.

In [None]:
def majority_class_predict(target):
    majority = target.value_counts().idxmax()
    predict = np.full(len(target,), majority)
    return pd.Series(predict)

In [None]:
majority_train_predict = majority_class_predict(train_target)
majority_train_predict_mcc = matthews_corrcoef(y_true=train_target,
                                               y_pred=majority_train_predict)
print_matthews_corrcoef(majority_train_predict_mcc, 'Majority class classifier',
                        data_label='train')
majority_confusion_matrix_train = confusion_matrix_to_dataframe(
    confusion_matrix(y_true=train_target, y_pred=majority_train_predict))
majority_confusion_matrix_train

In [None]:
majority_validation_predict = majority_class_predict(validation_target)
majority_validation_predict_mcc = matthews_corrcoef(y_true=validation_target,
                                                    y_pred=majority_validation_predict)
print_matthews_corrcoef(majority_validation_predict_mcc, 'Majority class classifier',
                        data_label='validation')
index = ['Observed non-laureate', 'Observed laureate']
columns = ['Predicted non-laureate', 'Predicted laureate']
majority_confusion_matrix_validation = confusion_matrix_to_dataframe(
    confusion_matrix(y_true=validation_target, y_pred=majority_validation_predict),
    index=index, columns=columns)
majority_confusion_matrix_validation

We can see that a classifier which always predicts the negative class is equivalent to random guessing and therefore is completely useless. The runtime warning is screaming this out loud as the sum of TP and FP is zero. Note that if I had instead used accuracy as the performance measure I would have been completely misled into believing that this was a reasonable classifier!

In [None]:
print('Majority class classifier accuracy (train):',
      round(accuracy_score(y_true=train_target, y_pred=majority_train_predict), 2))
print('Majority class classifier accuracy (validation):',
      round(accuracy_score(y_true=validation_target,
                           y_pred=majority_validation_predict), 2))

Surely I can do better than this classifier. If you recall, during the [exploratory data analysis](4.0-exploratory-data-analysis.ipynb), I saw that there was a big positive effect being an [experimental physicist](https://htmlpreview.github.io/?https://github.com/covuworie/nobel-physics-prizes/blob/master/nobel_physics_prizes/notebooks/html_output/4.0-exploratory-data-analysis.html) has on becoming a Nobel Laureate in Physics. So let's try and leverage this information now by creating a classifier which only predicts a physicist is a laureate when s/he is an experimental physicist.

In [None]:
experimental_physicist_train_predict_mcc = matthews_corrcoef(
    y_true=train_target, y_pred=train_features.is_experimental_physicist)
print_matthews_corrcoef(experimental_physicist_train_predict_mcc,
                        'Experimental physicist classifier', data_label='train')
experimental_physicist_confusion_matrix_train = confusion_matrix_to_dataframe(
    confusion_matrix(y_true=train_target, y_pred=train_features.is_experimental_physicist),
    index=index, columns=columns)
experimental_physicist_confusion_matrix_train

In [None]:
experimental_physicist_validation_predict_mcc = matthews_corrcoef(
    y_true=validation_target, y_pred=validation_features.is_experimental_physicist)
print_matthews_corrcoef(experimental_physicist_validation_predict_mcc,
                        'Experimental physicist classifier', data_label='validation')
experimental_physicist_confusion_matrix_validation = confusion_matrix_to_dataframe(
    confusion_matrix(y_true=validation_target,
                     y_pred=validation_features.is_experimental_physicist),
    index=index, columns=columns)
experimental_physicist_confusion_matrix_validation

This classifier is better than the previous one. 0.59 MCC is a moderate correlation for the training set and 0.24 MCC is a low correlation for the validation set. Examination of the confusion matrices illustrates that this classifier is excellent at identifying the negative class, but not very good at identifying the positive class. In fact, on the validation set, it is appalling at predicting the laureates. As such, as a classifier it is also useless.

Also, recall during the exploratory data analysis, I saw that the ratio of the number of alma mater and the ratio of the number of workplaces appear to be reasonable predictors of the target as the median values are consistently higher for laureates than non-laureates. It seems reasonable to create a classifier that always predicts a physicist is a laureate when the value of either of these variables is above the lower quartile for laureates. This is a value of approximately 0.8 for the ratio of the number of alma mater and 0 for the ratio of the number of workplaces. OK let's go ahead and look at the performance of thi classifier.

In [None]:
baseline_train_predict = baseline_model_predict(train_features)
baseline_train_predict_mcc = matthews_corrcoef(
    y_true=train_target, y_pred=baseline_train_predict)
print_matthews_corrcoef(baseline_train_predict_mcc, 'Baseline classifier',
                        data_label='train')
baseline_confusion_matrix_train = confusion_matrix_to_dataframe(
    confusion_matrix(y_true=train_target, y_pred=baseline_train_predict))
baseline_confusion_matrix_train

In [None]:
baseline_validation_predict = baseline_model_predict(validation_features)
baseline_validation_predict_mcc = matthews_corrcoef(
    y_true=validation_target, y_pred=baseline_validation_predict)
print_matthews_corrcoef(baseline_validation_predict_mcc, 'Baseline classifier',
                        data_label='validation')
baseline_confusion_matrix_validation = confusion_matrix_to_dataframe(
    confusion_matrix(y_true=validation_target, y_pred=baseline_validation_predict))
baseline_confusion_matrix_validation

This classifier is better than the previous two. Although the MCC's are weak for the training and validation sets, examination of the confusion matrices illustrates that this classifier is excellent at identifying the positive class. Or in other words, it has excellent recall on the positive class, although it's precision has plenty of scope for improvement.

In [None]:
print(classification_report(y_true=validation_target, y_pred=baseline_validation_predict))

This classifier is far from perfect, but it's not too bad for a "naive" rules-based classifier. It does seem like a classifier that is reasonable for me to use as a benchmark for comparing machine learning classifiers against.