# Predict Model

In [None]:
import ast

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import matthews_corrcoef

from src.features.features_utils import convert_categoricals_to_numerical
from src.features.features_utils import convert_target_to_numerical
from src.models.metrics_utils import confusion_matrix_to_dataframe
from src.models.metrics_utils import print_matthews_corrcoef
from src.visualization.visualization_utils import plot_logistic_regression_odds_ratio

## Reading in the Data

First let's read in the classifier parameters and metadata that we saved so that we can reconstruct the classifier.

In [None]:
classifier_params = pd.read_csv('../models/LR.csv', squeeze=True, index_col=0)
classifier_params

Next let's read in the training, validation and test features and targets. We make sure to convert the categorical fields to a numerical form that is suitable for building machine learning models.

In [None]:
train_features = pd.read_csv('../data/processed/train-features.csv')
X_train = convert_categoricals_to_numerical(train_features)
X_train.head()

In [None]:
train_target = pd.read_csv('../data/processed/train-target.csv', index_col='full_name', squeeze=True)
y_train = convert_target_to_numerical(train_target)
y_train.head()

In [None]:
validation_features = pd.read_csv('../data/processed/validation-features.csv')
X_validation = convert_categoricals_to_numerical(validation_features)
X_validation.head()

In [None]:
validation_target = pd.read_csv('../data/processed/validation-target.csv', index_col='full_name',
                                squeeze=True)
y_validation = convert_target_to_numerical(validation_target)
y_validation.head()

In [None]:
test_features = pd.read_csv('../data/processed/test-features.csv')
X_test = convert_categoricals_to_numerical(test_features)
X_test.head()

In [None]:
test_target = pd.read_csv('../data/processed/test-target.csv', index_col='full_name', squeeze=True)
y_test = convert_target_to_numerical(test_target)
y_test.head()

## Retraining on the Training and Validation Data

It makes sense to retrain the model on both the training and validation data so that we can obtain as good a predictive performance as possible. So let's combine the training and validation features and targets, reconstruct the classifier and retrain the model.

In [None]:
X_train_validation = X_train.append(X_validation)
assert(len(X_train_validation) == len(X_train) + len(X_validation))
X_train_validation.head()

In [None]:
y_train_validation = y_train.append(y_validation)
assert(len(y_train_validation) == len(y_train) + len(y_validation))
y_train_validation.head()

In [None]:
classifier = LogisticRegression(**ast.literal_eval(classifier_params.params))
classifier.fit(X_train_validation, y_train_validation)

## Predicting on the Test Data

Here comes the moment of truth! We will soon see just how good the model is by predicting on an unseen dataset, the test data. However, first it makes sense to look once again at the performance of our "naive" [baseline model](5.0-baseline-model.ipynb) on the test data. Recall that this is a model that predicts the physicist is a laureate whenever the number of workplaces is at least 2.

In [None]:
y_train_pred = X_train_validation.num_workplaces_at_least_2
y_test_pred = X_test.num_workplaces_at_least_2
mcc_train_validation = matthews_corrcoef(y_train_validation, y_train_pred)
mcc_test = matthews_corrcoef(y_test, y_test_pred)
name = 'Baseline Classifier'
print_matthews_corrcoef(mcc_train_validation, name, data_label='train + validation')
print_matthews_corrcoef(mcc_test, name, data_label='test')

Unsurprisingly, this classifier exhibits very poor performance on the test data. We see evidence of the covariate shift again here due to the difference in the test and train + validation MCCs. Either physicists started working in more workplaces in general, or the records of where physicists have worked are better in modern times. The confusion matrix and classification report indicate that the classifier is terrible in terms of both precision and recall when identifying laureates.

In [None]:
display(confusion_matrix_to_dataframe(confusion_matrix(y_test, y_test_pred)))
print(classification_report(y_test, y_test_pred))

OK let's see how our logistic regression model does on the test data.

In [None]:
y_train_pred = (classifier.predict_proba(X_train_validation)[:, 1] > ast.literal_eval(
    classifier_params.threshold)).astype('int64')
y_test_pred = (classifier.predict_proba(X_test)[:, 1] > ast.literal_eval(
    classifier_params.threshold)).astype('int64')
mcc_train_validation = matthews_corrcoef(y_train_validation, y_train_pred)
mcc_test = matthews_corrcoef(y_test, y_test_pred)
print_matthews_corrcoef(mcc_train_validation, classifier_params.name, data_label='train + validation')
print_matthews_corrcoef(mcc_test, classifier_params.name, data_label='test')

This classifier performs much better on the test data than the baseline classifier. Again you can see that by comparing it to the baseline classifier, I'm not actually saying whether it is a good or bad classifier. There is very little in the literature, even as a rule of thumb, saying what the expected MCC is for a "good performing classifier" as it is very dependent on the context and usage. As I said before, predicting Physics Nobel Laureates is a difficult task due to the many complex factors involved, so we certainly should not be expecting stellar performance from *any* classifier. This includes both machine classifiers, either machine-learning-based or rules-based, and human classifiers without inside knowledge. However, let me try and get off the fence just a little now. 

The MCC is a contingency matrix method of calculating the Pearson product-moment correlation coefficient and so it has the [same interpretation](https://stats.stackexchange.com/questions/118219/how-to-interpret-matthews-correlation-coefficient-mcc). If the values in the link are to be believed, then our classifier has a "moderate positive relationship" with the target. This [statistical guide](https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php ) also seems to agree with this assessment. However, we can easily find examples that indicate there is a [low positive correlation](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3576830/) or a [weak uphill (positive) linear relationship](https://www.dummies.com/education/math/statistics/how-to-interpret-a-correlation-coefficient-r/) between the classifier's predictions and the target.

So should we conclude that the classifier has a low or moderate performance? Asking this question is missing the purpose of this study. Instead we should ask, based on the classifier's performance, would we be willing to make recommendations to the Nobel Committee, about any biases that may be present when deciding Physics Laureates?  We can see from the confusion matrix and classification report that although this classifier has reasonable recall of laureates, it is contaminated by too many false postives. Or in other words, it is not precise enough. As a result, the answer to the question is very likely no.

In [None]:
display(confusion_matrix_to_dataframe(confusion_matrix(y_test, y_test_pred)))
print(classification_report(y_test, y_test_pred))

## Most Important Features

Out of interest, let's determine the features that are most important to the prediction by looking at the coefficients of the logistic regression model. Each coefficient represents the impact that the presence vs. absence of a predictor has on the log odds ratio of a physicist being a Nobel Laureate. The change in odds ratio for each predictor can can simply be computed by exponentiating its associated coefficient. The top fifteen most important features are plotted in the chart below.

In [None]:
top_n = 15
ax = plot_logistic_regression_odds_ratio(classifier.coef_, top_n=top_n, columns=X_train_validation.columns,
    title='Top {} most important features in prediction of Physics Nobel Laureates'.format(top_n))
ax.figure.set_size_inches(10, 8)

By far the most important feature is being an experimental physicist. This matches with what we observed during the [exploratory data analysis](4.0-exploratory-data-analysis.ipynb). Next comes having at least one physics laureate doctoral student and then living for at least 65-79 years. Some of the other interesting top features we see are being a citizen of France or Switzerland, working at [Bell Labs](https://en.wikipedia.org/wiki/Bell_Labs#Discoveries_and_developments) or [The University of Cambridge](https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_university_affiliation#University_of_Cambridge_(2nd)), being an alumnus in Asia and having at least two alma mater.