# Train Model

It is finally time to use the features we have created to train some classifiers to predict *Physics Nobel Laureates*. We will be training the following classifiers:
- [Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression)
- [Support Vector Machine](https://en.wikipedia.org/wiki/Support_vector_machine)
- [Random Forest](https://en.wikipedia.org/wiki/Random_forest)

These classifiers are chosen due to their simplicity, appropriateness for the input features and target data types, ease of interpretability and their well-established performance on a range of classification tasks.

We will be selecting hyperparameters and evaluating the performance of the models on the validation set. This will be done in a principled manner using a couple of performance measures. The first of these, the [Matthews Correlation Coefficient (MCC)](https://en.wikipedia.org/wiki/Matthews_correlation_coefficient), you have already seen when we created the [baseline model](5.0-baseline-model.ipynb). We will discuss the second measure in detail, and explain why it is needed, when we get to that point. OK let's get going!

In [None]:
import ast
import operator
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import matthews_corrcoef
from sklearn.model_selection import ParameterGrid
from sklearn.model_selection import ParameterSampler
from sklearn.svm import SVC

from src.data.progress_bar import progress_bar
from src.features.features_utils import convert_categoricals_to_numerical
from src.features.features_utils import convert_target_to_numerical
from src.models.metrics_utils import confusion_matrix_to_dataframe
from src.models.metrics_utils import mcc_auc_score
from src.models.metrics_utils import mcc_curve

%matplotlib inline

## Reading in the Data

First let's read in both sets of training and validation features and targets as well as the sample weights we created for covariate shift adaptation. We make sure to convert the categorical fields to a numerical form that is suitable for building machine learning models.

In [None]:
train_features = pd.read_csv('../data/processed/train-features.csv')
X_train = convert_categoricals_to_numerical(train_features)
X_train.head()

In [None]:
sample_weights = pd.read_csv('../models/train-features-sample-weights.csv')
sample_weights.head()

In [None]:
train_features_topics = pd.read_csv('../data/processed/train-features-topics.csv')
X_train_topics = convert_categoricals_to_numerical(train_features_topics)
X_train_topics.head()

In [None]:
sample_weights_topics = pd.read_csv('../models/train-features-topics-sample-weights.csv')
sample_weights_topics.head()

In [None]:
train_target = pd.read_csv('../data/processed/train-target.csv', index_col='full_name', squeeze=True)
y_train = convert_target_to_numerical(train_target)
y_train.head()

In [None]:
validation_features = pd.read_csv('../data/processed/validation-features.csv')
X_validation = convert_categoricals_to_numerical(validation_features)
X_validation.head()

In [None]:
validation_features_topics = pd.read_csv('../data/processed/validation-features-topics.csv')
X_validation_topics = convert_categoricals_to_numerical(validation_features_topics)
X_validation_topics.head()

In [None]:
validation_target = pd.read_csv('../data/processed/validation-target.csv', index_col='full_name',
                                squeeze=True)
y_validation = convert_target_to_numerical(validation_target)
y_validation.head()

## Hyperparameter Selection

The hyperparameters of the models that we will be fitting are critical to their predictive performance. We will use an exhaustive grid search to select them in a principled manner. The optimal hyperparameter values will be chosen according to the set of values that maximize the Matthews Correlation Coefficient (MCC) on the validation set. The function below will be used to accomplish this task.

In [None]:
def evaluate_classifier(
    X_train, y_train, X_validation, y_validation, clf=LogisticRegression(),
    param_grid=ParameterGrid(dict(C=np.logspace(-5, 15, 21, base=2.0))), score_func=matthews_corrcoef,
    greater_score_is_better=True, solver='lbfgs', sample_weight=None, max_iter=1000,
    random_state=None, n_jobs=None, name='classifier', progress_bar=None):
    
    """Evaluate an `sklearn` classifier over a parameter grid using a scoring function. 
    
    Args:
        X_train ({array-like, sparse matrix}, shape (n_samples, n_features)): Training features matrix, 
            where n_samples is the number of samples and n_features is the number of features.
        y_train (array-like, shape (n_samples,)): Training data target vector.
        X_validation ({array-like, sparse matrix}, shape (n_samples, n_features)): Validation features
        matrix, where n_samples is the number of samples and n_features is the number of features.
        y_validation (array-like, shape (n_samples,)): Validation data target vector.
        clf (sklearn.base.BaseEstimator, optional): Defaults to LogisticRegression(). Classifier.
        param_grid (sklearn.model_selection.ParameterGrid, optional): Defaults to 
            ParameterGrid(dict(C=np.logspace(-5, 15, 21, base=2.0))). Grid of parameters with a discrete
            number of values for each.
        score_func (callable, optional): Defaults to matthews_corrcoef. Score function (or loss function)
            with signature score_func(y, y_pred, **kwargs).
        greater_score_is_better (bool, optional): Defaults to True. Whether score_func is a score function
            (default), meaning high is good, or a loss function, meaning low is good. In the latter case,
            the scorer object will sign-flip the outcome of the score_func.
        solver (str, optional): Defaults to 'lbfgs'. Algorithm to use in the optimization problem.
        sample_weight (array-like, shape (n_samples,), optional): Defaults to None. Array of weights
            that are assigned to individual samples. If not provided, then each sample is given unit
            weight.
        max_iter (int, optional): Defaults to 1000. Maximum number of iterations taken for the solvers
            to converge.
        random_state (int, RandomState instance or None, optional): Defaults to None. The seed of the
            pseudo random number generator to use when shuffling the data.
        n_jobs (int or None, optional): Defaults to None. Number of CPU cores used when parallelizing.
            None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.
        name (str, optional): Defaults to 'classifier'. Name of classifier.
        progress_bar (progressbar.ProgressBar, optional): Defaults to None. Progress bar.
        
    Returns:
        dict: Results containing the name, best classifier estimator, best parameters, best score,
            training scores, and validation scores.
    
    Raises:
        NotImplementedError: If classifier is not LogisticRegression, SVC or RandomForestClassifier.

        """
    
    if progress_bar:
        progress_bar.start()

    train_scores = {}
    validation_scores = {}
    classifiers = {}
    num_iters = 0
    for params in param_grid:
        
        num_iters += 1
        if progress_bar:
            progress_bar.update(num_iters)
        
        # fit the model to training set
        if isinstance(clf, LogisticRegression):
            classifier = LogisticRegression(
                penalty=params.get('penalty', 'l2'), C=params.get('C', 1.0), solver=solver,
                random_state=random_state, class_weight=params.get('class_weight'), max_iter=max_iter)
        elif isinstance(clf, SVC):
            classifier = SVC(
                C=params.get('C', 1.0), kernel=params.get('kernel', 'rbf'),
                gamma=params.get('gamma', 'auto_deprecated'), random_state=random_state,
                class_weight=params.get('class_weight'), max_iter=max_iter)
        elif isinstance(clf, RandomForestClassifier):
            classifier = RandomForestClassifier(
                n_estimators=params.get('n_estimators', 'warn'),
                max_features=params.get('max_features', 'auto'),
                min_samples_leaf=params.get('min_samples_leaf', 1), random_state=random_state,
                class_weight=params.get('class_weight'), n_jobs=n_jobs)
        else:
            raise NotImplementedError
        classifier.fit(X_train, y_train, sample_weight=sample_weight)
        classifiers[str(params)] = classifier

        # predict on validation set and evaluate scores
        y_train_predict = classifier.predict(X_train)
        y_validation_predict = classifier.predict(X_validation)
        with warnings.catch_warnings():  # ignore runtime warnings caused by zero MCC
            warnings.filterwarnings('ignore', category=RuntimeWarning)
            train_scores[str(params)] = score_func(y_true=y_train, y_pred=y_train_predict)
            validation_scores[str(params)] = score_func(y_true=y_validation,
                                                        y_pred=y_validation_predict)
            
    if progress_bar:
        progress_bar.finish()
    
    # find the best scoring model
    sorted_validation_scores = sorted(
        validation_scores.items(), key=operator.itemgetter(1), reverse=greater_score_is_better)
    best_params = ast.literal_eval(sorted_validation_scores[0][0])
    best_score = sorted_validation_scores[0][1]
    best_classifier = classifiers[str(best_params)]
    
    # return results
    results = dict(name=name, best_classifier=best_classifier, best_params=best_params,
                   best_score=best_score, train_scores=train_scores, validation_scores=validation_scores) 
    return results


def print_best_classifier(results):
    """Print the best classifier.
    
    Args:
        results (dict): Results of classifier evaluation.
    """

    print(results['name'])
    print('Best params: ', results['best_params'])
    print('Training score: ', round(results['train_scores'][str(results['best_params'])], 3))
    print('Validation score: ', round(results['best_score'], 3))

It's now time to select the best parameters for the two feature sets with and without the sample weights.

### Logistic Regression (LR)

The hyperparameters to be selected for the logistic regression model are:
- The `penalty` which is used to specify whether the $L1$ or $L2$ norms are used in the regularization. The latter favors sparse solutions and naturally performs feature selection. 
- `C`, the inverse of regularization strength. Smaller values specify stronger regularization.
- `class_weight`, the weights associated with the classes. It penalizes mistakes in samples of a class with its associated class_weight. So a higher value indicates more emphasis is put on a class.

Let's perform the grid search now.

In [None]:
penalty = ['l1', 'l2']
Cs= np.logspace(-5, 15, 21, base=2.0)
class_weight = ([{0: weight, 1: 1.0 - weight} for weight in np.linspace(0.0, 1.0, 21)] +
                 [{0: 1.0, 1: 1.0}] + ['balanced'])
param_grid = ParameterGrid(dict(penalty=penalty, C=Cs, class_weight=class_weight))

clf = LogisticRegression()
solver = 'liblinear'
bar = progress_bar(len(param_grid), banner_text_begin='Running: ', banner_text_end=' param sets')

In [None]:
logit_results = evaluate_classifier(
    X_train, y_train, X_validation, y_validation, clf=clf, param_grid=param_grid, solver=solver,
    random_state=0, name='LR', progress_bar=bar)
print_best_classifier(logit_results)

In [None]:
logit_results_weights = evaluate_classifier(
    X_train, y_train, X_validation, y_validation, clf=clf, param_grid=param_grid, solver=solver,
    sample_weight=sample_weights['weight'], random_state=1, name='LR + sample weights', progress_bar=bar)
print_best_classifier(logit_results_weights)

In [None]:
logit_results_topics = evaluate_classifier(
    X_train_topics, y_train, X_validation_topics, y_validation, clf=clf, param_grid=param_grid,
    solver=solver, random_state=2, name='LR (topics)', progress_bar=bar)
print_best_classifier(logit_results_topics)

In [None]:
logit_results_topics_weights = evaluate_classifier(
    X_train_topics, y_train, X_validation_topics, y_validation, clf=clf, param_grid=param_grid,
    solver=solver, sample_weight=sample_weights_topics['weight'], random_state=3,
    name='LR (topics) + sample weights', progress_bar=bar)
print_best_classifier(logit_results_topics_weights)

We can make the following observations about the results:
- Since none of the models selected uniform class weights, we can see that the choice of this hyperparamter is very important.
- Unsurprisingly, $L1$ regularization is chosen for the original features and $L2$ regularization for the topics features.
- Models fitted with the original features are overfitting and those with the topics features are underfitting (the validation MCC's are higher than the training MCCs).
- Applying strong regularization does not improve performance for the original features.

### Support Vector Machine (SVM)

The hyperparameters to be selected for the support vector machine model are:
- The regularization parameter `C` of the error term. This parameter trades off correct classification of training examples against maximization of the separating hyperplane's margin. For larger values of `C`, a smaller margin will be accepted if the separating hyperplane is better at classifying training points correctly. Lower values of `C` encourage a larger margin at the cost of misclassifying more training points.
- `class_weight`, as defined above for logistic regression.

Note that the `kernel` parameter, which is used to specify the kernel type to be used in the algorithm, will be fixed as the *linear* kernel. The reason for not considering the *RBF* or *poly* kernels is interpretability of the model. For the *RBF* and *poly* kernels, the [separating hyperplane and the weights that define it exist in a transformed space](https://stackoverflow.com/questions/21260691/how-to-obtain-features-weights) that is not directly related to the input feature space. OK let's perform the grid search now.

In [None]:
Cs = np.logspace(-5, 10, 16, base=2.0)
param_grid = ParameterGrid(dict(kernel=['linear'], C=Cs, class_weight=class_weight))

clf=SVC()
max_iter = -1
bar = progress_bar(len(param_grid), banner_text_begin='Running: ', banner_text_end=' param sets')

In [None]:
svm_results = evaluate_classifier(
    X_train, y_train, X_validation, y_validation, clf=clf, param_grid=param_grid, random_state=4,
    max_iter=max_iter, name='SVM', progress_bar=bar)
print_best_classifier(svm_results)

In [None]:
svm_results_weights = evaluate_classifier(
    X_train, y_train, X_validation, y_validation, clf=clf, param_grid=param_grid,
    sample_weight=sample_weights['weight'], random_state=5, max_iter=max_iter,
    name='SVM + sample weights', progress_bar=bar)
print_best_classifier(svm_results_weights)

In [None]:
svm_results_topics = evaluate_classifier(
    X_train_topics, y_train, X_validation_topics, y_validation, clf=clf, param_grid=param_grid,
    random_state=6, max_iter=max_iter, name='SVM (topics)', progress_bar=bar)
print_best_classifier(svm_results_topics)

In [None]:
svm_results_topics_weights = evaluate_classifier(
    X_train_topics, y_train, X_validation_topics, y_validation, clf=clf, param_grid=param_grid,
    sample_weight=sample_weights_topics['weight'], random_state=7, max_iter=max_iter,
    name='SVM (topics) + sample weights', progress_bar=bar)
print_best_classifier(svm_results_topics_weights)

We can make the following observations about the results:
- Once again, since none of the models selected uniform class weights, we can see that the choice of this hyperparamter is very important. Interestingly, the balanced class weight is chosen for the topics features.
- Again we see overfitting and underfitting of the models. However, for the classifiers fitted with the sample weights, the effect is far less pronounced than for the logistic regression models.
- The sample-weighted classifiers seem to favor larger values of C (smaller margin) whereas the unweighted ones facvor smaller values of C (larger margin). 

### Random Forest (RF)

There are a multitude of hyperparameters that can be selected for a random forest model. The hyperparameters control the randomness of the forest. The aim is to set them such that reasonable predictive power of individual trees is achieved without excessive correlation between the trees (bias-variance trade-off). Based on the advice of Probst et al. in [Hyperparameters and Tuning Strategies for Random Forest](https://arxiv.org/pdf/1804.03515.pdf), we will restrict ourselves to the following most influential hyperparameters:

- The number of features `max_features` to consider when looking for the best split. Lower values lead to more diverse and less correlated trees, which results in more stable aggregation of predictions. However, lower values also lead to the individual trees being weaker predictors of the target variable.
- The minimum number of samples required to be at a leaf node `min_samples_leaf`. Lower values lead to
trees of larger depth, which means that more splits are performed until the terminal nodes. More splits means the model is more complex and this can potentially lead to overfitting.
- `class_weight`, as defined above for logistic regression.

Note that the number of trees in the forest `n_estimators` will be fixed at 500. Following the recommendations of Probst et al., "the number of trees should be set high: the higher the number of trees, the better the results in terms of performance and precision of variable importances. However, the improvement obtained by adding trees diminishes as more and more trees are added." 500 trees in the forest seems like a reasonable compromise based on the stability of the solution and computational resources available.  

Let's perform the grid search now. Note that this takes a considerable amount of time due to the number of trees in the forest and the large hyperparameter space being searched. To reduce the computation time, we will perform a *random search* over one-quarter of the values in the hyperparameter space, by drawing randomly from a uniform distribution. Bergstra and Bengio show in [Random Search for Hyper-Parameter Optimization](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) that for neural networks random search is more efficient in searching good hyperparameter specifications than grid search.

In [None]:
n_estimators = [500]
max_features = [0.01] + list(np.linspace(0.05, 0.95, 7)) + ['sqrt', None]
min_samples_leaf = np.logspace(0, 6, 7, base=2.0).astype('int64')

class_weight = ([{0: weight, 1: 1.0 - weight} for weight in np.linspace(0.0, 1.0, 11)] +
                 [{0: 1.0, 1: 1.0}] + ['balanced', 'balanced_subsample'])
param_grid = dict(n_estimators=n_estimators, max_features=max_features,
                  min_samples_leaf=min_samples_leaf, class_weight=class_weight)
param_grid_sampler = ParameterSampler(param_grid, n_iter=int(0.25 * len(list(ParameterGrid(param_grid)))),
                                      random_state=8)

clf=RandomForestClassifier()
n_jobs = -1
bar = progress_bar(len(param_grid_sampler), banner_text_begin='Running: ', banner_text_end=' param sets')

In [None]:
rf_results = evaluate_classifier(
    X_train, y_train, X_validation, y_validation, clf=clf, param_grid=param_grid_sampler, random_state=9,
    n_jobs=n_jobs, name='RF', progress_bar=bar)
print_best_classifier(rf_results)

In [None]:
rf_results_weights = evaluate_classifier(
    X_train, y_train, X_validation, y_validation, clf=clf, param_grid=param_grid_sampler,
    random_state=10, sample_weight=sample_weights['weight'], n_jobs=n_jobs, name='RF + sample weights',
    progress_bar=bar)
print_best_classifier(rf_results_weights)

In [None]:
rf_results_topics = evaluate_classifier(
    X_train_topics, y_train, X_validation_topics, y_validation, clf=clf, param_grid=param_grid_sampler,
    random_state=11, n_jobs=n_jobs, name='RF (topics)', progress_bar=bar)
print_best_classifier(rf_results_topics)

In [None]:
rf_results_topics_weights = evaluate_classifier(
    X_train_topics, y_train, X_validation_topics, y_validation, clf=clf, param_grid=param_grid_sampler,
    sample_weight=sample_weights_topics['weight'], random_state=12, n_jobs=n_jobs,
    name='RF (topics) + sample weights', progress_bar=bar)
print_best_classifier(rf_results_topics_weights)

We can make the following observations about the results:
- Once again, since none of the models selected uniform class weights, we can see that the choice of this hyperparamter is very important. In fact all four models select values for the weights which weights the laureate class much higher than the non-laureate class.
- The original features classifiers seem to favor lower values of `min_samples_leaf` (deeper trees) and are overfitting considerably.
- The `max_features` is low for the majority of the classifiers, which means the individual trees are diverse, but weak predictors.

## Model Selection 

Despite the Matthews Correlation Coefficient (MCC) being a balanced measure, which is computed using the entire confusion matrix, it is not a suitable metric for choosing between classifiers. The reason is that its value depends on the *threshold*, $T$, that is used for separating the positive from the negative class. As such, it is quite possible that two classifiers can be made to perform identically by simply adjusting one of their thresholds. To avoid such artifacts a nonparametric performance measure such as a [Receiver Operator Characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) (ROC) curve is generally applied.

The ROC curve is constructed by using different values of the threshold, $T$, 
to plot the [false positive rate](https://en.wikipedia.org/wiki/False_positive_rate) (FPR) (or alternatively the [true negative rate](https://en.wikipedia.org/wiki/Sensitivity_and_specificity)) (TNR = 1 - FPR) on the x-axis against the [true positive rate](https://en.wikipedia.org/wiki/Sensitivity_and_specificity) (TPR) on the y-axis. The [area under the ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve) is typically used as a measure of the predictive performance of a classifier. An alternative, is to construct the [precision-recall](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html) (PR) curve by using different values of the threshold, $T$, to plot the [recall](https://en.wikipedia.org/wiki/Precision_and_recall) on the x-axis against the [precision](https://en.wikipedia.org/wiki/Precision_and_recall) on the y-axis. In this case, the [average precision](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html) is typically used as a measure of the predictive performance of a classifier. For [imbalanced datasets the PR curve is more informative](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4349800/).

I'd prefer not to resort to new measures and would like to continue working directly with the MCC. I'll follow the methodology of Zhou and Jakobsson in [Predicting Protein-Protein Interaction by the MirrortreeMethod: Possibilities and Limitations](https://www.researchgate.net/publication/259354929_Predicting_Protein-Protein_Interaction_by_the_Mirrortree_Method_Possibilities_and_Limitations) by constructing the MCC curve. This curve plots different values of the threshold, $T$, on the x-axis against the [Matthews Correlation Coefficient](https://en.wikipedia.org/wiki/Matthews_correlation_coefficient) (MCC) on the y-axis. The area under the MCC curve seems like an intuitive measure of the predictive performance of a classifier and is defined by:

\begin{equation}
AUC_{MCC} = \int_{-\infty}^{\infty} MCC(T) dT
\end{equation}

However, the problem with this measure is that it is not easy to discern good performance versus bad performance. So let's normalize by the area under the MCC curve assuming $MCC \equiv 1$ for *all* thresholds, $T$. Thus we define the **normalized area under the MCC curve** as: 

\begin{equation}
NAUC_{MCC} = \frac{AUC_{MCC}}{AUC_{MCC \equiv 1}}
\end{equation}

As a result, the interpretation of the $NAUC_{MCC}$ is analogous to that of the MCC. It has an upper limit of +1 indicating a perfect prediction, a lower limit of -1 indicating total disagreement between prediction and observation and a mid value of 0 representing a random prediction. To the best of my knowledge this is the first time the normalized area under the MCC curve has been defined and used in a study.

An interesting case is the case of a probabilistic classfier where $T \in [0, 1]$. By definition, for such a classifier, $AUC_{MCC \equiv 1} = 1$ and $NAUC_{MCC} = AUC_{MCC} = \int_{0}^{1} MCC(T) dT$. So we see that the AUC is already normalized. For all other classifiers we should perform the normalization by integrating over the finite range of thresholds.

Now let's proceed and plot the MCC curves for each of the classifiers, evaluate their normalized AUC and see how they performance compared to our [baseline classifier](5.0-baseline-model.ipynb). Note that for the support vector machine (SVM) classifiers, I have normalized the thresholds $T \in [0, 1]$ so that their performance can be visualized on the same scale as the logistic regression (LR) and random forest (RF) classifiers.

In [None]:
def plot_mccs(
    clfs, X, linestyle='solid', colors = ['black', 'blue', 'green', 'red', 'orange', 'purple'],
    normalize_thresholds=True, ax=None):
    
    """Plot the Matthew Correlation Coefficients (MCC) as a function of the classification threshold for
        a list of classifiers. 

    
    Args:
        clfs (list): Classifiers.
        X ({array-like, sparse matrix}, shape (n_samples, n_features)): Features matrix, where n_samples
            is the number of samples and n_features is the number of features.
        linestyle (str, optional): Defaults to 'solid'. Linestyle to plot.
        colors (list, optional): Defaults to ['black', 'blue', 'green', 'red', 'orange', 'purple']. Colors
            to plot the lines in. They will be colored according to this order.
        normalize_thresholds (bool, optional): Defaults to True. Normalized the thresholds to lie in [0, 1].
        ax (matplotlib.axes.Axes, optional): Defaults to None. axes.
    
    Raises:
        NotImplementedError: Raises if classifier does not have `predict_proba` or `decision_function`
            attribute.
    
    Returns:
        matplotlib.axes.Axes): axes.
    """

    if not ax:
        fig, ax = plt.subplots(figsize=(10, 8))
    
    for color, clf in enumerate(clfs):
        best_classifier = clf['best_classifier']
        if hasattr(best_classifier, 'predict_proba'):
            y_score = best_classifier.predict_proba(X)[:, 1]
            probability = True
        elif hasattr(best_classifier, 'decision_function'):
            y_score = best_classifier.decision_function(X)
            probability = False
        else:
            raise NotImplementedError
        mccs, _, _, thresholds = mcc_curve(y_true=y_validation, y_score=y_score,
                                           probability=probability)
        mcc_auc = mcc_auc_score(y_true=y_validation, y_score=y_score, sample_weight=None,
                                probability=probability)
        
        if normalize_thresholds and (np.min(thresholds) < 0.0 or np.max(thresholds) > 1.0):
            scaler = MinMaxScaler()
            thresholds = scaler.fit_transform(thresholds.reshape(-1, 1))

        ax.plot(thresholds, mccs, linestyle=linestyle, color=colors[color],
                label=clf['name'] + ' - {} NAUC'.format(round(mcc_auc, 3)))

    ax.set_xlabel('Classification Threshold')
    ax.set_ylabel('Matthews Correlation Coefficient (MCC)')
    ax.set_title('Matthews Correlation Coefficient (MCC) vs Classification Threshold')
    ax.legend()

    return ax

In [None]:
classifiers = [logit_results, logit_results_weights, svm_results, svm_results_weights, rf_results,
               rf_results_weights]
classifiers_topics = [logit_results_topics, logit_results_topics_weights, svm_results_topics,
                      svm_results_topics_weights, rf_results_topics, rf_results_topics_weights]

In [None]:
ax = plot_mccs(classifiers, X_validation)
ax = plot_mccs(classifiers_topics, X_validation_topics, linestyle='dashed', ax=ax)
baseline_mcc = matthews_corrcoef(y_validation, X_validation.num_workplaces_at_least_2)
ax.axhline(y=baseline_mcc, linestyle='-.',
           label='Baseline' + ' - {} MCC'.format(round(baseline_mcc, 3)))
ax.set_xlim(0.0, 1.0)
ax.set_ylim(-0.2, 1.0)
ax.set_xlabel('Normalized Classification Threshold')
ax.set_title('Matthews Correlation Coefficient (MCC) vs Normalized Classification Threshold')
ax.legend(ncol=2);

We can make the following observations about the chart:

1. All of the models beat the naive baseline classifier over at least some range of classification thresholds. As such we can conclude that machine learning is appropriate for this task.
2. All $NAUC_{MCC}$'s are significantly greater than zero. This lends further support to the first point. It looks like there is some signal in the data for predicting the target. However, the low values indicate that the majority of the classifiers are weak.
3. The logistic regression classifiers peform better than the random forest classifier, which in turn, perform better than the support vector machine classifiers.
4. Performance of the classifiers trained with the topics features are worse than those trained with the original features. It seems like some important information was lost during topic modeling.
5. The effect of the sample weights varies with classifier type and feature type. Using sample weights improves the performance of the logistic regression and support vector machine classifiers when using the original features. The improvement in performance of the logistic regression model is really impressive! However, sample weights worsen the performance when the topics features are used or the classifier is a random forest.

Clearly we should select the logistic regression with sample weights (LR + sample weights) classifier as it is the standout performer based on $NAUC_{MCC}$. It is also very interesting and a good characteristic that this is the only classifier which is relatively insensitive to the classification threshold. Whether or not it is a good classifier is open to interpretation as it depends on the context and purposes. The task of predicting Nobel Physics Laureates is a difficult one, so I think this is most likely a moderate to good classifier at best.

## Optimal Classification Threshold

Our work is not quite done yet as we still have to choose an optimal classification threshold for the LR + sample weights classifier. Below we plot the classification threshold against the Matthews Correlation Coefficient (MCC), true positive rate (TPR) and true negative rate (TNR) to help us with this selection. Naturally, there is a tradeoff between the TPR and the TNR. Minimizing false positives (i.e. maximizing the TNR) is more important than minimizing false negatives (i.e. maximizing the TPR) when classifying the physicists as laureates and non-laureates. This is because we are trying to gain insight into any kinds of biases that may involved when a physicist is awarded the Nobel Prize. So any threshold to the right of the intersection of the TPR and TNR would satisfy this criteria. However, we must also ensure that we retrieve as many of the true positives as possible so that the conclusions we draw are based on an as large as possible sample of *actual* laureates. So intuitively, the threshold corresponding to the maximum value of the MCC, seems to be a good choice for the optimal classification threshold.

In [None]:
def plot_mcc_tpr_tnr(clf, X, ax=None):
    """Plot the Matthew Correlation Coefficients (MCC), true positive rate (TPR) and true negative rate (TNR) 
        as a function of the classification threshold for a classifier. 
    
    Args:
        clf ([type]): [description]
        X ({array-like, sparse matrix}, shape (n_samples, n_features)): Features matrix, where n_samples
            is the number of samples and n_features is the number of features.
        ax (matplotlib.axes.Axes, optional): Defaults to None. axes.
    
    Raises:
        NotImplementedError: Raises if classifier does not have `predict_proba` or `decision_function`
            attribute.
    
    Returns:
        matplotlib.axes.Axes): axes.

        float: Threshold corresponding to maximum MCC. 
    """

    if not ax:
        fig, ax = plt.subplots(figsize=(8, 6))
    
    best_classifier = clf['best_classifier']
    if hasattr(best_classifier, 'predict_proba'):
        y_score = best_classifier.predict_proba(X)[:, 1]
        probability = True
    elif hasattr(best_classifier, 'decision_function'):
        y_score = best_classifier.decision_function(X)
        probability = False
    else:
        raise NotImplementedError
    mcc, tnr, tpr, thresholds = mcc_curve(y_true=y_validation, y_score=y_score, probability=probability)

    ax.plot(thresholds, mcc, color='black', label='Matthews Correlation Coefficient (MCC)')
    ax.plot(thresholds, tpr, color='blue', label='True Positive Rate (TPR)')
    ax.plot(thresholds, tnr, color='green', label='True Negative Rate (TNR)')
    
    max_mcc_idx = np.argmax(mcc)
    ax.axvline(thresholds[max_mcc_idx], color='red', linestyle='dashed', linewidth=1.0,
               label='Max MCC = {} (threshold = {})'.format(round(np.max(mcc), 3),
                                                            round(thresholds[max_mcc_idx], 3)))
    ax.set_xlim(0.0, 1.0)
    ax.set_ylim(0.0, 1.0)
    
    ax.set_xlabel('Classification Threshold')
    ax.set_title('Matthews Correlation Coefficient (MCC) / True Positive Rate (TPR) \n' 
                 '/ False Negative Rate (FNR) vs Classification Threshold')
    ax.legend()

    return ax, thresholds[max_mcc_idx]

In [None]:
_, optimal_threshold = plot_mcc_tpr_tnr(logit_results_weights, X_validation);

Or alternatively, if you prefer, the confusion matrix and the classification report of this classifier tell a similar story to what I've described above.

In [None]:
best_classifier = logit_results_weights['best_classifier']
y_pred = (best_classifier.predict_proba(X_validation)[:, 1] > optimal_threshold).astype('int64')
display(confusion_matrix_to_dataframe(confusion_matrix(y_validation, y_pred)))
print(classification_report(y_validation, y_pred))

## Persisting the Best Classifier Parameters

Now we have the best classifier, let's persist it's parameters and associated metadata, which will allow us to reconstruct the classifier at any point in the future. Note that I am not persisting the actual sklearn estimator using [joblib](https://pypi.org/project/joblib/) or [pickle](https://docs.python.org/3/library/pickle.html) as I'd like to avoid any [compatibility or security issues](https://stackabuse.com/scikit-learn-save-and-restore-models/#compatibilityissues). 

In [None]:
best_classifier_params = pd.Series(
    type(best_classifier), name=logit_results_weights['name'].replace(' ', '_'), index=['estimator'])
best_classifier_params['params'] = best_classifier.get_params()
best_classifier_params['threshold'] = optimal_threshold
best_classifier_params['sample_weight'] = '../models/train-features-sample-weights.csv'
best_classifier_params

In [None]:
best_classifier_params.to_csv('../models/' + best_classifier_params.name + '.csv', header=True)

As a sanity check let's recreate the estimator and make sure that we get the same results as before.

In [None]:
best_classifier_params_check = pd.read_csv('../models/' + best_classifier_params.name + '.csv',
                                           squeeze=True, index_col=0)
best_classifier_params_check

In [None]:
best_classifier_check = LogisticRegression()
best_classifier_check.set_params(**ast.literal_eval(best_classifier_params_check.params))
best_classifier_check.fit(
    X_train, y_train, sample_weight=pd.read_csv(best_classifier_params_check.sample_weight)['weight'])

In [None]:
assert(np.array_equal(best_classifier_check.predict(X_validation), best_classifier.predict(X_validation)))
np.testing.assert_allclose(best_classifier_check.predict_proba(X_validation),
                           best_classifier.predict_proba(X_validation))
y_pred_check = (
    (best_classifier_check.predict_proba(
        X_validation)[:, 1] > best_classifier_params.threshold).astype('int64'))
assert(np.array_equal(y_pred_check, y_pred))

Great, everything looks good.