In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
from sklearn.metrics import (make_scorer, confusion_matrix, precision_score,
                             f1_score, roc_auc_score, accuracy_score,
                             recall_score)
from IPython.display import display, HTML
from sklearn.model_selection import StratifiedKFold, ParameterGrid

import sys
sys.path.append('/Users/samrelins/Documents/LIDA/ace_project/')
from src.data_prep import *
from src.train_test import *

# Initial Models: Analysis and Features

The below is a short(ish) script and summary of the data preparation and
modelling experiments I have performed. This is to be followed by a more in-depth analysis of a small selection of the best performing models - although, as we'll see, none of the models show impressive predictive accuracy at this point.

## Data Preparation Methods

The ACE dataset presents a couple of prominent considerations that need to be
 accounted for when preparing the data for modeling:

### 1. Categorical Encoding Methods:

Machine learning methods require categorical data to be represented
numerically for it to be interpretable. There are a number of ways to
approach
this, some of which are not possible in this setting because of the small amount of training data. I've focussed on two approaches:

**One Hot Encoding** - Each categorical feature is split into
individual categories and these categories are assigned a binary value, a
1 indicating the feature is present and 0 not present. For example, if we had
 the following data on the time of referral:

In [3]:
pd.DataFrame({
    "Referral Time": ["morning", "afternoon", "morning", "evening"],
}, index=[1,2,3,4])

Unnamed: 0,Referral Time
1,morning
2,afternoon
3,morning
4,evening


could be one-hot encoded as follows:

In [4]:
pd.DataFrame({
    "Referral Time Morning": [1, 0, 1, 0],
    "Referral Time Afternoon": [0, 1, 0, 0],
    "Referral Time Evening": [0, 0, 0, 1],
}, index=[1,2,3,4])

Unnamed: 0,Referral Time Morning,Referral Time Afternoon,Referral Time Evening
1,1,0,0
2,0,1,0
3,1,0,0
4,0,0,1


An issue with one-hot encoding is the creation of a large number of extra
features (one for each level of each categorical feature) i.e. the above took
 one category and made it into three. This crates a very "sparse" dataset
 (contains a lot of zeros that don't add much info) and can result in a very
 sparse model of the data i.e. a tree model that has to make hundreds
 of yes / no decisions on different binary categories before it can make a
 prediction. An alternative to this approach is to encode each category with
 a numerical representation of its value.

**Mean encoding / Feature encoding**


 Target encoding takes the target feature, in this case the need for hospital
  treatment, and encodes each categorical feature with the mean / proportion
  that applies to the individual "levels" of that category. Using the above example, we would calculate the proportion of referrals made in the morning / afternoon / evening that required hospital treatment, and use those proportions as numerical representations of the features. For example, if 15% of children referred in the morning required hospital treatment, and 5% and 18% for the kids referred in the afternoon and evening required hospital treatment, then the feature would look like this:

In [5]:
pd.DataFrame({
    "Referral Time": ["morning", "afternoon", "morning", "evening"],
    "Target Encoded Referral Time": [.15, .05, .18,
                                   .15],
}, index=[1,2,3,4])


Unnamed: 0,Referral Time,Target Encoded Referral Time
1,morning,0.15
2,afternoon,0.05
3,morning,0.18
4,evening,0.15


Note: One must be careful when using this approach, that "leakage" isn't
introduced
 into the dataset - that is, information about the target feature for that
 example being included in the explanatory variables for the same example.
 This can be avoided by ensuring that the target value for each example is
 left out when calculating its encodings.

Target encoding fixes the sparcity issue - each categorical feature remains
one feature rather than expanding, but it often results in overfitting - that
 is, when the model maps too closely to the examples it has seen in training
 and doesn't then generalise well when given new data.

### 2. Balancing Positive / Negative Examples:

The ACE dataset is heavily imbalanced i.e. only 16.5% of examples require
hospital treatment. Left as is, models can easily achieve high (83.5%)
accuracy by simply predicting ALL children can be treated by ACE. This
wouldn't be a very useful model!

To avoid this, efforts need to be made to balance the predictions made by
each model. Again, I have used two basic approaches to achieve this:

**1. Weighting Labels**:

The penalty a model is given for making an incorrect prediction can be
weighted to penalise the minority label incorrect guesses more
heavily. This discourages the model from simply guessing the majority label
over and over, as it gets a heavier penalty when it gets one of the minority
examples wrong.
 The
weight is usually chosen to be proportional to the imbalance i.e. if there
are 5 times more negative examples (children that can be treated by ACE) than
positive (children that require hospital treatment), then
 an incorrect
negative
guess is penalised 5 times more than an incorrect positive.

**2. SMOTE - Synthetic Minority Oversampling TEchnique**

This uses a statistical model to create synthetic examples from the minority
label to balance the number of positive / negative examples to 50/50. The
simplest form of oversampling is to simply duplicate the minority examples
over and over. SMOTE uses interpolation between the different minority
examples to create synthetic examples that roughly preserve the distribution of
 the
original examples.

### Data Preparation Pipeline

I've spent some time developing a "pipeline" or group of functions that can
automatically apply the above encoding and balancing techniques to the
data "at the flick of a switch". This means that, during training, the
different data preparation methods can be easily and consistently
applied to the data at runtime, without having to store many different versions
 of the ace dataset. The importance of this will become clear in the discussion of model
evaluation and cross validation below. Given the general utility of these
functions, and the fact they are fairly verbose, I have extracted them into a
 separate module - `data_prep.py` in which script and detailed documentation
 can be found.

A quick summary of the pipeline functions is as follows:

* `clean_data`: converts raw excel / csv data into a more python friendly format
* `fill_nas`: fills missing values in dataset with group means
* `add_features`: add the extra categorical features discussed in the data
analysis
* `return_train_test`: divides the dataset into consistent train and test
dataframes
* `add_synthetic_examples`: generates SMOTE (synthetic) examples and adds to
dataset
* `encode_and_scale`: applies various categorical encoding techniques and
min/max scales the data (for modelling techniques that require scaled data)

## Model Training and Evaluating Performance:

Having considered data preparation, we now need to define models to predict
the
hospital /
 community outcomes.

### Models and Parameters

The modelling techniques used are too numerous to attempt any discussion
here, but further exploration of the most successful techniques will be
included in
 the
more detailed discussion that will follow. Each modelling technique
includes a number of
parameters or "assumptions" that need to be specified when defining the model
 , and have a downstream effect on performance and prediction
 accuracy. To simplify and compartmentalise each of the models we wish to
 test along with its parameters (an extension of Ruaridh's work), we have
 a number of functions that return a model and a "parameter grid" of
  each of the parameters we wish to test.

Note: each function has a "balanced" argument allowing for the
calculation and use of balanced weights, which is not implemented in the
parameters of some models (hence the separate `scaled` keyword argument)

In [6]:
from sklearn.neighbors import KNeighborsClassifier

def return_knn_params(balanced=False):
    clf = KNeighborsClassifier()
    # param_grid = {'n_neighbors': np.arange(1,10),
    #                 'weights': ['uniform','distance'],
    #                 'p': [1,2],
    #                 'metric':['minkowski','euclidean','manhattan'],
    #                 'n_jobs':[-2]}
    param_grid = {'n_neighbors': [3],
                    'weights': ['uniform'],
                    'p': [1,],
                    'metric':['minkowski'],
                    'n_jobs':[-2]}
    if balanced:
        print("no available balancing technique for nearest neighbours")
    return {"clf": clf, "param_grid": param_grid, "scaled": True}

In [7]:
from sklearn.svm import SVC
def return_svm_params(balanced=False):
    clf = SVC()
    # param_grid = {'kernel': ['linear','rbf'],
    #               'C': np.logspace(2,4,2), # np.logspace(2,5,6)
    #               'gamma': np.logspace(-4,0.5,1)} # np.logspace(-4,0.5,10)}
    param_grid = {'kernel': ['linear'],
                  'C': [0.1], # np.logspace(2,5,6)
                  'gamma': [0.1]} # np.logspace(-4,0.5,10)}
    if balanced:
        param_grid["class_weight"] = "balanced",
    return {"clf": clf, "param_grid": param_grid, "scaled": True}

In [8]:
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF

def return_gaussian_process_params(balanced=False):
    clf = GaussianProcessClassifier(random_state=0, n_jobs=-2)
    kernels = [mul * RBF(length_scale)
                    for mul in np.arange(0.5, 2.5, 0.5)
                    for length_scale in np.arange(0.5, 2.5, 0.5)]
    param_grid = {'kernel': kernels,
                  'n_jobs': [-2]}
    if balanced:
        print("no available balancing technique for Gaussian Process")
    return {"clf": clf, "param_grid": param_grid, "scaled": True}

In [9]:
from sklearn.ensemble import RandomForestClassifier

def return_random_forest_params(balanced=False):
    clf = RandomForestClassifier(n_estimators=100)
    # param_grid = {'max_depth': [4, 6, 10, 14, 20],
    #               'n_estimators': [30, 100, 130, 300],
    #               'min_samples_split': [2, 3, 10, 13, 30],
    #               'max_features': [0.3, 0.4, 0.5, "auto"],
    #               'n_jobs': [-2]}
    param_grid = {'max_depth': [4],
                  'n_estimators': [30],
                  'min_samples_split': [2],
                  'max_features': [0.3],
                  'n_jobs': [-2]}
    if balanced:
        param_grid["class_weight"] = "balanced",
    return {"clf": clf, "param_grid": param_grid, "scaled": False}

In [10]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.utils.class_weight import compute_class_weight

def return_grad_boost_params(balanced=False):
    clf = GradientBoostingClassifier(n_estimators=100,random_state=0)
    # param_grid = {'learning_rate': [0.1, 0.05, 0.02, 0.01],
    #               'n_estimators': [30, 100, 130, 300],
    #               'max_depth': [4, 6, 10, 14, 20],
    #               'min_samples_split': [3, 10, 13, 30],
    #               'max_features': [x for x in np.linspace(0.2,0.4,4)]}
    param_grid = {'learning_rate': [0.1],
                  'n_estimators': [30],
                  'max_depth': [4],
                  'min_samples_split': [3],
                  'max_features': [3]}
    return {"clf": clf, "param_grid": param_grid, "scaled": False,
            "weight_y": balanced}

In [11]:
from sklearn.ensemble import AdaBoostClassifier

def return_ada_boost_params(balanced=False):
    clf = AdaBoostClassifier(random_state=0)
    # param_grid = {'n_estimators': [30, 100, 130, 300],
    #               'learning_rate': [0.001,0.01,0.1,0.2,0.5]}
    param_grid = {'n_estimators': [30],
                  'learning_rate': [0.001]}
    return {"clf": clf, "param_grid": param_grid, "scaled": False,
            "weight_y":balanced}

In [12]:
from sklearn.naive_bayes import GaussianNB

def return_naive_bayes_params(balanced=False):
    clf = GaussianNB()
    # param_grid = {'var_smoothing':  np.logspace(-11,-3,9,base=10)}
    param_grid = {'var_smoothing': [0.3]}
    return {"clf": clf, "param_grid": param_grid, "scaled": False,
            "weight_y":balanced}

In [13]:
from sklearn.linear_model import LogisticRegression

def return_lr_params(balanced=False):
    clf = LogisticRegression(random_state=0, max_iter=10000)
    # param_grid = {'penalty' : ['l2'],
    #               'solver': ["liblinear"],
    #               'C' : np.logspace(-4, 4, 20)}
    param_grid = {'penalty' : ['l2'],
                  'solver': ["liblinear"],
                  'C' : [0.1]}
    if balanced:
        param_grid["class_weight"] = "balanced",
    return {"clf": clf, "param_grid": param_grid, "scaled": True}

In [14]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

def return_qda_params(balanced=False):
    clf = QuadraticDiscriminantAnalysis()
    # param_grid = {'reg_param':  [0.0, 0.01, 0.03, 0.1, 0.3]}
    param_grid = {'reg_param':  [0.0]}
    if balanced:
        print("no available balancing technique for QDA")
    return {"clf": clf, "param_grid": param_grid, "scaled": True}

In [15]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

def return_lda_params(balanced=False):
    clf = LinearDiscriminantAnalysis()
    # param_grid = {'solver':  ["svd", "lsqr", "eigen"],
    #               "shrinkage": [None, "auto", 0.1, 0.3, 0.8, 1]}
    param_grid = {'solver':  ["svd"],
                  "shrinkage": [None]}
    if balanced:
        print("no available balancing technique for LDA")
    return {"clf": clf, "param_grid": param_grid, "scaled": True}


### Cross Validation and Grid Search

For those keeping score, we now have several different data preparation techniques,
models. We can't use them all simultaneously (well - technically that is
actually possible but would be a needlessly complex solution) and so we need
to decide on the best combination. To evaluate each possible permutation of
data preparation / model / parameters we
can use a technique called "cross validation": dividing the
training data into k
groups, training the model on k-1 of these groups leaving one group aside, and
 evaluating the model's predictions against the group held aside. Doing this
 ensures the model is never evaluated on examples it has already seen. This
 technique is used in conjunction with a parameter optimisation method called
  "grid search" - this iterates through each possible combination of the
  specified parameters, scoring each individually. The best combination of
  parameters can then be established.

There are several "out of the box" implementations of these methods that can
be applied in most use cases. However, these functions require you to specify
 a pipeline that can be compartmentalised into distinct stages and applied
 across the whole training dataset. This isn't practical in this case as:

 * The synthesizing and encoding stages can't be divorced from one another,
 otherwise the SMOTE examples may extrapolate between target-encoded
 features to produce nonsense examples
 * Only the training splits can include synthetic data - otherwise model
 performance will be based in part on its ability to predict synthetic data
 and will result in a biased cross validation score

Therefore, I've spent some time developing a custom cross validation and grid
 search loop that produces unbiased validation scores. The function includes
 the different data preparation techniques and pipeline functions discussed
 above:

In [33]:
import warnings
warnings.filterwarnings("ignore")
from sklearn.metrics import (make_scorer, confusion_matrix, precision_score,
                             f1_score, roc_auc_score, accuracy_score,
                             recall_score)
from sklearn.model_selection import StratifiedKFold, ParameterGrid

import sys
sys.path.append('/Users/samrelins/Documents/LIDA/ace_project/')
from src.data_prep import *

# custom scoring functions for CV loop
true_neg = make_scorer(lambda y, y_pred: confusion_matrix(y, y_pred)[0][0])
false_neg = make_scorer(lambda y, y_pred: confusion_matrix(y, y_pred)[1][0])
true_pos = make_scorer(lambda y, y_pred: confusion_matrix(y, y_pred)[1][1])
false_pos = make_scorer(lambda y, y_pred: confusion_matrix(y, y_pred)[0][1])
precision = make_scorer(precision_score, zero_division=0)

# dict of scoring functions
SCORING = {
    "f1": make_scorer(f1_score),
    "roc_auc": make_scorer(roc_auc_score),
    "accuracy": make_scorer(accuracy_score),
    "recall": make_scorer(recall_score),
    "precision": make_scorer(precision_score),
    "true_pos": true_pos,
    "true_neg": true_neg,
    "false_pos": false_pos,
    "false_neg": false_neg
}


def score_classifier(clf, X, y):
    """
    Scores a classifier against metrics in SCORING dict

    :param clf: (object: sklearn classifier) classifier to be scored
    :param X: (object: pandas DataFrame) matrix of training vectors
    :param y: (object: pandas Series) vector of target labels
    :return: (dict) group of {score function name: score} pairs
    """

    scores = {}
    for name, scorer in SCORING.items():
        scores[name] = scorer(clf, X, y)
    return scores


def cv_score_classifier(clf, X_train, y_train, params,
                        cat_encoder="one_hot",
                        add_synthetic=False,
                        scaled=False,
                        n_splits=3,
                        weight_y=False):
    """
    Custom CV loop to score classifier functions

    Implemented to account for SMOTE example generation and target encoding -
    both should only be performed on training data and not validation data -
    not possible to achieve this separation using the sklearn pipeline and
    GridSearchCV.

    :param clf: (object: sklearn classifier) classifier to train and score
    :param X_train: (object: pandas DataFrame) Explanatory Training data
    :param y_train: (object: pandas Series) Training data labels
    :param params: (dict) parameters for classifier
    :param cat_encoder: (str: "one_hot") categorical encoder for data
    either "one_hot" / "target"
    :param add_synthetic: (bool: False) set True to add SMOTE examples before
    training
    :param scaled: (bool: False) set True to scale numeric features
    :param n_splits: (int: 3) number of splilts for CV loop
    :param weight_y: (bool: False) set True if clf requires sample_weights
    :return:
    """

    # create splits for CV loop
    splitter = StratifiedKFold(n_splits=n_splits, random_state=1)
    splits = list(splitter.split(X_train, y_train))

    total_cv_scores = {} # dict to store cumulative CV scores
    for train_idxs, val_idxs  in splits:
        # divide data into train and validation sets for this cv loop
        cv_X_train, cv_y_train, X_val, y_val = (X_train.iloc[train_idxs],
                                                y_train.iloc[train_idxs],
                                                X_train.iloc[val_idxs],
                                                y_train.iloc[val_idxs])

        if add_synthetic: # add SMOTE examples to balance data if required
            cv_X_train, cv_y_train = add_synthetic_examples(cv_X_train, cv_y_train)

        # encode categorical features and scale numeric if required
        cv_X_train, X_val, = encode_and_scale(
            cv_X_train, cv_y_train, X_val,
            cat_encoder=cat_encoder,
            scaled=scaled)

        if weight_y:
            # calculate array of weights for y labels
            pos_weight, neg_weight = compute_class_weight(
                class_weight="balanced",
                classes=[1,0],
                y=cv_y_train)
            y_weights = cv_y_train.apply(lambda y: pos_weight if y else neg_weight)
            # train model using parameters, weights and cv loop data
            cv_clf = (clf
                      .set_params(**params)
                      .fit(cv_X_train, cv_y_train, sample_weight=y_weights))
        else:
            # train model using parameters and cv loop data
            cv_clf = (clf
                      .set_params(**params)
                      .fit(cv_X_train, cv_y_train))

        # score classifier on cv validation set and add scores to total
        scores = score_classifier(cv_clf, X_val, y_val)
        if total_cv_scores:
            for key, value in scores.items():
                total_cv_scores[key] += value
        else:
            total_cv_scores = scores

    mean_cv_scores = {}
    for key, value in total_cv_scores.items():
        mean_cv_scores[key] = value / n_splits

    return mean_cv_scores


def param_search_classifier(param_grid, **kwargs):
    """
    custom param grid search to compliment cv_score_classifier function

    :param param_grid: (dict) parameters on which to perform grid search
    :param kwargs: arguments for cv_score_classifier function
    :return: (dict: best_scores, dict: best_params) scores and parameters for
    highest scoring model
    """
    param_grid = ParameterGrid(param_grid)
    # variable to store best param combo and relevant scores
    best_scores = {}
    best_params = {}
    for params in param_grid:
        mean_cv_scores = cv_score_classifier(params=params,
                                             **kwargs)
        if not best_scores:
            best_scores = mean_cv_scores
            best_params = params
        elif mean_cv_scores["f1"] > best_scores["f1"]:
            best_scores = mean_cv_scores
            best_params = params

    return best_scores, best_params

## Performance Metrics

The cross validation loop outputs the following metrics, used to measure
model performance:

* **True Positive / False Positive / True Negative / False Negative**: Fairly
self
 explanatory. A true positive in this context is an example a model correctly
  states requires hospital treatment, a true negative is an example the model
   states needs hospital treatment when it doesn't, and so on....
* **Accuracy**: Again fairly self explanatory. The proportion of
 correct predictions
* **Precision**: the proportion of positive guesses that are
correct i.e. if a model has a precision of 75%, 3 out of every 4 times it
predicts that hospital treatment is needed it is correct.
* **Recall**: the proportion of positive examples in the dataset that the
model correctly predicts i.e. if there are 50 examples requiring hospital
treatment and the model correctly identifies 40 of them, it has an 80% recall.
* **ROC/AUC**: this is a measure of the tradeoff between precision and
recall, but is a little complex to define here. A 0.5 ROC/AUC is
representative of random chance and 1 is a perfect model.
* **F1 Score**: the f1 score is another measure of the tradeoff between precision and recall. It is a weighted average of the two and ranges from 0 (worst) to 1 (perfect)

## Tests

The following is the (perhaps long awaited!) output from the training /
validation. Scores are broken down by data preparation methods and then model
 type - the best performing model for each is selected from the cv loop and
 displayed in the results:

In [16]:
techniques_dict = {'K Nearest Neighbours': return_knn_params,
                   'Support Vector Machines': return_svm_params,
                   'Gaussian Process': return_gaussian_process_params,
                   'Random Forest Classifier': return_random_forest_params,
                   'Gradient Boosting Classifier': return_grad_boost_params,
                   'Ada Boost classifier': return_ada_boost_params,
                   'Gaussian Naieve Bayes': return_naive_bayes_params,
                   'Logistic Regression': return_lr_params,
                   'Quadratic Discriminant Analysis': return_qda_params}

data_prep_types = ["one_hot_balanced", "target_balanced",
                    "one_hot_resampled", "target_resampled"]

# data_loc = "/Users/samrelins/Documents/LIDA/ace_project/data/ace_data_orig.csv"
data_loc = "/Users/samrelins/Documents/LIDA/ace_project/data/ace_data_extra.csv"
ace_data_orig = pd.read_csv(data_loc)
ace_data_orig.drop(["medical_history", "examination_summary",
                    "recommendation"],
                   axis=1, inplace=True)
X_train, y_train, X_test, y_test = return_train_test(ace_data_orig)

best_params = {}
best_model_scores = {}

for data_prep_type in data_prep_types:
    ### uncomment this and other print statements for output of loop progress
    print(50* '=')
    print(f"Data Prep: {data_prep_type}")
    print(50* '=')

    cat_encoder = "one_hot" if "one_hot" in data_prep_type else "target"
    balanced = True if "balanced" in data_prep_type else False
    resample = "undersample" if "undersample" in data_prep_type else None
    resample = "smote" if "smote" in data_prep_type else resample

    scores_list = []
    best_loop_params = {}
    for model_name, model_params_f in techniques_dict.items():
        print(f"fitting {model_name}......")
        model_best_scores, model_best_params = param_search_classifier(
            **model_params_f(balanced=balanced),
            X_train=X_train,
            y_train=y_train,
            cat_encoder=cat_encoder,
            resample=resample,
            verbose=False)
        scores_list.append(model_best_scores)
        best_loop_params[model_name] = model_best_params
        print("done.")

    best_model_scores[data_prep_type] = pd.DataFrame(scores_list,
                                                     index=techniques_dict.keys())
    best_params[data_prep_type] = best_loop_params


Required features missing to run <function add_free_text_features at 0x7fe766f93e50>
Data Prep: one_hot_balanced
fitting K Nearest Neighbours......
no available balancing technique for nearest neighbours
Testing KNeighborsClassifier() classifier with one_hot encoded features.
done.
fitting Support Vector Machines......
Testing SVC() classifier with one_hot encoded features.
done.
fitting Gaussian Process......
no available balancing technique for Gaussian Process
Testing GaussianProcessClassifier(n_jobs=-2, random_state=0) classifier with one_hot encoded features.
done.
fitting Random Forest Classifier......
Testing RandomForestClassifier() classifier with one_hot encoded features.
done.
fitting Gradient Boosting Classifier......
Testing GradientBoostingClassifier(random_state=0) classifier with one_hot encoded features.
done.
fitting Ada Boost classifier......
Testing AdaBoostClassifier(random_state=0) classifier with one_hot encoded features.
done.
fitting Gaussian Naieve Bayes......

100%|██████████| 1/1 [00:00<00:00,  2.51it/s]
100%|██████████| 1/1 [00:00<00:00,  5.21it/s]
100%|██████████| 16/16 [00:08<00:00,  1.87it/s]
100%|██████████| 1/1 [00:00<00:00,  2.40it/s]
100%|██████████| 1/1 [00:00<00:00,  5.12it/s]
100%|██████████| 1/1 [00:00<00:00,  2.88it/s]
100%|██████████| 1/1 [00:00<00:00,  6.73it/s]
100%|██████████| 1/1 [00:00<00:00,  6.29it/s]
100%|██████████| 1/1 [00:00<00:00,  5.77it/s]
100%|██████████| 1/1 [00:01<00:00,  1.03s/it]
100%|██████████| 1/1 [00:00<00:00,  1.34it/s]
100%|██████████| 16/16 [00:17<00:00,  1.12s/it]
100%|██████████| 1/1 [00:01<00:00,  1.02s/it]
100%|██████████| 1/1 [00:00<00:00,  1.27it/s]
100%|██████████| 1/1 [00:00<00:00,  1.06it/s]
100%|██████████| 1/1 [00:00<00:00,  1.35it/s]
100%|██████████| 1/1 [00:00<00:00,  1.35it/s]
100%|██████████| 1/1 [00:00<00:00,  1.36it/s]
100%|██████████| 1/1 [00:00<00:00,  2.64it/s]
100%|██████████| 1/1 [00:00<00:00,  5.67it/s]
100%|██████████| 16/16 [00:08<00:00,  1.81it/s]
100%|██████████| 1/1 [00:00<

In [19]:
def highlight_good_scores_green(df):
    good_accuracy = df["mean_accuracy"] > 0.6
    good_recall = df["mean_recall"] > 0.4
    good_precision = df["mean_precision"] > 0.2
    highlight = good_recall & good_accuracy & good_precision
    if highlight:
        return [f"background-color: green"] * 18
    else:
        return [f"background-color:"] * 18


for name, df in best_model_scores.items():
    display(HTML(f"<h2>{name.replace('_', ' ').title()}"))
    display(df.style.apply(highlight_good_scores_green, axis=1))


Unnamed: 0,mean_f1,std_f1,mean_roc_auc,std_roc_auc,mean_accuracy,std_accuracy,mean_recall,std_recall,mean_precision,std_precision,mean_true_pos,std_true_pos,mean_true_neg,std_true_neg,mean_false_pos,std_false_pos,mean_false_neg,std_false_neg
K Nearest Neighbours,0.144819,0.042427,0.518926,0.00952,0.792049,0.021624,0.111111,0.045361,0.233333,0.037495,2.0,0.816497,84.333333,3.091206,6.666667,3.091206,16.0,0.816497
Support Vector Machines,0.223449,0.073343,0.498575,0.052939,0.559633,0.027008,0.407407,0.171734,0.155368,0.043618,7.333333,3.091206,53.666667,6.01849,37.333333,6.01849,10.666667,3.091206
Gaussian Process,0.0,0.0,0.5,0.0,0.834862,0.0,0.0,0.0,0.0,0.0,0.0,0.0,91.0,0.0,0.0,0.0,18.0,0.0
Random Forest Classifier,0.125483,0.091246,0.498779,0.036341,0.75841,0.033778,0.111111,0.078567,0.152632,0.122531,2.0,1.414214,80.666667,4.027682,10.333333,4.027682,16.0,1.414214
Gradient Boosting Classifier,0.158322,0.042224,0.484534,0.022682,0.685015,0.0354,0.185185,0.06929,0.14243,0.026586,3.333333,1.247219,71.333333,4.714045,19.666667,4.714045,14.666667,1.247219
Ada Boost classifier,0.249982,0.043867,0.517399,0.042994,0.529052,0.134487,0.5,0.20787,0.1789,0.02113,9.0,3.741657,48.666667,17.98765,42.333333,17.98765,9.0,3.741657
Gaussian Naieve Bayes,0.242619,0.031142,0.508038,0.024087,0.538226,0.114424,0.462963,0.183324,0.171074,0.019338,8.333333,3.299832,50.333333,15.627611,40.666667,15.627611,9.666667,3.299832
Logistic Regression,0.218896,0.062146,0.503765,0.042312,0.605505,0.052436,0.351852,0.13858,0.162159,0.035133,6.333333,2.494438,59.666667,7.586538,31.333333,7.586538,11.666667,2.494438
Quadratic Discriminant Analysis,0.027778,0.039284,0.494607,0.004611,0.813456,0.011442,0.018519,0.026189,0.055556,0.078567,0.333333,0.471405,88.333333,1.699673,2.666667,1.699673,17.666667,0.471405


Unnamed: 0,mean_f1,std_f1,mean_roc_auc,std_roc_auc,mean_accuracy,std_accuracy,mean_recall,std_recall,mean_precision,std_precision,mean_true_pos,std_true_pos,mean_true_neg,std_true_neg,mean_false_pos,std_false_pos,mean_false_neg,std_false_neg
K Nearest Neighbours,0.045977,0.065021,0.48372,0.015829,0.782875,0.00865,0.037037,0.052378,0.060606,0.08571,0.666667,0.942809,84.666667,1.885618,6.333333,1.885618,17.333333,0.942809
Support Vector Machines,0.300911,0.025771,0.550061,0.043348,0.397554,0.13196,0.777778,0.136083,0.188462,0.020217,14.0,2.44949,29.333333,16.519349,61.666667,16.519349,4.0,2.44949
Gaussian Process,0.0,0.0,0.494505,0.004486,0.825688,0.007491,0.0,0.0,0.0,0.0,0.0,0.0,90.0,0.816497,1.0,0.816497,18.0,0.0
Random Forest Classifier,0.107882,0.027525,0.485857,0.014474,0.749235,0.02408,0.092593,0.026189,0.133155,0.035156,1.666667,0.471405,80.0,2.828427,11.0,2.828427,16.333333,0.471405
Gradient Boosting Classifier,0.114472,0.046862,0.462251,0.019814,0.685015,0.033778,0.12963,0.06929,0.106162,0.036172,2.333333,1.247219,72.333333,4.642796,18.666667,4.642796,15.666667,1.247219
Ada Boost classifier,0.202363,0.028085,0.472629,0.034001,0.553517,0.127637,0.351852,0.145815,0.152926,0.029276,6.333333,2.624669,54.0,16.329932,37.0,16.329932,11.666667,2.624669
Gaussian Naieve Bayes,0.241628,0.029976,0.506207,0.023269,0.535168,0.118519,0.462963,0.183324,0.170314,0.019279,8.333333,3.299832,50.0,16.083117,41.0,16.083117,9.666667,3.299832
Logistic Regression,0.29599,0.04761,0.563492,0.048091,0.568807,0.029963,0.555556,0.120014,0.202453,0.029643,10.0,2.160247,52.0,4.242641,39.0,4.242641,8.0,2.160247
Quadratic Discriminant Analysis,0.141923,0.077249,0.446581,0.050949,0.559633,0.15695,0.277778,0.240027,0.113808,0.034334,5.0,4.320494,56.0,20.92845,35.0,20.92845,13.0,4.320494


Unnamed: 0,mean_f1,std_f1,mean_roc_auc,std_roc_auc,mean_accuracy,std_accuracy,mean_recall,std_recall,mean_precision,std_precision,mean_true_pos,std_true_pos,mean_true_neg,std_true_neg,mean_false_pos,std_false_pos,mean_false_neg,std_false_neg
K Nearest Neighbours,0.144819,0.042427,0.518926,0.00952,0.792049,0.021624,0.111111,0.045361,0.233333,0.037495,2.0,0.816497,84.333333,3.091206,6.666667,3.091206,16.0,0.816497
Support Vector Machines,0.0,0.0,0.5,0.0,0.834862,0.0,0.0,0.0,0.0,0.0,0.0,0.0,91.0,0.0,0.0,0.0,18.0,0.0
Gaussian Process,0.0,0.0,0.5,0.0,0.834862,0.0,0.0,0.0,0.0,0.0,0.0,0.0,91.0,0.0,0.0,0.0,18.0,0.0
Random Forest Classifier,0.0,0.0,0.496337,0.00518,0.828746,0.00865,0.0,0.0,0.0,0.0,0.0,0.0,90.333333,0.942809,0.666667,0.942809,18.0,0.0
Gradient Boosting Classifier,0.0,0.0,0.498168,0.00259,0.831804,0.004325,0.0,0.0,0.0,0.0,0.0,0.0,90.666667,0.471405,0.333333,0.471405,18.0,0.0
Ada Boost classifier,0.0,0.0,0.496337,0.00518,0.828746,0.00865,0.0,0.0,0.0,0.0,0.0,0.0,90.333333,0.942809,0.666667,0.942809,18.0,0.0
Gaussian Naieve Bayes,0.0,0.0,0.5,0.0,0.834862,0.0,0.0,0.0,0.0,0.0,0.0,0.0,91.0,0.0,0.0,0.0,18.0,0.0
Logistic Regression,0.0,0.0,0.5,0.0,0.834862,0.0,0.0,0.0,0.0,0.0,0.0,0.0,91.0,0.0,0.0,0.0,18.0,0.0
Quadratic Discriminant Analysis,0.027778,0.039284,0.494607,0.004611,0.813456,0.011442,0.018519,0.026189,0.055556,0.078567,0.333333,0.471405,88.333333,1.699673,2.666667,1.699673,17.666667,0.471405


Unnamed: 0,mean_f1,std_f1,mean_roc_auc,std_roc_auc,mean_accuracy,std_accuracy,mean_recall,std_recall,mean_precision,std_precision,mean_true_pos,std_true_pos,mean_true_neg,std_true_neg,mean_false_pos,std_false_pos,mean_false_neg,std_false_neg
K Nearest Neighbours,0.045977,0.065021,0.48372,0.015829,0.782875,0.00865,0.037037,0.052378,0.060606,0.08571,0.666667,0.942809,84.666667,1.885618,6.333333,1.885618,17.333333,0.942809
Support Vector Machines,0.0,0.0,0.5,0.0,0.834862,0.0,0.0,0.0,0.0,0.0,0.0,0.0,91.0,0.0,0.0,0.0,18.0,0.0
Gaussian Process,0.0,0.0,0.494505,0.004486,0.825688,0.007491,0.0,0.0,0.0,0.0,0.0,0.0,90.0,0.816497,1.0,0.816497,18.0,0.0
Random Forest Classifier,0.0,0.0,0.5,0.0,0.834862,0.0,0.0,0.0,0.0,0.0,0.0,0.0,91.0,0.0,0.0,0.0,18.0,0.0
Gradient Boosting Classifier,0.031746,0.044896,0.487281,0.022712,0.801223,0.022885,0.018519,0.026189,0.111111,0.157135,0.333333,0.471405,87.0,2.160247,4.0,2.160247,17.666667,0.471405
Ada Boost classifier,0.0,0.0,0.496337,0.00518,0.828746,0.00865,0.0,0.0,0.0,0.0,0.0,0.0,90.333333,0.942809,0.666667,0.942809,18.0,0.0
Gaussian Naieve Bayes,0.0,0.0,0.5,0.0,0.834862,0.0,0.0,0.0,0.0,0.0,0.0,0.0,91.0,0.0,0.0,0.0,18.0,0.0
Logistic Regression,0.0,0.0,0.5,0.0,0.834862,0.0,0.0,0.0,0.0,0.0,0.0,0.0,91.0,0.0,0.0,0.0,18.0,0.0
Quadratic Discriminant Analysis,0.141923,0.077249,0.446581,0.050949,0.559633,0.15695,0.277778,0.240027,0.113808,0.034334,5.0,4.320494,56.0,20.92845,35.0,20.92845,13.0,4.320494


In [25]:
all_results = pd.DataFrame([])
for name, df in best_model_scores.items():
    idx_tuples = [(name, classifier)
                  for classifier in df.index]
    new_idx = pd.MultiIndex.from_tuples(
        idx_tuples, names=["data prep method", "classifier"]
    )
    new_idx_df = df.set_index(new_idx)
    all_results = pd.concat([all_results, new_idx_df])
all_results.to_excel("blah.xls")

In [40]:
all_results = pd.DataFrame([])
for data_prep_method, scores_df in best_model_scores.items():

    results_idx_tuples = [(data_prep_method, classifier)
                          for classifier in scores_df.index]
    new_results_idx = pd.MultiIndex.from_tuples(
        results_idx_tuples,
        names=["data prep method", "classifier"]
    )
    new_idx_scores_df = scores_df.set_index(new_results_idx)
    all_results = pd.concat([all_results, new_idx_scores_df])
all_results


Unnamed: 0_level_0,Unnamed: 1_level_0,mean_f1,std_f1,mean_roc_auc,std_roc_auc,mean_accuracy,std_accuracy,mean_recall,std_recall,mean_precision,std_precision,mean_true_pos,std_true_pos,mean_true_neg,std_true_neg,mean_false_pos,std_false_pos,mean_false_neg,std_false_neg
data prep method,classifier,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
one_hot_balanced,K Nearest Neighbours,0.144819,0.042427,0.518926,0.00952,0.792049,0.021624,0.111111,0.045361,0.233333,0.037495,2.0,0.816497,84.333333,3.091206,6.666667,3.091206,16.0,0.816497
one_hot_balanced,Support Vector Machines,0.223449,0.073343,0.498575,0.052939,0.559633,0.027008,0.407407,0.171734,0.155368,0.043618,7.333333,3.091206,53.666667,6.01849,37.333333,6.01849,10.666667,3.091206
one_hot_balanced,Gaussian Process,0.0,0.0,0.5,0.0,0.834862,0.0,0.0,0.0,0.0,0.0,0.0,0.0,91.0,0.0,0.0,0.0,18.0,0.0
one_hot_balanced,Random Forest Classifier,0.125483,0.091246,0.498779,0.036341,0.75841,0.033778,0.111111,0.078567,0.152632,0.122531,2.0,1.414214,80.666667,4.027682,10.333333,4.027682,16.0,1.414214
one_hot_balanced,Gradient Boosting Classifier,0.158322,0.042224,0.484534,0.022682,0.685015,0.0354,0.185185,0.06929,0.14243,0.026586,3.333333,1.247219,71.333333,4.714045,19.666667,4.714045,14.666667,1.247219
one_hot_balanced,Ada Boost classifier,0.249982,0.043867,0.517399,0.042994,0.529052,0.134487,0.5,0.20787,0.1789,0.02113,9.0,3.741657,48.666667,17.98765,42.333333,17.98765,9.0,3.741657
one_hot_balanced,Gaussian Naieve Bayes,0.242619,0.031142,0.508038,0.024087,0.538226,0.114424,0.462963,0.183324,0.171074,0.019338,8.333333,3.299832,50.333333,15.627611,40.666667,15.627611,9.666667,3.299832
one_hot_balanced,Logistic Regression,0.218896,0.062146,0.503765,0.042312,0.605505,0.052436,0.351852,0.13858,0.162159,0.035133,6.333333,2.494438,59.666667,7.586538,31.333333,7.586538,11.666667,2.494438
one_hot_balanced,Quadratic Discriminant Analysis,0.027778,0.039284,0.494607,0.004611,0.813456,0.011442,0.018519,0.026189,0.055556,0.078567,0.333333,0.471405,88.333333,1.699673,2.666667,1.699673,17.666667,0.471405
target_balanced,K Nearest Neighbours,0.045977,0.065021,0.48372,0.015829,0.782875,0.00865,0.037037,0.052378,0.060606,0.08571,0.666667,0.942809,84.666667,1.885618,6.333333,1.885618,17.333333,0.942809


In [45]:
pd.DataFrame(best_params)

Unnamed: 0,one_hot_balanced,target_balanced,one_hot_resampled,target_resampled
K Nearest Neighbours,"{'metric': 'minkowski', 'n_jobs': -2, 'n_neigh...","{'metric': 'minkowski', 'n_jobs': -2, 'n_neigh...","{'metric': 'minkowski', 'n_jobs': -2, 'n_neigh...","{'metric': 'minkowski', 'n_jobs': -2, 'n_neigh..."
Support Vector Machines,"{'C': 0.1, 'class_weight': 'balanced', 'gamma'...","{'C': 0.1, 'class_weight': 'balanced', 'gamma'...","{'C': 0.1, 'gamma': 0.1, 'kernel': 'linear'}","{'C': 0.1, 'gamma': 0.1, 'kernel': 'linear'}"
Gaussian Process,"{'kernel': 0.707**2 * RBF(length_scale=0.5), '...","{'kernel': 0.707**2 * RBF(length_scale=0.5), '...","{'kernel': 0.707**2 * RBF(length_scale=0.5), '...","{'kernel': 0.707**2 * RBF(length_scale=0.5), '..."
Random Forest Classifier,"{'class_weight': 'balanced', 'max_depth': 4, '...","{'class_weight': 'balanced', 'max_depth': 4, '...","{'max_depth': 4, 'max_features': 0.3, 'min_sam...","{'max_depth': 4, 'max_features': 0.3, 'min_sam..."
Gradient Boosting Classifier,"{'learning_rate': 0.1, 'max_depth': 4, 'max_fe...","{'learning_rate': 0.1, 'max_depth': 4, 'max_fe...","{'learning_rate': 0.1, 'max_depth': 4, 'max_fe...","{'learning_rate': 0.1, 'max_depth': 4, 'max_fe..."
Ada Boost classifier,"{'learning_rate': 0.001, 'n_estimators': 30}","{'learning_rate': 0.001, 'n_estimators': 30}","{'learning_rate': 0.001, 'n_estimators': 30}","{'learning_rate': 0.001, 'n_estimators': 30}"
Gaussian Naieve Bayes,{'var_smoothing': 0.3},{'var_smoothing': 0.3},{'var_smoothing': 0.3},{'var_smoothing': 0.3}
Logistic Regression,"{'C': 0.1, 'class_weight': 'balanced', 'penalt...","{'C': 0.1, 'class_weight': 'balanced', 'penalt...","{'C': 0.1, 'penalty': 'l2', 'solver': 'libline...","{'C': 0.1, 'penalty': 'l2', 'solver': 'libline..."
Quadratic Discriminant Analysis,{'reg_param': 0.0},{'reg_param': 0.0},{'reg_param': 0.0},{'reg_param': 0.0}


It is clear that none of the methods have really "cracked this nut" so to
speak. Most models identify under 50% of the patients that need hospital
treatment and are right far less than 25% of the time when they do predict
the need for hospital treatment.

Almost all of the best results appear to come from data that is one-hot
encoded and when the target is weighted (rather than synthesising new
examples). The best performing models appear to be the tree based methods
 (Random Forest Classifier, Gradient Bossting Classifier, and Ada Boost
 Classifier), classic Logistic Regression (a version of linear regression
 optimised for classification tasks) and Support Vector Machines (trained
 using target encoded data).

We can test how well these models perform with new data by evaluating their predictions
on the holdout test set: a dataset that has not been used at any point
during training and thus is a good indicator of a
model's ability to generalise. I'll only evaluate the models identified
above, as indiscriminately evaluating every possible model against the holdout
test set risks biasing our selection - the more models we test the more
likely a result is to have occurred by chance rather than accurate modelling:

In [None]:
X_train_ohe, X_test_ohe = encode_and_scale(X_train, y_train, X_test,
                                           cat_encoder="one_hot")
X_train_ohe_scaled, X_test_ohe_scaled = encode_and_scale(X_train, y_train,
                                                         X_test,
                                                         cat_encoder="one_hot",
                                                         scaled=True)
X_train_target_scaled, X_test_target_scaled = encode_and_scale(X_train, y_train,
                                                               X_test,
                                                               cat_encoder="one_hot",
                                                               scaled=True)

best_performing_models = [
    ("one_hot_balanced", 'Random Forest Classifier'),
    ("one_hot_balanced", 'Gradient Boosting Classifier'),
    ("one_hot_balanced", 'Ada Boost classifier'),
    ("one_hot_balanced",  'Logistic Regression'),
    ("target_balanced", "Support Vector Machines")
]
test_scores = {}
for data_prep_type, model in best_performing_models:
    model_args = techniques_dict[model](balanced=True)
    clf = model_args["clf"]
    params = best_params[data_prep_type][model]
    if "weight_y" in model_args.keys():
        weight_y = model_args["weight_y"]
    else:
        weight_y = False
    if model_args["scaled"]:
        if weight_y:
            # calculate array of weights for y labels
            pos_weight, neg_weight = compute_class_weight(
                class_weight="balanced",
                classes=[1,0],
                y=y_train)
            y_weights = y_train.apply(lambda y: pos_weight if y else neg_weight)
            # train model using parameters, weights and cv loop data
            clf = (clf
                      .set_params(**params)
                      .fit(X_train_ohe_scaled,
                           y_train,
                           sample_weight=y_weights))
        else:
            # train model using parameters and cv loop data
            clf = (clf
                      .set_params(**params)
                      .fit(X_train_ohe_scaled, y_train))
    else:
        if weight_y:
            # calculate array of weights for y labels
            pos_weight, neg_weight = compute_class_weight(
                class_weight="balanced",
                classes=[1,0],
                y=y_train)
            y_weights = y_train.apply(lambda y: pos_weight if y else neg_weight)
            # train model using parameters, weights and cv loop data
            clf = (clf
                      .set_params(**params)
                      .fit(X_train_ohe,
                           y_train,
                           sample_weight=y_weights))
        else:
            # train model using parameters and cv loop data
            clf = (clf
                      .set_params(**params)
                      .fit(X_train_ohe, y_train))
    if model_args["scaled"]:
        scores = score_classifier(clf, X_test_ohe_scaled, y_test)
    else:
        scores = score_classifier(clf, X_test_ohe, y_test)
    test_scores[model] = scores

In [None]:
pd.DataFrame(test_scores).T

The test scores seem to agree reasonably with the validation scores, so we
seem to have reasonable estimates of the performance of these models (which is
 a relief given the amount of time it took to write the custom cv / grid
 search loop!)

## Closing Comments:

As I've mentioned a few times, I'm underway with a notebook to follow this
that includes a more in-depth analysis of the above models, with:
  * discussion of the model alogrithms
  * an explanations of the features the models are using to make predictions
  * analysis of individual examples the models are getting right / wrong

As things stand, none of these models are exhibiting a level of accuracy that
would leave us confident they could provide much in the way of useful
inference, or be taken forward into production. Even with better data, we
have a very long way to go to see really high levels of accuracy (the likes
of 80-90% precision and recall).

 It's because of this that I'm keen on moving away from the "rigid" single
 estimate of probability approach, to a model that can say more about the
 uncertainty of a given estimate. That way, even if the model is only right
 one third of the time, if it's confident about that third and unconfident
 the rest of the time then we can provide some useful inference when taking
 ACE referrals.

In [None]:
interesting_models = [
    ("one_hot_balanced", 'Gradient Boosting Classifier'),
    ("one_hot_balanced",  'Logistic Regression'),
    ("target_balanced", "Support Vector Machines")
]

for data_prep_type, model in interesting_models:
    print('=' * 50)
    print(model)
    print('=' * 50)
    print("Params:")
    for param, value in best_params[data_prep_type][model].items():
        print(f"{param}: {value}")
