In [1]:
import pandas as pd

# Initial Models: Analysis and Features

The following is a short analysis of the models that have performed best on
the initial ACE dataset, and an explanation of the features of the data the
model highlights as explanatory / predictive of the need for hospital treatment

## Data Preparation Methods

As part of the training pipeline, I have tried a number of different
approaches to data preparation. These are worth mentioning quickly, as they
have a huge bearing on the performance of each model. The discussion can be
sub-divided into two major considerations:

### 1. Categorical Encoding Methods:

Machine learning methods require categorical data to be represented
numerically for it to be interpretable. There are a number of ways of doing this, some of which are not possible in this setting because of the small amount of training data. I've focussed on two approaches:

**One Hot Encoding** - Each categorical feature is split into
individual categories and these categories are assigned a binary value, a
1 indicating the feature is present and 0 not present. For example, if we had
 the following data on the time of referral:

In [2]:
pd.DataFrame({
    "Referral Time": ["morning", "afternoon", "morning", "evening"],
}, index=[1,2,3,4])

Unnamed: 0,Referral Time
1,morning
2,afternoon
3,morning
4,evening


could be one-hot encoded as follows:


In [3]:
pd.DataFrame({
    "Referral Time Morning": [1, 0, 1, 0],
    "Referral Time Afternoon": [0, 1, 0, 0],
    "Referral Time Evening": [0, 0, 0, 1],
}, index=[1,2,3,4])

Unnamed: 0,Referral Time Morning,Referral Time Afternoon,Referral Time Evening
1,1,0,0
2,0,1,0
3,1,0,0
4,0,0,1


An issue with one-hot encoding is the creation of a large number of extra
features (one for each level of each categorical feature) i.e. the above took
 one category and made it into three. This crates a very "sparse" dataset
 (contains a lot of zeros that don't add much info) and can result in a very
 sparse model of the data i.e. a tree model that has to make hundreds
 of yes / no decisions on different binary categories before it can make a
 prediction. An alternative to this approach is to encode each category with
 a numerical representation of its value.

**Mean encoding / Feature encoding**


 Target encoding takes the target feature, in this case the need for hospital
  treatment, and encodes each categorical feature with the mean / proportion
  that applies to the individual "levels" of that category. Using the above example, we would calculate the proportion of referrals made in the morning / afternoon / evening that required hospital treatment, and use those proportions as numerical representations of the features. For example, if 15% of children referred in the morning required hospital treatment, and 5% and 18% for the kids referred in the afternoon and evening required hospital treatment, then the feature would look like this:

In [9]:
pd.DataFrame({
    "Referral Time": ["morning", "afternoon", "morning", "evening"],
    "Target Encoded Referral Time": [.15, .05, .18,
                                   .15],
}, index=[1,2,3,4])


Unnamed: 0,Referral Time,Target Encoded Referral Time
1,morning,0.15
2,afternoon,0.05
3,morning,0.18
4,evening,0.15


Note: One must be careful when using this approach, that "leakage" isn't
introduced
 into the dataset - that is, information about the target feature for that
 example being included in the explanatory variables for the same example.
 This can be avoided by ensuring that the target value for each example is
 left out when calculating its encodings.

### 2. Balancing Positive / Negative Examples:

The ACE dataset is heavily imbalanced i.e. only 16.5% of examples require
hospital treatment. Left as is, models can easily achieve high (83.5%)
accuracy by simply predicting ALL children can be treated by ACE. This
wouldn't be a very useful model!

To avoid this, efforts need to be made to balance the predictions made by
each model. Again, I have used two basic approaches to achieve this:

**1. Weighting Labels**:

The penalty a model is given for making an incorrect prediction can be
weighted to penalise the minority label incorrect guesses more
heavily. This discourages the model from simply guessing the majority label
over and over, as it gets a heavier penalty when it does so incorrectly. The
weight is usually chosen to be proportional to the imbalance i.e. if there
are 5 times more negative examples than positive, then an incorrect negative
guess is penalised 5 times more than an incorrect positive.

**2. SMOTE - Synthetic Minority Oversampling TEchnique**

This uses a statistical model to create synthetic examples from the minority
label to balance the number of positive / negative examples 50/50. The
simplest form of oversampling is to simply duplicate the minority examples
over and over. SMOTE uses interpolation between the different minority
examples to create synthetic examples that resemble the distribution of the
originals.

## Performance Metrics

The following metrics are used to measure model performance:

* **True Positive / False Positive / True Negative / False Negative**: Fairly
self
 explanatory. A true positive in this context is an example a model correctly
  states requires hospital treatment, a true negative is an example the model
   states needs hospital treatment when it doesn't, and so on....
* **Accuracy**: Again fairly self explanatory. The proportion of
 correct predictions
* **Precision**: the proportion of positive guesses that are
correct i.e. if a model has a precision of 75%, 3 out of every 4 times it
predicts that hospital treatment is needed it is correct.
* **Recall**: the proportion of positive examples in the dataset that the
model correctly predicts i.e. if there are 50 examples requiring hospital
treatment and the model correctly identifies 40 of them, it has an 80% recall.
* **ROC/AUC**: this is a measure of the tradeoff between precision and
recall, but is a little complex to define here. A 0.5 ROC/AUC is
representative of random chance and 1 is a perfect model.
* **F1 Score**: the f1 score is another measure of the tradeoff between precision and recall. It is a weighted average of the two and ranges from 0 (worst) to 1 (perfect)

## Models and Performance:

Different iterations of each model were evaluated based on a technique called
cross
validation -
this samples the training data and holds back a certain proportion to test
model predictions, ensuring the model is never evaluated on examples it has
already seen. These scores were used to select the best parameters for each
model, and as a guide to overall performance. Overall test scores were taken
from a holdout test set that was not used at any point during the training
process to ensure an unbiased estimate of how well a model can generalise to
new data.

The best models were trained with **one-hot encoded** data with **weighted labels** used to balance the dataset. It is likely that the mean / target encoded data, and the synthetic examples were causing the models to overfit to the training data and thus not generalise well to new examples not seen in training. The following are the scores for the one-hot encoded examples with weighted labels :