[Table of Contents](00.00-Learning-ML.ipynb#Table-of-Contents) &bull; [&larr; *Chapter 2 - Classification*](02.00-Classification.ipynb) &bull; [*Chapter 2.02 - ?* &rarr;](02.02-?.ipynb)

---

# 02.01 - Starting Simple

To really understand how classifiers work, we're going to start with two very basic models called dummy classifiers. 

Technically these aren't machine learning models, as they use simple rules defined by the user. As you will see, they provide a good baseline for performance and demonstrate the importance of using various performance measures to evaluate models. If you've ever witnessed someone *'wow'* an audience by describing a predictive model with an impressively high *accuracy* (such as 95%), you will see why this may not be as impressive as it sounds. (Accuracy has a special definition when we are talking about classification.)

## Mode

*Mode* is a statistical term for the most frequent value in a set of data. For example, in the set `a, b, b, b, c, d`, the value `b` occurs most frequently, so it is the mode. You can calculate the mode of a given dataset in Python with the `statistics.mode` function:

In [35]:
from statistics import mode

# our simple data set
data = ['a','b','b','b','c','d']

mode(data)

'b'

In the above example there are four unique values. In classification, when there are just two unique values in the labels, this is called *binary classification*. For example, consider the set `True, False, False, False, False`. The mode of this set is `False`, and in binary classification this is also known as the *majority class*.

We can use the mode to create a very simple model for predicting a value (and without requiring any input). In binary classification, this is can be called a *majority class classifier*. Using the example above, a majority class classifier would always predict `False`, and given the example data it would achieve an **accuracy of 80%**! A great achievement for such a simple model.

This terminology is potentially problematic. When talking about classification, accuracy is a measure of the proportion of predictions that are predicted correctly. The colloquial meaning of accuracy could mislead others about the performance of your model. Consider a set of 100 True and False labels that flag whether a loan has defaulted or not. Perhaps in this set, only 5 of the loans defaulted. With a basic model such as this, we could trivially achieve 95% accuracy by always predicting False.

#### Additional performance measures

Using the table below (called a *confusion matrix*, we can define some additional useful measures of performance:

| | Predicted = True | Predicted = False |
|---|---|---|
| Actual = True | True Positive (TP)  | False Negative (FN) |
| Actual = False | False Positive (FP) |  True Negative (TN) |

*Accuracy* measures how often the model is correct, calculated as:
> (TP + TN) / (TP + TN + FP + FN)

*Sensitivity* (also called Recall or True Positive Rate) measures how often the model is correct when the actual value is true, calculated as:
> TP / (FN + TP)

*Fallout* (also called False Positive Rate) measures how often the model is incorrect when the actual value is false, calculated as:
> FP / (TN + FP)

*Precision* (also called Positive Predictive Value) measures the proportion of predictions that are correct when the predicted value is true, calculated as:
> TP / (FP + TP)

Generally, the goal is to maximise accuracy, sensitivity and precision, and minimise fallout.

Continuing with our loan defaults example above, let's calculate sensitivity and recall (remembering our model always predicts False):

| | Predicted = True | Predicted = False |
|---|---|---|
| Actual = True | 0 (TP)  | 5 (FN) |
| Actual = False | 0 (FP) |  95 (TN) |

* We already know accuracy is 95%
* Sensitivity = TP / (FN + TP) = 0 / (5 + 0) = 0%
* Fallout = FP / (TN + FP) = 0 / (95 + 0) = 0%
* Precision = TP / (FP + TP) = 0 / (0 + 0) = NaN

Considering these additional metrics, we can now see that while the model accuracy is high, it's actually plain garbage for it's predicting loan defaults. Where this model is strong however, is to form a baseline. Hopefully your real model will achieve higher accuracy (or precision, or sensitivity - depending on what is most important.


## Stratified

Before we continue to *real* models, lets consider one more dummy classifier - the stratified classifier. In statistics, stratified sampling takes samples from each group in the population. It works by assigning predictions according to the probability distribution of the underlying groups of labels.  This classifier is slightly more complex than the mode variant, and can also potentially achieve a higher accuracy.

Continuing with example of loan defaults, where in a set of 100, 5 default (True labels) and 95 do not (False labels), for every 100 predictions this model makes, it will randomly select 5 to be True and the rest False.

Best case, our stratigied classifier makes 5 True predictions that align to the 5 actual True values by chance.

| | Predicted = True | Predicted = False |
|---|---|---|
| Actual = True | 5 (TP)  | 0 (FN) |
| Actual = False | 0 (FP) |  95 (TN) |

* Accuracy = (TP + TN) / (TP + TN + FP + FN) = (5 + 95) / (5 + 95 + 0 + 0) = 100%
* Sensitivity = TP / (FN + TP) = 5 / (0 + 5) = 100%
* Fallout = FP / (TN + FP) = 0 / (95 + 0) = 0%
* Precision = TP / (FP + TP) = 5 / (5 + 0) = 100%

Worst case and our classifier makes no correct predictions.

| | Predicted = True | Predicted = False |
|---|---|---|
| Actual = True | 0 (TP)  | 5 (FN) |
| Actual = False | 5 (FP) |  90 (TN) |

* Accuracy = (TP + TN) / (TP + TN + FP + FN) = (0 + 90) / (0 + 90 + 5 + 5) = 90%
* Sensitivity = TP / (FN + TP) = 0 / (5 + 0) = 0%
* Fallout = FP / (TN + FP) = 5 / (90 + 5) = 5.26%
* Precision = TP / (FP + TP) = 0 / (5 + 0) = 0%

In this example, we could achieve an accuracy of between 90% (all five wrong) and 100%. If we take all the possible scenarios (i.e. 0 true positives, 1 true positives, ... , 5 true positives), the classifier will give us on average an accuracy of 95% (same as mode classifier), sensitivity of 50%, fallout of ~2.6% and precision of 50%. This means, given a sufficiently large data set, we will produce a model that performs (on average) better than the mode dummy classifier according to our additional performance measures.

## Predicting probabilities

As discussed in Chapter 2's introduction to classification, classifiers are typically able to output a probability or likelihood of the positive class occurance. This means that instead of predicting True or False, and 1 or 0, predictions will be output as 0.23, 0.67 or 0.94, where this repesents the probability of a True or 1 value occuring.

Notice that in each confusion matrix above, we have explicitly discretely defined the model predictions (meaning as either True or False, and nothing in between).

Before we can begin measuring accuracy (, sensitivity, fallout or precision) of a predicted probability, we need to define a threshold (called a discrimination threshold) below which we assume probabilities to be False or 0, and above which they become True or 1. For example, if this threshold is 0.5 for our values 0.23, 0.67 or 0.94, they would become 0, 1 and 1 respectively. Furthermore, it were 0.75, they would become 0, 0 and 1 respectively.

But how do we know where to set this threshold? One common method is to maximise what is called the *F1 score*. The F1 score is an alternative accuracy measure which is the weighted average of precision and sensitivity (or recall), and is calculated as:

    2 * (precision * sensitivity) / (precision + sensitivity)

The F1 score is calculated for each threshold from 0.01 to 0.99, with the maximal score occuring indicating the appropriate threshold which maximises both TP / (FP + TP) *and* TP / (FN + TP) simulatenously.

There will potentially be times where your problem may place different value on the occurances of FP and FN. Wouldn't it be nice if there was a performance measure that is useful for all binary classification problems (that is, without the need to explicitly define a threshold)?

We can achieve this by calculating the sensitivity (true positive rate) and fallout (false positive rate) of our model for every threshold increment, and compare this to that of random predictions. 

A random binary prediction has a 50% chance of being correct. Obverse the sensitivity and fallout as we change the number of True predictions in a total set of 100 predictions:

* 0% True predictions: sensitivity = 0% and fallout = 0%
* 10% True predictions: sensitivity = 10% and fallout = 10%
* 20% True predictions: sensitivity = 20% and fallout = 20%
* ...
* 100% True predictions: sensitivity = 100% and fallout = 100%

We can plot sensitivity and fallout of our random predictions to form a straight line from (0,0) to (1,1). This line is our worst possible baseline. The line is called a receiver operating characteristic curve, though it's not a curve at this point.

As the discrimination threshold is varied and the resulting sensitivity and fallout is plotted, a 'good' model will produce a curve reaching up towards the top left corner. The closer the ROC curve to the top left, the better the model. A perfect model will maximise this curve all the way to the top left corner, effectively making a right angle.

If the curve crosses over the random baseline, this indicates an error with the model. If the curve is completely below the random baseline, simply inverting the model (replacing all True predictions with False predictions and vice versa) is a trivial improvement.

The space below the curve is a performance metric called AUC (or area under the curve). By measuring the number of true positives to false positives at each variation of the discrimination threshold, this provides an effective measure of how well a binary classifier can distinguish Trues from Falses, and allows us to measure model performance without explicitly defining a threshold.


## Implementing dummy classifiers

Having a baseline as described above is genuinely useful. Luckily, these models are straightforward to implement in Python using the Scikit-learn Machine Learning package. Conveniently, the `DummyClassifier` class conforms to the same API (i.e. code style) as the rest of the algorithms, so what we learn for applying these baseline models to a dataset is largely transferrable to real machine learning!

First, let's quickly generate a some sample data to work with:

In [21]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# create a sample data set, where X are the features and y are the labels
X, y = make_classification(n_samples=100, n_classes=2)

Now we can build our first dummy classifier, using the mode method described above and compute the AUC metric:

In [25]:
from sklearn.dummy import DummyClassifier
from sklearn import metrics

# define and fit our dummy model
dummy_mode = DummyClassifier(strategy='most_frequent')
dummy_mode.fit(X, y)

# generate predictions and calculate sensitivity and fallout
predictions = dummy_mode.predict(X)
sensitivity, fallout, thresholds = metrics.roc_curve(y, predictions)

print(metrics.auc(sensitivity, fallout))

0.5


And the second dummy classifier, using the probability distribution method:

In [29]:
# define and fit our dummy model
dummy_strat = DummyClassifier(strategy='stratified')
dummy_strat.fit(X, y)

# generate predictions and calculate sensitivity and fallout
predictions = dummy_strat.predict(X)
sensitivity, fallout, thresholds = metrics.roc_curve(y, predictions)

print(metrics.auc(sensitivity, fallout))

0.509203681473


Notice the AUC will increase and decrease if you re-run the stratified dummy classifier. This is because the predictions are assigned randomly according to the probability distribution, with a range of possible sensitivity and fallout measures as described above.


## Where to now?

Given the two dummy classifiers as a baseline, our goal with machine learning is to create a model capable of out-performing these. The rest of Chapter 2 is going to explain and demonstrate various different approaches (i.e. types of models) to help you achieve this for your given classification problem.

---

[Table of Contents](00.00-Learning-ML.ipynb#Table-of-Contents) &bull; [&larr; *Chapter 2 - Classification*](02.00-Classification.ipynb) &bull; [*Chapter 2.02 - ?* &rarr;](02.02-?.ipynb)