# Classification

Classification is a supervised task where a model maps input to a discrete output, often referred to as the class. Classification is a specific case of regression. More formally, a classification problem can be defined as learning a function $f$ that will map input variables $X = x_0, x_1,\dots, x_{m-1}, x_{m}$ to a discrete target variable $y$ such that $f(x) = y$.

So for instance, let's say that we have the following data:

| Variable 1 | Variable 1 | Variable 3 | Variable 4 | Target variable |
|------------|------------|------------|------------|:---------------:|
| 1          | 2          | 3          | 4          | Yes              |
| 2          | 3          | 4          | 5          | No              |
| 3          | 4          | 5          | 6          | Yes              |
| ...        | ...        | ...        | ...        | ...             |
| 2000       | 2001       | 2002       | 2003       | No            |

We would want to learn some function such that $f(1,2,3,4) = Yes$ and $f(2,3,4,5) = No$ and so on.

Classification is often seen as finding decision boundaries in data. Finding a boundary which separates the multiple classes.

![Title](../../../source/visualization/images/classification.png)

## Evaluating classification

Machine learning is all about getting better and better at a task. Therefore, we need to define what it means to be _good_.

For instance, given the output of different models compared to the target variable, which model would you say is better, and why?

| Target |0|0|0|3|1|0|1|1|0|1|1|0|3|3|0|2|1|1|3|3|
|:-----:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|Model A|3|0|3|2|1|3|2|2|0|1|1|3|1|0|0|0|0|1|0|3|
|Model B|3|3|0|1|0|1|2|0|1|0|1|0|1|3|1|3|3|1|0|0|
|Model C|3|0|1|2|0|0|1|1|3|1|3|3|3|0|1|2|0|1|0|0|
|Model D|0|3|0|1|3|1|1|0|1|3|1|3|0|0|3|0|2|1|1|0|
|Model E|3|1|3|0|0|0|2|1|1|0|1|3|0|1|1|0|0|3|1|3|

This might be difficult to tell, especially if there are more models and predictions. Thankfully, when it comes to classification, there exist several commonly-used metrics to tackle this problem. Let's use the data from the table as example.

In [118]:
import numpy as np

target = np.array([0, 0, 0, 3, 1, 0, 1, 1, 0, 1, 1, 0, 3, 3, 0, 2, 1, 1, 3, 3])

predictions = {'A': np.array([3, 0, 3, 2, 1, 3, 2, 2, 0, 1, 1, 3, 1, 0, 0, 0, 0, 1, 0, 3]),
               'B': np.array([3, 3, 0, 1, 0, 1, 2, 0, 1, 0, 1, 0, 1, 3, 1, 3, 3, 1, 0, 0]),
               'C': np.array([3, 0, 1, 2, 0, 0, 1, 1, 3, 1, 3, 3, 3, 0, 1, 2, 0, 1, 0, 0]),
               'D': np.array([0, 3, 0, 1, 3, 1, 1, 0, 1, 3, 1, 3, 0, 0, 3, 0, 2, 1, 1, 0]),
               'E': np.array([3, 1, 3, 0, 0, 0, 2, 1, 1, 0, 1, 3, 0, 1, 1, 0, 0, 3, 1, 3])}

### Accuracy

$Accuracy = \frac{\textrm{tp} + \textrm{tn}}{\textrm{tp} + \textrm{tn} + \textrm{fp} + \textrm{fn}}$

The accuracy is a straight-forward method which tells us how many classes were correctly predicted overall. It gives a clear idea of how well a model is performing when classes are balanced. Indeed, if classes aren't balanced, let's say 95% class 1 and 5% class 2, we could always predict class 1 and already have an accuracy of 95%. But we might not want to never predict class 2.

In [119]:
def accuracy(predicted_target, target):
    
    return np.mean(target == predicted_target)

for model_name, predicted_target in predictions.items():
    print(f"{model_name}: {accuracy(predicted_target, target):.4f}")

A: 0.4000
B: 0.2500
C: 0.4000
D: 0.2500
E: 0.2000


### Recall

To remedy the flaws of simple accuracy, there exist metrics like recall. The recall for each class is defined as the number of instance of that class that were correctly predicted (true positives), divided by the total number of instances of that class (true positives and false negatives).

In other words, if a classifier only predicts class 1 and never class 2, it will have a recall of 1 for class 1 but a recall of 0 for class 2.

Recall is a metric that is calculated for each class. In order to provide a single metric for the classification problem as a whole, we typically compute a weighted average based on how many times each class appears.

In [120]:
def recall(predicted_target, target):

    recall_per_class = {}
    classes, counts = np.unique(target, return_counts=True)

    for c in classes:
        recall_per_class[c] = np.sum((target == c) & (predicted_target == c)) / np.sum(target == c)

    weighted_recall = np.sum([recall_per_class[c] * counts[c] / len(target) for c in classes])

    return weighted_recall, recall_per_class

for model_name, predicted_target in predictions.items():
    weighted_recall, recall_per_class = recall(predicted_target, target)
    
    print(f"{model_name}: weighted recall({weighted_recall:.2f})", end="")
    print(f"| class 0 ({recall_per_class[0]:.2f})", end="")
    print(f"| class 1 ({recall_per_class[1]:.2f})", end="")
    print(f"| class 2 ({recall_per_class[2]:.2f})", end="")
    print(f"| class 3 ({recall_per_class[3]:.2f})")

A: weighted recall(0.40)| class 0 (0.43)| class 1 (0.57)| class 2 (0.00)| class 3 (0.20)
B: weighted recall(0.25)| class 0 (0.29)| class 1 (0.29)| class 2 (0.00)| class 3 (0.20)
C: weighted recall(0.40)| class 0 (0.29)| class 1 (0.57)| class 2 (1.00)| class 3 (0.20)
D: weighted recall(0.25)| class 0 (0.29)| class 1 (0.43)| class 2 (0.00)| class 3 (0.00)
E: weighted recall(0.20)| class 0 (0.14)| class 1 (0.29)| class 2 (0.00)| class 3 (0.20)


### Precision

$Precision = \frac{tp}{\textrm{tp} + \textrm{fp}}$

Another metric that remedies the flaws of simple accuracy is precision. The precision represents how accurate a model is at predicting each class. In other words, if a classifier only recalls 1 instance of class 1 but predicts it correctly, it will have a precision of 1, whereas if it predicts half of class 1 correctly, its precision will be 0.5.

Precision is a metric that is calculated for each class in a classification problem. In order to provide a single metric, we typically compute the weighted average of the precision of each class, based on how many times each class appears.

In [121]:
def precision(predicted_target, target):

    precision_per_class = {}
    classes, counts = np.unique(target, return_counts=True)

    for c in classes:
        precision_per_class[c] = np.sum((target == c) & (predicted_target == c)) / np.sum(predicted_target == c)

    weighted_precision = np.sum([precision_per_class[c] * counts[c] / len(target) for c in classes])

    return weighted_precision, precision_per_class

for model_name, predicted_target in predictions.items():
    weighted_precision, precision_per_class = precision(predicted_target, target)
    
    print(f"{model_name}: weighted precision({weighted_precision:.2f})", end="")
    print(f"| class 0 ({precision_per_class[0]:.2f})", end="")
    print(f"| class 1 ({precision_per_class[1]:.2f})", end="")
    print(f"| class 2 ({precision_per_class[2]:.2f})", end="")
    print(f"| class 3 ({precision_per_class[3]:.2f})")

A: weighted precision(0.48)| class 0 (0.43)| class 1 (0.80)| class 2 (0.00)| class 3 (0.20)
B: weighted precision(0.25)| class 0 (0.29)| class 1 (0.29)| class 2 (0.00)| class 3 (0.20)
C: weighted precision(0.41)| class 0 (0.29)| class 1 (0.67)| class 2 (0.50)| class 3 (0.20)
D: weighted precision(0.25)| class 0 (0.29)| class 1 (0.43)| class 2 (0.00)| class 3 (0.00)
E: weighted precision(0.20)| class 0 (0.14)| class 1 (0.29)| class 2 (0.00)| class 3 (0.20)


### $F_\beta$ Score

$F_{\beta} = (1 + \beta^2) \cdot \frac{\textrm{precision} \cdot \textrm{recall}}{(\beta^2 \cdot \textrm{precision}) + \textrm{recall}}$

A good model should have a high recall and a high precision. In some cases, one matters more than the other. The $F_\beta$ measure is defined as the weighted harmonic mean of the recall and the precision and therefore provides a single metric combining the two. The $\beta$ parameter influences the weight given to the precision. If it is set to 1, precision and recall are weighted equally. If it is less than 1, recall is favoured. If it is more than 1, precision is favoured. Typically, $\beta$ is set to 1 and the measure is then called the $F_1$ score.

In [122]:
def fbeta_score(predicted_target, target, beta=1.0):

    classes, counts = np.unique(target, return_counts=True)
    fbeta_score_per_class = {c:0 for c in classes}

    # Computes the recall and precision of these classes
    _, p = precision(predicted_target, target)
    _, r = recall(predicted_target, target)

    # Computes the F-beta score as the harmonic mean between precision and recall
    for c in classes:
        if beta**2 * p[c] + r[c] == 0:
            fbeta_score_per_class[c] = 0 # if precision and recall are 0, then f-beta should also be zero
        else:
            fbeta_score_per_class[c] = (1 + beta**2) * (p[c] * r[c]) / (beta**2 * p[c] + r[c])

    weighted_fbeta_score = np.sum([fbeta_score_per_class[c] * counts[c] / len(target) for c in classes])

    return weighted_fbeta_score, fbeta_score_per_class

for model_name, predicted_target in predictions.items():
    weighted_fbeta_score, fbeta_score_per_class = fbeta_score(predicted_target, target)
    
    print(f"{model_name}: weighted fbeta_score({weighted_fbeta_score:.2f})", end="")
    print(f"| class 0 ({fbeta_score_per_class[0]:.2f})", end="")
    print(f"| class 1 ({fbeta_score_per_class[1]:.2f})", end="")
    print(f"| class 2 ({fbeta_score_per_class[2]:.2f})", end="")
    print(f"| class 3 ({fbeta_score_per_class[3]:.2f})")

A: weighted fbeta_score(0.43)| class 0 (0.43)| class 1 (0.67)| class 2 (0.00)| class 3 (0.20)
B: weighted fbeta_score(0.25)| class 0 (0.29)| class 1 (0.29)| class 2 (0.00)| class 3 (0.20)
C: weighted fbeta_score(0.40)| class 0 (0.29)| class 1 (0.62)| class 2 (0.67)| class 3 (0.20)
D: weighted fbeta_score(0.25)| class 0 (0.29)| class 1 (0.43)| class 2 (0.00)| class 3 (0.00)
E: weighted fbeta_score(0.20)| class 0 (0.14)| class 1 (0.29)| class 2 (0.00)| class 3 (0.20)


### Understanding the difference between precision and recall


Precision and recall both express related but conflicting concepts. To really understand the difference between the two, let's use an example. Imagine that you are an admission officer at a university. Your goal is to classify the applications you receive such that you admit all of the good candidates, and none of the bad ones.

If you decide to focus only on recall, it means that your goal is to admit as many good candidates as possible, regardless of how many bad candidates you admit by mistake. If you decide to focus only on precision, it means that your goal is that every single candidate you admit is a good candidate, regardless of whether you rejected some good candidates by mistake.

Accepting all applications will increase the recall, but it will reduce the precision. Similarly, being extremely strict about which application to accept will increase the precision, but will reduce the recall.


### Custom metrics

Of course, it is completely possible to use custom metrics.

A simple example would be to use weighted versions of the aforementioned metrics. By doing this, you would loosely make it more important to perform well for certain classes than others. It could also be possible to have a fully custom metric based on a custom error function. Perhaps, your application entails that it is much worse to confuse class 1 with class 2 than it is to confuse it with class 3 for instance.

The metric should ultimately represent what it means for your classification to be good, whatever it may mean in your application.

## Practical Examples

### Detecting spam e-mails

_Input variables_: Number of times certain words appear in the e-mail
_Target variable_: Whether an e-mail is a spam

| Spam | Mom | Loan | Hello | Spam |
|:----:|:---:|:----:|:-----:|:----:|
|   4  |  0  |   3  |   0   |  Yes |
|   0  |  1  |   0  |   1   |  No  |
|   0  |  3  |   1  |   2   |  No  |
|  ... | ... |  ... |  ...  |   1  |
| 5    | 0   | 0    | 1     | Yes  |

This could be useful to create a spam filter.

### Facial recognition

_Input variables_: Intensity of pixels in a 100x100 photo
_Target variable_: Person

| (0,0) | (0,1) | ... | (99,98) | (99,99) | Person   |
|:-----:|:-----:|:---:|:-------:|:-------:|----------|
|  0.81 |  0.72 | ... |   0.41  |   0.55  | Valentin |
|  0.23 |  0.12 | ... |   0.07  |   0.92  | Jack     |
|  0.54 |  0.48 | ... |    0    |   0.31  | Robin    |
|  ...  |  ...  | ... |   ...   |   ...   | ...      |
| 0.71  | 0.79  | ... | 0.37    | 0.81    | Lisa     |

This could be used by an application that tags pictures automatically.

### Predict whether a team will win a basketball game

_Input variables_: Win percentage, win percentage of the opponent, rebound per game, rebound per game by the opponent
_Target variable_: Likelihood of defaulting

| Win % | Opponent Win % |  RPG | Opponent RPG | Will win the game |
|:-----:|:--------------:|:----:|:------------:|:-----------------:|
|  0.65 |      0.33      | 45.8 |     42.2     |        Yes        |
|  0.54 |      0.47      | 37.6 |     44.3     |         No        |
|  0.28 |      0.77      | 38.1 |     48.7     |         No        |
|  ...  |       ...      |  ... |      ...     |        ...        |
| 0.38  | 0.43           | 37.8 | 36.9         | Yes               |

This could be used by a team to know when they could rest their star players.