# multiclass and multilabel Classification

## Introduction

There is some confusion about the difference between multiclass, multilabel, and multioutput classification. Scikit-Learn's documentation about these topics is rigorous, but a bit terse. In what follows, we aim to provide an unambiguous definition of these three types of classificatino, and how one can build classifiers to work on them. We will see that some classifiers are able to perform certain types of classification out of the box, while others need to be *wrapped* in special *meta-estimators* in order to do the same thing.

In this article we will first define the three types of classification, we will then show how to generate synthetic datasets for each of the classes. Finally we will see how to build classifiers for each data type, which classifiers need to be modified by wrapping them into a meta-estimator, and which ones do not.

**TODO**: find out how to include a TOC in this notebook.

## Difference between multiclass, multilabel and multioutput

We start setting some terminology that we will be using throughout the article. A classification problem involves a set of *samples* or *observations* or *instances*. Each sample has a number of **outputs**, and each output is characterized by a number of **classes**. In a binary classification problem, e.g. classifying pictures of cats and dogs, each image has one underlying output (it depicts an animal) and two classes, (cat and dog).

### multiclass data

The difference between multiclass, multilabel and multioutput data is described in [section 1.12](https://scikit-learn.org/stable/modules/multiclass.html) of the Scikit-Learn User Guide. In a multiclass classification problem we still have only *one output* but *more than two classes*. For example, if our dataset consists of images of cats and dogs, and our goal is to build a classifier for these images, we are dealing with a *binary* classification problem. If, instead, our datasets contains images of cats, dogs and horses, we have a *multiclass* classification problem. A subtle but crucial point here is that we are assuming that each image can contain one and only one animal. An animal in a picture can be one and only one of either a cat, a dog or a horse, but it cannot be a cat and a dog at the same time, nor it cannot contain two cats. In other words, each instance:

1. Contains only one entity (a cat, a dogs, or a horse).
2. The entities are mutually exclusive.

In what follows, wehttps://scikit-learn.org/stable/modules/multiclass.html always assume point 1, but we will relax point 2.

If we had 5 images of either animal, we could represent the labels in a few ways.

1. An array of labels like `['dog', 'horse', 'cat', 'horse', 'dog']`.
2. An array of integers, where `cat`=0, `dog`=1, `horse`=2: The array at point 1, therefore, could be represented as `[1, 2, 0, 2, 1]`.
3. A sparse binary matrix of shape `(n_samples, n_classes)`, where columns 0 to 2 are, respectively, `cat`, `dog`, `horse`, and each of them contains a binary variable indicating whether the instance belongs to the class.

```
[[0, 1, 0],
 [0, 0, 1],
 [1, 0, 0],
 [0, 0, 1],
 [0, 1, 0]]
```

This last representation emphasizes the fact that the three classes are mutually exclusive, as each row can only have one entry equal to 1.

### multilabel data

If we still assume that there is only one output, but we relax the constraint that classes be mutually exclusive, we have a *multilabel* classification problem. A typical example is provided by text documents. Each sample (i.e., document) has one output, its *topic*. The article topic can be classified, for example, into politics, sport, entertainment and crime. Therefore we have a classification problem with one output and four classes. Unlike the multiclass case, however, some articles may contain multiple topics. An article about a corrupted politician facing trial would belong to both the crime *and* politics classes. Similarly an article about an athlete taking part in a charity TV show, would belong to the sport *and* entertainment classes.

More formally, we can say that given $K$ classes (4, in our example above), each document can belong to $x$ of them.
If an article touches all of politics, sport, entertainment and crime, then $x = K$. What is, however, the smallest number of classes a document can belong to? This depends on how we collect our data. If our dataset also contains articles about computer programming, a subject that does not appear in the four labels above, then we must admit that some articles may not be labeled at all. The lowest possible value of $x$ would, in this case, be 0.
If, conversely, we take care of including in our dataset only articles which refer to at least one of the four topics above, then the lowest possible value of $x$ is 1. The Scikit-Learn function `make_multilabel_classification` covers these two scenarios via the `allow_unlabeled` argument, as we will see below.

In Scikit-Learn, multilabel datasets must be represented as either sparse or dense matrices of shape `(n_samples, n_classes)`, where each column represents a class. If sample $i$ belongs to class $j, j \in {0,\ldots\, K-1}$, then label $y_{ij} = 1$. Since classes are not mutually exclusive, each row can have multiple, or even all entries equal to 1. If we index our classes as `crime` = 0, `entertainment` = 1, `politics` = 2 and `sport` = 3, a document covering both `politics` and `crime` would be represented as `[1, 0, 1, 0]`, while a document about `entertainment` and `sport` would be represented as `[0, 1, 0, 1]`. A dataset consisting of these two samples would be a dense or sparse matrix of the form

```
[[1, 0, 1, 0],
 [0, 1, 0, 1]]
```

**TODO** include the example about `MultiOutputClassifier`.

### multioutput (a.k.a. multitask) classification

In a multilabel dataset, each sample can belong to multiple classes, but each individual class is binary. Therefore, an article can be about politics and crime, but when restricting to a given class, say crime, the article is either about crime, or it is not. Put differently, each row of the multilabel array, can only contain either 0 or 1, to indicate the absence or presence of the corresponding label.

**TODO** understand whether multioutput generalizes only multiclass or also multilable.

1. Each sample has > 1 output.
2. Each output has more than 2 classes.

Note however, that the classes in each output are mutually exclusive. Scikit-Learn documentation shows an example where, given a set of images, we measure two outputs: the type of fruit and its color. The type of fruit is one of `['apple', 'pear', 'orange']`, and the color is one of `['green', 'red', 'yellow', 'orange']`. These classes are mutually exclusive (we assume, for simplicity, that a fruit is only one color). As such, this would be a generalization of the multiclass case. This example would be represented associating to each output a column vector containing the class label of each instance for that particular output. To clarify, if our dataset contains a red apple, an orange orange and a green pear, we would create two column vectors, one for the fruit and one for the color, and we would combine them as follows.

In [1]:
import numpy as np

fruits = ['apple', 'orange', 'pear']
colors = ['red', 'orange', 'green']
dataset = np.stack([fruits, colors], axis=1)
dataset

array([['apple', 'red'],
       ['orange', 'orange'],
       ['pear', 'green']], dtype='<U6')

**TODO** remember to mention that " all classifiers handling multioutput-multiclass (also known as multitask classification) tasks, support the multilabel classification task as a special case."
**NOTE**: it seems that the recommended way to handle multilabel classification problems is via `MultiOututClassifier`. Is this the case?
**TODO** mention the difference between methods that treat each each label independently, like `MultiOutputClassifier`, and those that consider them jointly.


**TODO** confirm that the case below is not covered.

Let's now consider the text document example, with the same four classes as before, but with an additional topic describing the mood of the article. For simplicity let's imagine three classes: `happy`, `sad` and `neutral`. An article about an unjustly convicted athlete who is found innocent after spending years in prison would have a sad beginning and a happy ending. This article would therefore belong both to the `sport` and `crime` classes of the `subject` output and to the `happy` and `sad` classes of the `mood` output. This is a generalization of the multilabel case. If, as before, we index the classes for the `subject` output as `crime` = 0, `entertainment` = 1, `politics` = 2 and `sport` = 3 and the classes for the `mood` output as `happy` = 0, `neutral` = 1, `sad` = 2, the document we just described would be represented as `[1, 0, 0, 1]` for the subject and `[1, 0, 1]` for the mood.

At the time of writing, there is no metric in the `sklearn.metrics` module, that supports multioutput-multiclass classification.

Summarizing, we can use the following taxonomy (from the Scikit-Learn User Manual) of the models described so far.

| Type        | Number of Targets | Target Cardinality | Valid Type of Target     |
| ---         | ---               | ---                | ---                      |
| Multiclass  | 1                 | > 2                | `multiclass`             |
| Multilabel  | > 1               | 2 (0 or 1)         | `multilabel-indicator`   |
| Multioutput | > 1               | > 2                | `multiclass-multioutput` |


## Multiclass classification

### Fitting a multiclass model

let's start creating a synthetic multiclass dataset containing 100 samples each one belonging to one of three possible classes.

In [2]:
from sklearn.datasets import make_classification

x1, y1 = make_classification(n_samples=100, n_features=20, n_informative=5,
                             n_classes=3, random_state=42)
y1[:5]

array([0, 1, 1, 2, 1])

We split this dataset into a training and test set comprising 80% and 20% of the dataset, respectively.

In [3]:
from sklearn.model_selection import train_test_split

x1_train, x1_test, y1_train, y1_test = train_test_split(x1, y1, test_size=0.2, stratify=y1,
                                                       random_state=42)
x1_train.shape, y1_train.shape, x1_test.shape, y1_test.shape

((80, 20), (80,), (20, 20), (20,))

`LogisticRegression` has a `multi_class` argument equal to `auto` by default. This estimator is described as "inherently multiclass". Let's fit this estimator with default settings. As we can see, the model infers from the format of the targets that this is a multiclass problem, and return predictions in the same format.

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

clf = LogisticRegression()
clf.fit(x1_train, y1_train)
yhat1_test = clf.predict(x1_test)
y1_test[:5], yhat1_test[:5]

(array([0, 2, 0, 0, 1]), array([0, 1, 2, 0, 1]))

We can be more explicit and set `multi_class='multinomial'`. The result is the same.

In [5]:
clf = LogisticRegression(multi_class='multinomial')
clf.fit(x1_train, y1_train)
tmp = clf.predict(x1_test)
np.all(tmp == yhat1_test)

True

### Assessing the performance of a multiclass classifier

The accuracy of a multiclass classifier is assessed the same way as a binary classifier: we simply count the fraction of samples where our prediction is identical to the actual target.

In [6]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy_score(y1_test, yhat1_test)

0.75

In [7]:
print(np.sum(y1_test == yhat1_test) / len(y1_test))

0.75


On the other hand, scores like precision and recall, that rely on the idea of "positive" and "negative", are less obvious. We must therefore provide information on how such terms should be interpreted. If we write `f1_score(y1_test, yhat1_test)` we get an error, which tells us that

```
Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted']
```

In other words, if we consider each of the 3 classes in turn as the "positives" and the other classes as the "negatives", we end up with three precision scores. The question is how we should average these three scores. The possibilities are `None`, `micro`, `macro`, `weighted`.

The first option, `None`, does not average at all, but rather returns the precision values for each class when considered as the positive one.

In [8]:
f1_none = f1_score(y1_test, yhat1_test, average=None)
f1_none

array([0.83333333, 0.76923077, 0.66666667])

`macro` computes the precision of each of the three classes without taking the class cardinality into consideration, i.e., classes with a very large number of samples have the same weight in the average (in our case 1/3) than classes with very few samples.

In [9]:
f1_macro = f1_score(y1_test, yhat1_test, average='macro')
f1_macro

0.7564102564102564

According to the definition above, the value of `f1_macro` should simply be the average of the values of `f1_none`. Let's verify that this is the case.

In [10]:
f1_none.mean()

0.7564102564102564

To understand how the `micro` average works, let's consider a simpler example: we have 3 classes and two samples per class. We compare the true labels with our predictions, `yhat_simple`.

In [11]:
y_simple    = np.array([0, 1, 0, 2, 1, 2])
yhat_simple = np.array([0, 1, 0, 0, 2, 2])

accuracy_score(y_simple, yhat_simple)

0.6666666666666666

We correctly classify the two samples in the 0 class, we also erroneously misclassify a sample in class 2 as a 0. Therefore precision for class 0 is 2/3.
For class 1 we correctly classify one sample and miss the other. Precision for class 1 is 1.
For class 2 we have the one correct and one incorrect prediction, the precision for this class is therefore 1/2. We also misclassify a 1 for a 2. Therefore, if $p_k$ represents precision for class $k, k \in {0, 1, 2}$, we have:

$p_0 = 2/3$, $p_1 = 1$ and $p_2 = 1/2$.

If we average the three precisions we get the `macro` estimate.

In [12]:
(2/3 + 1 + 1/2)/3

0.7222222222222222

In [13]:
precision_score(y_simple, yhat_simple, average='macro')

0.7222222222222222

**TODO** re-read this paragraph and make sure it is correct.

In the `micro` case we consider each class and its support separately and we add them together. More precisely:

For class 0 we have 3 predictions, 2 of which are correct. Support for class 0 is $n_0 = 3$ and $p_0 = 2/3$.
For class 1 we have 1 prediction, which is correct. Support $n_1$ is 1 and $p_1 = 1$.
For class 2 we have 2 predictions, one of which is correct. Support $n_2 = 2$, $p_2 = 1/2$.

$$
\frac{n_0 p_0 + n_1 p_1 + n_2 p_2}{n_0 + n_1 + n_2} = 
\frac{3 * 2/3 + 1 * 1 + 2 * 1/2}{3 + 1 + 2} = \frac{4}{6} = 0.66666
$$

In [14]:
precision_score(y_simple, yhat_simple, average='micro')

0.6666666666666666

The weighted case computes the per-class metric and uses the class cardinality as a weight to compute the average. Since in our case all classes have size 2, this ends up being the same as the macro average.

In [15]:
precision_score(y_simple, yhat_simple, average='weighted')

0.7222222222222222

`classification_report` prints this information, and more, in tabular form.

In [16]:
print(classification_report(y_simple, yhat_simple))

              precision    recall  f1-score   support

           0       0.67      1.00      0.80         2
           1       1.00      0.50      0.67         2
           2       0.50      0.50      0.50         2

    accuracy                           0.67         6
   macro avg       0.72      0.67      0.66         6
weighted avg       0.72      0.67      0.66         6



We can see this same metric for our synthetic dataset

In [17]:
print(classification_report(y1_test, yhat1_test))

              precision    recall  f1-score   support

           0       1.00      0.71      0.83         7
           1       0.71      0.83      0.77         6
           2       0.62      0.71      0.67         7

    accuracy                           0.75        20
   macro avg       0.78      0.75      0.76        20
weighted avg       0.78      0.75      0.76        20



### When to use micro, macro, weighted etc.

According to the Scikit-Learn User Manual (section 3.3.2.1):

-  In problems where infrequent classes are nonetheless important, `macro`-averaging may be a means of highlighting their performance. On the other hand, the assumption that all classes are equally important is often untrue, such that macro-averaging will over-emphasize the typically low performance on an infrequent class.
- `micro` averaging may be preferred in multilabel settings, including multiclass classification where a majority class is to be ignored.
- `weighted` averaging accounts for class imbalance by computing the average of binary metrics in which each class’s score is weighted by its presence in the true data sample.

## Multilabel classification

We can create a multilabel dataset with the `make_multilabel_classification` function from `datasets`. If we use the function without specifying any argument, we will get 100 samples and 5 binary classes, where each sample can belong to multiple classes.

In [18]:
from sklearn.datasets import make_multilabel_classification

x2, y2 = make_multilabel_classification(random_state=42)
y2[:5]

array([[0, 0, 0, 1, 0],
       [1, 1, 1, 0, 0],
       [0, 0, 1, 1, 0],
       [1, 0, 0, 0, 0],
       [1, 0, 1, 0, 0]])

We can get the same targets as a list by setting the `return_indicator` argument to `False`.

In [19]:
_, y2 = make_multilabel_classification(return_indicator=False, random_state=42)
y2[:5]

[[3], [0, 1, 2], [2, 3], [0], [0, 2]]

There is a `n_labels` argument, but when we set it to, say, 3, we don't get the expected output.

In this example we have 100 samples and 5 classes. Each sample belongs to a number of classes $x$, where $x$ is sampled from a Poisson distribution centered around `n_labels`. If we allow some samples to be unlabeled, by setting the `allow_unlabeled` option to `True`, we set the lower bound of `x` to zero, i.e., a sample can belong to zero classes but not to a negative number of classes. If `allow_unlabeled=False`, the lower bound on `x` is 1, i.e., each sample must belong to at least one class. The maximum number of classes a sample can belong to is obviously the total number of available classes. `n_labels` determine the distribution of the number of classes each sample can belong to. If we set `n_labels=1` and we allow for unlabeled samples, most samples will be unlabeled.

In [20]:
_, y2 = make_multilabel_classification(n_labels=1, random_state=42, allow_unlabeled=True)
y2[:5]

array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]])

If we impose that all samples must belong to at least one class, and `n_labels=1`, most samples will belong to only one class.

In [21]:
_, y2 = make_multilabel_classification(n_labels=1, random_state=42, allow_unlabeled=False)
y2[:5]

array([[0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [1, 0, 1, 0, 0],
       [0, 1, 1, 0, 0],
       [0, 1, 0, 0, 0]])

Conversely, if we set `n_labels=5` samples will tend to belong to many classes.

In [22]:
_, y2 = make_multilabel_classification(n_labels=5, random_state=42, allow_unlabeled=False)
y2[:5]

array([[1, 1, 1, 1, 0],
       [0, 1, 1, 0, 0],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 0, 1],
       [1, 1, 1, 0, 0]])

Let's create a dataset of 100 samples with 20 features, 4 classes where on average each sample can belong to 2 classes

In [23]:
x2, y2 = make_multilabel_classification(n_samples=100, n_features=20, n_labels=2, n_classes=4,
                                       random_state=42)
print(y2[:5])
x2_train, x2_test, y2_train, y2_test = train_test_split(x2, y2, test_size=0.2, random_state=42)

[[0 0 1 0]
 [1 0 1 0]
 [0 1 0 1]
 [0 0 1 0]
 [0 1 1 0]]


Logistic regression is not listed among the functions that support multilabel classification. If we create a `LogisticRegression` estimator and we fit it to `x2` and `y2`, we get an error:

> `y should be a 1d array, got an array of shape (80, 4) instead.`

We will try two approaches:

1. Wrapping the estimator into `OneVsRestClassifier()`.
2. Wrapping the estimator into `MultiOutputClassifier()`.

### `OneVsRestClassiffier()`

In [24]:
from sklearn.multiclass import OneVsRestClassifier

clf_ovr = OneVsRestClassifier(LogisticRegression())
clf_ovr.fit(x2_train, y2_train)
yhat2_test_ovr = clf_ovr.predict(x2_test)

In [25]:
yhat2_test_ovr[:5]

array([[1, 0, 1, 1],
       [1, 1, 0, 1],
       [0, 0, 0, 1],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])

## `MultiOutputClassifier()`

If we wrap `LogisticRegression()` into `MultiOutputClassifier()` we obtain the same result.

In [26]:
from sklearn.multioutput import MultiOutputClassifier

clf_moc = MultiOutputClassifier(LogisticRegression())
clf_moc.fit(x2_train, y2_train)
yhat2_test_moc = clf_moc.predict(x2_test)

In [27]:
print(yhat2_test_moc[:5])
print(np.all(yhat2_test_moc == yhat2_test_ovr))

[[1 0 1 1]
 [1 1 0 1]
 [0 0 0 1]
 [0 0 0 0]
 [0 0 0 0]]
True


One difference between these two approaches is the representation of the classification probabilities. The model wrapped into `OneVsRest` returns an array, while the model wrapped into `MultiOutputClassifier()` returns a list.

In [28]:
prob2_test_ovr = clf_ovr.predict_proba(x2_test)
prob2_test_moc = clf_moc.predict_proba(x2_test)

type(prob2_test_ovr), type(prob2_test_moc)

(numpy.ndarray, list)

The $p_{ij}$ element of the NumPy array representation contains the probability for sample $i$ of belonging to class $j$. Looking at the first row of the array below, we see that the probabilities of belonging to class 0 and 2 are above 0.9, while class 3 is barely above 0.5, but this is enough to predict that the sample belongs also to this class.

In [29]:
prob2_test_ovr[:5]

array([[0.92785382, 0.01756551, 0.97444888, 0.51609456],
       [0.996253  , 0.92493145, 0.00850985, 0.67339801],
       [0.00191349, 0.25360644, 0.01994427, 0.9999983 ],
       [0.0628209 , 0.00164815, 0.04906532, 0.00534206],
       [0.00197308, 0.42121543, 0.24990504, 0.00276496]])

The list representation contains as many elements as we have classes (4 in our case). If we look into the first element, we see that:

1. The sum of the elements in each row is 1.
2. The values in the second column are identical to the values in the first column of the array representation.

The interpretation is that the $k$th element of this list contains, for each sample, the probabilities of being a negative (column 0) or a positive (column 1) for class $k$.

In [30]:
print(len(prob2_test_moc))
prob2_test_moc[0][:5]

4


array([[0.07214618, 0.92785382],
       [0.003747  , 0.996253  ],
       [0.99808651, 0.00191349],
       [0.9371791 , 0.0628209 ],
       [0.99802692, 0.00197308]])

## Metrics in the multilabel case

The evaluation of a multilabel model adds one additional layer of complexity, in that we must decide how to handle the various labels. Samples will usually belong to different numbers of classes. Suppose we have the targets and the predictions shown below.

In [31]:
tgts = [[0, 0, 0, 1],
        [1, 1, 1, 1]]
preds = [[1, 1, 1, 1],
         [1, 1, 1, 0]]

In the first case we are correctly predicting the positive class, but misclassifying all the others. In the second case we are misclassifying one class, but correctly predicting the other 3. Is the first one correctly classified or not? Shall we somehow weight for the fact that the first prediction is 3/4 wrong and the second is 3/4 right?

### Accuracy

If we use the `accuracy_score` without further arguments, we get a value of zero. This happens because the default behavior is that if the prediction is not identical to the target, the whole sample is incorrect. Therefore we have two misclassification over two samples, and an accuracy of 0.0.

In [32]:
accuracy_score(tgts, preds)

0.0

### Precision, recall, f1 score

For multilabel problems, the notion of precision and recall can be applied to each label independently. As before, the question becomes how to average the class-specific values. Let's start with `average=None` where we compute the per-class precision score. Precision is defined as the fraction of positives out of what we classify as positive. For class 0 we predict that both labels are positive, but only one is, therefore $p_0 = 0.5$. Same for $p_1$ and $p_2$. The last label is different, in that we predict one positive, which is an actual positive, and one negative. Only positives are considered in the computation of precision, therefore $p_3 = 1$. Let's verify this is the case.

In [36]:
p_none = precision_score(tgts, preds, average=None)
p_none

array([0.5, 0.5, 0.5, 1. ])

The `macro` average of precision is the average of the entries of `p_none`.

In [35]:
precision_score(tgts, preds, average='macro')

0.625

In [37]:
p_none.mean()

0.625

The `micro` version is given by observing that:

1. We predict 2 positives, with one correct for classes 0, 1, and 2. For each of these 3 classes $p = 0.5$ and support = 2/
2. We predict 1 positive for class 3, and it is correct. $p = 1$ and support = 1.

Therefore, $p_0 = p_1 = p_2 = 0.5$, $p_3 = 1$ and $n_0 = n_1 = n_2 = 2$ and $n_3 = 1$. Therefore:

$$ \frac{n_0 p_0 + n_1 p_1 + n_2 p_2 + n_3 p_3}{n_0 + n_1 + n_2 + n_3} = \frac{4}{7} = 0.5714285714285714$$

In [39]:
(2*0.5 + 2*0.5 + 2*0.5 + 1) / 7

0.5714285714285714

In [40]:
precision_score(tgts, preds, average='micro')

0.5714285714285714

## TODO Log-loss