# Evaluation

This notebook discusses multi label evaluation methods using the [academia.stackexchange.com](https://academia.stackexchange.com/) data dump.

Evaluation metrics for multi label classification differ from binary classification metrics, since there are a few more questions to take into account. Therefore these metrics are divided into two main areas.

## Table of Contents
* [Example-based](#example_based)
* [Label-based](#label_based)

<a id='example_based'/>

## Example-based

> The example-based evaluation measures are based on the average differences of the actual and the predicted sets of labels over all examples of the evaluation dataset. <cite>[(Madjaroj et al. 2012)][0]</cite>


**Hamming loss**
> The Hamming loss is the fraction of labels that are incorrectly predicted.
<cite>[scikit-learn][1]</cite>

**Accuracy**
> In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.
<cite>[scikit-learn][2]</cite>

**Precision**
> The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.
<cite>[scikit-learn][3]</cite>

**Recall**
> The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.
<cite>[scikit-learn][4]</cite>

**F1 score**
> The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is: *F1 = 2 * (precision * recall) / (precision + recall)*
<cite>[scikit-learn][5]</cite>

[0]: https://doi.org/10.1016/j.patcog.2012.03.004
[1]: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html?highlight=hamming_loss
[2]: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html?highlight=accuracy#sklearn.metrics.accuracy_score
[3]: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html?highlight=precision#sklearn.metrics.precision_score
[4]: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html?highlight=recall_score
[5]: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html?highlight=f1_score#sklearn.metrics.f1_score

<a id=label_based/>

## Label-based

> The label-based evaluation measures [...] assess the predictive performance for each label separately and then average the performance over all labels. 
<cite>[(Madjaroj et al. 2012)][0]</cite>

**Micro**
> Calculate metrics globally by counting the total true positives, false negatives and false positives.
<cite>[scikit-learn][1]</cite>

**Macro**
> Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
<cite>[scikit-learn][1]</cite>

[0]: https://doi.org/10.1016/j.patcog.2012.03.004
[1]: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html?highlight=f1_score#sklearn.metrics.f1_score