# Evaluation

Measuring the quality of a classifier is a neccesary step in order to potentially improve it. The main metrics for Text Classification are: 

- Precision: Number of documents correctly assigned to a category out of the total number of documents predicted
- Recall: Number of documents correctly assigned to a category out of the total number of documents in such category
- F1: Metric that combines precision and recall using a harmonic mean

In [30]:
from sklearn.metrics import f1_score, precision_score, recall_score

# Binary problem
binary_labels = [1, 0, 1]
binary_predictions = [1, 0, 0]

# Quality values (with respect to class 1 by default)
print("Binary quality")
print("Precision: {}, Recall: {}, F1-measure: {}".format(precision_score(binary_labels, binary_predictions),
                                                         recall_score(binary_labels, binary_predictions),
                                                         f1_score(binary_labels, binary_predictions)))

binary_labels = ["A", "B", "A"]
binary_predictions = ["A", "B", "B"]

# Quality values (with respect to class A)
print("Precision: {}, Recall: {}, F1-measure: {}".format(precision_score(binary_labels, binary_predictions, pos_label="A"),
                                                         recall_score(binary_labels, binary_predictions, pos_label="A"),
                                                         f1_score(binary_labels, binary_predictions, pos_label="A")))

Binary quality
Precision: 1.0, Recall: 0.5, F1-measure: 0.6666666666666666
Precision: 1.0, Recall: 0.5, F1-measure: 0.6666666666666666


Evaluation in Multi-class scenarios is slighly more complicated because the quality metrics have to be either shown per category, or aggregated somehow. There are two main aggregation approaches:

- Micro-average: Every assignment (document, label) has the same importance. Common categories has more effect over the aggregate quality than smaller ones.
- Macro-average: The quality for each category is calcualted independently and their average is reported. Therefore, all the categories are equally important.

In [31]:
# Multi-Class
multi_class_labels = [0, 0, 0, 0, 0, 1, 1, 2]
multi_class_predictions = [0, 0, 0, 0, 0, 1, 2, 1]

# Quality must be given per category or aggregated when dealing with multiclass data

print("Precision per category (0, 1, 2) {}".format(precision_score(multi_class_labels, multi_class_predictions, average=None)))
print("Micro-average Precision {}".format(precision_score(multi_class_labels, multi_class_predictions, average='micro')))
print("Macro-average Precision {}".format(precision_score(multi_class_labels, multi_class_predictions, average='macro')))
print()


print("Micro-average quality numbers")
print("Precision: {}, Recall: {}, F1-measure: {}".format(precision_score(multi_class_labels, multi_class_predictions, average='micro'),
                                                         recall_score(multi_class_labels, multi_class_predictions, average='micro'),
                                                         f1_score(multi_class_labels, multi_class_predictions, average='micro')))

Precision per category (0, 1, 2) [ 1.   0.5  0. ]
Micro-average Precision 0.75
Macro-average Precision 0.5

Micro-average quality numbers
Precision: 0.75, Recall: 0.75, F1-measure: 0.75


In [32]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import f1_score, precision_score, recall_score

# Multi-Label
multi_class_labels = [[0], [0], [0], [0], [0], [1], [1], [2]]
multi_class_predictions = [[0], [0], [0], [0], [0], [], [2], [1, 2]]

mlb = MultiLabelBinarizer()
binarised_labels = mlb.fit_transform(multi_class_labels)
binarised_decisions = mlb.transform(multi_class_predictions)

print("Micro-average Precision {}".format(precision_score(binarised_labels, binarised_decisions, average='micro')))
print("Macro-average Precision {}".format(precision_score(binarised_labels, binarised_decisions, average='macro')))
print()

print("Micro-average quality numbers")
print("Precision: {}, Recall: {}, F1-measure: {}".format(precision_score(binarised_labels, binarised_decisions, average='micro'),
                                                         recall_score(binarised_labels, binarised_decisions, average='micro'),
                                                         f1_score(binarised_labels, binarised_decisions, average='micro')))

Micro-average Precision 0.75
Macro-average Precision 0.5

Micro-average quality numbers
Precision: 0.75, Recall: 0.75, F1-measure: 0.75
