In [1]:
import numpy as np
from mlcevaluator1 import mlcEvaluator1
from mlcevaluator2 import mlcEvaluator2
from mlctensor import mlcTensor
from sklearn.metrics import multilabel_confusion_matrix


## Data set
Select data set by uncomenting either synthetic data set with 3 labels and 9 instances, or movie poster data set (true labels and predictions are loaded from a file).

In [2]:
# Example GT and Prediction matrices
'''
gt=np.asarray([[1,1,0], [1,1,1], [0,0,0],
               [1,0,0], [1,1,0], [0,0,0],
               [1,0,0], [1,1,0], [1,1,0]])
              
pred=np.asarray([[1,1,0],[1,0,1],[0,0,0],
                 [1,1,1], [1,1,1], [0,1,1],
                 [0,1,1], [1,0,1], [0,0,1]])
'''
# Load GT and prediction from file
gt=np.load('data/posters/gt.npy')
gt.shape

# Uncomment one set of predictions
#
#pred=np.load('data/posters/pred_t05.npy')
pred=np.load('data/posters/pred_t09.npy')
pred.shape

(7209, 18)

The equations for computing contribution of a single data instance $i$ to the confusion tensor have an implicit assumption that $\lvert T_i\rvert > 0$ and $\lvert P_i\rvert > 0$, i.e. that both true labels and predictions for the data instance $i$ have at least one label assigned. To cope with the scenarios where true labels or prediction has no labels assigned, an additional class is included in computing the confusion tensor. This label, *unknown* is added as last element of each $T_i$ and $P_i$ vector.

## Multi-label Confusion Tensor
Compute raw Multi-Label Confusion Tensor and normalized Recall and Precision Confusion matrices

In [3]:
evalT = mlcTensor(gt, pred)
MT = evalT.computeConfusionTensor()
RT = evalT.getRecall()
PT = evalT.getPrecision()

## Per-class Precision/Recall
Per-class Recall and Precision are defined as:
$$
R(k) = \frac{TP(k)}{TP(k)+FN(k)},\qquad P(k) = \frac{TP(k)}{TP(k)+FP(k)}
$$
where $k$ is the class index, $TP(k)$ stands for a number of correctly assigned labels, $FN(k)$ represents the number of cases where the relevant label $k$ was not assigned to an instance and $FP(k)$ is a number of instances with incorrectly assigned label $k$.
$F_1$ score is the harmonic mean of the precision and recall:
$$
F_1(k) = \frac{2*P(k)*R(k)}{P(k)+R(k)} = \frac{2TP(k)}{2TP(k)+FP(k)+FN(k)}
$$

Recall for each class is represented by corresponding diagonal element in the recall matrix. Precision is represented by diagonal elements of the precision matrix.

In [4]:
R=RT.diagonal()
P=PT.diagonal()

True positive values for each class $TP(k), k=1, ...,q$ are represented by diagonal elements of raw confusion tensor in both Recall and Precision matrices (diagonal elements in both matrices are exactly the same).

False negative value $FN(k)$ for label $k$ can be computed as a sum of the corresponding row in the raw recall matrix (first elemet of Confusion Tensor) minus the value of the diagonal element, i.e. number of true positives for the same label.

Similarly, False positive values $FP(k)$ are represented by the sum of the corresponding column in the raw Precision matrix (second elemet of Confusion Tensor) minus the value of the diagonal, i.e. number of true positives for the same label.

In [5]:
TP = MT[0,:,:].diagonal()
FN = MT[0,:,:].sum(axis=1)-TP
FP = MT[1,:,:].sum(axis=0)-TP

F1 = 2*TP/(2*TP+FP+FN)

print(' k\t| R(k)\t P(k)\t F1(k)')
print('---------------------------------')
for k in range(R.shape[0]):
    print('%2d\t| %.2f\t %.2f\t| %.2f' %(k, R[k], P[k], F1[k]))



 k	| R(k)	 P(k)	 F1(k)
---------------------------------
 0	| 0.10	 0.14	| 0.12
 1	| 0.06	 0.10	| 0.08
 2	| 0.01	 0.03	| 0.01
 3	| 0.03	 0.05	| 0.04
 4	| 0.31	 0.32	| 0.31
 5	| 0.09	 0.11	| 0.10
 6	| 0.06	 0.08	| 0.07
 7	| 1.00	 0.51	| 0.67
 8	| 0.02	 0.05	| 0.03
 9	| 0.01	 0.02	| 0.02
10	| 0.00	 0.00	| 0.00
11	| 0.06	 0.09	| 0.07
12	| 0.00	 0.00	| 0.00
13	| 0.03	 0.06	| 0.04
14	| 0.15	 0.16	| 0.15
15	| 0.02	 0.06	| 0.03
16	| 0.11	 0.12	| 0.12
17	| 0.01	 0.03	| 0.01
18	| 0.00	 0.00	| 0.00


## Evaluating Classifier performance over all labels 
Let $\boldsymbol{T}_i$ be the vector representing the set of true labels for data instance $i$ and $\boldsymbol{P}_i$ be the vector of predicted labels for the same instance. Vectors $\boldsymbol{T}_{i1}=\boldsymbol{P}_{i1}=\boldsymbol{T}_i\cap \boldsymbol{P}_i$ represents correctly predicted labels. $\boldsymbol{T}_{i2}=\boldsymbol{T}_i\backslash \boldsymbol{P}_i$ is a set of true labels not predicted by the classifier, while $\boldsymbol{P}_{i2} = \boldsymbol{P}_i\backslash \boldsymbol{T}_i$ represents incorrectly predicted labels. It is clear that $\boldsymbol{T}_i = \boldsymbol{T}_{i1} + \boldsymbol{T}_{i1}$ and $\boldsymbol{P}_i = \boldsymbol{P}_{i1}+\boldsymbol{P}_{i2}$.<br>
 - $TP(k)$ (True Positive) stands for a number of instances with correctly assigned label $k$
 - $FP(k)$ (False Positive) stands for a number of instances with incorrectly assigned label $k$
 - $FN(k)$ (False Negative) represents the number of cases where the relevant label $k$ was not assigned to an instance. 

Let $B(TP(k) , FP(k) , TN(k), FN(k))$ be some specific binary classification metric, $k = 1, ...,q$, where $q$ is the number of possible labels.

Label-based classification metrics for a classificator can be obtained using either Macro-averaging or Micro-averaging approach.

### Macro-averaging
Macro-averaging averages over all  categories, thus giving each category equal weight<br>
$B_{Macro} = \frac{1}{q}\sum\limits_{k=1}^q B\big[ TP(k), FP(k), TN(k), FN(k)\big]$

Macro-averaged Recall and Precision for a classifier can be computed as:<br>
$R_{Macro} = \frac{1}{q}\sum\limits_{k=1}^q \frac{TP(k)}{TP(k)+FN(k)} = \frac{1}{q}\sum\limits_{k=1}^q R(k)$

$P_{Macro} = \frac{1}{q}\sum\limits_{k=1}^q \frac{TP(k)}{TP(k)+FP(k)} = \frac{1}{q}\sum\limits_{k=1}^q P(k)$,

$F_{1Macro} = \frac{1}{q}\sum\limits_{k=1}^q \frac{2TP(k)}{2TP(k)+FP(k)+FN(k)} = \frac{1}{q}\sum\limits_{k=1}^qF_1(k)$

In [6]:
q = R.shape

RMacro = R.sum()/q
PMacro = P.sum()/q
F1Macro = F1.sum()/q

print('Macro-averaged Classifier Recall:   ', RMacro.round(decimals=2))
print('Macro-averaged Classifier Precision:', PMacro.round(decimals=2))
print('Macro-averaged Classifier F1 score: ', F1Macro.round(decimals=2))


Macro-averaged Classifier Recall:    [0.11]
Macro-averaged Classifier Precision: [0.1]
Macro-averaged Classifier F1 score:  [0.1]


### Micro-averaging
Macro-averaging averages over data instances, thus giving each sample equal weight<br>
$B_{Micro} = B\big[\sum\limits_{k=1}^q TP(k), \sum\limits_{k=1}^q FP(k), \sum\limits_{k=1}^q TN(k), \sum\limits_{k=1}^q FN(k)\big]$

Micro-averaged Recall and Precision for a classifier can be computed as:<br>
$R_{Micro} = \frac{\sum\limits_{k=1}^q TP(k)}{\sum\limits_{k=1}^q TP(k)+\sum\limits_{k=1}^q FN(k)}$

$P_{Micro} = \frac{\sum\limits_{k=1}^q TP(k)}{\sum\limits_{k=1}^q TP(k)+ \sum\limits_{k=1}^qFP(k)}$

In [7]:
TPs = TP.sum()
FPs = FP.sum()
FNs = FN.sum()
RMicro = TPs/(TPs+FNs)
PMicro = TPs/(TPs+FPs)
F1Micro = 2*TPs/(2*TPs+FPs+FNs)

print('Micro-averaged Classifier Recall:', RMicro.round(decimals=2))
print('Micro-averaged Classifier Precision:', PMicro.round(decimals=2))
print('Micro-averaged Classifier F1 score:', F1Micro.round(decimals=2))

Micro-averaged Classifier Recall: 0.34
Micro-averaged Classifier Precision: 0.33
Micro-averaged Classifier F1 score: 0.33
